Saturday, July 10, 2010

F# Async workflow application: a Flickr crawler

In data mining, much time is spent on collecting raw data and preprocessing data. Quite a few data mining research tasks require to download data from Internet, e.g. Wikipedia articles from Wikipedia, photos from Flickr, Google search results, reviews from Amazon, etc.

Usually general purpose crawlers, such as wget, are not sufficiently powerful and specialized in downloading data from Internet. Writing a crawler on our own is often required.

Recently I am doing an image related research project, in which I need to download a lot of tagged images from Flickr. I am aware that there is a Flickr downloadr, which uses Flickr API to download images. However 1) it only downloads licensed photos and 2) it cannot download the tags of a photo. Thus I decided to write one myself.

The input of the is a tag query, e.g. “dog”, # of photos to download and the disk folder to store the downloaded images.

Because the number of photos is quite big, so downloading them in parallel is critical. In .Net, there are several ways to do parallel computing. For IO intensive tasks, Async workflow is the best.

Tutorials for Async workflow

As Async workflow is one of the key features of F#, there are quite a few tutorials online for F# Async programming. Providing one more in this blog would be repetious.

Luke Hoban’s PDC 2009 talk, F# for Parallel and Asynchronous Programming, is very good for beginners.

Don Syme wrote 3 articles in a series. Highly recommended for experienced F# users!

The Flickr crawler

To write a crawler for a web site like Flickr, we need to 1) design the downloading strategy and 2) analyze the structures of Flickr.

My strategy is to use the search query to search images with some specific tags and from the result page(as shown below), the url of each image is extracted, from which the image and its tags are then crawled.

flickr-1

Figure 1. Flickr search result page.

flickr-2

Figure 2. Flickr image page with an image and its tags.

So first we need a function to download a web page asynchronously :

  1. let fetchUrl (url:string) =
  2.     async {
  3.         try
  4.             let req = WebRequest.Create(url) :?> HttpWebRequest
  5.             req.UserAgent <- "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)";
  6.             req.Method <- "GET";
  7.             req.AllowAutoRedirect <- true;
  8.             req.MaximumAutomaticRedirections <- 4;
  9.             let! response1 = req.AsyncGetResponse()
  10.             let response = response1 :?> HttpWebResponse
  11.             use stream = response.GetResponseStream()
  12.             use streamreader = new System.IO.StreamReader(stream)
  13.             return! streamreader.AsyncReadToEnd() // .ReadToEnd()
  14.         with
  15.             _ -> return "" // if there's any exception, just return an empty string
  16.     }

 

fetchUrl pretends to be a Mozilla browser and can do some redirections if the url is slightly invalid. The current exception handling is very easy – just return the empty string for the web page. Notice that the return type of the function is Async<string>, thus it cannot be used to download images as images are of binary format, not text.

So the next task is to write a function to download images:

  1. let getImage (imageUrl:string) =
  2.     async {
  3.         try
  4.             let req = WebRequest.Create(imageUrl) :?> HttpWebRequest
  5.             req.UserAgent <- "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)";
  6.             req.Method <- "GET";
  7.             req.AllowAutoRedirect <- true;
  8.             req.MaximumAutomaticRedirections <- 4;
  9.             let! response1 = req.AsyncGetResponse()
  10.             let response = response1 :?> HttpWebResponse
  11.             use stream = response.GetResponseStream()
  12.             let ms = new MemoryStream()
  13.             let bytesRead = ref 1
  14.             let buffer = Array.create 0x1000 0uy
  15.             while !bytesRead > 0 do
  16.                 bytesRead := stream.Read(buffer, 0, buffer.Length)
  17.                 ms.Write(buffer, 0, !bytesRead)
  18.             return ms.ToArray();
  19.  
  20.         with
  21.             _ -> return Array.create 0 0uy // if there's any exception, just return an empty image
  22.     }

 

Next we write the code to get the url of an image and its tags from an image page(see Figure 2):

  1. let getBetween (page:string) (head:string) =
  2.     let len = head.Length
  3.     let idx = page.IndexOf(head)
  4.     let idx2 = page.IndexOf('"', idx+len)
  5.     let between = page.Substring(idx+len, idx2 - idx - len)
  6.     between
  7.  
  8.  
  9. let getImageUrlAndTags (page:string) =
  10.     let header = "class=\"photoImgDiv\">"
  11.     let idx = page.IndexOf(header)
  12.     let url = getBetween (page.Substring(idx)) "<img src=\""
  13.  
  14.     let header2 = "<meta name=\"keywords\" content=\""
  15.     let tagStr = getBetween page header2
  16.  
  17.     let s = tagStr.Split([|','|], System.StringSplitOptions.RemoveEmptyEntries)
  18.     let tags = s |> Array.map (fun t -> t.Trim())
  19.     url, tags

 

Finally, write a function to work through every search result page, parse the result page and download the images in that result page:

  1. let getImagesWithTag (tag:string) (pages:int) =
  2.     let rooturl = @"http://www.flickr.com/search/?q="+tag+"&m=tags&s=int"
  3.     seq {
  4.         for i=1 to pages do
  5.             let url = rooturl + "&page=" + i.ToString()
  6.             printfn "url = %s" url
  7.             let page = fetchUrl url |> Async.RunSynchronously
  8.             let imageUrls = getImageUrls page
  9.             let getName (iurl:string) =
  10.                 let s = iurl.Split '/'
  11.                 s.[s.Length-1]
  12.             
  13.             (* images in every search page *)
  14.             let images =
  15.                 imageUrls
  16.                 |> Seq.map (fun url -> fetchUrl url)
  17.                 |> Async.Parallel
  18.                 |> Async.RunSynchronously
  19.                 |> Seq.map (fun page ->
  20.                     async {
  21.                         let iurl, tags = getImageUrlAndTags page
  22.                         let icontent = getImage iurl |> Async.RunSynchronously
  23.                         let iname = getName iurl
  24.                         return iname, icontent, tags
  25.                     })
  26.                 |> Async.Parallel
  27.                 |> Async.RunSynchronously
  28.             yield! images
  29.         }

with a driver function to write all the images into hard disk:

  1. let downloadImagesWithTag (tag:string) (pages:int) (folder:string) =
  2.     let images = getImagesWithTag tag pages
  3.     images
  4.     |> Seq.iter (fun (name, content, tags) ->
  5.         let fname = folder + name
  6.         File.WriteAllBytes(fname, content)
  7.         File.WriteAllLines(fname + ".tag", tags)
  8.         )

 

We’ve done! A Flickr image crawler in only about 120 lines of code. Let’s download some images!

downloadImagesWithTag "sheep" 5 @"D:\WORK\ImageData\flickr\sheep\"

flickr-3

It only costs less than 5 minutes to download 300 sheep pictures from Flickr.

Discussions

1. One of the strengths of F# Async programming is its ease for exception handling. In this example, the exception is handled immediately, we can also propagate the exception into an upper level function and handle it there. However, that would require more thinking….

2. I only parallelly download images in one search page. The program could be modified to parallelly process multiple search result pages, which is done sequentially now. If done in this way, we can see that, we can build a hierarchical parallel program: 1) at the first level, multiple search result pages are processed parallelly and 2) at the second level, images in a research result pages are downloaded parallelly.