Monday, April 18, 2011

Extracting top words from titles and abstractions in MIX11 presentations

 

The videos of MIX11 (Apr. 11-14) conference are available at http://channel9.msdn.com/Events/MIX/MIX11. To get a sense of what’s going on recently, I wrote an F# script to count the top occurring words in the titles and abstractions of all the presentations.

Title statistics:

"windows" : 41
"phone" : 26
"azure" : 14

"web" : 14
"html5" : 11
"silverlight" : 9

"net" : 9
"7" : 8
"applications" : 8
"data" : 8
"platform" : 7
"application" : 7
"javascript" : 7
"new" : 7
"ux" : 6
"what’s" : 6
"asp" : 6
"boot" : 5
"camp" : 5
"building" : 5

From the statistics of the words in titles, we can find that windows mobile phone and azure cloud platform are the hottest topics.  HTML5, the next standard of web page technology, has been always been a focus by Microsoft. Silverlight still has its heat. It is a good company to HTML5, I think it will have many applications in in-house web applications; while HTML5 has more support and available in all browsers across platforms.

Abstract statistics:

"web" : 62
"session" : 55
"windows" : 49
"new" : 40
"applications" : 36
"phone" : 36
"learn" : 32
"we’ll" : 29
"use" : 25
"using" : 24
"silverlight" : 23
"data" : 23
"net" : 22
"javascript" : 21
"azure" : 21
"developers" : 20
"come" : 17
"microsoft" : 17
"features" : 17
"one" : 16

 

and the whole F# program is actually short -- only 40 lines! You have everything there: download webpages, get the titles and titles, stopword removing and word counting, sorting...

 

Code Snippet
  1. open System
  2. open System.Net
  3. open System.Text.RegularExpressions
  4.  
  5. let fetchUrlSimple (url:string) =
  6.     let req = WebRequest.Create(url)
  7.     let response = req.GetResponse()
  8.     use stream = response.GetResponseStream()
  9.     use streamreader = new System.IO.StreamReader(stream)
  10.     streamreader.ReadToEnd()
  11.  
  12. let topKWords (docs:string seq) K =
  13.     let separator = [|' '; '\r'; '\n'; '-'; '.'; ',' ; '\t'; '!'; '?'; '\''; ';'; '/' |]
  14.     let stopwords = Set("a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your".Split ',')
  15.  
  16.     docs
  17.     |> Seq.map (fun doc ->
  18.         doc.Split(separator, StringSplitOptions.RemoveEmptyEntries)
  19.         |> Seq.map (fun word -> word.ToLower())
  20.         |> Seq.filter (fun word -> not (stopwords.Contains(word)))
  21.         )
  22.     |> Seq.concat
  23.     |> Seq.groupBy (fun x->x)
  24.     |> Seq.map (fun (word, wordSeq) -> (word, wordSeq |> Seq.length))
  25.     |> Seq.sortBy (fun (_, wordCnt) -> - wordCnt)
  26.     |> Seq.take K
  27.     |> Seq.toList
  28.  
  29. let rawpage = fetchUrlSimple @"http://channel9.msdn.com/Events/MIX/MIX11"
  30. let titles, abstracts =
  31.     let page = Regex.Replace(rawpage, "&#?[a-z0-9]+;", " ")
  32.     let titleMatches = Regex.Matches(page, "class=\"title\">(.*?)</a>")
  33.     let abstractMatches = Regex.Matches(page.Replace("\n"," "), "class=\"description\">(.*?)</div>")
  34.     let matchesToSeq (matches: MatchCollection) =
  35.         seq {
  36.             for m in matches do
  37.                 yield m.Groups.[1].Value
  38.         }
  39.     matchesToSeq titleMatches, matchesToSeq abstractMatches
  40.  
  41.  
  42. topKWords titles 20
  43. topKWords abstracts 20

 

My colleague Defu Lian wrote a C# version for topKWords function using LINQ:

Code Snippet
  1. static List<Tuple<string, int>> topKWords(IEnumerable<string> docs,int K)
  2. {
  3.     char[] separator = { ' ', '\r', '\n', '-', '.', ',', '\t', '!', '?', '\'', ';', '/' };
  4.     var stopwords = new HashSet<string>("a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your".Split(','));
  5.     return docs.SelectMany(doc => doc.Split(separator, StringSplitOptions.RemoveEmptyEntries)
  6.                         .Select(word => word.ToLower())
  7.                         .Where(word => !stopwords.Contains(word)))
  8.         .GroupBy(a => a)
  9.         .Select(wordgroup => Tuple.Create<string,int>(wordgroup.Key, wordgroup.Count()))
  10.         .OrderByDescending(w2c => w2c.Item2)
  11.         .Take(K)
  12.         .ToList();
  13. }

1 comment:

  1. Very good, I had a demo at http://www.youtube.com/watch?v=c41mmbuXBZQ and a source here http://pastebin.com/qy96KkGL but I'm seek for Hyperz.SharpLeech.Engine lib and I can't find it.

    Take a look

    ReplyDelete