Monday, January 28, 2013

A simple GUI tool to help find the right people

 

Today our professor asked us to find a list of researchers who are suitable as program committees for the big data conference:

2013 IEEE International Conference on Big Data (IEEE BigData 2013)

IEEE BigData is a brand new conference and does not have a “circle” yet. That means finding PCs is harder than other more established conferences. Our solution is to search the PCs in other major conferences such as WWW/KDD and find researchers who are doing big data.

After I count the number of PCs in SIGIR conference, I know that I need a tool to help me. SIGIR’11 has 223 researchers in its community! Think about repeating the following tasks for 223 times:

image

I want to automate the following the tasks:

image

 

There are two tasks that are hard:

1. Getting emails automatically. Some researchers simply don’t write emails on their webpages. To find the email, one have to open one of their research papers.

2. Decide if a researcher is related to big data. Actually we can formulate the problem into a binary classification task, which however requires some training data. So we go back to human labeling. Currently I define a set of keywords that indicate BigData:

let keywords =
    [ "large"
      "scale"
      "big"
      "data"
      "hadoop"
      "mapreduce"
      "distributed"
      "system"
      "cloud"
      "computing"
      "parallel"
    ]

and for each home page, I list the subset of keywords that occur in it. If no keyword is in his home page, then it is unlikely the researcher is doing something related to BigData. So the keyword filter is served as a helper for the human to decide.

I write an F# program and the code can be find on BitBucket. Here is a screen shot of running program:

image

 

When I click “WANT THIS”, the program inserts a record to the log file.

There two windows, one is for Google search and the other is for the page of the first search result. I keep two pages because occasionally the first result is not correct, I can then click other search results.

Some technical details:

1. WebBrowser in F# with only four lines!

let form = new Form(Visible=true, Text="Google Search Result")
let content = new WebBrowser()
content.Navigate("http://www.google.com")
form.Controls.Add(content)

2. Finding ralevtive path in an F# script (this makes the experiment easier to reproduce)

let folder = __SOURCE_DIRECTORY__ + @"\..\..\data\"

let logfile = folder + "log.csv"
let namelistfile = folder + "sigir10.txt"

3. Details in WebBrowser class

content.DocumentCompleted.Add(fun _ –> doing something when the page is fully loaded)

Tuesday, December 11, 2012

A reading list on F# and other ML languages

 

Yin Zhu

This page will be updated regularly.    last updated on 11 Dec 2012.

 

For F#, Don Syme has posted a blog page:

How to reference F# in a research paper?

Most of the papers therein are, of course, written by Don himself, the creator of F#.

The above page is on F#, particularly on the novel points in F#, e.g. active patterns, async IO, quotations, etc. But the most powerful part of F# is probably its ML heritage. To have a full and authoritative review of the core of F#, we must to back to the original papers of these ideas. This also motivated me to read some of these papers. For a few of them, I have quite a good understanding. However, as I haven’t had any formal training in programming languages (nor do I intend to – data mining is my primary job!), I only have a shallow understanding of most papers. So, I set up this page to keep track on interesting papers that I have read and want to re-read.

General ML

Paul Hudak. The conception, evolution, and application of functional programming languages. ACM Computing Surveys. 1989. pdf

A comprehensive survey. Tutorial on lambda calculus and its relationship to FP languages. Note the year it published, Haskell was not mature at that time.

Andrew Appel. A Critique of Standard ML. 1992. pdf

This paper gives a detailed review on ML language features, mostly from a user’s perspective. The concerns Appel raises are what a typical ML programmer would raise, e.g. the efficiency of the low-level implementation of datatypes.

David B. MacQueen. Reflections on Standard ML. 1993. pdf

This paper is another review on ML, but from a language designer’s perspective.

Module systems

Robert Harper and Benjamin C. Pierce. Design Considerations for ML-Style Module Systems. Chapter 8 in Advanced Topics in Types and Programming Languages. MIT. 2005. book link

An up-to-date review of the ML module system. It starts with a very general introduction, independent of any concrete implementations. At the end of it, it compares different of implementations (SML and OCaml), and compares the ML module system briefly to C and Java. This chapter does not build on the type book of Pierce and can be read alone.

David MacQueen. Modules  for  Standard  ML. 1984. link

Module extension to ML was primary designed by David MacQueen, and this paper is the first paper on ML modules. MacQueen wrote several subsequent papers on ML modules, e.g. implementation issues and separate compilation of SML modules.

Derek Dreyer. Understanding and Evolving the ML Module System. PhD thesis. 2005. pdf

To learn to use a concrete ML module system, the best way is probably to read tutorials and code. For OCaml, check Jason Hickey's introduction to Objective Caml, OCaml manual, many OCaml library designs(Map, Set, and particularly Jean-Christophe FilliĆ¢tre’s ocamlgraph).

Ad-hoc polymorphism

ML does not support typeclass/ad-hoc polymorphism. Attracted by the elegance of Haskell typeclass, many ML programmers want to use typeclass in ML. Without typeclass, ML programmers have to mimic a typeclass, e.g. dictionary passing is used for implementing F# powerpack’s matrix library.

Stefan Wehr and Manuel M. T. Chakravarty. ML Modules and Haskell Type Classes: A Constructive Comparison. 2008. pdf

It implements ML’s full module system using Haskell typeclass; and implement simple typeclass suing ML’s modules. Constructor typeclass is not implemented and everyone said it is hard or impossible (but I haven’t seen on the web why….)

Jeremy Yallop. Practical Generic Programming with OCaml. ML workshop’07. slides.

Another implementation of typeclass (again simple typelcass) using OCaml.

John Peterson and Mark P. Jones. Implementing Type Classes. PLDI’93. paper.

Philip Wadler and Stephen Blott. How to make ad-hoc polymorphism less ad hoc. POPL’88. paper.

The original paper of typeclass actually has suggested a simple implementation using explicit dictionary passing, and this technique as poor man’s ad-hoc polymorphism can be used in ML.

The realpower of Haskell comes from constructor typeclasses – Functor, Monads, etc, are all based on it. (I know how to use them in simple cases, however I have no idea how it is implemented and why it cannot be implemented in .Net; people keep saying that “to support higher-order typeclasses, the IL of .Net needs to be changed” however they never give any clue on why…)

Mark P. Jones. A system of constructor classes: overloading and implicit higher-order polymorphism. FPCA '93. paper.

Inside ML polymorphism

From a programmer’ view, ML polymorphism is quite easy to understand and use. However, normal programmers have little knowledge of the underlying theory and the implementation details of it. Also .Net and Java generics are heavily influenced by ML polymorphism (the core designers are from FP community). If one wants to know the generics in Java or .Net thoroughly, understanding ML polymorphism seems a good pre-course.

Robin Milner. A theory of type polymorphism in programming. Journal of Computer and System Sciences. 1978. pdf

This paper is the second most cited paper in ML literature (the first being the definition of Standard ML), the breakthrough in this paper, plus the ML module system, gives the modern shape of ML. The Hindley-Milner type inference scheme is also described in this paper.

Gilad Bracha, Martin Odersky, David Stoutamire and Philip Wadler. Making the future safe for the past: Adding Genericity to the Java Programming Language. OOPSLA'98.pdf

Andrew Kennedy and Don Syme. Design and Implementation of Generics for the .NET Common Language Runtime. PPLDI’01. pdf

These two papers are not directly related to ML. But since Java and .NET are the two major programming platforms today, understanding their implementations of generics is critical to programmers today.

Courses and tutorials

The following courses are taught by active researchers in this field. The instructors are not only good teachers of functional programming, but also successful researchers.

Cornell CS3110. Data Structures and Functional Programming. Uses OCaml. As the title says, data structures and OCaml are taught in an interleaving way. Well designed assignments. More theoretical stuff, e.g. fixpoints.

Harvard CS51. Introduction to Computer Science II: Abstraction and Design. Uses OCaml. Introductory course. Half OCaml and half Java. Focused on core programming abstractions, e.g. higher-order functions, modules, data types, lazy, objects, parallelism.

IT University of Copenhagen. Programs as data. Uses F# to implement a small compiler. Focuses on compilers implementation. At the end of the course, it also introduces Scala, and advanced abstractions, e.g. continuations and Monads.

Yale CS421. Compilers and Interpreters. Uses SML to teach compilers.

OCaml tutorials/books.

Tuesday, November 13, 2012

Reflection on Martin Odersky’s Scala Course at Coursera

Some F# news first:

Our team F# and Composite Networks, got the 4th place at the Stackoverflow Closed-Question prediction competition. And check out our beautiful visualization, “mapping the world’s programmers”, in which the location parser is written in F# and is open sourced. I will write more about this Stackoverflow project in future posts, but today is about Scala… 

--
Today I finished the last assignment of the Scala course at Coursera (code repository link removed due to course restriction). I must say this course is so well designed! As I am already familiar with F#, walking through the course is easy. There are a few things that are new to me or that I only had vague understanding previously. So after this course, I really feel that I have learned something and this is also why I’d like to write this reflection here, to share my fresh feeling of it.

Before this course, my only experience with Scala is an email judge program, which frequently scans an Email account and receive the new submissions from students and auto-score them. I used Scala because Java platform has a quite nice email library and has the example code for Gmail and also because I like to play new languages. The realtime scoreboard for that course is still online – it was really cool to watch this leaderboard when students made submissions. The source code is also in the repository. This program is written in one day (with copy&pate from a few snippets online). Scala is friendly to new users – as in day one I already become productive in it.

In the following, I will summarize the examples used in the course, unique language feature in Scala, homework  and a short language comparison to F# and C#.

1. Examples

The lectures, though mainly on simple materials, are well designed and well explained. Some of the classical examples are presented in the Scala language:

A. Newton’s method for calculating square root. Used as an example in higher order functions, stream/lazy evaluation. Later used to illustrate the concept of fix-point, but Odersky didn't go further to the fix-point in lambda calculus.

B. Rational number class. Numbers are the fundamental types in any programming languages. However usually they are built-in, and their implementations are hidden in the compilers or standard libraries. Implementing a Rational number is a great exercise in learning any language. From the number type, it can go as deep as class/object, comparison, type classes, passing operation dictionaries, etc.

C. Implementing functions as objects. The general idea is simple. What’s not simple is the variant. 

abstract class Function1[-S,+T] { 
  def apply(x: S): T 
}

There are rules and design patterns for when to use covariant/contravariant/non-variant. In Scala, the type checking for variant is much stronger than Java. This also adds complexity to the Scala type system and makes learning more involved. A complicated type system gives more power to library designers though. I will talk about this later.


D. Use List and binary search tree to implement IntSet. Scala does not have datatypes, so the implementation follows Scala's OO style.


E. Expression problem. This is a classical problem to discuss functional design and OO design. The best discussion I found was Ralf Lammel’s C9 lectures.


F. Fibonacci sequence and prime number sequence.


G. Classical search problem: n-queen for depth-first search (DFS) and pouring water for breath-first search (BFS).

Basically every example is implemented in several Scala ways, to show different language features of Scala, from simple to more advanced ones. For example, Fibonacci sequence first appeared in simple recursive functions and later in Streams as an example to show lazy evaluation/computing on demand.

The implementation for the examples is quite OO and trait is heavily used in all examples – actually you cannot avoid OO in Scala for everything concrete is an object.

2. Concepts

Besides the examples, I learn more about the following concepts:

A. Dynamic dispatching vs higher order functions. This is an exercise left at the end of one lecture. Odersky didn’t discuss it, but it is quite interesting to think about it. What I can think of is that .toString/.hashCode/.compare these universal methods associated with any object can be implemented as higher order functions.

Say in OO design, we have a Rational class that implements Order interface

class Rational implements Order { override int compare(other) { }}

In functional design, we would have (immutable) data and functions defined for them:

module Rational =
  type t = int * int
  let compare (a1, b1) (a2, b2) = a1*b2 – a2 * b2

When implementing a sort function, OO approach depends on dynamic method dispatch, while functional design uses higher order functions:

let sort (compare: Rational.t -> Rational.t -> int) (Rational.t list) = ...

The problem also reminds me of dictionary passing style for emulating type classes.


B. Subtyping vs generics, variant. (Lecture 4.4 & 4.5) I was confused about covariant and contravarint in C# and Java (there are major differences between the two). The variant problem becomes a problem when OO-style inheritance marries with FP-style generics. And definitely Scala has a very good solution in its language design. After the lecture and some more reading, I am clearer. But I think what I really need is to read through Scala’s standard library design to get more real-world examples and design lessons.

Variant + implicits are very powerful abstraction tools though the latter seems ad hoc.


C. By value, by name, and lazy. The lecture on Stream is amazingly clear, and it clearly differentiates three concepts: by value, by name and lazy.


D. for comprehensions are translated as map/filter/flatMap higher order functions. For compressions are easier to read than higher order functions. For example, Python, the language that keeps readability in its heart, encourages for comprehension over map/lambda syntax. This is particularly true when nested comprehensions are used, e.g. in the 8-queen example used in the lecture:

def queens(n: Int) = {
 def placeQueens(k: Int): Set[List[Int]] = {
   if (k == 0) Set(List()) 
   else
   for {
     queens <- placeQueens(k - 1)
     col <- 0 until n
     if isSafe(col, queens)
   } yield col :: queens
 }
 placeQueens(n)
}


3. Homework

The homework is well designed, which, as Ordersky reveals in an email to the email list, is from several TAs’ hard work. There are six projects; the skeleton of each project is already there, and each project has several holes to fill in. The primal exercise of doing these projects is thinking in types. Every function is fixed with a type signature, and the implementation I am asked to fill is usually one-liners. Placing the one-line-of-code that fits the type signature well is the primal problem in my head. After seeing six type-oriented design, I think when next time I need to design a project myself in Scala, it will come more naturally OO. (If you see my Scala email judge, you will immediately notice that I used Scala in an ML style…)

The other aspect of the homework is interesting. E.g. Representing a set as a function seems quite neat (though impractical because of efficiency), and using breadth-first-search for Bloxorz game is quite fun.

The hoemwork project also teaches test-driven development, for each project has a test suite, which is usually incomplete and requires the user to fill in more test cases.


4. Scala vs F#/C#


Qua work of art, the work of art cannot be interpreted; there is nothing to interpret; we can only criticize it according to standards, in comparison to other works of art. – T. S. Eliot, Hamlet and his problems, 1921.

This section is not about the course. But it is always interesting to compare languages, and in comparison, we learn.

My impression of Scala is that it is closer to C# than to any functional language. Scala is Java redesigned. The the core of Scala is OO. Then functional programming concepts and others are added based on its OO core. This is the same situation with C#. For language features, it seems to me that only Trait is unique in Scala, and Scala’s treat on variant is more sound. For other features, C# has them or can emulate them with little extra syntax. The fact that Scala is more complicate-designed than C# does not mean it is a better C#. I have seen discussions online arguing that Scala is too complicated and some early enthusiasts go back to Java after their projects become bigger and harder to manage.

On the functional side, Scala program can be very declarative, e.g. when using its immutable collections; however it is a another style of declarativeness which is different from ML family languages (F#/OCaml, and to some extend Haskell). In ML languages, the concept of function is everywhere – from simple function, to operators, to data constructors, functors, to even type constructors, it is a very consistent applicative style. While in Scala, from the very beginning, you think of an object, and then think about what can we do to this object.  This major difference becomes more evident when you actually begin to design a non-trivial project.


If Java platform has a soundly implemented ML, I would prefer that to Scala. However there is no ML there. If cross platform, concise and functional, performance of a statics language, a lot of libraries, etc, are kept in mind, then Scala seems to be a good choice among the very few. And recently the EPFL + Stanford joint team on parallel computing DSLs using Scala seems quite interesting and useful. There are also light-weight distributed frameworks written in Scala, e.g. Spark at Berkeley.


Summary


It is always entertaining to learn a new programming language, and I learned quite a lot by following this course. Thanks to Prof. Martin Ordersky and his TAs. It would be great if there will be a new and more advanced course next year. I would be interested in:

1. Design and implementation of the Scala collection library. Very few people write collection libraries, but many programmers need to design APIs for others to use. The Scala collection library is a gold mine to learn modern API design patterns.

2. Concurrent and parallel programming in Scala, using Actor, and other DSLs.

I will still use F# primarily for my data mining projects. And when I need a Java library or serious cross platform, I can rely on Scala.

Monday, October 1, 2012

Setting OCaml/F# Environment in Emacs

For those who don’t know, I did the first 50 Project Euler problem about three years ago (to teach myself F#) and uploaded them at fsharp-euler wikispace. Since then, however, I haven’t had time to do any update.
By update, I mean writing up more explanations on the existing programs, rather than more programs for new Euler problems. Because for learning the basic syntax and core library of a functional language, solving the first 50 should be enough. Solving more problems adds little to the conquering of the language itself, though would be a great mental exercise.
During this weekend, I had the time to play with OCaml on Project Euler problems, and to set up OCaml in Windows. The following shows the Emacs screenshot.
I use Emacs with Tuareg mode and Auto-Complete mode, and I also find Tuareg’s library browser (Menu –> Tuareg –> OCaml library … ) useful. Auto-Complete mode is also automatically hooked with Tuareg mode to provide key word completion.

emacs-ocaml

The completion dictionary is limited to OCaml keywords only. I used this little trick by scanning the whole library fold and get all function names using F#….

(* get all val names from ocaml lib mli files*)
let files = Directory.GetFiles(@"D:\OCaml\lib", "*.mli")
let only_names = 
    files |> Array.map Path.GetFileNameWithoutExtension
for mli in files do
    let Cap (s: string) = 
        s.Substring(0,1).ToUpper() + s.Substring(1)

            
    let onlyname = Path.GetFileNameWithoutExtension mli |> Cap
    let lines = File.ReadAllLines mli
    let funcs = 
        lines
        |> Array.map (fun (line:string) ->
            let s = line.Split(' ')
      
            if s.[0] = "val" || s.[0] = "external" then
                if s.[1].EndsWith(":") then
                    (true, s.[1].Substring(0, s.[1].Length-1))
                elif s.[1].EndsWith("(") then
                    (false, "")
                else 
                    (true, s.[1])
                    
            else 
                (false, ""))
        |> Array.filter (fun (s, _) -> s)
        |> Array.map (fun (_, f) -> onlyname + "." + f)
    
    for f in funcs do
        printfn "%s" f



By appending the output to the tuareg dict file (on my machine, it is ~\.emacs.d\elpa\auto-complete-20120922.1815\dict\tuareg-mode), we now get the nice Intellisense for OCaml libraries:





image


We can apply this trick to other OCaml libraries, and automate the updating of the dictionary file. Poor man’s intellisense, haha! Don’t worry about the length of the word list, the PHP list is over 100KB, and auto complete mode handles just fine!


There are other things to enhance, e.g. when we move cursor up and down along the completion choices, the corresponding function signature would be displayed at the mini buffer. But I am not familiar with elisp, so I opened the library reference buffer on the right window to check the signatures occasionally. For most of them, F# and OCaml have the same usage patterns, e.g. the main object is usually the last parameter of a function, and seeing the function name alone gives me sense of how to use it.

In Windows, we have the great Visual Studio IDE, but if you use F# in Linux, using F#-mode and auto complete mode will be handy.

Tuesday, September 18, 2012

Another tooling discussion for data mining, F#, OCaml, and Python.

As an F# programmer, I always keep an eye on its cousin, OCaml. OCaml and F# share a large portion of syntax, and an average F# programmer can pick up basic OCaml very quickly, say in one afternoon. And recently, the development of OCaml seems accelerated – exciting language features are added to the new releases: first-class modules (in 3.12) and Generalized abstract data types (GADTs) (in 4.0).  And for a good comparison in features of the two languages, please refer to this SO question.

Today, I’d like to start a short discussion on their application on a specific area, data mining.

The first question the reader may wonder is that “why do you want to use OCaml for data mining?” Yes. If F# makes everything convenient, there would less point for me to consider other languages. The main point for F# is that its cross platform support is poor. Maybe the language itself is fine as there is the Mono tool chain for F#, but the whole ecosystem is not cross-platform friendly. There is always one or two .Net libraries that are written specific for Windows, to name two libraries that I use: R.Net and Sho. Besides the library part, the Mono runtime has poor performance for the .net parallel library, and sometimes may crash. When I was in Microsoft Research Asia, we don’t have any cross-platform issues – every server is powerful with more than 12 cores and 32GB memory and is well installed with Visual Studio and other other tools. When back to HKUST, most of the servers are installed Linux for no good reasons (Win Server licenses are very cheap for our school, the IT guys are simply not experienced enough to make Win servers secured.)

With such constraints, we do have successful stories with F#. Recently I led a team winning the semantic place predict task in Nokia Mobile Data Challenge. Details here. The feature extraction program is written in F# and C#, and we can generate all features (from simple counting to FFT) from 50GB of text data within two hours! Thanks to F#’s asynchronous programming and other functional features!

But after that, I think we really need a better Windows server, or we have to use a language that is more friendly with Linux (at the same time, runs OK with Windows for I use Windows for development).

Currently I use Python. I write scripts in my laptop and when it works correctly on a small dataset, I put the program to the server to get it running on the full dataset, and usually the processing costs one hours to one night. Python is easy to write, and as it claims, easy indeed to read. But at the language level, cod reuse in a dynamic language is generally harder than a static language. The interface is too flexible and I usually dive into the struggle of which-one-to-choose.

On the other side, static language encourages refactoring and provides static types to facilitate code reuse. Consider the following F# code snippet:

 

    // a common pattern for accessing Stackoverflow csv file, question by question
let csv_iter (csvfile: string) (mapper: CsvReader -> string option) (mapfile: string) =
use csv = new CsvReader(new System.IO.StreamReader(csvfile),true)
let headers = csv.GetFieldHeaders()
let fieldCount = csv.FieldCount

use sw = new StreamWriter(mapfile)
use bad = new StreamWriter(mapfile + ".badids")
while csv.ReadNextRecord() do
match
mapper csv with
| Some text -> sw.WriteLine(text)
| None
->
bad.WriteLine(csv.[0])
sw.WriteLine()

sw.Close()
bad.Close()



From the function signature alone, one may guess what this function does. Basically, it process each record in a hugh csv file by a mapper function. The result of the function could be None, which means some errors occurs inside the mapper, or Some string that is a normal result. And all the results are stored in a file line by line. Let’s focus on the mapper function here, we have string option type in the function signature, which means we have a more specific requirement on the mapper than only requiring it to return a string. We have encoded the logic into the types. This is actually the main point of functional programming – types say more things than other languages. If we write this function in Python, the mapper would still return a string, however, the logic that the string can be nil is encoded in the implementation, but not in the type signature. In this case, we have to resort to natural language to comment the Python function that if the mapper gets error, please return None. And this comment is obviously less strict than types.



The above code is only the first version of this csv iterator; several enhancements can be added, e.g. we can define a more detailed type for mapper result, other than to use string option, which is too simple:



    type MapResult = 
MapError
of string * string
| NormalResult
of string



Now, the error has two parts: the error message and the best result string the mapper can get, which is set to empty string in the above csv_iter. In the stackoverflow example, if I cannot parse the question body, maybe I can use its title as a resort.



Say this is my old mapper, which gets the pure text from a question body, excluding its code and other annotations :



    let mapper_getbody_textonly (csv: CsvReader) = 
try
let
feat = createZero()
let markdoc = csv.[7] |> Markdown.Parse
calcTopParagraphs feat markdoc
Some (
let text = feat.PlainString.ToString()
text.Replace(
'\n', ' ').Replace('\r', ' ')
)
with _ ->
None

Now I have to change the code a little bit to make it compile:

    let mapper_getbody_textonly (csv: CsvReader) = 
try
let
feat = createZero()
let markdoc = csv.[7] |> Markdown.Parse
calcTopParagraphs feat markdoc
NormalResult (
let text = feat.PlainString.ToString()
text.Replace(
'\n', ' ').Replace('\r', ' ')
)
with _ ->
MapError ("markdown parser error", csv.[0])



In Python, the compiler will not tell you that you need to change the mapper! So you have to keep in mind a list of old mappers and change them one by one to suit the new interface. In F#, it is required and you don’t need to keep the mapper list as the compiler will tell you to change once you use an old one with the new interface.



With strict types, we are forced and helped by the compiler to change and refactor our code gradually. This is the main point I want to make about code reuse.



I keep 80% code in F# as long as my Windows laptop can run it. When it does not, say the processing costs more than 6 hours, I will write the rest 20% of code in Python and put it on the Linux server. This is my current resort.



So this is my little complain on Python, or on the dynamic languages in general. Let’s get back to OCaml at last. OCaml is great functional language. Among the practical functional languages, I think Haskell is the only one that is more interesting than OCaml in the functional aspect. However, Haskell is definitely not ready for large scale data mining tasks – its memory usage is unacceptable not to mention performance issues. So I was serious on using OCaml for my current task and writing all code in OCaml.



But the librariy availability on OCaml ecosystem bites me! I could not find a CSV library that supports streaming access and Unicode. I did find a simple Markdown parser in OCaml, but it parses a lot of SO Markdowns wrongly. Data mining practitioners usually depend a lot of tools for theirs tasks at hand, from solid CSV and database libraries, to linear algebra libraries, to NLP tagger, etc.. In this aspect, .Net and Python have better libraries than OCaml does.



So this ends my story in using OCaml this time.



I will probably come to OCaml in the future, and will try it for several times, I think. It is really a good language, and is more clean than F# in the functional aspect, e.g. .Net objects can still be null in F#, but there is really no null in OCaml. If everything in a project is better written from scratch, then OCaml can be a very good candidate.

Monday, October 31, 2011

Any numerical computing environment on Java platform?

Recently Tomas Petricek and I have co-written a book chapter, Numerical Computing in F#. It is a free online book chapter for his Manning book: Real World Functional Programming. This book chapter is a survey on numerical computing for .Net languages, specially for F#. I will write another article on this book chapter and present some materials that were cut down from final version on MSDN.

But today, I am looking for things on our competitor’s side -- numerical computing environments for Java platform. I am just wondering what Java guys are doing for numerical computing.

Looking at the mature numerical systems such as Matlab, Mathematica and S-plus/R, we can summarize three general elements of such a system:

1. Math library. Math library is a must for serious numerical computing. It is hard to build well-tested numerical procedures. This kind of low level stuff, e.g. linear algebra routines and FFT, is usually available as a reusable software component.

2. Interactive/Console. The Read-eval-print loop (REPL) is very important for exploratory analysis because it gives a quick feedback on some small fragments of code.

3. Plot/chart library. A plot explains what’s going on clearly and quickly!

I did some web search and found quite a few blog articles; I want to mention articles by Mikio L. Braun. He also maintains his jblas library, which is a JNI wrapper to ATLAS, the highly-optimized BLAS library.

I am aware Incanter, which is a Clojure project aiming to be R in Java platform. But I don't think rewriting basic things, e.g. PCA and linear regression, for Java Platform is a good idea since a stable implementation for these algorithms usually take years because there is just too much details in numerical algorithms.

This time, I found a new library, ScalaLab, which is a on-going project for numerical computing in Scala.

Let’s first talk the library a little bit. For matrix and linear algebra, there are also two quite stable Java libraries: colt and JAMA. The linear algebra functionality of the two libraries should be intact. But the performance is quite bad compared to native code, even several times slower to pure .Net code. [The comparison to .Net code is my personal experience. ] But one should not care too much about performance as the bottleneck of a numerical project differs from projects to projects. A working system is the most important thing, various methods could help to improve the performance. Btw, the performance difference is usually not noticeable when we work on small or sample datasets. I use R quite often recently, the matrix library associated with its Windows version is from Netlib, a normal native performance one. I didn’t have any performance complain over months.

For REPL, basically it is a language issue. After reading through online, I think I will like something built on Scala, such as the ScalaLab environment. Among the mature/quite-mature JVM languages, Scala is most similar to F#. The advantage of Scala to F# is that Scala is a component language, which means Scala has better OO features for library designers and software architects. However, on the small/low level of programming (e.g. a small function), F# feels much more functional and pure than Scala. This, I think, is partially because F# “IS” an ML, while Scala borrows some syntax from Java and C#. The other reason is that OO is at the heart of Scala – everything is an object. Scala’s OO model is greatly influenced by Smalltalk’s. I haven’t learned any Smalltalk. But after some learning in Scala, I can appreciate why Alan Kay, the inventor of Smalltalk, says, “Actually I made up the term "object-oriented", and I can tell you I did not have C++ in mind. “. So even functional programming in Scala is OO-emulated, while F# has a functional base first and then the implementation using .Net is transparent to its syntax. Ok… the above comparison is just off the topic. Anyway, Scala can serve as a good static script language with a REPL.

For the plotting library, JFreeChart seems to be the standard. It is also the plotting library used in Incanter [For this part, I think Incanter really does a great job!]. I am not sure whether there are some advanced or commercial plotting libraries for Java. But I know that .Net has tons of that! Anyway, JFreeChart is ok for daily use.

The last thing, which I haven’t listed above, is parallel computing! Java platform, or more specific, the Java Virtual Machine, proves to be one of the most stable multithreaded systems ever built. For example, Hadoop is written in Java. Although people in Google said that Hadoop is way slow compared to their C++ MapReduce. A lot of big companies are using it. Two months ago, I read the following nice paper by Martin Odersky and Kunle Olukotun:

Language Virtualization for Heterogeneous Parallel Computing

It is just published and already has 28 citations on Google Scholar. The idea is to build a DSL, which serves a middle ware, to write high level programs, e.g. equations involving some matrix operations. Underlying the DSL, there are complicated optimizations going on to transform it and make it parallel using heterogeneous technologies, e.g. GPU and multi-core. So the programmer does not need to know anything about parallel computing and his program is executed in parallel utilizing the multi-cores! The DSL is an internal one built in Scala, another example showing the great expressiveness of Scala. Since such idea is not new, the novel part of the paper is to demonstrate how Scala perfectly fits the requirements of such a paradigm.

I don’t quite have a conclusion for this post. Just some random thoughts on numerical computing in Java and an advertisement for my articles on MSDN微笑

Wednesday, September 14, 2011

A functional red black tree with dynamic order statistics in F#

Yin Zhu

14 September 2011

(The code for this article is at codeplex.)

Binary search trees play an extremely important role in computer science. They were hot research topics back to 1970s and early 1980s. While the plain binary search tree is of little usage, balanced search trees are now being used everywhere. For example, most C++ STL implementations use Red Black Tree for its set and map containers. The F# Core library uses a functional version of AVL tree for Set and Map, which are both persistent/functional containers.

Special binary search trees also have other functionalities other than implementing standard set or map containers. For example, splay tree has an efficient split operation – splitting a dynamic set into two in O(log n) time: one is less than the given key; the other is greater than or equal to the given key. This operation could be used to optimize network flow algorithms. Details could be found in Tarjan’s classical (also tiny) data structures book published in 1983.

In this post, I’d like to implement the classical Red Black Tree in F#. I searched on the internet and found that there were some Red Black Tree code snippets in F#, but none of them are complete. Some only implemented the insert operation, which is relatively easy. Some are incorrect.

My main purpose for implementing the Red Black Tree in F# is to test whether I am still good at implementing complex data structures (I implemented quite a few classical data structures and algorithms during my early undergraduates for ACM/ICPC competitions). But there are new things: 1) the implementation language is F# rather than C++, 2) the implementation should be functional rather than imperative, and 3) the interface design should be general, and allows code reuse.

One function I always miss from standard set containers is dynamic order statistics – finding the ith smallest element in a dynamic set. The query time should be in O(log n). To accomplish this, we need add a size field to the internal tree node. Adding extra fields into a data structure is called augmenting data structures (Chapter 14, CLRS). Augmenting data structures is useful when a function is not built in a classical data structure but can be added by genuinely modifying the classical data structure. It is a very important skill when solving algorithmic competition problems.

I divided this article into several sections. I will first introduce the basics of red black trees, and then we move the insertion part, and order statistics, and then the hardest part – deletion. I have a section describing the testing procedures that ensure the correctness of the implementation. At last, I will leave the specifics of Red Black Trees, and cover the tree traversal in general and show some more functional stuff: continuation and a continuation Monad.

Red Black Tree Basics

A Red Black Tree is a binary search tree with the following extra properties:

1. The root is black and the leaf nodes are black (note: leaf nodes do not contain keys, only internal nodes contain keys).

2. A red node’s two children must be black.

3. For any subtree rooted at x, all the simple paths down to the leaf nodes contain the same number of black nodes.

With the above properties, the Red Black tree is able to perform queries such as search, insert and delete all in O(log n) time.

The F# definition of the tree is given below:

type Color =
Red | Black | DoubleBlack

type RBTree<'a when 'a:comparison> =
| Node of Color * 'a * int * RBTree<'a> * RBTree<'a>
| Leaf of Color



Notice that a node could be Red or Black. For black, there are two cases Black and DoubleBlack. The DoubleBlack color is used only when we are dealing with deletion, and in a valid Red Black tree, there would be of no DoubleBlack node. I will explain DoubleBlack in the delete section.



A Red Black Tree has two kinds of nodes: internal nodes storing keys and dummy leaf nodes. The internal node has a color field, the key filed, the size of the subtree and left and right children. The leaf node has two possible colors: Black or DoubleBlack.



Let’s write two simple recursive functions working on the tree structure to get us warmed up:



let rec contains x = function
| Leaf _ -> false
| Node (_, y, _, a, b) ->
if
x = y then true
elif
x < y then contains x a
else contains x b

let rec depth = function
| Leaf _ -> 0
| Node (_, _, _, a, b) ->
1 + max (depth a) (depth b)



The first is to test whether a key x is in the tree or not. The second is to calculate the depth of a tree.



Insert



For insertion, there is a known trick found by Chris Okasaki in this paper (Red-black trees in a functional setting). Okasaki is also the author of the famous book, Purely Functional Data Structures.



The idea of insertion is: always insert a Red node at some leaf node so that property 3 holds, then fix property 2 up to the root. There are four cases where property 2 could break. For all the four cases, the grand parent is always black, and there will be two red nodes right in the below. Please see the picture on page 3 of the paper.



Then the buttom-up fixing goes up the root. The root may be changed into Red. To hold property 1, simply change the color of the root to Black.



let insert x (t: RBTree<'a>) =
let balance = function
| Black, z, size, Node (Red, y, _, Node (Red, x, _, a, b), c), d
| Black, z, size, Node (Red, x, _, a, Node (Red, y, _, b, c)), d
| Black, x, size, a, Node (Red, z, _, Node (Red, y, _, b, c), d)
| Black, x, size, a, Node (Red, y, _, b, Node (Red, z, _, c, d))->
Node (Red, y, size, Node (Black, x, (getSize a b)+1, a, b), Node (Black, z, (getSize c d)+1, c, d))
| color, b, _, c, d -> // grandparent is red, does not change
Node (color, b, (getSize c d)+1, c, d)

let rec ins = function
| Leaf _ ->
Node (Red, x, 1, Leaf(Black), Leaf(Black))
| Node (color, y, size, a, b) as s ->
if
x < y then balance(color, y, (getSize a b)+2, ins a, b)
elif x > y then balance(color, y, (getSize a b)+2, a, ins b)
else s

match ins t with
| Node (Red, y, size, a, b)->
Node (Black, y, size, a, b)
| t -> t



Order statistics



Let’s have some rest before going to the monstrous delete operation. To get the nth smallest key in the tree, we can add a size field to each node, which is the size of the subtree rooted at the node. Thus the size of Node(_, _, _, left, right) is defined as



size = left.size + right.size + 1



and, as a boundary condition, a leaf node has size 0.



With this size field, the nth function is given below:



let nth n t =
let rec nth' n = function
| Leaf _->
failwith "can't find the nth member"
| Node (_, v, size, l, r)->
let
lsize=getSize l (Leaf(Black))
if lsize+1=n then
v
elif lsize+1>n then
nth' n l
else
nth' (n-lsize-1) r
if n >= size t || n < 0 then
failwith "nth out of range"
else
nth' (n+1) t



Delete



The idea of delete is simple (if we don’t care about the three properties of Red Black Trees): if the node x to be deleted has less than two children, then simply delete it, and lift its child (if any) to its position; if the node has two children, that means there must a node that is just bigger than x in its right child, replace x with that value, now the problem reduces to delete the smallest node in x’s right child.



First let’s write the function to get the smallest value in a tree:



let rec min = function
| Leaf _->
None
| Node (_, value, _, Leaf(Black), _)->
Some(value)
| Node (_, _, _, l, r)->
min l





Then we continue to code the delete logic:



        let rec del value = function
| Leaf(color) ->
failwith "RBT delete failed because x is not in the tree"
| Node (color, y, _, a, b) as s ->
if
value < y then balance(color, y, 0, del value a, b)
elif value > y then balance(color, y, 0, a, del value b)
else
match
(a, b) with
| (Leaf(Black), Leaf(Black)) -> // without non-leaf children
Leaf(Black++color)
| (Leaf(Black), Node (nc, nv, size, nl, nr))
| (Node (nc, nv, size, nl, nr), Leaf(Black)) -> // with a single child
Node (color++nc, nv, size, nl, nr)
| (l, r) ->
//find the successor (smallest element in a right child),
//replace the key with the successor's
//and delete relevant node
match (min r) with
| Some(v) ->
balance(color, v, 0, l, del v r)
| None ->
failwith "impossible: can't find the successor"




The hard part is how to balance the tree after the deletion of the node. If we delete a black node, then Property 3 cannot hold anymore. So we must make rotations and do some rearrangements on colors to make Property 3 to hold again.



Think this problem in another way: after the deletion of a Black node and we connect its single child or leaf node into it, the node becomes darker. Conceptually Property 3 still holds if we think the node has two “Black”s in it. We mark this node as DoubleBlack and the task of balance is to eliminate this DoubleBlack node!



I defined an operator ++ to plus two colors:



let (++) color1 color2 =
match color1, color2 with
| Red, Black
| Black, Red ->
Black
| Black, Black ->
DoubleBlack
| _, _ ->
Red



Let’s define the invariant for the balance function as follows: no node under pv is DoubleBlack.



Consider pv is the root of the subtree, and its left child is rooted at x, which is a DoubleBlack. Let’s consider all the cases for its right child. (After this is done, then we can work out when x is located as the right similarly.)



Case 1. If pv is black and its right child rv is red. Then do a left-rotation and make the DoubleBlack node x one level down (so that we reduce the problem):



image





(notations in the picture: nodes in circle are single nodes with colors, blue represents either red or black; a symbol represents a tree rooted at the symbol, for some cases the colors of the symbols are also noted.)



Case 2: If pv is of anycolor, and its right child is black. The three sub cases are shown below:



clip_image004



clip_image006



clip_image008



And the balance procedure is as follows:



let rec balance = function
// invariant: the two children rooted at node pv have same "black" depth
// but the root node itself may become a "double" black node
| Black, pv, _, x, Node (Red, rv, size, ra, rb) when (isDoubleBlack x)-> // case 1
balance(Black, rv, 0, balance(Red, pv, 0, x, ra), rb) // do left-rotate and continue balance
| pc, pv, _, x, Node (Black, rv, size, ra, rb) when (isDoubleBlack x)-> // case 2.a 2.b and 2.c
if isBlack ra && isBlack rb then // 2.a: reduce a black on both side
let tempNode=Node (Red, rv, size, ra, rb)
Node (pc++Black, pv, (getSize x tempNode)+1, blackify x, tempNode) // reduces the double node to root
elif isBlack rb then // 2.b: do a right rotation in the right child, so rb becomes red and recudes to the "else" case
match ra with
| Node (_, rav, _, raa, rbb)->
let
tempNode1= Node (Red, rv, (getSize rbb rb)+1, rbb, rb)
let tempNode2=Node (Black, rav, (getSize raa tempNode1)+1, raa, tempNode1)
balance(pc, pv, 0, x, tempNode2)
| _->
failwith "impossible error"
else // 3.c
let tempNode=Node (Black, pv, (getSize x ra)+1, blackify x, ra)
Node (pc, rv, (getSize tempNode rb)+1, tempNode, blackify rb)

// when doubleblack x is on the right
| Black, pv, _, Node (Red, lv, _, la, lb), x when (isDoubleBlack x)->
balance(Black, lv, 0, la, balance(Red, pv, 0, lb, x))
| pc, pv, _, Node (Black, lv, size, la, lb), x when (isDoubleBlack x)->
if
isBlack la && isBlack lb then
let
tempNode=Node (Red, lv, (getSize la lb)+1, la, lb)
Node (pc++Black, pv, (getSize tempNode x)+1, tempNode, blackify x)
elif isBlack la then
match
lb with
| Node (_, _, lbv, lba, lbb)->
let
tempNode1=Node (Red, lv, (getSize la lb)+1, la, lba)
let tempNode2=Node (Black, lbv, (getSize tempNode1 lbb)+1, tempNode1, lbb)
balance(pc, pv, 0, tempNode2, x)
| _->
failwith "impossible error"
else
let
tempNode=Node (Black, pv, (getSize lb x)+1, lb, blackify x)
Node (pc, lv, (getSize la tempNode)+1, blackify la, tempNode)

| a, b, _, c, d->
Node (a, b, (getSize c d)+1, c, d)



Remember that this balance only balances the two children of any tree. Thus the root can still be DoubleDark (similar case in insert). We need a final check on the root:



match del x t with
| Node (_, y, size, a, b) ->
Node (Black, y, size, a, b)
| Leaf _ -> Leaf(Black)



Validation for the implementation



It is important to have some test to verify the correctness of the above implementation. My idea is to do some random insertions and deletions, and then after each insertion or deletion, check whether the tree still holds the three properties.



The validate function is given below:



let validate t =
let rec validate = function
| Leaf Black -> true, 0
| Leaf DoubleBlack ->
printfn "DoubleBlack in a Leaf node"
false, 0
| Leaf Red ->
printfn "Red in a Leaf node"
false, 0
| Node (Black, _, _, a, b) ->
let
lr, ln = validate a
let rr, rn = validate b
lr && rr && (ln = rn), ln + 1
| Node (Red, _, _, a, b) ->
match
a, b with
| Node (Red, _, _, _, _), _
| _, Node (Red, _, _, _, _) ->
let
lr, ln = validate a
printfn "Red-Red in a tree"
false, ln
| _ ->
let
lr, ln = validate a
let rr, rn = validate b
lr && rr && (ln = rn), ln
| Node (DoubleBlack, _, _, a, _) ->
let
lr, ln = validate a
printfn "DoubleBlack in an internal node"
false, ln
match t with
| Leaf Black -> true, 0
| Leaf _ -> false, 0
| Node (Black, _, _, _, _) ->
validate t
| Node (Red, _, _, _, _)
| Node (DoubleBlack, _, _, _, _) ->
printfn "root must be Black"
false, 0



I did use the above test case to catch one bug in the implementation, the original code for the final deletion step is:



match del x t with

| Node (_, y, size, a, b) ->


Node (Black, y, size, a, b)


| t -> t




This code has a bug that when the tree is deleted to empty, its representation is Leaf (DoubleBlack) rather than Leaf(Black)!



I did a performance test against with F#’s set, which uses AVL tree:



    // Fisher-Yates shuffle
let randperm n =
let p = Array.init n (fun i -> i)
let r = new System.Random(1)
for i=n-1 downto 1 do
let
j = r.Next(i+1) // random number 0 <= j <= i
let tmp = p.[j]
p.[j] <- p.[i]
p.[i] <- tmp
p

let lst = randperm 1000000 |> Seq.toList

let rb = RBTree.ofList lst |> ignore

open Microsoft.FSharp.Collections

let av = Set.ofList lst |> ignore

//timing on a AMD E-350 Netbook (1.6 GHz)
// Real: 00:00:44.635, CPU: 00:00:44.694, GC gen0: 218, gen1: 61, gen2: 4
//
// val rb : unit = ()
//
// >
// Real: 00:00:17.448, CPU: 00:00:18.766, GC gen0: 76, gen1: 20, gen2: 2
//
// val av : unit = ()
//






As we can see, for a large set with one million elements, my implementation costs about two times of that of AVL tree in F# Core library. That was not bad. I shall devote more time into performance and see if we can do more improvement. One optimization is to reduce redundant tree node allocation.



Tree traversal



Let’s talk some interesting stuff about tree traversal at last. I did implement some traversal functions:



let rec map f = function
| Leaf _ as l -> l
| Node (color, x, size, l, r) -> Node (color, f x, size, map f l, map f r)

let rec iter f = function
| Leaf _ -> ()
| Node (color, x, size, l, r) ->
iter f l
f x
iter f r

let rec fold f t acc =
match t with
| Leaf _ -> acc
| Node (color, x, size, l, r) -> fold f l (f x (fold f r acc))


They are straightforward recursive functions, and not tail-recursive. For a balanced tree like red black tree, tail-recursive is not important because the tree does not go too deep (a tree with 1 to 1000000 in it has a depth of 26). But it is just fun to make tem tail-recursive by using continuations. The standard accumulating variable trick does not work here because we have two recursive calls.



Let’s write the tail-recursive form of map function:



let map2 f t =
let rec map f t k =
match t with
| Leaf _ as l -> k l
| Node (color, x, size, l, r) ->
map f l (fun lmap ->
map f r (fun rmap -> k (Node(color, f x, size, lmap, rmap))))

map f t (fun x -> x)



We can also use a Monad to abstract the continuation pattern out:



type Cont<'a,'r> = ('a -> 'r) -> 'r

type ContBuilder() =
member x.Return (a):Cont<'a,'r> = fun k -> k a
member x.Bind (c:Cont<'a,'r>, f:'a -> Cont<'b,_>) =
(fun k -> c (fun a -> f a k))
member this.Delay(f) = f()

let cont = ContBuilder()

let map3 f t =
let rec map f = function
| Leaf _ as t -> cont { return t }
| Node (color, x, size, l, r) ->
cont {
let! lm = map f l
let! rm = map f r
return Node (color, f x, size, lm, rm)
}
map f t (fun x -> x)



A note on <'a when 'a:comparison>



In our definition of the Red Black Tree:



type RBTree<'a when 'a:comparison> =
| Node of Color * 'a * int * RBTree<'a> * RBTree<'a>
| Leaf of Color



The type signature has an ad hoc construct: when ‘a: comparison. Actually the only other constraint on ‘a is equality. F# only supports these two type constraints.



Type constraints behave like type classes in Haskell. :equality is Eq, :comparison is Ord. However, type classes in Haskell are much flexible.



Details could be found on this blog article by Don Syme.



Summary



In this article, I have implemented a functional Red Black Tree with dynamic order statistics. The tree is complete and has been well tested. Before any performance tuning, the running speed of it is within a small factor (2-3) of the AVL tree in F# Core library.



This article shows many important features of functional programming: 1) pattern matching and algebraic data types help implement complex data structures; 2) recursion is everywhere and the correctness of the program could be proved by inductive reasoning; and finally 3) a light introduction to tree traversal and continuation and Monad in F#.