Tuesday, December 11, 2012

A reading list on F# and other ML languages

Yin Zhu

This page will be updated regularly. last updated on 11 Dec 2012.

For F#, Don Syme has posted a blog page:

How to reference F# in a research paper?

Most of the papers therein are, of course, written by Don himself, the creator of F#.

The above page is on F#, particularly on the novel points in F#, e.g. active patterns, async IO, quotations, etc. But the most powerful part of F# is probably its ML heritage. To have a full and authoritative review of the core of F#, we must to back to the original papers of these ideas. This also motivated me to read some of these papers. For a few of them, I have quite a good understanding. However, as I haven’t had any formal training in programming languages (nor do I intend to – data mining is my primary job!), I only have a shallow understanding of most papers. So, I set up this page to keep track on interesting papers that I have read and want to re-read.

General ML

Paul Hudak. The conception, evolution, and application of functional programming languages. ACM Computing Surveys. 1989. pdf

A comprehensive survey. Tutorial on lambda calculus and its relationship to FP languages. Note the year it published, Haskell was not mature at that time.

Andrew Appel. A Critique of Standard ML. 1992. pdf

This paper gives a detailed review on ML language features, mostly from a user’s perspective. The concerns Appel raises are what a typical ML programmer would raise, e.g. the efficiency of the low-level implementation of datatypes.

David B. MacQueen. Reflections on Standard ML. 1993. pdf

This paper is another review on ML, but from a language designer’s perspective.

Module systems

Robert Harper and Benjamin C. Pierce. Design Considerations for ML-Style Module Systems. Chapter 8 in Advanced Topics in Types and Programming Languages. MIT. 2005. book link

An up-to-date review of the ML module system. It starts with a very general introduction, independent of any concrete implementations. At the end of it, it compares different of implementations (SML and OCaml), and compares the ML module system briefly to C and Java. This chapter does not build on the type book of Pierce and can be read alone.

David MacQueen. Modules for Standard ML. 1984. link

Module extension to ML was primary designed by David MacQueen, and this paper is the first paper on ML modules. MacQueen wrote several subsequent papers on ML modules, e.g. implementation issues and separate compilation of SML modules.

Derek Dreyer. Understanding and Evolving the ML Module System. PhD thesis. 2005. pdf

To learn to use a concrete ML module system, the best way is probably to read tutorials and code. For OCaml, check Jason Hickey's introduction to Objective Caml, OCaml manual, many OCaml library designs(Map, Set, and particularly Jean-Christophe Filliâtre’s ocamlgraph).

Ad-hoc polymorphism

ML does not support typeclass/ad-hoc polymorphism. Attracted by the elegance of Haskell typeclass, many ML programmers want to use typeclass in ML. Without typeclass, ML programmers have to mimic a typeclass, e.g. dictionary passing is used for implementing F# powerpack’s matrix library.

Stefan Wehr and Manuel M. T. Chakravarty. ML Modules and Haskell Type Classes: A Constructive Comparison. 2008. pdf

It implements ML’s full module system using Haskell typeclass; and implement simple typeclass suing ML’s modules. Constructor typeclass is not implemented and everyone said it is hard or impossible (but I haven’t seen on the web why….)

Jeremy Yallop. Practical Generic Programming with OCaml. ML workshop’07. slides.

Another implementation of typeclass (again simple typelcass) using OCaml.

John Peterson and Mark P. Jones. Implementing Type Classes. PLDI’93. paper.

Philip Wadler and Stephen Blott. How to make ad-hoc polymorphism less ad hoc. POPL’88. paper.

The original paper of typeclass actually has suggested a simple implementation using explicit dictionary passing, and this technique as poor man’s ad-hoc polymorphism can be used in ML.

The realpower of Haskell comes from constructor typeclasses – Functor, Monads, etc, are all based on it. (I know how to use them in simple cases, however I have no idea how it is implemented and why it cannot be implemented in .Net; people keep saying that “to support higher-order typeclasses, the IL of .Net needs to be changed” however they never give any clue on why…)

Mark P. Jones. A system of constructor classes: overloading and implicit higher-order polymorphism. FPCA '93. paper.

Inside ML polymorphism

From a programmer’ view, ML polymorphism is quite easy to understand and use. However, normal programmers have little knowledge of the underlying theory and the implementation details of it. Also .Net and Java generics are heavily influenced by ML polymorphism (the core designers are from FP community). If one wants to know the generics in Java or .Net thoroughly, understanding ML polymorphism seems a good pre-course.

Robin Milner. A theory of type polymorphism in programming. Journal of Computer and System Sciences. 1978. pdf

This paper is the second most cited paper in ML literature (the first being the definition of Standard ML), the breakthrough in this paper, plus the ML module system, gives the modern shape of ML. The Hindley-Milner type inference scheme is also described in this paper.

Gilad Bracha, Martin Odersky, David Stoutamire and Philip Wadler. Making the future safe for the past: Adding Genericity to the Java Programming Language. OOPSLA'98.pdf

Andrew Kennedy and Don Syme. Design and Implementation of Generics for the .NET Common Language Runtime. PPLDI’01. pdf

These two papers are not directly related to ML. But since Java and .NET are the two major programming platforms today, understanding their implementations of generics is critical to programmers today.

Courses and tutorials

The following courses are taught by active researchers in this field. The instructors are not only good teachers of functional programming, but also successful researchers.

Cornell CS3110. Data Structures and Functional Programming. Uses OCaml. As the title says, data structures and OCaml are taught in an interleaving way. Well designed assignments. More theoretical stuff, e.g. fixpoints.

Harvard CS51. Introduction to Computer Science II: Abstraction and Design. Uses OCaml. Introductory course. Half OCaml and half Java. Focused on core programming abstractions, e.g. higher-order functions, modules, data types, lazy, objects, parallelism.

IT University of Copenhagen. Programs as data. Uses F# to implement a small compiler. Focuses on compilers implementation. At the end of the course, it also introduces Scala, and advanced abstractions, e.g. continuations and Monads.

Yale CS421. Compilers and Interpreters. Uses SML to teach compilers.

OCaml tutorials/books.

Tuesday, November 13, 2012

Reflection on Martin Odersky’s Scala Course at Coursera

Some F# news first:

Our team F# and Composite Networks, got the 4th place at the Stackoverflow Closed-Question prediction competition. And check out our beautiful visualization, “mapping the world’s programmers”, in which the location parser is written in F# and is open sourced. I will write more about this Stackoverflow project in future posts, but today is about Scala…

--
Today I finished the last assignment of the Scala course at Coursera (~~code repository link removed~~ due to course restriction). I must say this course is so well designed! As I am already familiar with F#, walking through the course is easy. There are a few things that are new to me or that I only had vague understanding previously. So after this course, I really feel that I have learned something and this is also why I’d like to write this reflection here, to share my fresh feeling of it.

Before this course, my only experience with Scala is an email judge program, which frequently scans an Email account and receive the new submissions from students and auto-score them. I used Scala because Java platform has a quite nice email library and has the example code for Gmail and also because I like to play new languages. The realtime scoreboard for that course is still online – it was really cool to watch this leaderboard when students made submissions. The source code is also in the repository. This program is written in one day (with copy&pate from a few snippets online). Scala is friendly to new users – as in day one I already become productive in it.

In the following, I will summarize the examples used in the course, unique language feature in Scala, homework and a short language comparison to F# and C#.

1. Examples

The lectures, though mainly on simple materials, are well designed and well explained. Some of the classical examples are presented in the Scala language:

A. Newton’s method for calculating square root. Used as an example in higher order functions, stream/lazy evaluation. Later used to illustrate the concept of fix-point, but Odersky didn't go further to the fix-point in lambda calculus.

B. Rational number class. Numbers are the fundamental types in any programming languages. However usually they are built-in, and their implementations are hidden in the compilers or standard libraries. Implementing a Rational number is a great exercise in learning any language. From the number type, it can go as deep as class/object, comparison, type classes, passing operation dictionaries, etc.

C. Implementing functions as objects. The general idea is simple. What’s not simple is the variant.

abstract class Function1[-S,+T] { 
  def apply(x: S): T 
}

There are rules and design patterns for when to use covariant/contravariant/non-variant. In Scala, the type checking for variant is much stronger than Java. This also adds complexity to the Scala type system and makes learning more involved. A complicated type system gives more power to library designers though. I will talk about this later.

D. Use List and binary search tree to implement IntSet. Scala does not have datatypes, so the implementation follows Scala's OO style.

E. Expression problem. This is a classical problem to discuss functional design and OO design. The best discussion I found was Ralf Lammel’s C9 lectures.

F. Fibonacci sequence and prime number sequence.

G. Classical search problem: n-queen for depth-first search (DFS) and pouring water for breath-first search (BFS).

Basically every example is implemented in several Scala ways, to show different language features of Scala, from simple to more advanced ones. For example, Fibonacci sequence first appeared in simple recursive functions and later in Streams as an example to show lazy evaluation/computing on demand.

The implementation for the examples is quite OO and trait is heavily used in all examples – actually you cannot avoid OO in Scala for everything concrete is an object.

2. Concepts

Besides the examples, I learn more about the following concepts:

A. Dynamic dispatching vs higher order functions. This is an exercise left at the end of one lecture. Odersky didn’t discuss it, but it is quite interesting to think about it. What I can think of is that .toString/.hashCode/.compare these universal methods associated with any object can be implemented as higher order functions.

Say in OO design, we have a Rational class that implements Order interface

class Rational implements Order { override int compare(other) { }}

In functional design, we would have (immutable) data and functions defined for them:

module Rational =
  type t = int * int
  let compare (a1, b1) (a2, b2) = a1*b2 – a2 * b2

When implementing a sort function, OO approach depends on dynamic method dispatch, while functional design uses higher order functions:

let sort (compare: Rational.t -> Rational.t -> int) (Rational.t list) = ...

The problem also reminds me of dictionary passing style for emulating type classes.

B. Subtyping vs generics, variant. (Lecture 4.4 & 4.5) I was confused about covariant and contravarint in C# and Java (there are major differences between the two). The variant problem becomes a problem when OO-style inheritance marries with FP-style generics. And definitely Scala has a very good solution in its language design. After the lecture and some more reading, I am clearer. But I think what I really need is to read through Scala’s standard library design to get more real-world examples and design lessons.

Variant + implicits are very powerful abstraction tools though the latter seems ad hoc.

C. By value, by name, and lazy. The lecture on Stream is amazingly clear, and it clearly differentiates three concepts: by value, by name and lazy.

D. for comprehensions are translated as map/filter/flatMap higher order functions. For compressions are easier to read than higher order functions. For example, Python, the language that keeps readability in its heart, encourages for comprehension over map/lambda syntax. This is particularly true when nested comprehensions are used, e.g. in the 8-queen example used in the lecture:

def queens(n: Int) = {
 def placeQueens(k: Int): Set[List[Int]] = {
   if (k == 0) Set(List()) 
   else
   for {
     queens <- placeQueens(k - 1)
     col <- 0 until n
     if isSafe(col, queens)
   } yield col :: queens
 }
 placeQueens(n)
}

3. Homework

The homework is well designed, which, as Ordersky reveals in an email to the email list, is from several TAs’ hard work. There are six projects; the skeleton of each project is already there, and each project has several holes to fill in. The primal exercise of doing these projects is thinking in types. Every function is fixed with a type signature, and the implementation I am asked to fill is usually one-liners. Placing the one-line-of-code that fits the type signature well is the primal problem in my head. After seeing six type-oriented design, I think when next time I need to design a project myself in Scala, it will come more naturally OO. (If you see my Scala email judge, you will immediately notice that I used Scala in an ML style…)

The other aspect of the homework is interesting. E.g. Representing a set as a function seems quite neat (though impractical because of efficiency), and using breadth-first-search for Bloxorz game is quite fun.

The hoemwork project also teaches test-driven development, for each project has a test suite, which is usually incomplete and requires the user to fill in more test cases.

4. Scala vs F#/C#

Qua work of art, the work of art cannot be interpreted; there is nothing to interpret; we can only criticize it according to standards, in comparison to other works of art. – T. S. Eliot, Hamlet and his problems, 1921.

This section is not about the course. But it is always interesting to compare languages, and in comparison, we learn.

My impression of Scala is that it is closer to C# than to any functional language. Scala is Java redesigned. The the core of Scala is OO. Then functional programming concepts and others are added based on its OO core. This is the same situation with C#. For language features, it seems to me that only Trait is unique in Scala, and Scala’s treat on variant is more sound. For other features, C# has them or can emulate them with little extra syntax. The fact that Scala is more complicate-designed than C# does not mean it is a better C#. I have seen discussions online arguing that Scala is too complicated and some early enthusiasts go back to Java after their projects become bigger and harder to manage.

On the functional side, Scala program can be very declarative, e.g. when using its immutable collections; however it is a another style of declarativeness which is different from ML family languages (F#/OCaml, and to some extend Haskell). In ML languages, the concept of function is everywhere – from simple function, to operators, to data constructors, functors, to even type constructors, it is a very consistent applicative style. While in Scala, from the very beginning, you think of an object, and then think about what can we do to this object. This major difference becomes more evident when you actually begin to design a non-trivial project.

If Java platform has a soundly implemented ML, I would prefer that to Scala. However there is no ML there. If cross platform, concise and functional, performance of a statics language, a lot of libraries, etc, are kept in mind, then Scala seems to be a good choice among the very few. And recently the EPFL + Stanford joint team on parallel computing DSLs using Scala seems quite interesting and useful. There are also light-weight distributed frameworks written in Scala, e.g. Spark at Berkeley.

Summary

It is always entertaining to learn a new programming language, and I learned quite a lot by following this course. Thanks to Prof. Martin Ordersky and his TAs. It would be great if there will be a new and more advanced course next year. I would be interested in:

1. Design and implementation of the Scala collection library. Very few people write collection libraries, but many programmers need to design APIs for others to use. The Scala collection library is a gold mine to learn modern API design patterns.

2. Concurrent and parallel programming in Scala, using Actor, and other DSLs.

I will still use F# primarily for my data mining projects. And when I need a Java library or serious cross platform, I can rely on Scala.

Monday, October 1, 2012

Setting OCaml/F# Environment in Emacs

For those who don’t know, I did the first 50 Project Euler problem about three years ago (to teach myself F#) and uploaded them at fsharp-euler wikispace. Since then, however, I haven’t had time to do any update.
By update, I mean writing up more explanations on the existing programs, rather than more programs for new Euler problems. Because for learning the basic syntax and core library of a functional language, solving the first 50 should be enough. Solving more problems adds little to the conquering of the language itself, though would be a great mental exercise.
During this weekend, I had the time to play with OCaml on Project Euler problems, and to set up OCaml in Windows. The following shows the Emacs screenshot.
I use Emacs with Tuareg mode and Auto-Complete mode, and I also find Tuareg’s library browser (Menu –> Tuareg –> OCaml library … ) useful. Auto-Complete mode is also automatically hooked with Tuareg mode to provide key word completion.

The completion dictionary is limited to OCaml keywords only. I used this little trick by scanning the whole library fold and get all function names using F#….

(* get all val names from ocaml lib mli files*)

let files = Directory.GetFiles(@"D:\OCaml\lib", "*.mli")

let only_names = 
    files |> Array.map Path.GetFileNameWithoutExtension
for mli in files do
    let Cap (s: string) = 
        s.Substring(0,1).ToUpper() + s.Substring(1)

            
    let onlyname = Path.GetFileNameWithoutExtension mli |> Cap
    let lines = File.ReadAllLines mli
    let funcs = 
        lines
        |> Array.map (fun (line:string) ->
            let s = line.Split(' ')
      
            if s.[0] = "val" || s.[0] = "external" then
                if s.[1].EndsWith(":") then
                    (true, s.[1].Substring(0, s.[1].Length-1))
                elif s.[1].EndsWith("(") then
                    (false, "")
                else 
                    (true, s.[1])
                    
            else 
                (false, ""))
        |> Array.filter (fun (s, _) -> s)
        |> Array.map (fun (_, f) -> onlyname + "." + f)
    
    for f in funcs do
        printfn "%s" f

By appending the output to the tuareg dict file (on my machine, it is ~\.emacs.d\elpa\auto-complete-20120922.1815\dict\tuareg-mode), we now get the nice Intellisense for OCaml libraries:

We can apply this trick to other OCaml libraries, and automate the updating of the dictionary file. Poor man’s intellisense, haha! Don’t worry about the length of the word list, the PHP list is over 100KB, and auto complete mode handles just fine!

There are other things to enhance, e.g. when we move cursor up and down along the completion choices, the corresponding function signature would be displayed at the mini buffer. But I am not familiar with elisp, so I opened the library reference buffer on the right window to check the signatures occasionally. For most of them, F# and OCaml have the same usage patterns, e.g. the main object is usually the last parameter of a function, and seeing the function name alone gives me sense of how to use it.

In Windows, we have the great Visual Studio IDE, but if you use F# in Linux, using F#-mode and auto complete mode will be handy.

Tuesday, September 18, 2012

Another tooling discussion for data mining, F#, OCaml, and Python.

As an F# programmer, I always keep an eye on its cousin, OCaml. OCaml and F# share a large portion of syntax, and an average F# programmer can pick up basic OCaml very quickly, say in one afternoon. And recently, the development of OCaml seems accelerated – exciting language features are added to the new releases: first-class modules (in 3.12) and Generalized abstract data types (GADTs) (in 4.0). And for a good comparison in features of the two languages, please refer to this SO question.

Today, I’d like to start a short discussion on their application on a specific area, data mining.

The first question the reader may wonder is that “why do you want to use OCaml for data mining?” Yes. If F# makes everything convenient, there would less point for me to consider other languages. The main point for F# is that its cross platform support is poor. Maybe the language itself is fine as there is the Mono tool chain for F#, but the whole ecosystem is not cross-platform friendly. There is always one or two .Net libraries that are written specific for Windows, to name two libraries that I use: R.Net and Sho. Besides the library part, the Mono runtime has poor performance for the .net parallel library, and sometimes may crash. When I was in Microsoft Research Asia, we don’t have any cross-platform issues – every server is powerful with more than 12 cores and 32GB memory and is well installed with Visual Studio and other other tools. When back to HKUST, most of the servers are installed Linux for no good reasons (Win Server licenses are very cheap for our school, the IT guys are simply not experienced enough to make Win servers secured.)

With such constraints, we do have successful stories with F#. Recently I led a team winning the semantic place predict task in Nokia Mobile Data Challenge. Details here. The feature extraction program is written in F# and C#, and we can generate all features (from simple counting to FFT) from 50GB of text data within two hours! Thanks to F#’s asynchronous programming and other functional features!

But after that, I think we really need a better Windows server, or we have to use a language that is more friendly with Linux (at the same time, runs OK with Windows for I use Windows for development).

Currently I use Python. I write scripts in my laptop and when it works correctly on a small dataset, I put the program to the server to get it running on the full dataset, and usually the processing costs one hours to one night. Python is easy to write, and as it claims, easy indeed to read. But at the language level, cod reuse in a dynamic language is generally harder than a static language. The interface is too flexible and I usually dive into the struggle of which-one-to-choose.

On the other side, static language encourages refactoring and provides static types to facilitate code reuse. Consider the following F# code snippet:

    // a common pattern for accessing Stackoverflow csv file, question by question
    let csv_iter (csvfile: string) (mapper: CsvReader -> string option) (mapfile: string) = 
        use csv = new CsvReader(new System.IO.StreamReader(csvfile),true)
        let headers = csv.GetFieldHeaders()
        let fieldCount = csv.FieldCount
    
        use sw = new StreamWriter(mapfile)
        use bad = new StreamWriter(mapfile + ".badids")
        while csv.ReadNextRecord() do
            match mapper csv with
            | Some text -> sw.WriteLine(text)
            | None -> 
                bad.WriteLine(csv.[0])
                sw.WriteLine()

        sw.Close()
        bad.Close()

From the function signature alone, one may guess what this function does. Basically, it process each record in a hugh csv file by a mapper function. The result of the function could be None, which means some errors occurs inside the mapper, or Some string that is a normal result. And all the results are stored in a file line by line. Let’s focus on the mapper function here, we have string option type in the function signature, which means we have a more specific requirement on the mapper than only requiring it to return a string. We have encoded the logic into the types. This is actually the main point of functional programming – types say more things than other languages. If we write this function in Python, the mapper would still return a string, however, the logic that the string can be nil is encoded in the implementation, but not in the type signature. In this case, we have to resort to natural language to comment the Python function that if the mapper gets error, please return None. And this comment is obviously less strict than types.

The above code is only the first version of this csv iterator; several enhancements can be added, e.g. we can define a more detailed type for mapper result, other than to use string option, which is too simple:

    type MapResult = 
        MapError of string * string
        | NormalResult of string

Now, the error has two parts: the error message and the best result string the mapper can get, which is set to empty string in the above csv_iter. In the stackoverflow example, if I cannot parse the question body, maybe I can use its title as a resort.

Say this is my old mapper, which gets the pure text from a question body, excluding its code and other annotations :

    let mapper_getbody_textonly (csv: CsvReader) = 
        try
            let feat = createZero()
            let markdoc = csv.[7] |> Markdown.Parse
            calcTopParagraphs feat markdoc
            Some (
                let text = feat.PlainString.ToString()
                text.Replace('\n', ' ').Replace('\r', ' ')
                )
        with _ ->
            None

Now I have to change the code a little bit to make it compile:

    let mapper_getbody_textonly (csv: CsvReader) = 
        try
            let feat = createZero()
            let markdoc = csv.[7] |> Markdown.Parse
            calcTopParagraphs feat markdoc
            NormalResult (
                let text = feat.PlainString.ToString()
                text.Replace('\n', ' ').Replace('\r', ' ')
                )
        with _ ->
            MapError ("markdown parser error", csv.[0])

In Python, the compiler will not tell you that you need to change the mapper! So you have to keep in mind a list of old mappers and change them one by one to suit the new interface. In F#, it is required and you don’t need to keep the mapper list as the compiler will tell you to change once you use an old one with the new interface.

With strict types, we are forced and helped by the compiler to change and refactor our code gradually. This is the main point I want to make about code reuse.

I keep 80% code in F# as long as my Windows laptop can run it. When it does not, say the processing costs more than 6 hours, I will write the rest 20% of code in Python and put it on the Linux server. This is my current resort.

So this is my little complain on Python, or on the dynamic languages in general. Let’s get back to OCaml at last. OCaml is great functional language. Among the practical functional languages, I think Haskell is the only one that is more interesting than OCaml in the functional aspect. However, Haskell is definitely not ready for large scale data mining tasks – its memory usage is unacceptable not to mention performance issues. So I was serious on using OCaml for my current task and writing all code in OCaml.

But the librariy availability on OCaml ecosystem bites me! I could not find a CSV library that supports streaming access and Unicode. I did find a simple Markdown parser in OCaml, but it parses a lot of SO Markdowns wrongly. Data mining practitioners usually depend a lot of tools for theirs tasks at hand, from solid CSV and database libraries, to linear algebra libraries, to NLP tagger, etc.. In this aspect, .Net and Python have better libraries than OCaml does.

So this ends my story in using OCaml this time.

I will probably come to OCaml in the future, and will try it for several times, I think. It is really a good language, and is more clean than F# in the functional aspect, e.g. .Net objects can still be null in F#, but there is really no null in OCaml. If everything in a project is better written from scratch, then OCaml can be a very good candidate.