Tuesday, July 9, 2013

Unicode Tips in Python 2 and R

Most of time, I don’t need to deal with different encodings at all. When possible, I use ASCII characters. And when there is a little processing in Chinese characters or other Unicode characters, I use .Net languages or JVM languages, in which every string is Unicode and of course when the characters are displayed they are displayed as characters (not as the unreadable escaped strings or Unicode IDs).

However recently I am working on projects on Chinese Weibo data, and I encountered some Unicode problems when using Python and R. (I use Python for data crawling and processing and R for modeling and visualization.)

This post is a note I start today and I will update it when I encounter new Unicode problems…

Python 2.7

Lesson 1: Normal string and Unicode string are two types. Do not mix use them.

When writing Unicode strings literals, we put a prefix u before the string:

>>> type(u'')
<type 'unicode'>
>>> type('')
<type 'str'>

Notice that the types of the two strings are different, one is 'unicode’, and the other is ‘str’. Python is a dynamic language and sometimes will do smart things between the two types. But we should use Python as if it is a strict static type language and never cross use the two types, and when needed, do conversions explicitly. See the following example:

>>> u'abcd'.find('b')
1
>>> u'你好'.find('好')

Traceback (most recent call last):
  File "", line 1, in 
    u'你好'.find('好')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xba in position 0: ordinal not in range(128)

The first find operation works fine because it does not involve any non-ascii characters. The second won’t work because the two types don’t work with each other.

The error in the example seems simple to avoid. However when you load a set of text files from different sources, you may mix Unicode strings and normal strings.

Lesson 2: When passing a Unicode string to a function not written by you, check carefully whether that function supports Unicode.

It seems that if we have all the strings as Unicode strings in Python 2.7, we won’t have any problems. We can add u prefix to all non-ascii string in the file, and we also use codecs package to read Unicode files:

    with codecs.open(unicode_text_file, 'r', encoding='utf-8') as f:
        read the file and all strings you get are of unicode type

Everything seems fine until we call function/packages that are written only for normal strings and we wished they would work for Unicode strings. For example, the csv package in Python 2.7 official release does not work with Unicode files. You cannot do things like:

    with codecs.open(file_profile, 'r', encoding='utf-8') as f:
        reader = csv.reader(f)
        head = reader.next()

The third line will throw UnicodeDecodeError just as above. So in Python 2.7 this is the pain point – you cannot expect all the common libraries that work nicely with normal strings to work nicely with Unicode strings. And a script that worked before for ascii files can suddenly fail on a Unicode file.

For library writers, it is pain too. Sometimes they need to write a special version for Unicode. I once worked with a human name parser in Python. European names can have accents on letters, but that library only accepts ascii strings. To use that name parser, I have to

1. Convert the name string into a normal Python string by removing all accents

2. Parse it using the library

3. Convert the result to Unicode string using .decode method

R

Lesson 1: In R, there is only one type of string, that is character. But this type can store whatever can be repressed as byte arrays. Think of R’s character type as C’s char[] and nothing more.

In Python, we have two types for two kinds of strings. In R, we have only one type for all kinds of strings.

My first problem is displaying Unicode characters in R. R’s default GUI and command line console cannot even display Chinese on a non-Chinese Windows. (On a Chinese Windows, you can only get a Chinese version R. Chinese characters are fine. However all info/error messages are also in Chinese, which are translated from English and which are just wired Chinese.)

When R is installed, it checks the System encoding. On Linux and Mac OS, the system encoding is usually UTF-8, and R uses that. However Windows is different, the following is my R session info:

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252

R’s default GUI then uses the encoding to display all strings. Of course, it fails on most Unicode characters.

Luckily the editor and the console in RStudio can display Chinese characters. And RStudio must have done some smart things:

> as.character(china_mapdata$NAME[1])
[1] "黑龙江省"
> "黑龙江省"
[1] "黑龙江省"
> as.character(china_mapdata$NAME[1]) == "黑龙江省"
[1] FALSE

The string in the data frame displays the same as the literal (the literal is represented as Unicode in RStudio console), yet they are not equal in many ways. This is because two strings display the same can be different in their internal representations:

> nchar("黑龙江省")
[1] 4
> nchar(as.character(china_mapdata$NAME[1]))
[1] 8

Their types are both character. However the representation are different: the literal is represented as 4 Unicode characters (4 chars * 3 bytes/char = 12 bytes) in the memory, and the string read from the data file is represented as 8 bytes (4 chars * 2 bytes/char = 8 bytes) in the memory:

> charToRaw("黑龙江省")
 [1] e9 bb 91 e9 be 99 e6 b1 9f e7 9c 81
> charToRaw(as.character(china_mapdata$NAME[1]))
[1] ba da c1 fa bd ad ca a1

I find that the Chinese characters in the file I load the data frame uses GB2312 encoding. And because it is a binary file, I don’t have a simple way to change its encoding. But here is a method that I find:

# First write the data frame to disk
write.csv(china_mapdata, file = 'china_map.csv')
# In EmEditor, open it as GB2312, and SaveAs UTF-8
# Load the utf-8 file
china_mapdata <- read.csv('china_map.utf8.csv', encoding='UTF-8')
# Test
as.character(china_mapdata$NAME[1]) == "黑龙江省" # should be TRUE now

After converting all Chinese characters in Unicode, I can now follow the map example here (the author of the post, I believe, uses a Chinese Windows, therefore does not have my problem; All systems except Chinese Windows will encounter the coding problem though) :

ggplot(zhejiang, aes(x = long, y = lat, group = group,fill=NAME))+
  geom_polygon(fill="beige" )+
  geom_path(colour = "grey40")+
  ggtitle("中华人民共和国浙江省")+
  geom_point(x=120.12,y=30.16,fill=FALSE)+
  annotate("text",x=120.3,y=30,label="杭州市")

But when exporting the file as PDF, the Chinese characters cannot display correctly:

After searching online, I find the solution in this SO question, and specifying the font explicitly solves the problem:

cairo_pdf("example.pdf", family="FangSong")
ggplot(zhejiang, aes(x = long, y = lat, group = group,fill=NAME))+
  geom_polygon(fill="beige" )+
  geom_path(colour = "grey40")+
  ggtitle("中华人民共和国浙江省")+
  geom_point(x=120.12,y=30.16,fill=FALSE)+
  annotate("text",x=120.3,y=30,label="杭州市")
dev.off()

The correct PDF output:

We can also change the font type to other Chinese fonts, refer their names here.

Some R’s functions can recognize the Unicode string, e.g. Encoding. I think such recognition is based on the first few bits of the string. But it does not recognize GB2312 string (Encoding function outputs ‘unknown’ for GB2312 strings). Magically RStudio in Windows (English version, locale set to Simplified Chinese) can recognize both strings and display them correctly.

Lesson 2: When dealing with Unicode in R, use a Chinese Windows (for Chinese only), or use Linux/Mac OS (which is by default UTF-8), otherwise you cannot display Unicode well.

See how Unicode code may be displayed in R's console (English Windows):

> head(zhejiang)
    X.U.FEFF.  AREA PERIMETER BOU2_4M_ BOU2_4M_ID ADCODE93 ADCODE99                     NAME
223       222 9.277    26.825      224         33   330000   330000 <u  +6D59><u  +6C5F><u  +7701>
224       223 0.000     0.103      225       1626   330000   330000 <u  +6D59><u  +6C5F><u  +7701>
225       224 0.000     0.052      226       1634   330000   330000 <u  +6D59><u  +6C5F><u  +7701>

The <u+xxx> symbols indicate that R knows that they are Unicode characters, however R cannot display them correctly. And these symbols occur as symbols in your plots as well, which is just meaningless. If R cannot display these characters as characters, how can we know what are they and how to use these characters in a plot?

To solve this problem, my solution is to use a Linux with UTF-8 encoding. This is the session info in my Linux virtual machine:

> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

Summary

Conceptually Unicode in Python and R are quite simple. In Python, Unicode and normal strings have different types. In R, all the strings are represented as byte arrays (the good old C char[]!); so from its type, you cannot decide whether a string is Unicode or not. I think Python’s two-type design is good because the runtime exception will force the programmer to split normal strings and Unicode strings clearly (though it is headache, this is why Python 3 treat all strings as Unicode strings) .

What adds to the Unicode complexity is functions that manipulate Unicode strings and especially the display of Unicode characters on screens. A Unicode character is anyway 2-4 bytes in the memory, nothing special. When using Python and R for data analysis, we usually follow the REPL (Read–eval–print loop), it is meaningless to display Unicode as escaped strings. One cannot even judge whether the character is a Chinese character by reading the escaped strings! What’s worse is that different terminals may display differently for the same character. For example, in a Unicode-enabled Python console:

>>> print u'你'
你
>>> u'你'
u'\u4f60'

In a non-Unicode console:

>>> u'你'
u'\xc4\xe3'
>>> print u'你'
Äã
>>> print u'\u4f60'
你
>>> u'你' == u'\u4f60'
False
(In this example, the problem is not that the terminal cannot display Unicode characters, it is that the input characters in the terminal are not Unicode!)

The suggestion for the terminal and editor is

Use a Unicode terminal and a Unicode text editor when working with Python and R. For example, RStudio is, while Rgui.exe isn’t. PyDev plugin/PyScripter is, while the default IDLE isn’t.

And in Python, always put

# -*- coding: utf-8 -*-

at the beginning of the file, and save all source code files as Unicode.

Monday, January 28, 2013

A simple GUI tool to help find the right people

Today our professor asked us to find a list of researchers who are suitable as program committees for the big data conference:

2013 IEEE International Conference on Big Data (IEEE BigData 2013)

IEEE BigData is a brand new conference and does not have a “circle” yet. That means finding PCs is harder than other more established conferences. Our solution is to search the PCs in other major conferences such as WWW/KDD and find researchers who are doing big data.

After I count the number of PCs in SIGIR conference, I know that I need a tool to help me. SIGIR’11 has 223 researchers in its community! Think about repeating the following tasks for 223 times:

I want to automate the following the tasks:

There are two tasks that are hard:

1. Getting emails automatically. Some researchers simply don’t write emails on their webpages. To find the email, one have to open one of their research papers.

2. Decide if a researcher is related to big data. Actually we can formulate the problem into a binary classification task, which however requires some training data. So we go back to human labeling. Currently I define a set of keywords that indicate BigData:

let keywords =
    [ "large"
      "scale"
      "big"
      "data"
      "hadoop"
      "mapreduce"
      "distributed"
      "system"
      "cloud"
      "computing"
      "parallel"
    ]

and for each home page, I list the subset of keywords that occur in it. If no keyword is in his home page, then it is unlikely the researcher is doing something related to BigData. So the keyword filter is served as a helper for the human to decide.

I write an F# program and the code can be find on BitBucket. Here is a screen shot of running program:

When I click “WANT THIS”, the program inserts a record to the log file.

There two windows, one is for Google search and the other is for the page of the first search result. I keep two pages because occasionally the first result is not correct, I can then click other search results.

Some technical details:

1. WebBrowser in F# with only four lines!

let form = new Form(Visible=true, Text="Google Search Result")
let content = new WebBrowser()
content.Navigate("http://www.google.com")
form.Controls.Add(content)

2. Finding ralevtive path in an F# script (this makes the experiment easier to reproduce)

let folder = __SOURCE_DIRECTORY__ + @"\..\..\data\"

let logfile = folder + "log.csv"
let namelistfile = folder + "sigir10.txt"

3. Details in WebBrowser class

content.DocumentCompleted.Add(fun _ –> doing something when the page is fully loaded)

Tuesday, December 11, 2012

A reading list on F# and other ML languages

Yin Zhu

This page will be updated regularly. last updated on 11 Dec 2012.

For F#, Don Syme has posted a blog page:

How to reference F# in a research paper?

Most of the papers therein are, of course, written by Don himself, the creator of F#.

The above page is on F#, particularly on the novel points in F#, e.g. active patterns, async IO, quotations, etc. But the most powerful part of F# is probably its ML heritage. To have a full and authoritative review of the core of F#, we must to back to the original papers of these ideas. This also motivated me to read some of these papers. For a few of them, I have quite a good understanding. However, as I haven’t had any formal training in programming languages (nor do I intend to – data mining is my primary job!), I only have a shallow understanding of most papers. So, I set up this page to keep track on interesting papers that I have read and want to re-read.

General ML

Paul Hudak. The conception, evolution, and application of functional programming languages. ACM Computing Surveys. 1989. pdf

A comprehensive survey. Tutorial on lambda calculus and its relationship to FP languages. Note the year it published, Haskell was not mature at that time.

Andrew Appel. A Critique of Standard ML. 1992. pdf

This paper gives a detailed review on ML language features, mostly from a user’s perspective. The concerns Appel raises are what a typical ML programmer would raise, e.g. the efficiency of the low-level implementation of datatypes.

David B. MacQueen. Reflections on Standard ML. 1993. pdf

This paper is another review on ML, but from a language designer’s perspective.

Module systems

Robert Harper and Benjamin C. Pierce. Design Considerations for ML-Style Module Systems. Chapter 8 in Advanced Topics in Types and Programming Languages. MIT. 2005. book link

An up-to-date review of the ML module system. It starts with a very general introduction, independent of any concrete implementations. At the end of it, it compares different of implementations (SML and OCaml), and compares the ML module system briefly to C and Java. This chapter does not build on the type book of Pierce and can be read alone.

David MacQueen. Modules for Standard ML. 1984. link

Module extension to ML was primary designed by David MacQueen, and this paper is the first paper on ML modules. MacQueen wrote several subsequent papers on ML modules, e.g. implementation issues and separate compilation of SML modules.

Derek Dreyer. Understanding and Evolving the ML Module System. PhD thesis. 2005. pdf

To learn to use a concrete ML module system, the best way is probably to read tutorials and code. For OCaml, check Jason Hickey's introduction to Objective Caml, OCaml manual, many OCaml library designs(Map, Set, and particularly Jean-Christophe Filliâtre’s ocamlgraph).

Ad-hoc polymorphism

ML does not support typeclass/ad-hoc polymorphism. Attracted by the elegance of Haskell typeclass, many ML programmers want to use typeclass in ML. Without typeclass, ML programmers have to mimic a typeclass, e.g. dictionary passing is used for implementing F# powerpack’s matrix library.

Stefan Wehr and Manuel M. T. Chakravarty. ML Modules and Haskell Type Classes: A Constructive Comparison. 2008. pdf

It implements ML’s full module system using Haskell typeclass; and implement simple typeclass suing ML’s modules. Constructor typeclass is not implemented and everyone said it is hard or impossible (but I haven’t seen on the web why….)

Jeremy Yallop. Practical Generic Programming with OCaml. ML workshop’07. slides.

Another implementation of typeclass (again simple typelcass) using OCaml.

John Peterson and Mark P. Jones. Implementing Type Classes. PLDI’93. paper.

Philip Wadler and Stephen Blott. How to make ad-hoc polymorphism less ad hoc. POPL’88. paper.

The original paper of typeclass actually has suggested a simple implementation using explicit dictionary passing, and this technique as poor man’s ad-hoc polymorphism can be used in ML.

The realpower of Haskell comes from constructor typeclasses – Functor, Monads, etc, are all based on it. (I know how to use them in simple cases, however I have no idea how it is implemented and why it cannot be implemented in .Net; people keep saying that “to support higher-order typeclasses, the IL of .Net needs to be changed” however they never give any clue on why…)

Mark P. Jones. A system of constructor classes: overloading and implicit higher-order polymorphism. FPCA '93. paper.

Inside ML polymorphism

From a programmer’ view, ML polymorphism is quite easy to understand and use. However, normal programmers have little knowledge of the underlying theory and the implementation details of it. Also .Net and Java generics are heavily influenced by ML polymorphism (the core designers are from FP community). If one wants to know the generics in Java or .Net thoroughly, understanding ML polymorphism seems a good pre-course.

Robin Milner. A theory of type polymorphism in programming. Journal of Computer and System Sciences. 1978. pdf

This paper is the second most cited paper in ML literature (the first being the definition of Standard ML), the breakthrough in this paper, plus the ML module system, gives the modern shape of ML. The Hindley-Milner type inference scheme is also described in this paper.

Gilad Bracha, Martin Odersky, David Stoutamire and Philip Wadler. Making the future safe for the past: Adding Genericity to the Java Programming Language. OOPSLA'98.pdf

Andrew Kennedy and Don Syme. Design and Implementation of Generics for the .NET Common Language Runtime. PPLDI’01. pdf

These two papers are not directly related to ML. But since Java and .NET are the two major programming platforms today, understanding their implementations of generics is critical to programmers today.

Courses and tutorials

The following courses are taught by active researchers in this field. The instructors are not only good teachers of functional programming, but also successful researchers.

Cornell CS3110. Data Structures and Functional Programming. Uses OCaml. As the title says, data structures and OCaml are taught in an interleaving way. Well designed assignments. More theoretical stuff, e.g. fixpoints.

Harvard CS51. Introduction to Computer Science II: Abstraction and Design. Uses OCaml. Introductory course. Half OCaml and half Java. Focused on core programming abstractions, e.g. higher-order functions, modules, data types, lazy, objects, parallelism.

IT University of Copenhagen. Programs as data. Uses F# to implement a small compiler. Focuses on compilers implementation. At the end of the course, it also introduces Scala, and advanced abstractions, e.g. continuations and Monads.

Yale CS421. Compilers and Interpreters. Uses SML to teach compilers.

OCaml tutorials/books.

Tuesday, November 13, 2012

Reflection on Martin Odersky’s Scala Course at Coursera

Some F# news first:

Our team F# and Composite Networks, got the 4th place at the Stackoverflow Closed-Question prediction competition. And check out our beautiful visualization, “mapping the world’s programmers”, in which the location parser is written in F# and is open sourced. I will write more about this Stackoverflow project in future posts, but today is about Scala…

--
Today I finished the last assignment of the Scala course at Coursera (~~code repository link removed~~ due to course restriction). I must say this course is so well designed! As I am already familiar with F#, walking through the course is easy. There are a few things that are new to me or that I only had vague understanding previously. So after this course, I really feel that I have learned something and this is also why I’d like to write this reflection here, to share my fresh feeling of it.

Before this course, my only experience with Scala is an email judge program, which frequently scans an Email account and receive the new submissions from students and auto-score them. I used Scala because Java platform has a quite nice email library and has the example code for Gmail and also because I like to play new languages. The realtime scoreboard for that course is still online – it was really cool to watch this leaderboard when students made submissions. The source code is also in the repository. This program is written in one day (with copy&pate from a few snippets online). Scala is friendly to new users – as in day one I already become productive in it.

In the following, I will summarize the examples used in the course, unique language feature in Scala, homework and a short language comparison to F# and C#.

1. Examples

The lectures, though mainly on simple materials, are well designed and well explained. Some of the classical examples are presented in the Scala language:

A. Newton’s method for calculating square root. Used as an example in higher order functions, stream/lazy evaluation. Later used to illustrate the concept of fix-point, but Odersky didn't go further to the fix-point in lambda calculus.

B. Rational number class. Numbers are the fundamental types in any programming languages. However usually they are built-in, and their implementations are hidden in the compilers or standard libraries. Implementing a Rational number is a great exercise in learning any language. From the number type, it can go as deep as class/object, comparison, type classes, passing operation dictionaries, etc.

C. Implementing functions as objects. The general idea is simple. What’s not simple is the variant.

abstract class Function1[-S,+T] { 
  def apply(x: S): T 
}

There are rules and design patterns for when to use covariant/contravariant/non-variant. In Scala, the type checking for variant is much stronger than Java. This also adds complexity to the Scala type system and makes learning more involved. A complicated type system gives more power to library designers though. I will talk about this later.

D. Use List and binary search tree to implement IntSet. Scala does not have datatypes, so the implementation follows Scala's OO style.

E. Expression problem. This is a classical problem to discuss functional design and OO design. The best discussion I found was Ralf Lammel’s C9 lectures.

F. Fibonacci sequence and prime number sequence.

G. Classical search problem: n-queen for depth-first search (DFS) and pouring water for breath-first search (BFS).

Basically every example is implemented in several Scala ways, to show different language features of Scala, from simple to more advanced ones. For example, Fibonacci sequence first appeared in simple recursive functions and later in Streams as an example to show lazy evaluation/computing on demand.

The implementation for the examples is quite OO and trait is heavily used in all examples – actually you cannot avoid OO in Scala for everything concrete is an object.

2. Concepts

Besides the examples, I learn more about the following concepts:

A. Dynamic dispatching vs higher order functions. This is an exercise left at the end of one lecture. Odersky didn’t discuss it, but it is quite interesting to think about it. What I can think of is that .toString/.hashCode/.compare these universal methods associated with any object can be implemented as higher order functions.

Say in OO design, we have a Rational class that implements Order interface

class Rational implements Order { override int compare(other) { }}

In functional design, we would have (immutable) data and functions defined for them:

module Rational =
  type t = int * int
  let compare (a1, b1) (a2, b2) = a1*b2 – a2 * b2

When implementing a sort function, OO approach depends on dynamic method dispatch, while functional design uses higher order functions:

let sort (compare: Rational.t -> Rational.t -> int) (Rational.t list) = ...

The problem also reminds me of dictionary passing style for emulating type classes.

B. Subtyping vs generics, variant. (Lecture 4.4 & 4.5) I was confused about covariant and contravarint in C# and Java (there are major differences between the two). The variant problem becomes a problem when OO-style inheritance marries with FP-style generics. And definitely Scala has a very good solution in its language design. After the lecture and some more reading, I am clearer. But I think what I really need is to read through Scala’s standard library design to get more real-world examples and design lessons.

Variant + implicits are very powerful abstraction tools though the latter seems ad hoc.

C. By value, by name, and lazy. The lecture on Stream is amazingly clear, and it clearly differentiates three concepts: by value, by name and lazy.

D. for comprehensions are translated as map/filter/flatMap higher order functions. For compressions are easier to read than higher order functions. For example, Python, the language that keeps readability in its heart, encourages for comprehension over map/lambda syntax. This is particularly true when nested comprehensions are used, e.g. in the 8-queen example used in the lecture:

def queens(n: Int) = {
 def placeQueens(k: Int): Set[List[Int]] = {
   if (k == 0) Set(List()) 
   else
   for {
     queens <- placeQueens(k - 1)
     col <- 0 until n
     if isSafe(col, queens)
   } yield col :: queens
 }
 placeQueens(n)
}

3. Homework

The homework is well designed, which, as Ordersky reveals in an email to the email list, is from several TAs’ hard work. There are six projects; the skeleton of each project is already there, and each project has several holes to fill in. The primal exercise of doing these projects is thinking in types. Every function is fixed with a type signature, and the implementation I am asked to fill is usually one-liners. Placing the one-line-of-code that fits the type signature well is the primal problem in my head. After seeing six type-oriented design, I think when next time I need to design a project myself in Scala, it will come more naturally OO. (If you see my Scala email judge, you will immediately notice that I used Scala in an ML style…)

The other aspect of the homework is interesting. E.g. Representing a set as a function seems quite neat (though impractical because of efficiency), and using breadth-first-search for Bloxorz game is quite fun.

The hoemwork project also teaches test-driven development, for each project has a test suite, which is usually incomplete and requires the user to fill in more test cases.

4. Scala vs F#/C#

Qua work of art, the work of art cannot be interpreted; there is nothing to interpret; we can only criticize it according to standards, in comparison to other works of art. – T. S. Eliot, Hamlet and his problems, 1921.

This section is not about the course. But it is always interesting to compare languages, and in comparison, we learn.

My impression of Scala is that it is closer to C# than to any functional language. Scala is Java redesigned. The the core of Scala is OO. Then functional programming concepts and others are added based on its OO core. This is the same situation with C#. For language features, it seems to me that only Trait is unique in Scala, and Scala’s treat on variant is more sound. For other features, C# has them or can emulate them with little extra syntax. The fact that Scala is more complicate-designed than C# does not mean it is a better C#. I have seen discussions online arguing that Scala is too complicated and some early enthusiasts go back to Java after their projects become bigger and harder to manage.

On the functional side, Scala program can be very declarative, e.g. when using its immutable collections; however it is a another style of declarativeness which is different from ML family languages (F#/OCaml, and to some extend Haskell). In ML languages, the concept of function is everywhere – from simple function, to operators, to data constructors, functors, to even type constructors, it is a very consistent applicative style. While in Scala, from the very beginning, you think of an object, and then think about what can we do to this object. This major difference becomes more evident when you actually begin to design a non-trivial project.

If Java platform has a soundly implemented ML, I would prefer that to Scala. However there is no ML there. If cross platform, concise and functional, performance of a statics language, a lot of libraries, etc, are kept in mind, then Scala seems to be a good choice among the very few. And recently the EPFL + Stanford joint team on parallel computing DSLs using Scala seems quite interesting and useful. There are also light-weight distributed frameworks written in Scala, e.g. Spark at Berkeley.

Summary

It is always entertaining to learn a new programming language, and I learned quite a lot by following this course. Thanks to Prof. Martin Ordersky and his TAs. It would be great if there will be a new and more advanced course next year. I would be interested in:

1. Design and implementation of the Scala collection library. Very few people write collection libraries, but many programmers need to design APIs for others to use. The Scala collection library is a gold mine to learn modern API design patterns.

2. Concurrent and parallel programming in Scala, using Actor, and other DSLs.

I will still use F# primarily for my data mining projects. And when I need a Java library or serious cross platform, I can rely on Scala.

Monday, October 1, 2012

Setting OCaml/F# Environment in Emacs

For those who don’t know, I did the first 50 Project Euler problem about three years ago (to teach myself F#) and uploaded them at fsharp-euler wikispace. Since then, however, I haven’t had time to do any update.
By update, I mean writing up more explanations on the existing programs, rather than more programs for new Euler problems. Because for learning the basic syntax and core library of a functional language, solving the first 50 should be enough. Solving more problems adds little to the conquering of the language itself, though would be a great mental exercise.
During this weekend, I had the time to play with OCaml on Project Euler problems, and to set up OCaml in Windows. The following shows the Emacs screenshot.
I use Emacs with Tuareg mode and Auto-Complete mode, and I also find Tuareg’s library browser (Menu –> Tuareg –> OCaml library … ) useful. Auto-Complete mode is also automatically hooked with Tuareg mode to provide key word completion.

The completion dictionary is limited to OCaml keywords only. I used this little trick by scanning the whole library fold and get all function names using F#….

(* get all val names from ocaml lib mli files*)

let files = Directory.GetFiles(@"D:\OCaml\lib", "*.mli")

let only_names = 
    files |> Array.map Path.GetFileNameWithoutExtension
for mli in files do
    let Cap (s: string) = 
        s.Substring(0,1).ToUpper() + s.Substring(1)

            
    let onlyname = Path.GetFileNameWithoutExtension mli |> Cap
    let lines = File.ReadAllLines mli
    let funcs = 
        lines
        |> Array.map (fun (line:string) ->
            let s = line.Split(' ')
      
            if s.[0] = "val" || s.[0] = "external" then
                if s.[1].EndsWith(":") then
                    (true, s.[1].Substring(0, s.[1].Length-1))
                elif s.[1].EndsWith("(") then
                    (false, "")
                else 
                    (true, s.[1])
                    
            else 
                (false, ""))
        |> Array.filter (fun (s, _) -> s)
        |> Array.map (fun (_, f) -> onlyname + "." + f)
    
    for f in funcs do
        printfn "%s" f

By appending the output to the tuareg dict file (on my machine, it is ~\.emacs.d\elpa\auto-complete-20120922.1815\dict\tuareg-mode), we now get the nice Intellisense for OCaml libraries:

We can apply this trick to other OCaml libraries, and automate the updating of the dictionary file. Poor man’s intellisense, haha! Don’t worry about the length of the word list, the PHP list is over 100KB, and auto complete mode handles just fine!

There are other things to enhance, e.g. when we move cursor up and down along the completion choices, the corresponding function signature would be displayed at the mini buffer. But I am not familiar with elisp, so I opened the library reference buffer on the right window to check the signatures occasionally. For most of them, F# and OCaml have the same usage patterns, e.g. the main object is usually the last parameter of a function, and seeing the function name alone gives me sense of how to use it.

In Windows, we have the great Visual Studio IDE, but if you use F# in Linux, using F#-mode and auto complete mode will be handy.

Tuesday, September 18, 2012

Another tooling discussion for data mining, F#, OCaml, and Python.

As an F# programmer, I always keep an eye on its cousin, OCaml. OCaml and F# share a large portion of syntax, and an average F# programmer can pick up basic OCaml very quickly, say in one afternoon. And recently, the development of OCaml seems accelerated – exciting language features are added to the new releases: first-class modules (in 3.12) and Generalized abstract data types (GADTs) (in 4.0). And for a good comparison in features of the two languages, please refer to this SO question.

Today, I’d like to start a short discussion on their application on a specific area, data mining.

The first question the reader may wonder is that “why do you want to use OCaml for data mining?” Yes. If F# makes everything convenient, there would less point for me to consider other languages. The main point for F# is that its cross platform support is poor. Maybe the language itself is fine as there is the Mono tool chain for F#, but the whole ecosystem is not cross-platform friendly. There is always one or two .Net libraries that are written specific for Windows, to name two libraries that I use: R.Net and Sho. Besides the library part, the Mono runtime has poor performance for the .net parallel library, and sometimes may crash. When I was in Microsoft Research Asia, we don’t have any cross-platform issues – every server is powerful with more than 12 cores and 32GB memory and is well installed with Visual Studio and other other tools. When back to HKUST, most of the servers are installed Linux for no good reasons (Win Server licenses are very cheap for our school, the IT guys are simply not experienced enough to make Win servers secured.)

With such constraints, we do have successful stories with F#. Recently I led a team winning the semantic place predict task in Nokia Mobile Data Challenge. Details here. The feature extraction program is written in F# and C#, and we can generate all features (from simple counting to FFT) from 50GB of text data within two hours! Thanks to F#’s asynchronous programming and other functional features!

But after that, I think we really need a better Windows server, or we have to use a language that is more friendly with Linux (at the same time, runs OK with Windows for I use Windows for development).

Currently I use Python. I write scripts in my laptop and when it works correctly on a small dataset, I put the program to the server to get it running on the full dataset, and usually the processing costs one hours to one night. Python is easy to write, and as it claims, easy indeed to read. But at the language level, cod reuse in a dynamic language is generally harder than a static language. The interface is too flexible and I usually dive into the struggle of which-one-to-choose.

On the other side, static language encourages refactoring and provides static types to facilitate code reuse. Consider the following F# code snippet:

    // a common pattern for accessing Stackoverflow csv file, question by question
    let csv_iter (csvfile: string) (mapper: CsvReader -> string option) (mapfile: string) = 
        use csv = new CsvReader(new System.IO.StreamReader(csvfile),true)
        let headers = csv.GetFieldHeaders()
        let fieldCount = csv.FieldCount
    
        use sw = new StreamWriter(mapfile)
        use bad = new StreamWriter(mapfile + ".badids")
        while csv.ReadNextRecord() do
            match mapper csv with
            | Some text -> sw.WriteLine(text)
            | None -> 
                bad.WriteLine(csv.[0])
                sw.WriteLine()

        sw.Close()
        bad.Close()

From the function signature alone, one may guess what this function does. Basically, it process each record in a hugh csv file by a mapper function. The result of the function could be None, which means some errors occurs inside the mapper, or Some string that is a normal result. And all the results are stored in a file line by line. Let’s focus on the mapper function here, we have string option type in the function signature, which means we have a more specific requirement on the mapper than only requiring it to return a string. We have encoded the logic into the types. This is actually the main point of functional programming – types say more things than other languages. If we write this function in Python, the mapper would still return a string, however, the logic that the string can be nil is encoded in the implementation, but not in the type signature. In this case, we have to resort to natural language to comment the Python function that if the mapper gets error, please return None. And this comment is obviously less strict than types.

The above code is only the first version of this csv iterator; several enhancements can be added, e.g. we can define a more detailed type for mapper result, other than to use string option, which is too simple:

    type MapResult = 
        MapError of string * string
        | NormalResult of string

Now, the error has two parts: the error message and the best result string the mapper can get, which is set to empty string in the above csv_iter. In the stackoverflow example, if I cannot parse the question body, maybe I can use its title as a resort.

Say this is my old mapper, which gets the pure text from a question body, excluding its code and other annotations :

    let mapper_getbody_textonly (csv: CsvReader) = 
        try
            let feat = createZero()
            let markdoc = csv.[7] |> Markdown.Parse
            calcTopParagraphs feat markdoc
            Some (
                let text = feat.PlainString.ToString()
                text.Replace('\n', ' ').Replace('\r', ' ')
                )
        with _ ->
            None

Now I have to change the code a little bit to make it compile:

    let mapper_getbody_textonly (csv: CsvReader) = 
        try
            let feat = createZero()
            let markdoc = csv.[7] |> Markdown.Parse
            calcTopParagraphs feat markdoc
            NormalResult (
                let text = feat.PlainString.ToString()
                text.Replace('\n', ' ').Replace('\r', ' ')
                )
        with _ ->
            MapError ("markdown parser error", csv.[0])

In Python, the compiler will not tell you that you need to change the mapper! So you have to keep in mind a list of old mappers and change them one by one to suit the new interface. In F#, it is required and you don’t need to keep the mapper list as the compiler will tell you to change once you use an old one with the new interface.

With strict types, we are forced and helped by the compiler to change and refactor our code gradually. This is the main point I want to make about code reuse.

I keep 80% code in F# as long as my Windows laptop can run it. When it does not, say the processing costs more than 6 hours, I will write the rest 20% of code in Python and put it on the Linux server. This is my current resort.

So this is my little complain on Python, or on the dynamic languages in general. Let’s get back to OCaml at last. OCaml is great functional language. Among the practical functional languages, I think Haskell is the only one that is more interesting than OCaml in the functional aspect. However, Haskell is definitely not ready for large scale data mining tasks – its memory usage is unacceptable not to mention performance issues. So I was serious on using OCaml for my current task and writing all code in OCaml.

But the librariy availability on OCaml ecosystem bites me! I could not find a CSV library that supports streaming access and Unicode. I did find a simple Markdown parser in OCaml, but it parses a lot of SO Markdowns wrongly. Data mining practitioners usually depend a lot of tools for theirs tasks at hand, from solid CSV and database libraries, to linear algebra libraries, to NLP tagger, etc.. In this aspect, .Net and Python have better libraries than OCaml does.

So this ends my story in using OCaml this time.

I will probably come to OCaml in the future, and will try it for several times, I think. It is really a good language, and is more clean than F# in the functional aspect, e.g. .Net objects can still be null in F#, but there is really no null in OCaml. If everything in a project is better written from scratch, then OCaml can be a very good candidate.

Monday, October 31, 2011

Any numerical computing environment on Java platform?

Recently Tomas Petricek and I have co-written a book chapter, Numerical Computing in F#. It is a free online book chapter for his Manning book: Real World Functional Programming. This book chapter is a survey on numerical computing for .Net languages, specially for F#. I will write another article on this book chapter and present some materials that were cut down from final version on MSDN.

But today, I am looking for things on our competitor’s side -- numerical computing environments for Java platform. I am just wondering what Java guys are doing for numerical computing.

Looking at the mature numerical systems such as Matlab, Mathematica and S-plus/R, we can summarize three general elements of such a system:

1. Math library. Math library is a must for serious numerical computing. It is hard to build well-tested numerical procedures. This kind of low level stuff, e.g. linear algebra routines and FFT, is usually available as a reusable software component.

2. Interactive/Console. The Read-eval-print loop (REPL) is very important for exploratory analysis because it gives a quick feedback on some small fragments of code.

3. Plot/chart library. A plot explains what’s going on clearly and quickly!

I did some web search and found quite a few blog articles; I want to mention articles by Mikio L. Braun. He also maintains his jblas library, which is a JNI wrapper to ATLAS, the highly-optimized BLAS library.

I am aware Incanter, which is a Clojure project aiming to be R in Java platform. But I don't think rewriting basic things, e.g. PCA and linear regression, for Java Platform is a good idea since a stable implementation for these algorithms usually take years because there is just too much details in numerical algorithms.

This time, I found a new library, ScalaLab, which is a on-going project for numerical computing in Scala.

Let’s first talk the library a little bit. For matrix and linear algebra, there are also two quite stable Java libraries: colt and JAMA. The linear algebra functionality of the two libraries should be intact. But the performance is quite bad compared to native code, even several times slower to pure .Net code. [The comparison to .Net code is my personal experience. ] But one should not care too much about performance as the bottleneck of a numerical project differs from projects to projects. A working system is the most important thing, various methods could help to improve the performance. Btw, the performance difference is usually not noticeable when we work on small or sample datasets. I use R quite often recently, the matrix library associated with its Windows version is from Netlib, a normal native performance one. I didn’t have any performance complain over months.

For REPL, basically it is a language issue. After reading through online, I think I will like something built on Scala, such as the ScalaLab environment. Among the mature/quite-mature JVM languages, Scala is most similar to F#. The advantage of Scala to F# is that Scala is a component language, which means Scala has better OO features for library designers and software architects. However, on the small/low level of programming (e.g. a small function), F# feels much more functional and pure than Scala. This, I think, is partially because F# “IS” an ML, while Scala borrows some syntax from Java and C#. The other reason is that OO is at the heart of Scala – everything is an object. Scala’s OO model is greatly influenced by Smalltalk’s. I haven’t learned any Smalltalk. But after some learning in Scala, I can appreciate why Alan Kay, the inventor of Smalltalk, says, “Actually I made up the term "object-oriented", and I can tell you I did not have C++ in mind. “. So even functional programming in Scala is OO-emulated, while F# has a functional base first and then the implementation using .Net is transparent to its syntax. Ok… the above comparison is just off the topic. Anyway, Scala can serve as a good static script language with a REPL.

For the plotting library, JFreeChart seems to be the standard. It is also the plotting library used in Incanter [For this part, I think Incanter really does a great job!]. I am not sure whether there are some advanced or commercial plotting libraries for Java. But I know that .Net has tons of that! Anyway, JFreeChart is ok for daily use.

The last thing, which I haven’t listed above, is parallel computing! Java platform, or more specific, the Java Virtual Machine, proves to be one of the most stable multithreaded systems ever built. For example, Hadoop is written in Java. Although people in Google said that Hadoop is way slow compared to their C++ MapReduce. A lot of big companies are using it. Two months ago, I read the following nice paper by Martin Odersky and Kunle Olukotun:

Language Virtualization for Heterogeneous Parallel Computing

It is just published and already has 28 citations on Google Scholar. The idea is to build a DSL, which serves a middle ware, to write high level programs, e.g. equations involving some matrix operations. Underlying the DSL, there are complicated optimizations going on to transform it and make it parallel using heterogeneous technologies, e.g. GPU and multi-core. So the programmer does not need to know anything about parallel computing and his program is executed in parallel utilizing the multi-cores! The DSL is an internal one built in Scala, another example showing the great expressiveness of Scala. Since such idea is not new, the novel part of the paper is to demonstrate how Scala perfectly fits the requirements of such a paradigm.

I don’t quite have a conclusion for this post. Just some random thoughts on numerical computing in Java and an advertisement for my articles on MSDN