Saturday, May 16, 2015

Is Rust good for data mining?


Rust 1.0 is just released. This is great achievement for Rust team! I have watched Rust for some time. One phase on its website summaries this language rather well: “zero-cost abstraction”. I was attracted to Rust by Poss’s article: Rust for functional programmers. But the more I know the language, the more I find that its syntax-level similarity to Haskell and ML is only superficial, what Rust really wants to be is a system programming language with safe and fine-granularity memory management.

C has “zero-cost”, however it is hard to build high-level abstractions in it. If we program concurrency in C, we shall explicitly use a thread library (e.g. OpenMP, PThreads) or we need to go lower to implement a thread using more primitive functions at the OS level.

On the other extreme, languages such as Haskell and Scala have “abstractions”. They can build beautiful APIs that others can use and don’t care about how they are implemented. There is a cost in abstraction. The memory usage of Haskell and Scala is harder to predict than C programs. When we chain functions such as map/filter, we don’t know exactly how many objects are created as immediate objects; it depends the underlying library implementation and the optimization ability of the compiler and these two are not common knowledge to the applications programmer . The extra cost also comes from the constant hidden inside the big O – using the iterator interface would has extra cost on each .next() function call while iterating a bare array in C is much faster.

How is it possible for Rust to have the zero-cost abstraction? Basically it achieves it through memory management system. This is the real innovative part of Rust, though some theory is built in academic papers and previous small languages, Rust is the first to provide a best engineering on it. In F#, we can easily write

col |> (fun e -> e*e) |> Seq.filter (fun e -> e%3==0) |> Seq.sum

and we don’t know how memory is managed. Will map and filter create many small objects to be GCed? The F# program can ignore these questions. He can of course dig into Seq module’s source code and the final compiled IL code to know the details. But by default, the F# programmer need not care.

If the same program is written in Rust, the programmer has to control exactly how each object is created. He knows it by writing extra annotations to the code. This adds burden to the programmer’s mind!

In data mining, writing correct code for the numerics and the algorithmic logics is already hard, how would a data miner want to put the programming issues in his mind? I would not. This is of course because I am not familiar with Rust’s borrow system. I believe after enough training, my skill can reach a state of caring less and less about memory when programming data mining applications. But in the first place, why should I? Fine memory control is not the primary issue of data mining applications. If performance is not that critical, any static language such as F# and Scala would have a fine performance. Need more performance? Code in C++, allocate all the memory deterministically, and avoid big-object copy and dynamic heap allocation when critical components are running!

Saturday, April 11, 2015

Playing Clojure with a simple setup


I am always trying to find a scripting language which could replace Python for data preprocessing tasks. I love Python for everything except its performance [think about implementing a dynamic programming with two nested and busy loops]. There are many choices, F#/OCaml/Scala(succinct & typed!), go (at least typed…), etc. I also checked Clojure several years ago. But at that time, I knew little Lisp and I felt that Clojure is bundled with too much abstractions under a dynamic type system which would hurt performance. But recently I happen to see a Clojure code snippet which has type annotations (type hints in Clojure’s terminology)!

With type hints and other tricks, Clojure can be as fast as Java [1] (i.e., the same order of speed with F# and OCaml). Of course, the type hinted and highly imperative Clojure programs are even harder to read than their Java equivalents. But at least Clojure provides a way to achieve this goal in Clojure itself. Once we find the performance bottle neck, we don’t need any other tool, we just add type hints and if it does not work, simply write it in a more careful way in Clojure (e.g., keeping use unboxed primitive values and Java primitive arrays.) The good thing is that we don’t need to leave Clojure. Because I want to use Clojure as a quick scripting language, I don’t really want to create a project and setup a Makefile-kind of thing – I just want a single file that contains all the program. If we want to speed up Python, we can use Cython or writing C and use the compiled dynamic library in Python. This is just too much for the purpose of scripting. And the final product would be hard to deploy, think about somebody has no Cython installed, or transferring binaries to a new machine, etc. Just a nightmare. For Clojure, the compiler and the library are in the same jar (e.g. clojure-1.6.0.jar), the deployment of the script is really easy.


Set up Emacs with a Clojure Interpreter

No leiningen, no cider, no maven! Just download clojure.jar and two .el files. Then you are done! Actually you only need to have a Java compiler installed to have compile Clojure because Clojure compiler is written in Java. We really need to give a thump up for Rich Hickey for creating a production-quality language with so small size source code (less than 3MB for Clojure 1.6). That’s amazing!

Since I intend to use Clojure for data processing (or in a fancier phase, data science), an interpreter is a must. I need to program-and-test on the go [the same working routine I am under F#, R, and Matlab]. Unlike F# or Scala, the ideal IDE for Clojure is no IDE. Clojure programs needs no dot magic while for F# and Scala it is quite nice to type “.” and then find available functions under a module or object. So I have no plan to install the Cider. I use clojure-mode and inf-clojure, both of which are distributed as a single .el file. Or you can use (M-x list-packages, in Emacs 24 only) to install these two modules. I would suggest to install auto-complete mode and the dictionary for Clojure. The dictionary contains a long list of common Clojure keywords. But this is optional as once we get familiar with Clojure’s functions, we need no auto complete for keywords [Emacs’s M-/ is good enough]. For example, I never use auto complete minor mode when programming R scripts under ESS mode.

Put these into .emacs:

(setq inferior-lisp-program "C:\\Progra~1\\Java\\jdk1.7.0\\bin\\java  -cp d:\\home\\clojure-1.7.0-alpha5\\clojure-1.7.0-alpha5.jar clojure.main") ; clojure REPL command line
(add-hook 'clojure-mode-hook #'inf-clojure-minor-mode)
(setq inf-clojure-buffer "*inferior-lisp*")

To start the interpreter, use M-x run-lisp. The shortcuts are the same with elisp modes (e.g. the *scratch* buffer). Two most useful ones:

C-x C-e: send the last expression to the interpreter. And expression is all the content between two parenthesis. So we can use this short cut to send multi-line function definition, or single-line printf to the interpreter.

C-x C-r: send the clipboard to the interpreter.


Run Clojure scripts from command line

REPL is great for playing with data and making sure every function works as expected. In the end, we usually put all functions into a script file which can run as a task. If we just use Java standard library and Clojure core library, we can simply type the following command:

java -cp clojure.jar clojure.main script.clj

and we can create .bat, an Linux alias or whatever to make the command short. Something like:

alias runclj=”java -cp /path/to/clojure.jar clojure.main”

However when our script is dependent on some libraries, managing the dependency can be tricky. I searched online, it seems that everyone is recommending leiningen to load a library or manage a script that contains multiple files. For some reason, I have to avoid using leiningen.

I find a simple solution. That is the good and old load-file command in LISP world (In Emacs, I sometimes use load-file to update my .emacs without restarting Emacs).

And it seems that the Clojure community has learnt from the practice of Javascript. That is, if a library is small enough, then distribute it as a single file! I checked several clojure libraries that I am interested: data.csv, clojure-csv, data.json. All these libraries are distributed as one or two .clj files!

The following script first loads two libraries: numeric-tower and json, and then do a test:

(load-file "d:\\home\\numeric_tower.clj")
(load-file "d:\\home\\json.clj")
(load-file "d:\\home\\json_compat_0_1.clj")

(ns example
(:require [clojure.math.numeric-tower :as math])
(:require [ :as json]))

; use numeric-tower library
(defn- sqr
"Uses the numeric tower expt to square a number"
(math/expt x 2))

(println (sqr 100))

; use json library
(println (json/read-str "{\"a\":1,\"b\":2}"))


Notice that the last line of json.clj is: (load "json_compat_0_1"). load is similar to load-file, but it loads the .class files. Since we don’t pre-compile json_compat_0_1, we need to comment out this line and load json_compat_0_1 explicitly in our script.

Since I am usually dealing with large amount of data, I don’t care about the compiling and JVM startup time, which is negligible compared to the total time spent on number crunching.


My first Clojure programs

I’d like to thank Prof. Dan Grossman’s brilliant course at Coursera first. I never believe in books like Seven languages in seven weeks. Repeatedly writing three-line programs in different languages only give you an illusion that you’ve learned. Prof. Grossman’s course is truly “three languages in one semester”. One of his three languages is Racket/Scheme [the other two being ML and Ruby]. I did the two Racket assignments (Assignment 4&5). Assignment 5 is on how to implement a simple interpreter in Racket. That’s about all my training in lisp.

I solved some Hacker Rank problems using Clojure. The following includes my solutions to three problems. Two of them are dynamic programming problems. I focus on dynamic programming because the loops and array operations are slow in Python and I wish to test their speed in Clojure.

Problem: two strings

Find if there is a substring that appears in both A and B.


(use '[clojure.set :only (intersection)])
(def n (Integer/parseInt (read-line)))

(defn toset [line]

(defn compare1 [line1 line2]
(if (= 0 (count (intersection (toset line1) (toset line2))))
(println "NO")
(println "YES")))

(loop [i 0]
(when (< i n)
(let [line1 (read-line)
line2 (read-line)]
(compare1 line1 line2)
(recur (inc i))


Problem: Candies

A very common interview question which tests dynamic programming and its memorization technique.


(def T (Integer. (read-line)))

(def v
(->> (range T)
(map (fn [_] (read-line)))
(map #(Integer. %))
(into [])))

(def inf 2000000)

(def f (int-array T inf))

(defn dp
(if (< (aget f i) inf)
(aget f i)
(= i 0) (let [right (if (> (v i) (v (inc i))) (inc (dp (inc i))) 1)]
(aset f i right)
(= i (dec T)) (let [left (if (> (v i) (v (dec i))) (inc (dp (dec i))) 1)]
(aset f i left)
:else (let [right (if (<= (v i) (v (inc i))) 1 (inc (dp (inc i))))
left (if (<= (v i) (v (dec i))) 1 (inc (dp (dec i))))
dp-i (max right left)]
(aset f i dp-i)

(println (apply + (map dp (range T))))

Problem: Stock Maximize

Given a sequence of stock prices, find the maximum profit.

(set! *warn-on-reflection* true)

(defn line->ints
(clojure.string/split line #" ")
(map #(Integer/parseInt %))
(into [])

(defn find-sell-points
"Find decreasing local-maximums"
(let [n (count v)
stack (int-array n)]
(aset stack 0 1) ; the first sell point at 1
(loop [i 2 j 0 in-while false add-to false]
(println j (take 4 stack))
(if (< i n)
(let [v-top (v (aget stack j))
v-i (v i)]
(if in-while
(if (>= v-i v-top) ; in-while
(if (= 0 j) (recur i j false true) (recur i (dec j) true false))
(recur i (inc j) false true))
(if add-to
(do (aset stack j i) (recur (inc i) j false false))
(if (>= v-i v-top)
(recur i j true false)
(do (aset stack (inc j) i) (recur (inc i) (inc j) false false))))))
[stack j]))))

(defn simulate-trade
"Sell at pos"
[v pos]
(loop [i 0
j 0
sum 0]
(if (< i (count pos))
(if (<= j (pos i))
(recur i (inc j) (+ sum (max 0 (- (v (pos i)) (v j)))))
(recur (inc i) (inc (pos i)) sum))

(let [T (Integer/parseInt (read-line))]
(loop [test 0]
(when (< test T)
(let [_ (read-line)
x (read-line)
v (line->ints x)
[pos j] (find-sell-points v)
res (simulate-trade v (into [] (take (inc j) pos)))]
(println res))
(recur (inc test)))))

I may write a longer post on how to learn Clojure basics by solving Hacker Rank problems. My inspiration is from the following page:

which shows how to use Clojure to solve the first 33 Project Euler problems. The page layout is so nice: on the left is the explanation, and the code is well aligned on the right. For each problem there is also a Docs section, which lists the new Clojure core functions that are introduced in the solution. By following all the 33 solutions, we may have seen usage examples of many functions listed on the Clojure cheat sheet.


Good readings on Clojure

I borrowed two Clojure books from a university library. But most of my readings are based on online materials. I list them below:

Adam Bard’s Clojure blog

The Clojure Style Guide

Clojure Cheatsheet

Introducing HipHip (Array): Fast and flexible numerical computation in Clojure




[1] I have seen a few questions asked on Stackoverflow and other places that concern the performance of Clojure. Here is a list of them: