Saturday, May 16, 2015

Is Rust good for data mining?


Rust 1.0 is just released. This is great achievement for Rust team! I have watched Rust for some time. One phase on its website summaries this language rather well: “zero-cost abstraction”. I was attracted to Rust by Poss’s article: Rust for functional programmers. But the more I know the language, the more I find that its syntax-level similarity to Haskell and ML is only superficial, what Rust really wants to be is a system programming language with safe and fine-granularity memory management.

C has “zero-cost”, however it is hard to build high-level abstractions in it. If we program concurrency in C, we shall explicitly use a thread library (e.g. OpenMP, PThreads) or we need to go lower to implement a thread using more primitive functions at the OS level.

On the other extreme, languages such as Haskell and Scala have “abstractions”. They can build beautiful APIs that others can use and don’t care about how they are implemented. There is a cost in abstraction. The memory usage of Haskell and Scala is harder to predict than C programs. When we chain functions such as map/filter, we don’t know exactly how many objects are created as immediate objects; it depends the underlying library implementation and the optimization ability of the compiler and these two are not common knowledge to the applications programmer . The extra cost also comes from the constant hidden inside the big O – using the iterator interface would has extra cost on each .next() function call while iterating a bare array in C is much faster.

How is it possible for Rust to have the zero-cost abstraction? Basically it achieves it through memory management system. This is the real innovative part of Rust, though some theory is built in academic papers and previous small languages, Rust is the first to provide a best engineering on it. In F#, we can easily write

col |> (fun e -> e*e) |> Seq.filter (fun e -> e%3==0) |> Seq.sum

and we don’t know how memory is managed. Will map and filter create many small objects to be GCed? The F# program can ignore these questions. He can of course dig into Seq module’s source code and the final compiled IL code to know the details. But by default, the F# programmer need not care.

If the same program is written in Rust, the programmer has to control exactly how each object is created. He knows it by writing extra annotations to the code. This adds burden to the programmer’s mind!

In data mining, writing correct code for the numerics and the algorithmic logics is already hard, how would a data miner want to put the programming issues in his mind? I would not. This is of course because I am not familiar with Rust’s borrow system. I believe after enough training, my skill can reach a state of caring less and less about memory when programming data mining applications. But in the first place, why should I? Fine memory control is not the primary issue of data mining applications. If performance is not that critical, any static language such as F# and Scala would have a fine performance. Need more performance? Code in C++, allocate all the memory deterministically, and avoid big-object copy and dynamic heap allocation when critical components are running!

Saturday, April 11, 2015

Playing Clojure with a simple setup


I am always trying to find a scripting language which could replace Python for data preprocessing tasks. I love Python for everything except its performance [think about implementing a dynamic programming with two nested and busy loops]. There are many choices, F#/OCaml/Scala(succinct & typed!), go (at least typed…), etc. I also checked Clojure several years ago. But at that time, I knew little Lisp and I felt that Clojure is bundled with too much abstractions under a dynamic type system which would hurt performance. But recently I happen to see a Clojure code snippet which has type annotations (type hints in Clojure’s terminology)!

With type hints and other tricks, Clojure can be as fast as Java [1] (i.e., the same order of speed with F# and OCaml). Of course, the type hinted and highly imperative Clojure programs are even harder to read than their Java equivalents. But at least Clojure provides a way to achieve this goal in Clojure itself. Once we find the performance bottle neck, we don’t need any other tool, we just add type hints and if it does not work, simply write it in a more careful way in Clojure (e.g., keeping use unboxed primitive values and Java primitive arrays.) The good thing is that we don’t need to leave Clojure. Because I want to use Clojure as a quick scripting language, I don’t really want to create a project and setup a Makefile-kind of thing – I just want a single file that contains all the program. If we want to speed up Python, we can use Cython or writing C and use the compiled dynamic library in Python. This is just too much for the purpose of scripting. And the final product would be hard to deploy, think about somebody has no Cython installed, or transferring binaries to a new machine, etc. Just a nightmare. For Clojure, the compiler and the library are in the same jar (e.g. clojure-1.6.0.jar), the deployment of the script is really easy.


Set up Emacs with a Clojure Interpreter

No leiningen, no cider, no maven! Just download clojure.jar and two .el files. Then you are done! Actually you only need to have a Java compiler installed to have compile Clojure because Clojure compiler is written in Java. We really need to give a thump up for Rich Hickey for creating a production-quality language with so small size source code (less than 3MB for Clojure 1.6). That’s amazing!

Since I intend to use Clojure for data processing (or in a fancier phase, data science), an interpreter is a must. I need to program-and-test on the go [the same working routine I am under F#, R, and Matlab]. Unlike F# or Scala, the ideal IDE for Clojure is no IDE. Clojure programs needs no dot magic while for F# and Scala it is quite nice to type “.” and then find available functions under a module or object. So I have no plan to install the Cider. I use clojure-mode and inf-clojure, both of which are distributed as a single .el file. Or you can use (M-x list-packages, in Emacs 24 only) to install these two modules. I would suggest to install auto-complete mode and the dictionary for Clojure. The dictionary contains a long list of common Clojure keywords. But this is optional as once we get familiar with Clojure’s functions, we need no auto complete for keywords [Emacs’s M-/ is good enough]. For example, I never use auto complete minor mode when programming R scripts under ESS mode.

Put these into .emacs:

(setq inferior-lisp-program "C:\\Progra~1\\Java\\jdk1.7.0\\bin\\java  -cp d:\\home\\clojure-1.7.0-alpha5\\clojure-1.7.0-alpha5.jar clojure.main") ; clojure REPL command line
(add-hook 'clojure-mode-hook #'inf-clojure-minor-mode)
(setq inf-clojure-buffer "*inferior-lisp*")

To start the interpreter, use M-x run-lisp. The shortcuts are the same with elisp modes (e.g. the *scratch* buffer). Two most useful ones:

C-x C-e: send the last expression to the interpreter. And expression is all the content between two parenthesis. So we can use this short cut to send multi-line function definition, or single-line printf to the interpreter.

C-x C-r: send the clipboard to the interpreter.


Run Clojure scripts from command line

REPL is great for playing with data and making sure every function works as expected. In the end, we usually put all functions into a script file which can run as a task. If we just use Java standard library and Clojure core library, we can simply type the following command:

java -cp clojure.jar clojure.main script.clj

and we can create .bat, an Linux alias or whatever to make the command short. Something like:

alias runclj=”java -cp /path/to/clojure.jar clojure.main”

However when our script is dependent on some libraries, managing the dependency can be tricky. I searched online, it seems that everyone is recommending leiningen to load a library or manage a script that contains multiple files. For some reason, I have to avoid using leiningen.

I find a simple solution. That is the good and old load-file command in LISP world (In Emacs, I sometimes use load-file to update my .emacs without restarting Emacs).

And it seems that the Clojure community has learnt from the practice of Javascript. That is, if a library is small enough, then distribute it as a single file! I checked several clojure libraries that I am interested: data.csv, clojure-csv, data.json. All these libraries are distributed as one or two .clj files!

The following script first loads two libraries: numeric-tower and json, and then do a test:

(load-file "d:\\home\\numeric_tower.clj")
(load-file "d:\\home\\json.clj")
(load-file "d:\\home\\json_compat_0_1.clj")

(ns example
(:require [clojure.math.numeric-tower :as math])
(:require [ :as json]))

; use numeric-tower library
(defn- sqr
"Uses the numeric tower expt to square a number"
(math/expt x 2))

(println (sqr 100))

; use json library
(println (json/read-str "{\"a\":1,\"b\":2}"))


Notice that the last line of json.clj is: (load "json_compat_0_1"). load is similar to load-file, but it loads the .class files. Since we don’t pre-compile json_compat_0_1, we need to comment out this line and load json_compat_0_1 explicitly in our script.

Since I am usually dealing with large amount of data, I don’t care about the compiling and JVM startup time, which is negligible compared to the total time spent on number crunching.


My first Clojure programs

I’d like to thank Prof. Dan Grossman’s brilliant course at Coursera first. I never believe in books like Seven languages in seven weeks. Repeatedly writing three-line programs in different languages only give you an illusion that you’ve learned. Prof. Grossman’s course is truly “three languages in one semester”. One of his three languages is Racket/Scheme [the other two being ML and Ruby]. I did the two Racket assignments (Assignment 4&5). Assignment 5 is on how to implement a simple interpreter in Racket. That’s about all my training in lisp.

I solved some Hacker Rank problems using Clojure. The following includes my solutions to three problems. Two of them are dynamic programming problems. I focus on dynamic programming because the loops and array operations are slow in Python and I wish to test their speed in Clojure.

Problem: two strings

Find if there is a substring that appears in both A and B.


(use '[clojure.set :only (intersection)])
(def n (Integer/parseInt (read-line)))

(defn toset [line]

(defn compare1 [line1 line2]
(if (= 0 (count (intersection (toset line1) (toset line2))))
(println "NO")
(println "YES")))

(loop [i 0]
(when (< i n)
(let [line1 (read-line)
line2 (read-line)]
(compare1 line1 line2)
(recur (inc i))


Problem: Candies

A very common interview question which tests dynamic programming and its memorization technique.


(def T (Integer. (read-line)))

(def v
(->> (range T)
(map (fn [_] (read-line)))
(map #(Integer. %))
(into [])))

(def inf 2000000)

(def f (int-array T inf))

(defn dp
(if (< (aget f i) inf)
(aget f i)
(= i 0) (let [right (if (> (v i) (v (inc i))) (inc (dp (inc i))) 1)]
(aset f i right)
(= i (dec T)) (let [left (if (> (v i) (v (dec i))) (inc (dp (dec i))) 1)]
(aset f i left)
:else (let [right (if (<= (v i) (v (inc i))) 1 (inc (dp (inc i))))
left (if (<= (v i) (v (dec i))) 1 (inc (dp (dec i))))
dp-i (max right left)]
(aset f i dp-i)

(println (apply + (map dp (range T))))

Problem: Stock Maximize

Given a sequence of stock prices, find the maximum profit.

(set! *warn-on-reflection* true)

(defn line->ints
(clojure.string/split line #" ")
(map #(Integer/parseInt %))
(into [])

(defn find-sell-points
"Find decreasing local-maximums"
(let [n (count v)
stack (int-array n)]
(aset stack 0 1) ; the first sell point at 1
(loop [i 2 j 0 in-while false add-to false]
(println j (take 4 stack))
(if (< i n)
(let [v-top (v (aget stack j))
v-i (v i)]
(if in-while
(if (>= v-i v-top) ; in-while
(if (= 0 j) (recur i j false true) (recur i (dec j) true false))
(recur i (inc j) false true))
(if add-to
(do (aset stack j i) (recur (inc i) j false false))
(if (>= v-i v-top)
(recur i j true false)
(do (aset stack (inc j) i) (recur (inc i) (inc j) false false))))))
[stack j]))))

(defn simulate-trade
"Sell at pos"
[v pos]
(loop [i 0
j 0
sum 0]
(if (< i (count pos))
(if (<= j (pos i))
(recur i (inc j) (+ sum (max 0 (- (v (pos i)) (v j)))))
(recur (inc i) (inc (pos i)) sum))

(let [T (Integer/parseInt (read-line))]
(loop [test 0]
(when (< test T)
(let [_ (read-line)
x (read-line)
v (line->ints x)
[pos j] (find-sell-points v)
res (simulate-trade v (into [] (take (inc j) pos)))]
(println res))
(recur (inc test)))))

I may write a longer post on how to learn Clojure basics by solving Hacker Rank problems. My inspiration is from the following page:

which shows how to use Clojure to solve the first 33 Project Euler problems. The page layout is so nice: on the left is the explanation, and the code is well aligned on the right. For each problem there is also a Docs section, which lists the new Clojure core functions that are introduced in the solution. By following all the 33 solutions, we may have seen usage examples of many functions listed on the Clojure cheat sheet.


Good readings on Clojure

I borrowed two Clojure books from a university library. But most of my readings are based on online materials. I list them below:

Adam Bard’s Clojure blog

The Clojure Style Guide

Clojure Cheatsheet

Introducing HipHip (Array): Fast and flexible numerical computation in Clojure




[1] I have seen a few questions asked on Stackoverflow and other places that concern the performance of Clojure. Here is a list of them:

Saturday, October 4, 2014

A list of references in machine learning and programming languages


Today I have to dispose some papers that I’ve printed during my PhD study. Most of them are research papers or notes, and I had great joy when I first read them. For some of them, I constantly refer back to. Since all these materials are freely available online, I now write down this list and may find them more easily. (In terms of reading, I still prefer papers, however they consume too much space.)

04 Oct 2014

Machine learning/statistics

Minka, A comparison of numerical optimizers for logistic regression.

          Estimating a Gamma distribution.

          Beyond Newton’s method.

Steyvers, Multidimensional scaling.

von Luxburg, A tutorial of spectral clustering.

Das et. al, Google news personalization: scalable online collaborative filtering.

Hofmann, Probabilistic latent semantic analysis.

Deerwester et. al, Indexing by latent semantic analysis.

Yi Wang, Distributed gibbs sampling of latent dirichlet allocation, the gritty details.

Blei et. al, latent dirichlet allocation.

Herbrich et. al, TreeSkill: a bayesian skill rating system.

                      Web-scale bayesian click-through rate prediction for Sponsored search..

                      Matchbox: large scale online bayesian recommendations

Wickham, Tidy data.

                The split-apply-combine strategy for data analysis.

DiMaggio, Mapping great circles tutorial.

Breiman, Statistical modeling: the two cultures. (with comments)

Wagstaff, Machine learning that matters.

Shewchuk, An introduction to the conjugate gradient method without the agonizing pain.

Mohan et. al, Web-search ranking with initialized gradient boosted regression trees.

Burges, From RankNet to LambdaRank to KambdaMART: an overview.



Goodger, Code Like a Pythonista: Idiomatic Python.

C++ FAQ, inheritance.

Wadler: Monads for functional programming.

            How to make ad-hoc polymorphism less ad hoc.

            Comprehending monads.

            The essence of functional programming.

Jones: A system of constructor classes: overloading and implicit higher-order polymorphism.

          Functional programming with overloading and higher-order polymorphism.

          Implementing type classes.

          Monad transformers and modular interpreters.

Yorgey, The Typeclassopedia.

Augustsson, Implementing Haskell overloading.

Damas and Milner, Pricipal type-schemes for functional programs.

Wehr et. al, ML modules and haskell type classes: a constructive comparison.

Bernardy et. al, A comparison of C++ concepts and Haskell type classes.

Garcia et. al, A comparative study of language support for generic programming.

Conchon et. al, Designing a generic graph library using ML functors.

Fenwick, A New Data Structure for Cumulative Frequency Tables.

Petr Mitrchev, Fenwick tree range updates. and his blog.

Flanagan et. al, The essence of compiling with continuations.

Haynes et. al, Continuations and coroutines.

Notes on CLR garbage collector.

Odersky et. al, an overview of the scala programming language.

                       lightweight modular staging: a pragmatic approach to runtime code genration and compiled dsls.

                       javascript as an embedded dsl

                       StagedSAC: a case study in performance-oriented dsl development

F# language reference, chapter 5, types and type constraints.

Remy, Using, undertanding, and unraveling the ocaml language.

Carette et. al, Finally tagless, partially evaluated.

Gordon et. al, A model-learner pattern for bayesian reasoning.

                     Measure transformer semantics for bayesian machine learning

Daniels et. al, Experience report: Haskell in computational biology.

Kiselyo et. al, Embedded probabilistic programming.

Hudak, Conception, evollution, and application of functional programming languages.

Hudak et al, A gentle introduction to Haskell 98.

                  A history of Haskell: Being lazy with class.

Yallop et. al, First-class modules: hidden power and tantalizing promises.

Lammel et. al, Software extension and integration with type classes.

HaskellWiki, OOP vs type classes.

                   All about monads.


Course notes

Tao, linear algebra.

MIT 18.440 Probability and Random Variables.OCW link.

MIT 18.05 Introduction to Probability and Statistics. OCW link.

MIT 6.867 Machine learning. Good problem sets.  OCW link.

MIT 18.01 Single Variable Calculus. OCW link.

Princeton COS424, Interacting with Data.

Stanford CS229. Machine learning.

Cornell CS3110, Data Structures and Functional Programming.

FISH 554 Beautiful Graphics in R. (formally FISH507H, course website not available now)

Haskell course notes, CS240h, Stanford. recent offering.

Liang Huang’s course notes, Python, Programming Languages.


Balanced Search Trees Made Simple (in C#)

Fast Ensembles of Sparse Trees (fest)

An implementation of top-down splaying.

Largest Rectangle in a Histogram.

Monday, July 14, 2014

R Notes: Functions

R's semantics is modeled after Scheme. Scheme is a functional language and R is functional too. I am writing about the functions in R and many R's strange usages are just syntax sugars of special function calls.

What is rownames(x) <- c('A','B','C')?

y <- c(1, 2, 3, 4, 5, 6)
x <- matrix(y, nrow = 3, ncol = 2)
rownames(x) <- c("A", "B", "C")

##   [,1] [,2]
## A 1 4
## B 2 5
## C 3 6

How can we assign c('A','B','C') to the return value of rownames(x), yet the assignment effect is done on x?! This is because the statement is a syntax sugar for the following function call:

x <- `rownames<-`(x, c("A1", "B1", "C1"))

First, rownames<- is indeed a function! Because it has special characters, to apply it we need put " around the function name. Second, "rownames<-"(x, c('A1','B1','C1')) is a pure function call. It returns a new copy of x and does not change the row names of x. To take the assignment effect, we must assign the return value back to x explicitly.

The technical term for functionname<- is replacement function. The general rule:

f(x) <- y is transformed as x <- "f<-(x,y)".

Besides rownames, there are many other functions that have a twin replacement function, for example, diag and length.

A special case is index operator "[". Its replacement function is "[<-":

y <- c(1, 2, 3)

## [1] 1

`[<-`(y, 2, 100)

## [1]   1 100   3

More examples at R language defination 3.4.4.

Operators are functions

Operators in R are indeed function calls.

"+"(c(1,2), 2)  # same as c(1,2) + 2

## [1] 3 4

Because operators are functions, we can define new operators using function syntax:

# strict vector add
"%++%" <- function(x, y) {
n <- length(x)
m <- length(y)
if (n != m) {
stop("length does not match")
} else {
x + y

Self-defined operators become more powerful when used with S3 objects (see below).

S3 functions

There are two object systems in R, S3 and S4. S3 objects are lists, so its fields are accessed by the same operator to access list fields $. S4 objects are intended to be safer than S4 objects. Its fields are accessed by @. Many packages, especially those with a long history, such as lars, rpart and randomForest, use S3 objects. S3 objects are easier to use and understand.

Knowing how to implement an S3 object is very useful when we need to check the source code of a specific R package. And when we want to release a small piece of code for others to use, S3 objects provide a simple interface.

The easiest way to understand S3 objects is the following analogy:

S3 object/R list object C struct
S3 functions functions on struct

struct MyVec {
int *A;
int n;

int safe_get(struct MyVec vec, int i) {
if (i<0 || i>=vec.n) {
fprintf(stderr, "index error");
return vec.A[i];

In R, the S3 object is implemented as:

vec <- list()
vec$A <- c(1, 2, 3)
vec$n <- length(vec$A)
class(vec) <- "MyVec"

In the S3 object system, the method names cannot be set freely as those in C. They must follow a pattern: “functionname.classname”. Here my class name is MyVec, so all the methods names must end with .MyVec.

"[.MyVec" <- function(vec, i) {
if (i <= 0 || i > vec$n) {
stop("index error")

Let's implement the replacement function too:

"[<-.MyVec" <- function(vec, i, value) {
if (i <= 0 || i > vec$n)
stop("index error")
vec$A[i] <- value

Let's play with MyVec objects:


## [1] 3

vec[2] <- 100

## Error: index error

We can also add other member functions for MyVec such as summary.MyVec, print.MyVec, plot.Vec, etc. To use these functions, we don't have to specify the full function names, we can just use summary, print, and plot. R will inspect the S3 class type (in our case, it is MyVec) and find the corresponding functions automatically.

Default parameters and ...

Many functions in R have a long list of parameters. For example plot function. It would becomes tedious and even impossible for the end user to assign the values for every parameter. So to have a clean interface, R supports default parameters. A simple example below:

add2 <- function(a, b = 10) {
a + b

## [1] 15

What I want to emphasize in this section is ..., which is called variable number of arguments. And it is universal in R to implement good function interfaces. If you read the documents of R's general functions such as summary and plot, most of their interfaces include ....

Consider the following case: I am implementing a statistical function garchFit, say GARCH model calibration, I used a optimizer optim which has a lot of parameters. Now I need to think about the API of my GARCH calibration function because I want others to use it as well. Shall I expose parameters of optim in garchFit's parameters? Yes, since I want to give the users of my function some freedom in optimizing. But as we know a single procedure in optim such as l-bfgs would have many parameters. On one side, I want to give the user the option to specify these parameters, on the the side, if I expose all of them in my garchFit, the parameter list would go too long. ... comes to the rescue! See the following example:

f1 <- function(a = 1, ...) {
a * f2(...)
f2 <- function(b = 5, ...) {
b * f3(...)
f3 <- function(c = 10) {

## [1] 50

f1(a = 1, b = 2)

## [1] 20

f1(c = 3)

## [1] 15

A simple user of f1 would only need to study its exposed parameter a, while advanced users have options to specify parameters in f2 and f3 when calling f1.

Global variables

Global variables are readly accessible from any R functions. However, to change the value of a global variable, we need a special assignment operator <<-. Python has similar rules. See the following example:

a <- 100
foo <- function() {
b <- a # get a's value
a <- 10 # change a's value fails, (actually done: creates a local variable a, and assign 10)
c(a, b)

## [1]  10 100


## [1] 100

boo <- function() {
a <<- 10

## [1] 10

Here our scopes have two layers “global” and top-layer functions (foo and boo). When there are more layers, i.e., nested functions, <<- operator finds the variable with the same name in the closet layer for assignment. But it is generally very bad practice to have same variable names across different function layers (except for variables like i, j, x). assign is more general, check ?assign for document.

Variable scope

I think this is the most tricky part of R programming for C programmers. Because block syntax {...} does not introduce a new scope in R while C/C++/Java/C#/etc all introduce a new scope! In R, only a function introduce a new scope.

Please see my previous post: Subtle Variable Scoping in R

quote, subsititude, eval, etc.

Many language-level features of R such as debugging function trace is implemented in R itself, rather than by interpreter hack because R supports meta-programming. I will write another post for these special functions.

Saturday, July 12, 2014

R Notes: vectors


R is different from C family languages. It has a C syntax, but a Lisp semantics. Programmers from C/C++/Java world would find many usages in R adhoc and need to memorize special cases. This is because they use R from a C's perspective. R is a very elegant language if we unlearn some C concepts and know R’s rules. I am writing several R notes to explain several important R language rules. This is the first note.


The atomicity of R vectors

The atomic data structure in R is vector. This is so different from any C family language. In C/C++, built-in types such as int and char are atomic data structures while C array (a continuous data block in memory) is obviously not the simplest type. In R, vector is indeed the most basic data structure. There is no scalar data structure in R – you cannot have a scalar int in R as int x = 10 in C.

The atomicity of R vectors is written in many documents. The reason that it is usually skipped by R learners is that many R users come from C in which array is a composite data structure. Many seemingly special cases in R language all comes from the atomicity of R vectors. And I will try to cover them coherently.


x <- 10  # equivalent to x <- c(10)
x # or equivalent to print(x)

## [1] 10

y <- c(10, 20)

## [1] 10 20

What does [1] mean in the output? It means that the output is a vector and from index 1, the result is ... x is a vector of length 1, so its value is [1] 10, while y is a vector of length 2, so its value is [1] 10 20. For a vector with longer length, the output contains more indices to assist human reading:

z <- 1:25

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25

Vectors with different types

Though vectors in R are atomic. There are different vectors: int vector, float vector, complex vector, character vector and logical vector. Int and float vectors are numeric vectors. In above, we have seen int vectors. Let's see more types of vectors below:

x <- c(1, 2.1)

## [1] "numeric"

y <- c("a", "bb", "123")

## [1] "character"

z <- complex(real = 1, imaginary = 1)

## [1] "complex"

Notice that in R, string (In R's term: character type) is like int, float, logical types. It is not a vector of chars. R does not differentiate between a character and a sequences of characters. R has a set of special functions such as paste and strsplit for string processing, however R's character type is not a composite type and it is not a vector of chars either!

matrix and array

Matrix is a vector with augmented properties and this makes matrix an R class. Its core data structure is still a vector. See the example below:

y <- c(1, 2, 3, 4, 5, 6)
x <- matrix(y, nrow = 3, ncol = 2)

## [1] "matrix"

rownames(x) <- c("A", "B", "C")
colnames(x) <- c("V1", "V2")

## $dim
## [1] 3 2
## $dimnames
## $dimnames[[1]]
## [1] "A" "B" "C"
## $dimnames[[2]]
## [1] "V1" "V2"


##   V1 V2
## A 1 4
## B 2 5
## C 3 6


## [1] 1 2 3 4 5 6

In R, arrays are less frequently used. A 2D arrays is indeed a matrix. To find more: ?array. We can say that an array/matrix is a vector (augmented with dim and other properties). But we cannot say that a vector is an array. In OOP terminology, array/matrix is a subtype of vector.


Because the fundamental data structure in R is vector, all the basic operators are defined on vectors. For example, + is indeed vector addition while adding two vectors with length 1 is just a special case.

When the lengths of the two vectors are not of the same length, then the shorter one is repeated to the same length as the longer one. For example:

x <- c(1, 2, 3, 4, 5)
y <- c(1)
x + y # y is repeated to (1,1,1,1,1)

## [1] 2 3 4 5 6

z <- c(1, 2)
x + z # z is repeated to (1,2,1,2,1), a warning is triggered

## Warning: longer object length is not a multiple of shorter object length

## [1] 2 4 4 6 6

+,-,*,/,etc. are vector operators. When they are used on matrices, their semantics are the same when dealing with vectors – a matrix is treated as a long vector concatenated column by column. So do not expect all of them to work properly as matrix operators! For example:

x <- c(1, 2)
y <- matrix(1:6, nrow = 2)
x * y

##      [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 4 8 12

For matrix multiplication, we shall use the dedicated operator:

x %*% y  # 1 x 2 * 2 x 3 = 1 x 3

##      [,1] [,2] [,3]
## [1,] 5 11 17

y %*% x  # dimension does not match, c(1,2) is a row vector, not a col vector!

## Error: non-conformable arguments

The single-character operators are all operated on vectors and would expect generate a vector of the same length. So &, |, etc, are vector-wise logic operators.  While &&, ||, etc are special operators that generates a logic vector with length 1 (usually used in IF clauses).

x <- c(T, T, F)
y <- c(T, F, F)
x & y


x && y

## [1] TRUE

math functions

All R math functions take vector inputs and generate vector outputs. For example:


## [1] 2.718


## [1] 2.718

exp(c(1, 2))

## [1] 2.718 7.389

sum(matrix(1:6, nrow = 2))  # matrix is a vector, for row/col sums, use rowSums/colSums

## [1] 21

cumsum(c(1, 2, 3))

## [1] 1 3 6

which.min(c(3, 1, 2))

## [1] 2

sqrt(c(3, 2))

## [1] 1.732 1.414


NA is a valid value. NULL means empty.


## [1] NA



c(NA, 1)

## [1] NA  1

c(NULL, 1)

## [1] 1



*I find Knitr integrated with RStudio IDE is very helpful to write tutorials.

Sunday, March 30, 2014

Book review: The Geometry of Multivariate Statistics


This book was written by the late professor Thomas D. Wickends, and is by far the best book on understanding linear regression I have read. As written in the preface, most books on multivariate statistics approach it via two aspects: algebra and computation. Algebraic approach to the multivariate statistics, and multiple regression in particular, is a self-contained one. It is built on linear algebra and probability and therefore can use these two to develop all the theorems and algorithms, and give rigorous proves on the correctness and error bounds of the methods. Computational approach is to teach the students on how to run the software and interpret the results. For example, reading the p-values of the regression coefficients. However the computational approach, because of its lack of diving into the theories, does not tell one how F-test and the corresponding p-values will change when the properties of the data change. And the computational approach would leave statements like “insignificant p-values don’t mean no correlation with the target” at a superficial level without any deep understanding. The geometrical approach taken by Prof. Wickends has the exact aim: to understand the meanings behind the linear algebra equations and the output of the software!

Geometry is a perfect tool for understanding multivariate statistics because first it can help understand linear algebra (quite a few linear algebra books take a geometry approach) and second it can help understand probability distribution as well. For one-dimension distributions, we think in terms of PDF and CDF curves. And for multivariate distributions, the pdf contour in the high dimensional space is elliptical. This book combines these two geometrical understandings in a unified framework and develops several key concepts of multivariate statistics in only 160 pages!

The main feature of this book is that every concept is developed geometrically, without using linear algebra or calculus at all. In viewing the data points, there are two spaces to view them: variable space (row space) and subject space (column space). When the number of variables exceeds three, it becomes hard to think about/visualize the data points in the variable space, yet it is still conceptually very easy to see them in the subject space! And most of the concepts in this book are developed in this subject space, only occasionally the author switches to the variable space to provide another view (e.g. in the PCA chapter). For example, the linear regression equation:


(all vectors here are demeaned, so there is no offset parameter a)

Basically this equation defines a linear space that are spanned by x_1 to x_p. The aim of the linear regression is to make \hat{y} as close to the target y as possible. In the subject space:


Where e is the residual vector, which is perpendicular to the space defined by xs when the distance (|e|) is minimized. By setting the perpendicular constraints as equations, we can drive the normal equation of linear regression, which in an algebraic approach is usually defined directly as X’X b = X’y rather than derived from geometrical intuition.

Other multivariate statistics book (e.g. Johnson & Wichern) though at places introduce a geometrical explanation. These bits are however embedded inside a large set of linear algebra equations. That makes the geometrical explanation in those book only supportive for the algebra equations, and therefore are not self-contained! While in this book, everything has to be developed geometrically, there is no corner that the author can hide a geometry concept that has not developed before. For example, this book gives a very good explain of the demean operation. In multiple regression, the demean operation introduces a space defined by a 1 vector (1,1,…1)’, and the original y and xs share their mean components in this same space. And this space is perpendicular to the residual vector e, therefore e’1 = 0! (If e’1!=0, then e has a non-zero projection to 1-space, which can then be absorbed into the constant offset in the regression equation.)

For practitioners, section 4.2 (goodness of fit) and chapter 5 of this book are most useful. The coefficient of determination is interpreted geometrically as:


Chapter 5’s title is Configurations of multiple regression vectors, which explains how the regression becomes for different configurations of xs. When the predictors (xs) are highly correlated, the regression becomes unstable (more analysis in Chapter 6). And the book proposes several approaches to deal with the multicollinearity problem. One of them is not to minimize |e| alone (as the position of the regression space defined by the collinear xs is not stable), and |e| + some criterion that stabilizes the position of the regression space. This would lead to ridge regression and other variants. Ridge regression is l2-regularized regression, and the regularization is to “control model complexity”. And in the situation of multicollinearity, we can a much better understanding of what on earth is model complexity through geometry!

Other points in this book: The geometry of linear algebra meets the geometry of probability in Chapter 6 (tests). Geometry interpretation of analysis of variance (Sec 3.4 & Chapter 8), PCA (Chapter 9), and CCA (Chapter 10). Each chapter of this book refreshes some of my memory and adds something new, geometrically. Strongly recommended for anyone who works with regression problems and does data mining in general.

This last paragraph is not about this particular book, but is about my motivation of reading statistics books recently. Many data miners understand regression as loss minimization, probably plus knowing that a small p-value is good indication of a useful feature (this is indeed not accurate). I was such a data miner until half a year ago. But few of them know how the p-value in multiple regression (and in other models, e.g. logistic regression) is actually calculated. Software packages nowadays are very advanced and make models like linear regression/pca/cca seem to be as simple as one line of code in R/Matlab. But if we don’t know them by hand and by heart and understand the various properties of these simple models, how can we feel confident when applying them? On the other side, as researchers in data mining, how can we confidently modify existing models to avoid their weakness in a particular application if we only have a shallow understanding of the weakness of a model? Therefore, for example, we cannot stop our understanding of regularization at only one sentence “controlling model complexity.” We need go deeper than that, to know what is behind that abstract term “model complexity”. This is why recently while writing my phd thesis, I am also learning and re-learning statistics and machine learning models. I should have done this say four years ago! But it is never too late to learn, and learning them after doing several successful data mining projects gives me a new perspective.

Monday, October 7, 2013

A Collection of Interview Questions on Arrays


One dimensional array is probably the most important data types in programming languages. First of all, computer memory supports random access and can be treated as a big array. Second, other data structures can be implemented/simulated using arrays, e.g. strings, stacks, queues, binary heaps, linked lists, binary search trees, graphs can all be implemented using arrays. Because of this fundamentality, questions on arrays are commonly asked during a technical interview. The hardness level of such questions may vary from implementing a strcpy function to applying algorithmic tricks. In this post, I will try to cover a diverse set of these array questions, and I will keep this post updated when encountering new problems.

Coding interview questions usually test two aspects:

1. The ability to write correct and stylish programs. The interviewee should write syntactically correct C-family language code, and prove the correctness of the program by checking boundary conditions, loop invariants, etc. If the problem is relatively easy, not only the correctness but also the cleanness and readability of the code count.

2. The ability to think quickly and solve hard algorithmic problems. Google/Facebook/MSFT tend to ask tricky questions in their interviews. Most experienced programmers don’t think that using ACM/ICPC and algorithmic problems really test a programmer’s ability to code. However, this is a good way to see if the interviewee is clever or not (and clever people learn and adapt fast). More importantly, the coding of these problems will show how can you compact several coding tricks in, say, 20 lines of code.

Actually many problems here test both abilities, and in many cases writing correct programs with a good style weighted more than knowing the algorithmic trick to a specific problem.

Several problems in my collection are taken from a problem series by Chen Liren, link. If you read Chinese, I highly recommend this series, which contains many good questions and experience reports from real interviews.



The problem descriptions and some keywords use bold fonts.


Maximum subarray problem. This problem is so famous that it has its own Wikipedia page. It was popularized by Jon Bentley in his CACM column Programming Pearls and later collected in the book with the same name.

This problem can be solved in O(n) with O(1) space using dynamic programming. The DP solution itself is not very interesting. What interests me is two others solutions, one using a technique called cumulative sum, and the other using a more general technique -- divide and conquer.

Cumulative sum is a very useful pre-computing technique to allow O(1) range queries. It costs O(n^2) for the pre-computing thought.

Divide and conquer applies on a lot of sequence problems, including many sorting and searching algorithms. The trick is to design an O(n) merging step. In this problem O(n) merging is simple. In other problems, e.g. closest point pair, O(n) merging is non-trivial.

An extension of this problem is to find the maximum submatrix. An O(n^3) algorithm is expected.

Another variant of this problem: find two disjoints subarrays A and B such that |sum(A)-sum(B)| is maximized. Suppose A is on the left and B is on the right, for position i, we need the maximum/minimum subarray in range [1,i], and [i+1,n].

Given an array that is a rotation of a sorted array (elements are unique). The rotation of an array is to split the original array into two parts and concatenate the left part to the tail of the right part. E.g., {3,4,5,1,2} is a rotation of {1,2,3,4,5}. Now given this rotated array, find the smallest element.

Obviously we can do a linear scan and find the first number that is smaller than its left number. But linear scan costs O(n). Because the original array is sorted, it is natural to know binary search might get the running time to O(log n). In designing any binary search algorithm, it is important to prove that its correctness using loop invariant. [Bentley’s pearls and CLRS both cover loop invariant.] The algorithm sketch:

l=1, r=n

while a[l]>a[r] do
if (l+1==r) then
return r
if a[m]>a[l] then
return a[l]

Another binary search problem: given a sorted array, if A[i] == i, i is called magic index. Find the magic index.

Another binary search problem: given an array A=BC where B is a sorted array with non-increasing order, and C is a sorted array with non-decreasing order. Find B and C.

Not binary search, but incremental search: given an array A in which any two consecutive numbers diff by 1. Find the index of a target number T. E.g. A= {4,5,6,5,6,7,8,9,10,9}, and T=9.


Given an array of integers, write an algorithm that shuffles the array so that all odd numbers and ahead of even numbers.

This problem tests the understanding of quicksort algorithm. In quick sort’s partition procedure, once we fix a pivot, there is an O(n) algorithm that puts all elements less than or equal to the pivot to the left, and all elements that are greater to the right. In this problem, we shall redefine the comparison function that any odd number is “smaller” than any even number.

If C/C++ used, then write a general partition procedure which takes a comparison function pointer(C)/comparison operator(C++) as input would be nice.

A variant of this problem: 3-way partition. Given an array and i, design an in-place algorithm that puts all number < A[i] to the left, and all numbers = A[i] in the middle, and the rest in the right.

Another variant of partition algorithm: sort an array of 1/2/3 (don’t count # of 1,2,3), use a new partition algorithm instead.


Inversions. Given an array, calculates the numbers of pairs (i,j) such that A[i]>A[j].

This problem has several solutions. One is to modify the merge procedure in merge sort, to count the inversions in each merge procedure. Other solutions require using advanced data structures, and we don’t cover them here.

long long res; 

void merge(int l, int m, int r)
int n1=m-l+1, n2=r-m;
for (int i=0; i<n1; i++) L[i]=a[l+i];
L[n1] = INF;
for (int i=0; i<n2; i++) R[i]=a[m+i+1];
R[n2] = INF;

int j=0,k=0;
for (int i=l; i<=r; i++)
if (L[j]<=R[k]) a[i]=L[j++];
else { a[i]=R[k++]; res+=n1-j; }

void merge_sort(int l, int r)
if (l<r) {
int m=(l+r)/2;

Majority. Given an array of n integers, find the number that occurs more n/2 times, or report that such a number does not exist. E.g., in array {1,2,2,2,3}, 2 is such a number.

Solution 1: If such a number exists, then the n/2-th number in the sorted array must equal to it. Then we can first use a linear algorithm to find the n/2-th number in an unsorted array. In the literature, this k-th number problem is called selection problem, which has a complicated linear algorithm by Blum et. al. (four of the five authors of this paper were awarded Turing award!), and a O(n)-average time algorithm using quicksort variant.

Solution 2: a property of this problem: removing any two numbers that are different -> the frequent number remains the same. A walk through of the array using this property can get a candidate of the frequent number. Another linear scan checks whether the candidate occurs more than n/2 times.

A variant of this problem is to find a number that occurs more than n/3 times. The above two solutions still apply.


Given an array of integers, concatenate all numbers in some order so that the combined string is lexicological smallest. E.g., {3,32,321}, solution is 321323.

Define a new comparison operator of two numbers and sort the array. The new comparator is

compare x y 
if str(xy) > str(yx) then
else if str(xy) < str(yx) then

The tricky part is to prove that this does yield the smallest string. To prove this order is correct, we need to prove three properties: reflexive, symmetric and transitive.

Given a sorted array and an integer x, output the number of times this integer occurs in the array.

Linear scan costs O(n). A straightforward binary search can spot the first occurrence of this integer, and we continue to count until the next number is not equal to it. However, this integer can occur O(n) times. A workaround is to do binary search again at the first occurrence. Suppose the first occurrence is at position i, and there are m numbers to the right. First, we check position i + m/2, if equal, we check i + m/2 + m/4, if not equal, we check i + m/4, and so on. However, this is non-trivial to get correct because there are a few boundary conditions.

The best solution to this problem is to use two binary search functions in C++ STL: lower_bound and upper_bound. lower_bound finds the smallest position i such that A[i]>=x, upper_bound finds the smallest position i such that A[i]>x. The solution to this problem is simply upper_bound(x)-lower_bound(x). The two functions already take care of boundary conditions and cases where x does not occur in the array.

Given an array of integers, all integers occur exactly twice except for one. Find the only integer that occurs once.

Use bit trick. x xor x = 0. (if a bit occurs twice, set it to 0)

Variant: Given an array, except for one X, all other numbers occur exactly 3 times. Find X.

Now bit trick does not work. However, the idea is the same. If a bit occurs 3 times, set it to 0.


Integers and sum. Subproblem 1: given a sorted array, find two numbers that sum to m. Subproblem 2: the array is not sorted. Subproblem 3: find four numbers that sum to m.

For subproblem 1, there is a trick called ruler reading. Keep two pointers to the left and right of the array, and move the pointers until the sum if m. O(n)

For subproblem 2, hasing or sorting.

For subproblem 3, there is trick called two-head search. First, we get the sums of all pairs (O(n^2)), sort these sums or put them in a hash table. for a+b+c+d=m, we now only need to enumerate a and b, and check whether c+d exists using binary search or hashing.

Another problem using ruler reading technique: given an array of positive integers, and S. Find a subarray with minimal length and the sum of the subarray is >= S.

Arrange 0 to n-1 in a circle, starting from 0, delete every m-th number. Output the delete numbers in sequence. E.g, {0,1,2,3}, m=2, the sequence is 1,3,2,0.

Solution: O(n), in every step, use module operation to find the exact number.

Given an array of integers, find a maximum subset that constitutes a sequence of consecutive integers. E.g., {15, 7, 12, 6, 14, 13, 9, 11}, the answer is {11,12,…,15}.

Obvious solution is to sort the numbers in O(n log n) and then do linear scan. The solution requires a running time better than O(n log n). Hint: make the consecutive numbers into same set, and use disjoint set algorithms to have O(n log log n) running time.

Given k sorted arrays, find a range [a,b] with minimal length such that each array has at least one element in [a,b].


1: 4, 10, 15, 24, 26

2: 0, 9, 12, 20

3: 5, 18, 22, 30

Minimal range is [20,24].

If we have all the numbers in k arrays sorted, and mark the numbers with labels 1 to k, then our minimal range should contain all k labels. This range checking can actually be done while we do merge sort of the k arrays. Hint: set a pointer for each array.

Given an array of 0/1, find the longest subarray such that the number of 0s and the number of 1s are the same.

Using the cumulative sum technique, the problem can be solved in O(n^2). Hint: fill 0 with -1, and use cumulative sum technique again.

Largest Rectangle in a Histogram. Given an array of heights, calculate the rectangle with largest area.

Solution: pre-compute two arrays L and R, L[i] indicates the most left position where h[L[i]]>=h[i].The following C++ code snippet calculates L (R would be similar):

int t = 0; 
for (int i=0; i<n; i++) {
while (t>0 && h[st[t-1]] >= h[i]) t--;
L[i] = t==0?0:(st[t-1]+1);

Many other solutions to this problem:

Variant: Given an matrix with 0/1, find the max submatrix with only 1s.

A much easier problem: given an unsorted array A, find in linear time, i and j (j>i), such that A[j] – A[i] is maximized.

In-place shuffle array: 1, 2, 3, …, 2n => 1, n+1, 2, n+2, etc. solution: Peiyush Jain, A Simple In-Place Algorithm for In-Shu?e.

Coins change problem. Given an array A and X, find how many subsets of A sum to X. Solution: DP.

Given two arrays S and T. Each is a permutation of 0 to n-1. By swapping 0 and another number repeatedly, change S to T. If minimal swaps are required, how to do it?

Solution: get a position correct in every step. If minimal swaps are required, use BFS/short path algorithm.

Given an unsorted array, find the longest arithmetic progression. Case 1: keep order of the original array. Case 2: no need to keep order.


For case 1: use this hash data structure to store the n(n-1)/2 pairs of difference:

map<int, set<pair<int, int> >

difference -> sorted set of (i,j) pairs [difference = a[j]-a[i] ]

For case 2: since the order does not matter, we first sort the array in O(nlogn), and then the longest arithmetic progression is a sub sequence of the sorted array. This can be solved by DP:

dp[i][j] = dp[j][k] + 1, if i<j<k, a[i]-a[j]=a[j]-a[k]

dp[i][n-1] = 2, for all i < n-1


Given an array of n unique numbers from 1 to n, delete 1/2/k numbers, how to identify the deleted number/numbers?

For k=1, calculate sum.

For k=2, calculate sum and sum of squares.

For k=3, calculate sum, sum of squares, and sum of cubes.

Time: O(nk)

Another solutions: count the 0/1 bit at each position. Time: O(n log n)

An array of length n+1 is populated with integers in the range [1, n]. Find a duplicate integer (just one, not all) in linear time with O(1) space. The array is read-only and may not be modified. Variation: what if the array may be written into but must be left unmodified by the algorithm?

Solution: and

The best solution requires solving several nontrivial subproblems, which are actually linked list problems!

Given an unsorted array A of n distinct natural numbers, find the first natural number that does not appear in the array.


I first saw this problem when reading R. Bird’s pearls of functional algorithm design, in which this problem is the first pearl. There are at least two good solutions (favor different tradeoffs) to this problem:

1. From 0 to n, there must be an integer that does not appear in A. We use an extra bool array with size n+1 and scan the array to mark occurred numbers. Then we can scan the bool array to find the first non-mark number as the solution. This solution is very easy to implement and costs O(n) time and O(n) space. Can we reduce the space cost?

2. Divide and conquer. Let’s check this property: from low to low+n, there must be a number that does not occur in A where low is the smallest (possible) number in A. We just need to guarantee that A does not have numbers smaller than low. Besides low, if we also know the upper bound value of A (we define upper bound as the number that is bigger than all numbers in A). So there are at most high-low distinct numbers in A. If and only if high-low==n, the numbers in [low, high) all occur in A once. If high-low>n, then there is a number in [low, high) that does not occur in A.

With this property, we can design a divide and conquer algorithm that in each step discard half of the numbers, the following is an implementation from Bird’s book:

minfree xs          = minfrom 0 (length xs, xs)
minfrom a (n, xs) | n == 0 = a
| m == b-a = minfrom b (n-m, vs) -- discard the numbers < b
| otherwise = minfrom a (m, us) -- discard the numbers >= b
where (us, vs) = partition (< b) xs -- partition in quicksort
b = a + 1 + n div 2
m = length us

This method uses O(n) time and O(log n) space. But the space cost can be reduced to O(1) if we write a iterative version of the above procedure. And different from linear selection algorithm which is only O(n) on average case, the algorithm here is strictly O(n).



Two problems on permutation arrays:

Find the next permutation. This function is implemented in C++ STL. Its Wikipedia page has a solution. The following is an OCaml implementation:

let next_permutation a = 
let swap i j =
let tmp = a.(i) in

a.(i) <- a.(j);
a.(j) <- tmp;
let n = Array.length a in
let k = ref 0 in
let i = ref (-1) in
(* find largest i such that a(i) < a(i+1) *)
while !k<n-1 do
if a.(!k) < a.(!k + 1) then
i := !k;
k := !k + 1
if !i = -1 then None (* if such i does not exist, no next permutation *)
else begin
(* find largest j such that a(j) > a(i) *)
let k = ref (!i + 1) in
let j = ref !k in
while !k<n do
if a.(!k)>a.(!i) then
j := !k;
k := !k + 1
(* swap i j *)
swap !i !j;
(* reverse all after index i *)
let k1 = ref (!i + 1) in
let k2 = ref (n - 1) in
while !k1 < !k2 do
swap !k1 !k2;
k1 := !k1 + 1;
k2 := !k2 - 1;
Some a;

Find the inverse of a permutation. E.g. A = [4,2,1,3] is a permutation, its inverse is B = [3,2,4,1], i.e. B[A[i]] = i (index starts from 1).

Solution: inverse several circle linked lists.


Problems with no good solutions yet:


Given an array A with length n, filled with 1 to n. Some numbers occur multiple times, some don’t occur. Identify these two groups of numbers in O(n) time and O(1) space.

Given an array, some numbers occur once, some twice, and only one X three time. Find X. (No solution yet)