Tuesday, May 4, 2010

Why F# is the language for data mining

UPDATE:
25 Sep 2012, the library libml is on bitbucket: https://bitbucket.org/blackswift/libml/. The code was written more than two years ago and has not been maintained since then. It was to accompany this article, but not to be used practically.

F# is a newly released language in Visual Studio 2010, as the 4-th language in .Net platform officially supported by Microsoft, aside with C#, VB.Net and C++/CLI. It has been a research project within Microsoft Research Cambridge for more than 5 years. Its origin could date back to the ML language designed by Turning Award winner Robin Milner in 1970s.
As a new yet old static-typed functional language designed for .Net platform, it will spread quickly in the coming years. It is such enjoyment to explore its usage in data mining, or more broadly, statistical computing area that I now share some my personal experience using it by trying to answer one question:
Why F# is the language for data mining.
This article is divided into 3 sections:
1. a review of data mining environments and a feature summary from them.
2. the answer. I list 7 reasons with 2 short examples.
3. a preview of a data mining library in F#, which will be open sourced this summer.
Before we go to the answer of this question, let’s see some existing data mining environments.

Data mining environments

Commercial:
SAS and SPSS are the main commercial data mining software. SAS has a mature macro language, also a reasonable GUI support. SPSS has a good GUI support. Their data mining algorithms are written in C/C++. The user doesn’t need to know about the underlying languages (C/C++), the data mining API is provided by a macro language (SAS) or GUI (SPSS). Using a macro language or GUI greatly eases users to perform data mining tasks.
There are also other products, e.g. the decision tree family developed by Salford.
Matlab is perhaps the most popular software in data mining community, especially in machine learning community. Although Matlab is not free, it is easily available in schools and research institutions. There are a lot of free programs in Matlab to perform data mining tasks. Matlab is so popular that putting an data mining algorithm name + "matlab" into Google gets what you need. The other reason is that the matrix operations in Matlab are efficient and convenient considering that the data tables in data mining are either dense or sparse matrices.
Free/Open source:
There are quite a lot of open source software of different quality. Here’s a list I am interested:
1. R. R is for statistics. The core part of R contains the R languages and the stable statistical procedures. The data mining part is a set of external packages. Most of the data mining models are available, e.g. Decision Trees, Neural Network, SVM, etc. R is a very capable language, however, it is not popular in data mining community. For data mining and machine learning, Matlab is more popular than R. R core does not support sparse data, thus a lot of extensions do not consider this support, which is extremely important. As in data mining, a lot of datasets are sparse, a tool that has sparse and large scale design in the beginning is very critical for performing data mining.
2. Weka. Weka is the most famous open source software in data mining and machine learning field. If data mining has three perspectives: database, machine learning and statistics[2]. Weka is from the second. Weka is widely used in data mining courses, in which instructors are able to show a lot of data mining algorithms to the students, turning the formulas in the book into several mouse clicks and the resulting figures and graphs. Who use Weka for serious data analysis? As I know, most people only use Weka’s decision tree implementation J48, because decision tree is really useful sometimes and Weka is handy to run to get the tree and its results are close to the standard C4.5's. But its usage is often on dense data set, as building a tree on sparse data sets costs a huge amount of memory in Weka. Weka does support sparse dataset, however, the performance is very bad, the speed is slow, the memory usage is unacceptable. The design in Weka is too general, thus sacrifices the speed.
Weka is written in Java. So the extensions to Weka, e.g. a different boosting scheme for trees, are usually written in Java. This kind of task is better done in a language like R or Matlab, i.e. a interpreted script language, not in Java. There are interpreted languages in JVM, most notably, Scala and Clojure. Calling Weka from Scala or Clojure will be a good idea.
3. Orange. Orange is written in Python and C++. All data mining algorithms are coded in C++ for performance’s sake. Python is used as the script language to call the algorithms and the language to build the GUI. Using Python in data mining is a great idea. It is interpreted, an concise language to read and write, easy binding to C/C++. Orange only supports dense data set.
4. Lush. Designed by Prof. Yann LeCun. He is a big figure in machine learning, a definite expert. There are not so many researchers, who write great programs, and write great papers at the same time. Prof. LeCun is one of the few. Lush is a Lisp environment with bindings to algorithms written in C/C++. The neural network in it is mature. Lisp is good, very expressive. However, nowadays, not many people write it.
after we have seen a set of data mining software, we can draw a conclusion on

the most favored features of a data mining environment:

1. Data mining algorithms are built in. The implementation should reliable and efficient. It should supports dense and sparse datasets equally well. It should also be robust, e.g. effectively handle missing values, large datasets, etc.

2. script language and iterative shell. The script language could be used to do data preprocessing, calling data mining procedures, visualization and report easily.

3. Math: Matrix and vector (both sparse and dense), linear algebra, common numerical optimizations. Sometimes, we need to design new algorithms or modified existing algorithms to suite one specific data mining application. In this case, these math tools are really important, e.g. a simple logistic regression could be done in several lines of code if these math modules are available.

4. GUI is preferred, but not a must. GUI is good for data exploration and visualization, e.g. Excel is used a front end for a lot of Business Intelligence solutions for its great GUI and stability. When performing data mining, most of the time, we don't need GUI (just as when we use Matlab, we seldom use its START menu and prefer command line).
5. A integrated system. You use tool A do download the data, use tool B to do clean the data, use tool C to do some matrix operations on the data, finally you get a dataset and put it into an SVM command line tool. Not the end, you still need to parse the ouput of the SVM to get the result. And think about you need to repeat this process for several times. What's wrong here? Tools are scattered. The Unix philosophy "do one thing and do it well" does not apply here because the data structure from one thing to the other is very complex. This is why we need a integrated system, at least data processing and model building are connected seamlessly, like in Matlab.

Weka has all the 5 points. However, none of them are very good. E.g. Java is hard to be used as a script language as you need to write a lot of lines of code and compile them to perform an easy task, and using Java also requires you to be very familiar with Weka's java classes, which is just huge. Java could also be used to do data preprocessing, however, few use it for this purpose. Python, Perl or F# are more appropriate.

F#

Following lists the notable features, which I think are helpful for data mining.
1. Succinct, static typing at the same time. These two features are combined together. When we write programs, we need to think about the programming logic and the programs, e.g. write the loop, define a variable, write the type for it, etc. Programming logic is the core thing, but every language has a syntax to obey, so we need to write a lot of other things, which could be distractive during programming because you need to think about more things at the same time. More succinct means you don’t need to write so many `declarations’ for your programs. Python is also succinct, however it is not static typing, which means typing checking is done during runtime. Dynamic typing is slow. In a lot of cases, F# does not need to write types, its other syntax is also succinct, it is also static typing at the same time. Combining these two together means we can focus on the programming logic and at the same time keep the performance.
2. Functional and programming style. F# is functional, thus the A, B, C of a functional programming also applies to F#. Here A could be ‘function as the first class or high order functions’, which means you could write:
a |> split |> Seq.filter notStopWord |> Seq.map stemmer
to perform string split, filtering using a self defined filter and doing word stemming in a single line!
In an OOP language, this would require to use a pipe line design pattern, which is far more complex than high order functions here.
3. Functional and parallel. A function is pure if it does not change anything outside the function, and every time you pass the same parameter into it, it gets the same result (this is a weaker version of purity). Based on this, we can easily parallel pure functions. In data mining, most of cases, we can simply split the data into several parts to get parallel execution.
As an example, following is a piece of code to perform multiple k-means clusterings at the same time:
if isParallel then
let dorepeats =
{0..restart-1} |> PSeq.map (fun _ -> doClustering(ds, k, maxiter, init, algo) :?> kmeansResult)
dorepeats |> PSeq.toList |> List.min :> IClusterResult
(* use async
let dore =
{0..restart-1} |> Seq.map (fun _ -> async { return doClustering(ds, k, maxiter, init, algo) :?> kmeansResult })
dore |> Async.Parallel |> Async.RunSynchronously |> Array.min :> IClusterResult
*)
else
//printfn "no parallel"
let dorepeats =
{0..restart-1} |> Seq.map (fun _ -> doClustering(ds, k, maxiter, init, algo) :?> kmeansResult)
dorepeats |> Seq.min :> IClusterResult
Note that the difference between sequential code and parallel code is minor.
4. OOP. At a high level of designing an application, the design patterns come from the application itself. A data mining application has its own working patterns. Design patterns are what OOP is good at. F# provides limited, but concise support for OOP. E.g. the following is the interface for a learner and predictor:
type IPredictor =
abstract member Predict: gvector -> int
abstract member Predict: datavecs -> int array
abstract member Predict: dataset2 -> float * (int array)
type IProbPredictor =
abstract member Predict: gvector -> int * (float array)
abstract member Predict: datavecs -> (int array) * (float array array)
abstract member Predict: dataset2 -> float * (int array) * (float array array)
type ILearner =
abstract member Learn: dataset2 * parameters -> IPredictor
A classification algorithm has two components, a learner and predictor. All learners implement ILearner interface. All predictors implement the IPredictor interface,some predictors also implement the IProbPredictor interface if it supports probability output. Because the code is so concise, I can change this interface more easily than in C#.
5. Interpret language. This is simply a must! Everyone wants to write a single line to perform data loading and model building together and see the result immediately.
6. .Net! If F# has the above 5, but without this one, it will not be successful. Tied with .Net has a lot of advantages, most notably a) a stable environment and a lot of stable libraries, b) writing components in C# when needed, c) the components in F# could be used by other .Net languages with no effort d) Visual Studio 2008 and 2010 are very good IDEs and finally e) Microsoft supports it, the `evil company’ is more responsible for its products than others and at last f) the performance of .Net is very good.
7. IO and preprocessing! This point is subsumed in previous points. However it is so important that I separate it out. A lot of software simply assume that data mining is the processing applying an algorithm to a dataset. This is WRONG! It is said that 80% of the data mining work is to preparing the data to get a dataset[1]. To prepare the data, we need to do a lot of IOs, from Internet, databases, xml files etc. .Net provides the most stable yet simple way to do these tasks. F# language is also well suited to perform data preprocessing. As an interpreted language, we can test and see the result immediately.
I use two examples from the code in my data mining library as examples to show F#'s strength.
Example 1. Wrod Tokenizer.
The following code is the Tokenizer module for text mining. It is so short, yet can perform various tokenizing tasks, removing stop words, doing stemming, bi-gram etc.. The conciseness comes from F# as a functional language and .Net library.
module Tokenizer =
let sep = [|' '; '\r'; '\n'; '-'; '.'; ',' ; '\t'; '!'; '?'; '\'' |]

let unigramToken (s:string) =
s.Split(sep, StringSplitOptions.RemoveEmptyEntries) |> Seq.filter (fun w -> not (isStopWord w)) |> Seq.map (fun s -> s.ToLower())

let unigramTokenWithStem s = s |> unigramToken |> Seq.map stemmer

let bigramToken (s:string) =
let uni = s.Split(sep, StringSplitOptions.RemoveEmptyEntries)
let bi = seq { for i=0 to uni.Length-2 do yield (uni.[i]+uni.[i+1]) }
bi |> Seq.filter (fun w -> not (isStopWord w)) |> Seq.map (fun s -> s.ToLower())

let bigramTokenWithStem s = s |> bigramToken |> Seq.map stemmer

let defaultToken = unigramTokenWithStem
Example 2. Parallel Tasks
You probably see similar code before as an example of the F# asynchronous programming. Here I changed the 'example' code to let it disguises as a Mozilla browser(this is useful when you try to crawling Google search results), also it does some retries and support exception handling. The code is so short!! Using it downloading 10000 posts (I need some special posts from a question answer forum for my research) in parallel costs only half an hour.
let fetchUrl (url:string) =
async {
try
let req = WebRequest.Create(url) :?> HttpWebRequest
req.UserAgent <- "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"; req.Method <- "GET"; req.AllowAutoRedirect <- true;
req.MaximumAutomaticRedirections <- 4;
let! response1 = req.AsyncGetResponse() let response = response1 :?> HttpWebResponse
use stream = response.GetResponseStream()
use streamreader = new System.IO.StreamReader(stream)
return! streamreader.AsyncReadToEnd() // .ReadToEnd()
with
_ -> return "" // if there's any exception, just return an empty string
// checked and found that pages that cannot be retrieved mainly because the pages are already deleted
// not because of the downloader
}
let getPages pageIds rooturl =
pageIds |> Seq.map (fun id -> let name = rooturl + id.ToString() in fetchUrl name) |> Async.Parallel |> Async.RunSynchronously

A data mining library in F#

In the past month, I have written a prototype of an incomplete data mining system in F#. The more I write, the more I feel that it is the language for data mining. The library will be open sourced this summer after functionality completion and testing. I’d like to pronounce some highlights here:
1. Datasets and IO.
Dense and Sparse datasets are equally support! Various data formats are supported, csv files, weka's ARFF format, libsvm/svm light data format, Excel table, for both read and save. The library does not have its own data format. Processed data are stored in different formats for different needs. Using a built-in database will be considered in the future, e.g. the light-weight SQLite.
There are a set of preprocessing functions for datasets, e.g., filling missing values, feature value normalization, etc.
2. Data mining algorithms.
A few State-of-the-art data mining algorithms are implemented:
classification: decision tree family, logistic regression, support vector machines (by compiling libsvm to .Net using C++/CLI), neural networks, Naive Bayes, KNN.
regression: simple least square linear regression, neural networks and SVM(using libsvm).
clustering: k-means and its variants, EM on Gaussian Mixture Model, spectral clustering
dimension reduction: SOM, PCA, LDA.
feature selection: Mutual Information, decision trees, logistic regression with Wald test.
MISC: Probability Latent Semantic Analysis(PLSA), co-clustering
association rule and sequence mining: I have a plan in mind, but the actual implementation may delay.
The library is mainly for data mining, not for machine learning research. So only the most useful models are implemented. I hope after several years, the they will be very stable with few bugs. Trying to make it big would be dangerous. E.g. Weka is big in the sense of its algorithms, however, the standard statistical and data mining workhouse, logistic regression, is badly implemented there. If even logistic regression cannot apply to a modest text classification problem, what shall we expect Weka's other algorithms?
Also, implement a lot of immature models is less meaningful for practical data mining. E.g. the data mining module in SQL Server 2008 only implements a few state-of-the-art models. The novel algorithms are suitable to be implemented on demand for a specific application or dataset.
3. Parallel.
Some algorithms have parallel implementations in algorithm level, e.g. k-means clustering. At a higher level, I define a set of tasks using the type systems in F#, e.g. the classification task is defined as
type classificationEvaluationTask =
| CrossValidation of int * dataset2 * ILearner * parameters
| RandomSplit of float * dataset2 * ILearner * parameters
| TrainTest of dataset2 * dataset2 * ILearner * parameters
| LeaveOneOut of dataset2 * ILearner * parameters
Data mining tasks in the library run in parallel on default, e.g. performing a classification task and a clustering task at the same time, or, doing cross validation on different parameters to choose the best parameter of a model at the same time. A major technique to boost the performance of classifiers is ensemble learning, which bags a lot of models together. Parallel really helps a lot in this kind of learning task. Of course, parallel programming could be done in C++/Java using threads too. The advantage of using F# is that the interface is very easy to use, parallel executing various tasks at the same time needs only a single line.
4. A service provider design pattern for data mining algorithms. An algorithm, e.g. KNN will have several implementations, F# implementation is called on default, the user could also start a service provider and call the implementation of that provider. A candidate for such a wrapper is OpenCV’s machine learning component for its robustness, speed and friendly license.
ML library in OpenCV only supports dense dataset as machine learning tasks in CV usually use dense data format. I haven't found a good service provider for sparse data sets yet. But for large scale sparse data sets, the common practice is to use SVM, Logistic regression and tree family, which I already have code in C/C++.
5. Visualization. A limited support is already done by extending Luca Bolognese' code snippet. The visualization is mainly for understanding the data and diagnosing the algorithms, e.g. plotting the scatter of two columns of a data table, drawing the objective values through iterations. For high quality visualization, choose Mathematica, Matlab, R or Excel:)
6. Math library. It is a pity that there's no stable, yet free math library in .Net. For data mining, a math library is really important. I use the matrix and vector implementation in F#'s PowerPack. The matrix type in PowerPack supports both dense and sparse matrix, however sparse vector is not implemented. So I implemented the sparse vector myself. PowerPack now deprecates its math provider, which I pick up and use as the linear algebra module. So actually, I made little effort to get matrix supported.

Other math functions are implemented on demand. Some gradient based optimization solvers are implemented to solve linear regression and logistic regression, e.g. conjugate gradient decedent and LBFGS, which are two state-of-the-art gradient based solvers. Currently I implemented them in F#, the performance is modest compared to my C++ implementation. Code tuning is yet to be done. Some probability distributions are implemented for EM and Naive Bayes. Some important statistical algorithms, e.g., various tests, are also considered to be implemented in the future.

Acknowledgment
I'd like to thank Haipeng Zhang for reviewing this article for me.

[1] S. Zhang, C. Zhang, and Q. Yang.
Data preparation for data mining. Applied Artificial Intelligence, 2003, 17(5-6): 375-381.
[2] Z.-H. Zhou. Three perspectives of data mining. Artificial Intelligence, 2003, 143(1): 139-146.