Sunday, September 19, 2010

WekaSharp: more features

1. An F# wrapper for Weka. The minimal wrapper in F# for Weka.

2. More features. (this post) This post will contain improvement over the minimal wrapper, e.g. more Dataset processing function, some plot functionality, etc.

3. Tutorial/Usage examples. This post is for end users, who might have no interested in reading the implementation details, but rather knowing how to use this wrapper to perform data mining tasks in .Net.

Here lists a set of features that have been added to WekaSharp.

Parameters for data mining algorithms.

F# features: named parameters and pattern matching.

Data mining algorithm usually have some parameters to set before applying to the data. E.g. the K-means clustering needs the user to provide the number of clusters, the Support Vector Machine classification algorithm need the user to provide the kernel type of the SVM to use, etc.

Also different algorithms have different set of parameters. In Weka, algorithm parameters are encoded as a string (similar to the parameter style of command line tools). However, a first time user would not be able to remember what does “-R 100” mean for a specific algorithm. So we’d like to have a parameter maker for each algorithm. F# fully supports this task with its named parameters.

Let’s see an example. The following code is for making parameters for Logistic Regression:

type LogReg() =
static member DefaultPara = "-R 1.0E08 -M -1"
static member MakePara(?maxIter, ?ridge) =
let maxIterStr =
let m = match maxIter with
| Some (v) -> v
| None -> -1
"-M " + m.ToString()
let ridgeStr =
let r = match ridge with
| Some(v) -> v
| None -> 1.0E08
"-R " + r.ToString()
maxIterStr + " " + ridgeStr


and this is an example usage of it:

let logRegPara = LogReg.MakePara (ridge = 1.0)

You can see that besides named parameter functionality, we can also only supply part of the parameters.

Another highlight F# feature in this the Parameter module is pattern matching. Sometimes, the parameters could be quite complicated, e.g. the nearest neighbor search algorithm used in KNN. There are 4 search algorithms, each of them is capable for some of the distance metrics. Discriminate unions and pattern matching make the parameter generating task easy:

    type DistanceFunction = Euclidean | Chebyshev | EditDistance | Manhattan
static member
GetString(distanceFunction) =
let d = match distanceFunction with
| Some (v) -> v
| None -> DistanceFunction.Euclidean
match d with
| DistanceFunction.Euclidean -> "weka.core.EuclideanDistance -R first-last"
| DistanceFunction.Chebyshev -> "weka.core.ChebyshevDistance -R first-last"
| DistanceFunction.EditDistance -> "weka.core.EditDistance -R first-last"
| DistanceFunction.Manhattan -> "weka.core.ManhattanDistance -R first-last"

            let distanceFunctionStr = DistanceFunction.GetString distanceFunction 

let searchAlgorithmStr =
let s = match searchAlgorithm with
| Some (v) -> v
| None -> NNSearchAlgorithm.LinearSearch
match s with
| NNSearchAlgorithm.LinearSearch ->
"weka.core.neighboursearch.LinearNNSearch" + " -A " + "\"" + distanceFunctionStr + "\""
| NNSearchAlgorithm.BallTree ->
"weka.core.neighboursearch.BallTree" + " -A " + "\"" + distanceFunctionStr + "\""
| NNSearchAlgorithm.CoverTree ->
"weka.core.neighboursearch.CoverTree" + " -A " + "\"" + distanceFunctionStr + "\""
| NNSearchAlgorithm.KDTree ->
"weka.core.neighboursearch.KDTree" + " -A " + "\"" + distanceFunctionStr + "\""


If you are interested, you can browse the Java code(in Weka’s GUI component) for this task and appreciate how succinct F# gets.


More Dataset manipulating functions

Weka’s dataset class(Instances) is very hard to use. Normally, a dataset is simply a table, with each row representing an instance and each column representing a feature. The dataset class in Weka is too general to be manipulated easily. Weka provides various filters to manipulate datasets. However, these filters are hard to remember their names and applying a filtering needs several lines of code.

I first write a function that transforms a 2 dimensional array into Weka dataset; it can transform the in-memory data more easily into a dataset.

let from2DArray (d:float[,]) (intAsNominal:bool)  

The parameter intAsNominal=true means that if a columns are all integers, then treat this column as a nominal column. Otherwise all columns are treated as real ones.

I also wrapper common filters into single functions, that means, applying one filter only uses one line code and they are all put in Dataset module.


The parallel processing

The PSeq module (in F# PowerPack) makes writing computing-bound parallel programs effortless. And in data mining, when it comes to parameter tuning or trying different models on different datasets, it would be fast to run these tasks in parallel on a multi-core machine.

The following code shows two functions performing a bulk of classification tasks:

let evalBulkClassify (tasks:ClaEvalTask seq) = 
|> evalClassify
|> Seq.toList

let evalBulkClassifyParallel (task:ClaEvalTask seq) =
|> evalClassify
|> PSeq.toList

The latter one is the parallel version, which simply change the Seq in the sequential version to PSeq.

Sunday, September 5, 2010

Numeric Performance in C, C# and Java

By occasion, I find the following draft by Prof. Peter Sestoft:

Numeric Performance in C, C# and Java

He implemented several typical numerical programs in C, C# and Java, and tested their running time.

The conclusion in the paper is that managed code is also performant! Read the paper for the details.

Prof. Sestoft is an expert in compiler design and compiler writing. The draft also contains the generated byte code with his annotations. Enjoy!

Just a note:
Although the C code in the paper is native, it does not take the advantage of the modern CPU design. A fully optimized implementation, which is very hard for normal programmers to write, is faster than the native code. This is why Matlab/Numpy's matrix multiplication is faster than your C code. Matlab uses Intel MKL optimized implementation.

Friday, September 3, 2010

WekaSharp: Tutorial for using Weka in F#/.Net

There are 3 posts in this series:

1. An F# wrapper for Weka. The minimal wrapper in F# for Weka.

2. More features. (not available yet) This post will contain improvement over the minimal wrapper, e.g. more Dataset processing function, some plot functionality, etc.

3. Tutorial/Usage examples. (this post) This post is for end users, who might have no interested in reading the implementation details, but rather knowing how to use this wrapper to perform data mining tasks in .Net.


The easiest way is to use Visual Studio 2010 with the latest F# PowerPack. Download the release/source code at and change the dll references to ones at the bin\Release folder.

Visual Studio 2008 should also be able open the project file. Only the parallel functionality cannot be used in 2008 as Parallel Sequence is only added in .Net 4.0.

The support for Mono is not prepared yet. However, it should be easy to figure out.

As all necessary runtimes are also included in the release, so the user does not need to download Weka or IKVM runtimes.

Quick Guide

The user is encourage to look at the script code in Script.fsx and run them to have a general feeling of the wrapper. These examples cover typical usage of WekaSharp.

Reading the other two posts is also very useful for using WekaSharp, and more importantly, for changing the source code of WekaSharp on your own need.

The License

As Weka is GPL, I must make WekaSharp under GPL.

The Efficiency Concern

People may concern how fast is the wrapper compared to the original Weka in Java. I haven’t thoroughly tested this yet. Some casual tests show that they are just the same, at least on Windows platforms. For instance, I run a 3-fold cross validation on the letter dataset using J48 decision trees, both the wrapper and the Weka (run from GUI) use about 20 seconds. It is quite surprising that IKVM’s compiler does so good a job.

As I will show later, the WekaSharp has some parallel constructs, which enable us to utilize multi-core more conveniently.

I’ve provided a module to do profiling. The design is similar to that of Matlab, but better:) You can use Profile.tic() and Profile.toc(“str”) pair to time the execution of F# code. Or You can get multiple timers by using Profile.getWatch().

The Dataset and IO

The Dataset module contains functions to manipulate datasets. Most of the functions are pure, i.e., they don’t change the input dataset and create a new dataset after processing.

Current 4 types of data files are supported: Weka’s ARFF format, LibSvm, Csv and SvmLight. Both loading and saving are supported, i.e. there are 8 functions.

The following code shows how to load an ARFF dataset and set the last column as the label:

let sonar =
|> Dataset.readArff
|> Dataset.setClassIndexWithLastAttribute

You can also do the two steps in one function:

let iris =
@"C:\Program Files\Weka-3.6\data\iris.arff"
|> Dataset.readArffLastAttributeAsLabel


Data mining algorithms usually have parameters, sometimes very complicated parameters. F# provides very convenient syntax construct for optional parameters.

E.g. the full set parameters of an SVM has the SVM type, the kernel type, the C value, and the parameters for the kernel, etc. Maybe you just need to set the C value: Parameter.SVM.MakePara(C = c).

Here Parameter is a module for parameters setting. For each data mining algorithm, there are two members:

* DefaultPara. The default parameter string, which is the same as ones used in Weka GUI.

* MakePara. A function to make different parameters.

Here are several more complicated examples of .MakePara method for different algorithms:

Parameter.SVM.MakePara(kernelType = Parameter.SVMKernelType.LinearKernel, C = 10.0)
Parameter.KMeans.MakePara(K=5, maxIter=10, seed=10)

Parameter.KNN.MakePara(distanceFunction = Parameter.Manhattan, K = 3)

and you can also use the IDE support to find which parameters are supported in a data mining algorithm:


Classification and its evaluation

WekaSharp supports most of the common classification algorithms:

type ClassifierType =
J48 | LogReg | NeuralNet | KNN | NaiveBayes | SVM | LibLinear | LinReg | AdaBoost | Bagging

There are three types of classification tasks:

type ClaEvalTask =
| CrossValidation of int * Dataset * ClassifierType * parameters
| RandomSplit of float * Dataset * ClassifierType * parameters
| TrainTest of Dataset * Dataset * ClassifierType * parameters

To run this task, you need the evalClassify method in Eval module. The following code shows a complete example using J48 as the classifier:

(* playing decision trees on Iris dataset *)
// load the dataset
let iris =
@"C:\Program Files\Weka-3.6\data\iris.arff"
|> Dataset.readArff
|> Dataset.setClassIndexWithLastAttribute

// describe 3 kinds of classification tasks
let j48Tt = TrainTest(iris, iris, ClassifierType.J48, Parameter.J48.DefaultPara)
let j48Cv = CrossValidation(5, iris, ClassifierType.J48, Parameter.J48.DefaultPara)
let j48Rs = RandomSplit(0.7, iris, ClassifierType.J48, Parameter.J48.DefaultPara)

// perform the task and get result
let ttAccuracy = j48Tt |> Eval.evalClassify |> Eval.getAccuracy
let cvAccuracy = j48Cv |> Eval.evalClassify |> Eval.getAccuracy
let rsAccuracy = j48Rs |> Eval.evalClassify |> Eval.getAccuracy

The evalClassify function returns a result object, you can use “.” to that object in the IDE to find out various types of results available. In the above, we use predefined functions to get the accuracy from it.

Clustering and its evaluation

Performing clustering is very similar to classification. You can define two types of clustering:

type CluEvalTask =
| ClusterWithLabel of Dataset * ClustererType * parameters
| DefaultCluster of Dataset * ClustererType * parameters

ClusterWithLabel means that you will need to use the label of the dataset to do evaluation. DefaultCluster does not require the dataset has label (actually, it does now allow datasets to have labels either), so the result will only contain the clustering assignments, but not accuracy, etc.

The following code shows a complete clustering example:

let irisLabeled =
@"C:\Program Files\Weka-3.6\data\iris.arff"
|> Dataset.readArffLastAttributeAsLabel

let kmeansTask = ClusterWithLabel(irisLabeled, ClustererType.KMeans, Parameter.KMeans.MakePara(K=3))
let emTask = ClusterWithLabel(irisLabeled, ClustererType.EM, Parameter.EM.MakePara(K=3))
let dbscanTask = ClusterWithLabel(irisLabeled, ClustererType.DBScan, Parameter.DBScan.DefaultPara)

let kmeansResult = Eval.evalClustering kmeansTask |> Eval.getClusterSummary
let emResult = Eval.evalClustering emTask |> Eval.getClusterSummary
let dbscanResult = Eval.evalClustering dbscanTask |> Eval.getClusterSummary

Bulk & parallel processing tasks

Sometimes, you need to run multiple tasks. E.g. you need to run the same task multiple times to see the mean and variance of the result, or you need to try different parameters for an algorithm, or you simply have different data mining tasks to run.

The following example shows how to create a bulk of tasks and run them:

// load the data set
let sonar =
|> Dataset.readArff
|> Dataset.setClassIndexWithLastAttribute

// set different parameters
let Cs = [0.01; 0.1; 1.; 10.; 50.; 100.; 500.; 1000.; 2000.; 5000. ]

// make the tasks with the parameter set
let tasks =
|> (fun c -> Parameter.SVM.MakePara(C = c))
|> (fun p -> CrossValidation(3, sonar, ClassifierType.SVM, p))

// the accuracy result
let results =
|> Eval.evalBulkClassify
|> Eval.getAccuracy
Profile.toc("sequential time: ")

Here I created different SVM tasks for different C values, run them and get the accuracy as a list.

F# provides very easy syntax to perform multiple tasks at the same time. Thus I provide evalBulkClassifyParallel method:

let resultsParallel =
|> Eval.evalBulkClassifyParallel
|> Eval.getAccuracy
Profile.toc("parallel (PSeq) time: ")

// sequential time: : 9767.804800 ms
// parallel (PSeq) time: : 6154.715500 ms

As the profiling shows, on a two-core machine, parallel executing the tasks does boost the speed.


is not finished, but it is still quite usable. To plot the different accuracies for different Cs in the SVM, we can use:

// do the plot
lc.column(y = results, xname = "differnet C", yname = "Accuracy", title = "SVM on iris",
isValueShownAsLabel = true ) |> display

Making Datasets from Memory

All the above examples use data from data files. In practice, we might want to convert the data in memory, e.g. in an array into a Weka dataset. The following examples shows how to create dataset from F# arrays:

(* create dataset from F# arrays *)

// make the data array
let data = [| 0.; 0.;
1.; 1.;
0.; 1.;
1.; 0.; |]
let xorArray = Array2D.init 4 2 (fun i j -> data.[i*2 + j])

// make weka dataset from array
let xor0 = Dataset.from2DArray xorArray false

// add labels
let xor = xor0 |> Dataset.addClassLabels ["T"; "T"; "F"; "F"]

// make svm tasks

let rbfTask = TrainTest(xor, xor, ClassifierType.SVM, Parameter.SVM.DefaultPara)
let linearTask = TrainTest(xor, xor, ClassifierType.SVM, Parameter.SVM.MakePara(kernelType = Parameter.SVMKernelType.LinearKernel, C = 10.0) )

// rbf svm gets 100% accuracy
let rbfAccuracy = rbfTask |> Eval.evalClassify |> Eval.getAccuracy
// linear svm does not work on XOR data set
let linearAccuracy = linearTask |> Eval.evalClassify |> Eval.getAccuracy


As a user of WekaSharp, I already benefit from it as I can process the data and run data mining algorithms over it both in F#. Originally, I needed to write my data into ARFF file, call Weka, parse the output to get the result. And the F# solution is far more declarative.

Another benefit is that we can write extensions to Weka in F#!


Monday, August 23, 2010

WekaSharp: An F# wrapper for Weka

There are 3 posts in this series:

1. An F# wrapper for Weka. (this post) The minimal wrapper in F# for Weka.

2. More features. This post will contain improvement over the minimal wrapper, e.g. more Dataset processing function, some plot functionality, etc.

3. Tutorial/Usage examples. This post is for end users, who might have no interested in reading the implementation details, but rather knowing how to use this wrapper to perform data mining tasks in .Net.

WekaSharp is available at CodePlex:

Weka is one of the most widely used data mining software package. It contains over 50 data mining algorithms, a good GUI support and well written documents. Because of its good features, a lot data mining courses use it as an illustrative software.

Its GPL license also allows it to be used freely in academic.

However, Weka GUI enables us to perform data loading, filtering and classifier building in an interactive way, sometimes programming is still needed. For instance, GUI has poor support for parameter selection and processing 100 data sets. In these tasks, scripting is a better way. However, Java language is not declarative and it does not support interactive shell either.

F# well supports these two features, also data processing in F# in general is far better than Java. So F# and Weka are good combination.

Compile Weka library to a .Net dll

As everybody knows, Weka is written in Java. To use it within .Net framework seamlessly, we need to compile it into a .Net assembly. Fortunately, IKVM project is made for such purpose.

There is an online tutorial for compiling weka.jar to weka.dll. However, this tutorial does not mention how to use external jar libraries in Weka, e.g. libsvm.jar. Because Java’ class loading mechanism is different from .Net’s. We cannot simply compile different jars into separate dlls. The simplest way is to compile weka.jar and libsvm.jar to a single dll:

>ikvmc -sharedclassloader -target:library weka.jar libsvm.jar

Notice that there is a WekaDotNet project at sourceforge. The project is compiled using Visual J#, which is already deprecated in Visual Studio family. Using IKVM to compile jar into .Net is an easier and more stable way.

The overall structure of the wrapper

Based on the Weka class hierarchy, I’ve defined several separate modules for different kinds of functionality:

* Common – contains all common types over the whole wrapper. E.g. the union type for various classifiers.

* Dataset – contains IO routines to read/save various datasets and functions (e.g. split) to manipulate/preprocess datasets.

* Parameter – contains functions to make parameters for the data mining algorithms, all the default parameters are also provided.

* Classify – contains standard classifiers.

* Cluster – contains standard clustering algorithms.

* Eval – contains functions to perform different kinds of classification/clustering tasks and output their result (e.g. accuracy/F1 measure) out.

In the following sections, I will present each module in details.

Common module

In this module, I try to write all types, including classifier types, clustering algorithm types and some evaluation-task types. This module is marked as AutoOpen, i.e., the types in the modules are easily accessible from other modules.

module Common =
let NYI() = failwith "Not yet implemented!"

type Dataset = core.Instances

type parameters = string

type DatafileType =
Arff | LibSVM | Csv | Svmlight

type ClassifierType =
J48 | LogReg | NeuralNet | KNN | NaiveBayes | SVM | LibLinear

type ClaEvalTask =
| CrossValidation of int * Dataset * ClassifierType * parameters
| RandomSplit of float * Dataset * ClassifierType * parameters
| TrainTest of Dataset * Dataset * ClassifierType * parameters

type ClustererType =
KMeans | EM | DBScan

I’ve redefined Weka Instances class as Dataset, and use a string for parameters.

Dataset module

This module includes functions to read/save 4 kinds of data files: Weka’s ARFF, libsvm, Csv and SvmLight. It also has a function randomSplit to randomly split a dataset with a given ratio.

This module currently is far from complete. For instance, we can only load datasets from 4 kinds of disk files, and no way to build a dataset using in-memory data, e.g. an array of vectors. We also only provide one data preprocessing step – random split, many other common preprocessing steps are needed.

I’d like to use another post to finish these. In this post, delivering the whole working wrapper (although incomplete) is more important.

Here is some of the implementation:

    let readDataFile (ftype: DatafileType) (fname:string) =
let loader =
match ftype with
| Arff -> new converters.ArffLoader() :> AbstractFileLoader
| LibSVM -> new converters.LibSVMLoader() :> AbstractFileLoader
| Csv -> new converters.CSVLoader() :> AbstractFileLoader
| Svmlight -> new converters.SVMLightLoader() :> AbstractFileLoader
| _ -> failwith "dataset loading error"

its input is a datafiletype and the file name. Based on this function, 4 concrete functions are defined:

let readArff = readDataFile DatafileType.Arff
    let readLibsvm = readDataFile DatafileType.LibSVM
let readCsv = readDataFile DatafileType.Csv
let readSvmlight = readDataFile DatafileType.Svmlight

The data saving functions have similar implementation.

Parameter module

This module contains functions that create parameters for different data mining algorithms. As Weka uses a space-separated string as parameters, it would be unclear to the first-time user that what does “-R 1.0 –M -1” mean for a logistic regression.

I have provided a default parameter string for each algorithm.

Classify module

In this module, I have wrapped most commonly used classification algorithms: C4.5 decision tree (named J48 in Weka), NaiveBayes, Logistic Regression, Multilayer perception neural net, Support Vector Machines (SVM).

The getClassifier function accepts a classifier type and the an option string, and returns a classifier instance:

let getClassifier (ctype:ClassifierType) (option:string)  =
let classifier =
match ctype with
| J48 -> new classifiers.trees.J48() :> classifiers.Classifier
| NaiveBayes -> new classifiers.bayes.NaiveBayes() :> classifiers.Classifier
//| KNN ->new classifiers.lazy.IBk() :> weka.classifiers.Classifier
| LogReg -> new classifiers.functions.Logistic() :> classifiers.Classifier
| NeuralNet -> new classifiers.functions.MultilayerPerceptron() :> classifiers.Classifier
| SVM -> new classifiers.functions.LibSVM() :> classifiers.Classifier
| LibLinear -> new classifiers.functions.LibLINEAR() :> classifiers.Classifier
| _ -> failwith "not supported"

classifier.setOptions(core.Utils.splitOptions option)
and buildClassifier takes one more parameter – the dataset to build a trained classifier:

let buildClassifier (ctype:ClassifierType) (option:string) (ds:Dataset) =
checkDataset ds
let classifier = getClassifier ctype option
Handy functions like getJ48, getSVM are also defined:

let getJ48 (option:string) = getClassifier ClassifierType.J48 option
let getSVM (option:string) = getClassifier ClassifierType.SVM option

There are several issues:

1. KNN (Weka class: IBk) is not supported as its namespace in Weka is core.classifiers.lazy. But lazy is a keyword in F#, thus cannot appear in a namespace.

2. The IO wrapper for classifiers are not provided yet.

Cluster module

Similar to Classify module, there are getClusterer and buildClusterer in Cluster module. There are also shortcuts to concrete clustering algorithms such as getKmeans and buildKmeans.

Eval module

This module contains evaluation methods for classification and clustering. For classification, I’ve defined a task discrete union:
type ClaEvalTask =
    | CrossValidation of int * Dataset * ClassifierType * parameters
| RandomSplit of float * Dataset * ClassifierType * parameters
| TrainTest of Dataset * Dataset * ClassifierType * parameters
classifyEval function is to do such a task:

let rec classifyEval (task:ClaEvalTask) =
let rnd = new java.util.Random(System.DateTime.Now.Ticks)
match task with
| TrainTest (dsTrain, dsTest, ctype, para) ->
Classify.checkDataset dsTrain
Classify.checkDataset dsTest
let eval = new classifiers.Evaluation(dsTrain)
let cl = Classify.buildClassifier ctype para dsTrain
eval.evaluateModel(cl, dsTest) |> ignore
| CrossValidation(cv, ds, ctype, para) ->
Classify.checkDataset ds
let eval = new classifiers.Evaluation(ds)
let cl = Classify.getClassifier ctype para
eval.crossValidateModel(cl, ds, cv, rnd, Array.empty)
| RandomSplit(ratio, ds, ctype, para) ->
Classify.checkDataset ds
let train, test = Dataset.randomSplit ratio ds
classifyEval (TrainTest(train, test, ctype, para))

the result of this function is an evaluation result object. We can use its properties to get various evaluation metrics, or use the shortcut functions in the Eval module:

let getAccuracy (eval:classifiers.Evaluation) = eval.pctCorrect()
let getPrecison (eval:classifiers.Evaluation) (classIdx:int) = eval.precision(classIdx)
let getRecall (eval:classifiers.Evaluation) (classIdx:int) = eval.recall(classIdx)
let getF1 (eval:classifiers.Evaluation) (classIdx:int) = eval.fMeasure(classIdx)
let getAUC (eval:classifiers.Evaluation) (classIdx:int) = eval.areaUnderROC(classIdx)
let getClassifySummary (eval:classifiers.Evaluation) = eval.toSummaryString()


Two features of the wrapper are

1) a declarative wrapper to Weka
. See the following code to see how declarative the wrapper is:

// load the dataset
let dataset =
@"C:\Program Files\Weka-3.6\data\iris.arff"
|> Dataset.readArff
|> Dataset.setClassIndexWithLastAttribute

// describe the cross validation task using SVM as the classifier
let classifyTask = CrossValidation(5, dataset, ClassifierType.SVM, Parameter.SVM.defPara)

// perform the cross validation task and then get the average accuracy
let cvAccuracy =
|> Eval.classifyEval
|> Eval.getAccuracy
The code is understandable even to those who do not know F#.

2) interactively perform data mining algorithms.

F# interactive is well integrated in Visual Studio. It is very stable and convenient to use. The powerful VS IDE provides good Intellisense for F#.

Saturday, August 14, 2010

Reading F# Projects, Part III: The F# math providers.


We’ve covered the usage of Math providers in the matrix series. In this post, the whole source code is decomposed.

As I find it time consuming to covert One Note notes into a blog format. Please see the attached pdf file.




Reading F# Projects, Part II: F# Set

The design patterns in a data structure project like F# Set is simple. However, understanding it really needs time as data structures involve heavy generic programming and the type system in F# (or any other static FP language) is very complex.

In this post, we will explore two aspects of F# Set:

1. The general implementation structure for a generic data structure. After knowing the pattern, we could write other data structures in F# and these data structures are generic enough.

2. Know something about generics in F# and type inference. We will come to this part in later readings.

In F# source code, the immutable set implementation (set.fs) is a good one for us to start for data structures. Why not more fundamental data structures like F# sequences? Because those ones are more involved in .Net platform and F# Set is nearly pure F# code. Also because:

I love trees!

F# Set and Map are implemented using balanced binary search tress, to be exact, AVL tress. Balanced BSTs support insert, delete and find operations all in O(log n). So it is widely used for implementing dynamic sets.

In general, trees are the most beautiful data structures and have so many variants. They are also efficient, e.g. R tree and friends are the fundamental data structures used in every database. Quadtrees are used for computational geometry in theory and spatial database in practice.

The F# Set is implemented by an immutable AVL tree. Here immutable means that every node, once created, its values could be be changed. In CS literature, researchers use “persistent” to name immutable data structures. If you do 100 insertions to an empty tree, you can have 101 version of trees, where common nodes are shared to reduce the space cost. It is proved that in such a tree, inserting operation still costs O(log n) time, and O(log n) space for duplicating some nodes in the tree (usually along the insertion path).

As this post is not intended to be a tutorial on data structures, I won’t go into details for the tree implementation. We mainly study how an industry-strength data structure is implemented in F#.

The main structure

set.fs contains about 840 lines of code. The code skeleton is shown below:

namespace Microsoft.FSharp.Collections
type SetTree<'T> when 'T : comparison =
| SetEmpty // height = 0
| SetNode of 'T * SetTree<'T> * SetTree<'T> * int // height = int
| SetOne of 'T // height = 1
module internal SetTree =
// functions:
// indert, delete, find, etc.
// split, rebalance, etc.

type SetIterator<'T> when 'T : comparison =
{ mutable stack: SetTree<'T> list; // invariant: always collapseLHS result
mutable started : bool // true when MoveNext has been called
type Set<'T when 'T : comparison >(comparer:IComparer<'T>, tree: SetTree<'T>) =
//member functions:
// Add, Remove, Contains, etc.
// - + Intersection, etc.
// GetHashCode, Equals, ToString
//implement interfaces:
// IComparable
// ICollection<'T>
// IEnumerable<'T>
// IEnumerable
module Set =
// set module functions:
// add, isEmpty, etc.

Under Microsoft.FSharp.Collections namespace, there are three things: SetTree<’T> type, Set module and SetTree module.

SetTree is a type constructor, the parameter is a comparable type ‘T.

Most of the code is in SetTree internal module:

the first part contains the code to manipulate the tree structure, e.g. inserting a node, deleting a node, etc.

Type SetIterator<’T> and related functions are for implementing IEnumerable interface.
Set<’T> (comparer, tree) is the implementation for Set.

Set module contains the common function to manipulate a set.

In a summary, A container, like Set, usually follows this pattern:

1. An implementation of the underlying data structure, here it is AVL tree.

2. A module for manipulating the data structures; members functions also can do this. Usually we keep functions in both places.

3. Implement the IEnumerable and IEnumerable<’T> interface, so that the container/data structure could be treated as a sequence.

These are kind of rules for implementing new containers in F#, e.g. F# Map.

Detail 1. The generic

It is meaningless if the Set implementation only supports integers, or floats. We want it to be generic, i.e., support any type. Since the underlying implementation is based on trees, so the base type should also support comparison.

The other part of generic, which is more subtle, is the comparer. There are situations, we have different string sets. However, the rules to order them are not the same, i.e., strings maybe compared using different rules. So we should add another parameter into consideration. (Refer to Functor in OCaml if you want to see a more elegant more to deal with this situation.)

Although F# Set currently does not support different comparers. Its internal implementation keeps comparer in mind. (Because F# supports OOP, the different comparers problem could be partially solved by inherit the original element type. Anyway, this is non-elegant.)

We know that once a Set has some concrete values in it, the type is clear. Here are two ways to initialize a Set:

static let empty : Set<'T> =
    let comparer = LanguagePrimitives.FastGenericComparer<'T>
new Set<'T>(comparer, SetEmpty)

new (elements : seq<'T>) =
let comparer = LanguagePrimitives.FastGenericComparer<'T>
new Set<_>(comparer,SetTree.ofSeq comparer elements)

We can see that they get the comparer for type ‘T and use that comparer to pass into the constructor of type Set<’T>.

Detail 2. The comparer

The above code get the comparer for type ‘T using FastGenericComparer, which is in a big file prim-types.fs. The primitive type module should be discussed separately and thoroughly. There are quite a few ad hoc tricks in the implementation.

Here we just need to know that it looks up a table to find a comparer and if the type is in the primitive types, then there are optimizations for them.

Detail 3. From the empty to concrete

F# allows you to write [], Set.Empty. These empties seem to be dynamic. Actually no. You cannot write sth like:

let empties = Array.create 100 Set.empty

the compiler gives the following error:

error FS0030: Value restriction. The value 'empties' has been inferred to have generic type
    val empties : Set<'_a> [] when '_a : comparison  

Because the compiler cannot generalize these empties into a single type. It is safe to write

Set.empty + Set([1;2;3])

Because the types of the two sets could be generalized.

Same situation applies to the zeros in different number types.

Detail 4. The Iterator

interface IEnumerable<'T> with
s.GetEnumerator() = SetTree.mkIEnumerator s.Tree

interface IEnumerable with
s.GetEnumerator() = (SetTree.mkIEnumerator s.Tree :> IEnumerator)
Both of the IEnumerable interfaces are implemented. Notice that difference between the two: there is a upcast in the type. Let’s see the actual implementation:

let mkIterator s = { stack = collapseLHS [s]; started = false }

let mkIEnumerator s =
    let i = ref (mkIterator s)
{ new IEnumerator<_> with
x.Current = current !i
interface IEnumerator with
x.Current = box (current !i)
member x.MoveNext() = moveNext !i
member x.Reset() = i := mkIterator s
interface System.IDisposable with
x.Dispose() = () }


An exercise is to read map.fs in F# core. The implementation of F# Map is very similar to F# Set.

Reading F# Projects, Part I: The Common Knowledge

During the programming of my data mining library, I occasionally refer to the design of other libraries. E.g. data mining toolkits WEKA and MALLET are well designed. I also refer to the design of F# libraries. The examples on F# books mainly tell us how to program on small level, i.e. how to manipulate list or sequence and some language features. However, designing in F# is often covered in a general way.

Thus, it is beneficial for new F# users to read well designed F# libraries. In this series, the design of the following source code will be presented: (the list is tentative to change.)

* some parts of F# Core and PowerPack.

* F# math providers -- service design pattern and PInvoke in F#.

* TrueSkill – how a specific data mining model is implemented.

* FParsec -- the best example of computation expression/monad.

Before going to any concrete projects, we review the common issues/tips/concepts that may be involved in the process of the project design in F#.

I will add more topics here as we go through different F# programs.

The interface/signature file .fsi

In Java or C#, the implementation and interface are put in one. In C/C++, the interface (.h) and the implementation (.c/c++) are separated. Either design is good and has advantages.

For F#, I think the later is better. In .fsi, all the public functions/types are commented carefully. These commented can be used to generated documents for the program, or used by the IDE Intellisense.

The actual implementation is in the .fs files. In this file, comments are in places where actually necessary for the program who implements it, e.g. a formula, or a refer to a page number in a book. Because the program who writes the function definitely know what every parameter mean. If this kind of comments are put into the implementation, then the programs look less succinct.

So usually, the programmer doing the implementation write the implementation in .fs. When the implementation is stable or ready to ship out, the corresponding .fsi interface file is generated and well commented manually.

Here is the MSDN page on signature file.

Namespaces and Modules

The latest F# version requires that an F# source file (.fs) starts with a namespace, that means all the following code is under this namespace(An exception: there could be multiple namespaces in a source file.)

Modules are similar to namespaces. You can view that a group of functions are put into a namespace.

The usage difference between namespaces and modules, in my option, is that namespace is broader concept. E.g. Microsoft.FSharp.Collections names contains a lot of standard data structures, where the manipulating functions for each data structure are put into a module.

You can also view modules as a class containing only static members.

Classes and Interfaces

F# interfaces are just like interfaces in C#/Java. Interface is a quite universal concept occurring different programming paradigms. Even in the pure FP Haskell, type classes are similar to interfaces.

F# classes are different from classes in OO languages. First, F# encourages an immutable programming fashion for classes: you have only one main constructor, once the object is constructed the values remain unchanged. You can also have mutable member fields in a class. The constraints on F# classes are for safer programs, although sometimes causes some inconvenience.

Besides classes, F# also has Records, Enums and Discrete Unions.

Extension to existing classes and modules

In F#, you can extend an existing .net class by using “type .. with ..” construct:

type System.Net.WebRequest with
member this.AsyncGetResponse() : Async<System.Net.WebResponse>=
async {… the implementation }
(from F# source code: control.fs)

In this piece of code, System.Net.WebRequest is an existing class in .Net. We add a member function AsyncGetResponse into this class. This is like lightweight inheritance by saving from creating a new class.

The attributes

In .Net, attributes associate declarative information with the code. Usually they occur in production/formal code. Here is a tutorial.

Here I list some commonly used attributes in F#:


To see a full list, and the exact meaning. No resource is better than the F# Specification, Chapter 16. Special Attributes and Types.

Saturday, July 10, 2010

F# Async workflow application: a Flickr crawler

In data mining, much time is spent on collecting raw data and preprocessing data. Quite a few data mining research tasks require to download data from Internet, e.g. Wikipedia articles from Wikipedia, photos from Flickr, Google search results, reviews from Amazon, etc.

Usually general purpose crawlers, such as wget, are not sufficiently powerful and specialized in downloading data from Internet. Writing a crawler on our own is often required.

Recently I am doing an image related research project, in which I need to download a lot of tagged images from Flickr. I am aware that there is a Flickr downloadr, which uses Flickr API to download images. However 1) it only downloads licensed photos and 2) it cannot download the tags of a photo. Thus I decided to write one myself.

The input of the is a tag query, e.g. “dog”, # of photos to download and the disk folder to store the downloaded images.

Because the number of photos is quite big, so downloading them in parallel is critical. In .Net, there are several ways to do parallel computing. For IO intensive tasks, Async workflow is the best.

Tutorials for Async workflow

As Async workflow is one of the key features of F#, there are quite a few tutorials online for F# Async programming. Providing one more in this blog would be repetious.

Luke Hoban’s PDC 2009 talk, F# for Parallel and Asynchronous Programming, is very good for beginners.

Don Syme wrote 3 articles in a series. Highly recommended for experienced F# users!

The Flickr crawler

To write a crawler for a web site like Flickr, we need to 1) design the downloading strategy and 2) analyze the structures of Flickr.

My strategy is to use the search query to search images with some specific tags and from the result page(as shown below), the url of each image is extracted, from which the image and its tags are then crawled.


Figure 1. Flickr search result page.


Figure 2. Flickr image page with an image and its tags.

So first we need a function to download a web page asynchronously :

  1. let fetchUrl (url:string) =
  2.     async {
  3.         try
  4.             let req = WebRequest.Create(url) :?> HttpWebRequest
  5.             req.UserAgent <- "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)";
  6.             req.Method <- "GET";
  7.             req.AllowAutoRedirect <- true;
  8.             req.MaximumAutomaticRedirections <- 4;
  9.             let! response1 = req.AsyncGetResponse()
  10.             let response = response1 :?> HttpWebResponse
  11.             use stream = response.GetResponseStream()
  12.             use streamreader = new System.IO.StreamReader(stream)
  13.             return! streamreader.AsyncReadToEnd() // .ReadToEnd()
  14.         with
  15.             _ -> return "" // if there's any exception, just return an empty string
  16.     }


fetchUrl pretends to be a Mozilla browser and can do some redirections if the url is slightly invalid. The current exception handling is very easy – just return the empty string for the web page. Notice that the return type of the function is Async<string>, thus it cannot be used to download images as images are of binary format, not text.

So the next task is to write a function to download images:

  1. let getImage (imageUrl:string) =
  2.     async {
  3.         try
  4.             let req = WebRequest.Create(imageUrl) :?> HttpWebRequest
  5.             req.UserAgent <- "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)";
  6.             req.Method <- "GET";
  7.             req.AllowAutoRedirect <- true;
  8.             req.MaximumAutomaticRedirections <- 4;
  9.             let! response1 = req.AsyncGetResponse()
  10.             let response = response1 :?> HttpWebResponse
  11.             use stream = response.GetResponseStream()
  12.             let ms = new MemoryStream()
  13.             let bytesRead = ref 1
  14.             let buffer = Array.create 0x1000 0uy
  15.             while !bytesRead > 0 do
  16.                 bytesRead := stream.Read(buffer, 0, buffer.Length)
  17.                 ms.Write(buffer, 0, !bytesRead)
  18.             return ms.ToArray();
  20.         with
  21.             _ -> return Array.create 0 0uy // if there's any exception, just return an empty image
  22.     }


Next we write the code to get the url of an image and its tags from an image page(see Figure 2):

  1. let getBetween (page:string) (head:string) =
  2.     let len = head.Length
  3.     let idx = page.IndexOf(head)
  4.     let idx2 = page.IndexOf('"', idx+len)
  5.     let between = page.Substring(idx+len, idx2 - idx - len)
  6.     between
  9. let getImageUrlAndTags (page:string) =
  10.     let header = "class=\"photoImgDiv\">"
  11.     let idx = page.IndexOf(header)
  12.     let url = getBetween (page.Substring(idx)) "<img src=\""
  14.     let header2 = "<meta name=\"keywords\" content=\""
  15.     let tagStr = getBetween page header2
  17.     let s = tagStr.Split([|','|], System.StringSplitOptions.RemoveEmptyEntries)
  18.     let tags = s |> (fun t -> t.Trim())
  19.     url, tags


Finally, write a function to work through every search result page, parse the result page and download the images in that result page:

  1. let getImagesWithTag (tag:string) (pages:int) =
  2.     let rooturl = @""+tag+"&m=tags&s=int"
  3.     seq {
  4.         for i=1 to pages do
  5.             let url = rooturl + "&page=" + i.ToString()
  6.             printfn "url = %s" url
  7.             let page = fetchUrl url |> Async.RunSynchronously
  8.             let imageUrls = getImageUrls page
  9.             let getName (iurl:string) =
  10.                 let s = iurl.Split '/'
  11.                 s.[s.Length-1]
  13.             (* images in every search page *)
  14.             let images =
  15.                 imageUrls
  16.                 |> (fun url -> fetchUrl url)
  17.                 |> Async.Parallel
  18.                 |> Async.RunSynchronously
  19.                 |> (fun page ->
  20.                     async {
  21.                         let iurl, tags = getImageUrlAndTags page
  22.                         let icontent = getImage iurl |> Async.RunSynchronously
  23.                         let iname = getName iurl
  24.                         return iname, icontent, tags
  25.                     })
  26.                 |> Async.Parallel
  27.                 |> Async.RunSynchronously
  28.             yield! images
  29.         }

with a driver function to write all the images into hard disk:

  1. let downloadImagesWithTag (tag:string) (pages:int) (folder:string) =
  2.     let images = getImagesWithTag tag pages
  3.     images
  4.     |> Seq.iter (fun (name, content, tags) ->
  5.         let fname = folder + name
  6.         File.WriteAllBytes(fname, content)
  7.         File.WriteAllLines(fname + ".tag", tags)
  8.         )


We’ve done! A Flickr image crawler in only about 120 lines of code. Let’s download some images!

downloadImagesWithTag "sheep" 5 @"D:\WORK\ImageData\flickr\sheep\"


It only costs less than 5 minutes to download 300 sheep pictures from Flickr.


1. One of the strengths of F# Async programming is its ease for exception handling. In this example, the exception is handled immediately, we can also propagate the exception into an upper level function and handle it there. However, that would require more thinking….

2. I only parallelly download images in one search page. The program could be modified to parallelly process multiple search result pages, which is done sequentially now. If done in this way, we can see that, we can build a hierarchical parallel program: 1) at the first level, multiple search result pages are processed parallelly and 2) at the second level, images in a research result pages are downloaded parallelly.

Friday, June 4, 2010

A note on efficiency in F#, part I: small things first

In this blog, I often talk about efficiency aside data mining as efficiency really matters in this field. Previously, we’ve talked about how to use PInvoke and C++/CLI to use C++ code to boost the performance. In this post, I’d like to start a series on F# itself. As a functional programming language, F# programs are kind of high-level, thus harder to reason its performance than imperative programs. Not the big-O thing, it is the constant. This series contains little tips about performance.

Note that these tips are just facts to let you know. There is no performance guidance that one should follow. There is always a treadoff, high level functional constructs usually are slower than their pure imperative equivalents. However, they are safer and have better abstraction. It usually depends when you really need to do some optimization. E.g. if you could guarantee that your program runs less than 0.5 second, then optimizing it would usually become less meaningful, unless you have another program calling this program a million times.

Also remember that

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" (by Donald Knuth)

1. Stream IO and Seq module

Think about that we need to process a large text file, say, 1TB. Then it is infeasible to load the file into memory. The approach is to treat the text file as a stream and scan this stream to get useful information.

  1. open System.IO
  2. let readLinesSeq = File.ReadLines

.Net 4.0 provides ReadLines, the type of its return value is seq<string>.

In lower versions of .Net, one needs to implement this function:

  1. let readLines filePath = seq {
  2. use sr = new StreamReader (filePath)
  3. while not sr.EndOfStream do
  4. yield sr.ReadLine ()
  5. }

Once we have this function, we can process it as as a sequence.

While prototyping a model, we may make mistakes, and we also know that scanning the whole file would cost a lot of time. If we find an error after scanning the whole scanning, that would waste a lot of time. So we would like to take the first few lines of the file as a small test set for the debugging purpose. By treating a file as a sequence, we could simply use

  1. let lines100 = "filename" |> readLines |> Seq.take 100

to get a small test set.

Here is an quite old post in 2006. At that time, there seemed to be no Seq module. The idea is the same.

2. Loop, Seq module and the tail-recursive version of “while”

One of my first F# programs is a prime number test function. I first used an ugly for loop with mutable variables:

  1. let isPrime1 n =
  2. let m = int(sqrt (float n))
  3. let mutable flag = true
  4. for i=2 to m do
  5. if n % i = 0 then
  6. flag <- false
  7. flag

You can see that I clearly didn’t know how to early stop/break the loop when flag is assigned to false. Because I could not find break, so I crazily use exception to break out the loop:-)

  1. exception Break
  2. let isPrime1' n =
  3. let m = int(sqrt (float n))
  4. let mutable flag = true
  5. try
  6. for i=2 to m do
  7. if n % i = 0 then
  8. flag <- false
  9. raise Break
  10. with
  11. _ -> ()
  12. flag

And based on my FP experience on Scheme, I immediately wrote a tail-recursive function, which behaves like a “while” loop:

  1. let isPrime2 n =
  2. let m = int(sqrt (float n))
  3. let rec loop div =
  4. if div > m then true
  5. elif n % div = 0 then false
  6. else loop (div+1)
  7. loop 2

I can also explicitly use a while loop with a mutable flag:

  1. let isPrime3 n =
  2. let m = int(sqrt (float n))
  3. let mutable flag = true
  4. let mutable div = 2
  5. while flag && div <= m do
  6. if n % div = 0 then
  7. flag <- false
  8. div <- div + 1
  9. flag

And I finally find the elegant way using Seq:

  1. let isPrime n = { (float(n)))} |> Seq.forall (fun i->n%i<>0)

However, the last version would be slower than others except for the first one. Because a sequence is actually an object implements IEnumerable interface. If the object currently points to 10, to get 11, it needs to call a Next function to get it. There is definitely some overhead here.

So pay attention when you use Seq to do the job of a for loop or while loop when the loop is executed for a huge number of times.

3. List and Array

F# lists are single linked lists. Thus, Lists use more memory as it needs to keep a next pointer in every node. Lists are also slow. Let’s do a test:

  1. let n = 10000000
  2. let a = List.init n (fun i-> i)
  3. let b = Array.init n (fun i-> i)

creating list a uses about 2.2 seconds and about 200MB memory, while creating array b only uses 0.12 second and about 40MB memory, which equals our estimate n * 4 bytes/int ~= 40MB.

4. Access lists, arrays and sequences

We know that the three modules List, Array and Seq have similar functions. E.g. all of them have the function map, which applies a function to the existing list, array or sequence to generate a new one. However, the three maps are implemented differently, with array most efficiently, list second, the most general seq the last:

  1. // form F# source code: local.fs
  2. // note: the actuall implementation for list is in local.fs, not in list.fs
  3. let map f x =
  4. match x with
  5. | [] -> []
  6. | [h] -> [f h]
  7. | (h::t) ->
  8. let cons = freshConsNoTail (f h)
  9. mapToFreshConsTail cons f t
  10. cons
  11. // form F# source code: array.fs
  12. let map (f: 'T -> 'U) (array : 'T[]) : 'U[]=
  13. checkNonNull "array" array
  14. let inputLength = array.Length
  15. let result = Microsoft.FSharp.Primitives.Basics.Array.zeroCreateUnchecked inputLength
  16. Parallel.For(0, inputLength, fun i ->
  17. result.[i] <- f array.[i]) |> ignore
  18. result
  19. // form F# source code: seq.fs
  20. let map f (e : IEnumerator<_>) : IEnumerator<_>=
  21. upcast
  22. { new MapEnumerator<_>() with
  23. member this.DoMoveNext (curr : byref<_>) =
  24. if e.MoveNext() then
  25. curr <- (f e.Current)
  26. true
  27. else
  28. false
  29. member this.Dispose() = e.Dispose()
  30. }

We can see that the is actually implemented using ParallelFor. The most general implementation is based one the IEnumerable interface in .Net, which is relatively inefficient compared to a for loop.

However, by using or, we create another array or list. While only creates an auxiliary sequence, which costs nearly no memory!

So usually, I use to map arrays or lists to keep memory usage small. When performance really matters, I use array and for loop, i.e. I write imperative code. In my experience, writing imperative code in F# sometimes still introduces some runtime errors, while this is rare in pure functional code.

When I only need to iterate an array, I will directly use Array.iter since it is efficient, and use no extra memory.

All in all: performance costs. Do some quick estimation before you write imperative code. Also know the low-level implementation of your daily-use libraries is critical for you to estimate the performance.

5. Access F# matrices and vectors

The Vector<’T> type is implemented using 1-D array and the Matrix<’T> type is implemented using 2-D array for dense case and row-sparse for sparse case. Details are in a matrix series post.

Let m be a matrix, we could also use m.[i,j] to access the element, which is actually a function call to the the internal array. Thus there is some overhead. Some profiling is reported in a previous post.

The way to get rid of this kind of overhead is to program on the internal arrays directly. As in this answer in Stackoverflow.

Tuesday, May 4, 2010

Why F# is the language for data mining

25 Sep 2012, the library libml is on bitbucket: The code was written more than two years ago and has not been maintained since then. It was to accompany this article, but not to be used practically.

F# is a newly released language in Visual Studio 2010, as the 4-th language in .Net platform officially supported by Microsoft, aside with C#, VB.Net and C++/CLI. It has been a research project within Microsoft Research Cambridge for more than 5 years. Its origin could date back to the ML language designed by Turning Award winner Robin Milner in 1970s.
As a new yet old static-typed functional language designed for .Net platform, it will spread quickly in the coming years. It is such enjoyment to explore its usage in data mining, or more broadly, statistical computing area that I now share some my personal experience using it by trying to answer one question:
Why F# is the language for data mining.
This article is divided into 3 sections:
1. a review of data mining environments and a feature summary from them.
2. the answer. I list 7 reasons with 2 short examples.
3. a preview of a data mining library in F#, which will be open sourced this summer.
Before we go to the answer of this question, let’s see some existing data mining environments.

Data mining environments

SAS and SPSS are the main commercial data mining software. SAS has a mature macro language, also a reasonable GUI support. SPSS has a good GUI support. Their data mining algorithms are written in C/C++. The user doesn’t need to know about the underlying languages (C/C++), the data mining API is provided by a macro language (SAS) or GUI (SPSS). Using a macro language or GUI greatly eases users to perform data mining tasks.
There are also other products, e.g. the decision tree family developed by Salford.
Matlab is perhaps the most popular software in data mining community, especially in machine learning community. Although Matlab is not free, it is easily available in schools and research institutions. There are a lot of free programs in Matlab to perform data mining tasks. Matlab is so popular that putting an data mining algorithm name + "matlab" into Google gets what you need. The other reason is that the matrix operations in Matlab are efficient and convenient considering that the data tables in data mining are either dense or sparse matrices.
Free/Open source:
There are quite a lot of open source software of different quality. Here’s a list I am interested:
1. R. R is for statistics. The core part of R contains the R languages and the stable statistical procedures. The data mining part is a set of external packages. Most of the data mining models are available, e.g. Decision Trees, Neural Network, SVM, etc. R is a very capable language, however, it is not popular in data mining community. For data mining and machine learning, Matlab is more popular than R. R core does not support sparse data, thus a lot of extensions do not consider this support, which is extremely important. As in data mining, a lot of datasets are sparse, a tool that has sparse and large scale design in the beginning is very critical for performing data mining.
2. Weka. Weka is the most famous open source software in data mining and machine learning field. If data mining has three perspectives: database, machine learning and statistics[2]. Weka is from the second. Weka is widely used in data mining courses, in which instructors are able to show a lot of data mining algorithms to the students, turning the formulas in the book into several mouse clicks and the resulting figures and graphs. Who use Weka for serious data analysis? As I know, most people only use Weka’s decision tree implementation J48, because decision tree is really useful sometimes and Weka is handy to run to get the tree and its results are close to the standard C4.5's. But its usage is often on dense data set, as building a tree on sparse data sets costs a huge amount of memory in Weka. Weka does support sparse dataset, however, the performance is very bad, the speed is slow, the memory usage is unacceptable. The design in Weka is too general, thus sacrifices the speed.
Weka is written in Java. So the extensions to Weka, e.g. a different boosting scheme for trees, are usually written in Java. This kind of task is better done in a language like R or Matlab, i.e. a interpreted script language, not in Java. There are interpreted languages in JVM, most notably, Scala and Clojure. Calling Weka from Scala or Clojure will be a good idea.
3. Orange. Orange is written in Python and C++. All data mining algorithms are coded in C++ for performance’s sake. Python is used as the script language to call the algorithms and the language to build the GUI. Using Python in data mining is a great idea. It is interpreted, an concise language to read and write, easy binding to C/C++. Orange only supports dense data set.
4. Lush. Designed by Prof. Yann LeCun. He is a big figure in machine learning, a definite expert. There are not so many researchers, who write great programs, and write great papers at the same time. Prof. LeCun is one of the few. Lush is a Lisp environment with bindings to algorithms written in C/C++. The neural network in it is mature. Lisp is good, very expressive. However, nowadays, not many people write it.
after we have seen a set of data mining software, we can draw a conclusion on

the most favored features of a data mining environment:

1. Data mining algorithms are built in. The implementation should reliable and efficient. It should supports dense and sparse datasets equally well. It should also be robust, e.g. effectively handle missing values, large datasets, etc.

2. script language and iterative shell. The script language could be used to do data preprocessing, calling data mining procedures, visualization and report easily.

3. Math: Matrix and vector (both sparse and dense), linear algebra, common numerical optimizations. Sometimes, we need to design new algorithms or modified existing algorithms to suite one specific data mining application. In this case, these math tools are really important, e.g. a simple logistic regression could be done in several lines of code if these math modules are available.

4. GUI is preferred, but not a must. GUI is good for data exploration and visualization, e.g. Excel is used a front end for a lot of Business Intelligence solutions for its great GUI and stability. When performing data mining, most of the time, we don't need GUI (just as when we use Matlab, we seldom use its START menu and prefer command line).
5. A integrated system. You use tool A do download the data, use tool B to do clean the data, use tool C to do some matrix operations on the data, finally you get a dataset and put it into an SVM command line tool. Not the end, you still need to parse the ouput of the SVM to get the result. And think about you need to repeat this process for several times. What's wrong here? Tools are scattered. The Unix philosophy "do one thing and do it well" does not apply here because the data structure from one thing to the other is very complex. This is why we need a integrated system, at least data processing and model building are connected seamlessly, like in Matlab.

Weka has all the 5 points. However, none of them are very good. E.g. Java is hard to be used as a script language as you need to write a lot of lines of code and compile them to perform an easy task, and using Java also requires you to be very familiar with Weka's java classes, which is just huge. Java could also be used to do data preprocessing, however, few use it for this purpose. Python, Perl or F# are more appropriate.


Following lists the notable features, which I think are helpful for data mining.
1. Succinct, static typing at the same time. These two features are combined together. When we write programs, we need to think about the programming logic and the programs, e.g. write the loop, define a variable, write the type for it, etc. Programming logic is the core thing, but every language has a syntax to obey, so we need to write a lot of other things, which could be distractive during programming because you need to think about more things at the same time. More succinct means you don’t need to write so many `declarations’ for your programs. Python is also succinct, however it is not static typing, which means typing checking is done during runtime. Dynamic typing is slow. In a lot of cases, F# does not need to write types, its other syntax is also succinct, it is also static typing at the same time. Combining these two together means we can focus on the programming logic and at the same time keep the performance.
2. Functional and programming style. F# is functional, thus the A, B, C of a functional programming also applies to F#. Here A could be ‘function as the first class or high order functions’, which means you could write:
a |> split |> Seq.filter notStopWord |> stemmer
to perform string split, filtering using a self defined filter and doing word stemming in a single line!
In an OOP language, this would require to use a pipe line design pattern, which is far more complex than high order functions here.
3. Functional and parallel. A function is pure if it does not change anything outside the function, and every time you pass the same parameter into it, it gets the same result (this is a weaker version of purity). Based on this, we can easily parallel pure functions. In data mining, most of cases, we can simply split the data into several parts to get parallel execution.
As an example, following is a piece of code to perform multiple k-means clusterings at the same time:
if isParallel then
let dorepeats =
{0..restart-1} |> (fun _ -> doClustering(ds, k, maxiter, init, algo) :?> kmeansResult)
dorepeats |> PSeq.toList |> List.min :> IClusterResult
(* use async
let dore =
{0..restart-1} |> (fun _ -> async { return doClustering(ds, k, maxiter, init, algo) :?> kmeansResult })
dore |> Async.Parallel |> Async.RunSynchronously |> Array.min :> IClusterResult
//printfn "no parallel"
let dorepeats =
{0..restart-1} |> (fun _ -> doClustering(ds, k, maxiter, init, algo) :?> kmeansResult)
dorepeats |> Seq.min :> IClusterResult
Note that the difference between sequential code and parallel code is minor.
4. OOP. At a high level of designing an application, the design patterns come from the application itself. A data mining application has its own working patterns. Design patterns are what OOP is good at. F# provides limited, but concise support for OOP. E.g. the following is the interface for a learner and predictor:
type IPredictor =
abstract member Predict: gvector -> int
abstract member Predict: datavecs -> int array
abstract member Predict: dataset2 -> float * (int array)
type IProbPredictor =
abstract member Predict: gvector -> int * (float array)
abstract member Predict: datavecs -> (int array) * (float array array)
abstract member Predict: dataset2 -> float * (int array) * (float array array)
type ILearner =
abstract member Learn: dataset2 * parameters -> IPredictor
A classification algorithm has two components, a learner and predictor. All learners implement ILearner interface. All predictors implement the IPredictor interface,some predictors also implement the IProbPredictor interface if it supports probability output. Because the code is so concise, I can change this interface more easily than in C#.
5. Interpret language. This is simply a must! Everyone wants to write a single line to perform data loading and model building together and see the result immediately.
6. .Net! If F# has the above 5, but without this one, it will not be successful. Tied with .Net has a lot of advantages, most notably a) a stable environment and a lot of stable libraries, b) writing components in C# when needed, c) the components in F# could be used by other .Net languages with no effort d) Visual Studio 2008 and 2010 are very good IDEs and finally e) Microsoft supports it, the `evil company’ is more responsible for its products than others and at last f) the performance of .Net is very good.
7. IO and preprocessing! This point is subsumed in previous points. However it is so important that I separate it out. A lot of software simply assume that data mining is the processing applying an algorithm to a dataset. This is WRONG! It is said that 80% of the data mining work is to preparing the data to get a dataset[1]. To prepare the data, we need to do a lot of IOs, from Internet, databases, xml files etc. .Net provides the most stable yet simple way to do these tasks. F# language is also well suited to perform data preprocessing. As an interpreted language, we can test and see the result immediately.
I use two examples from the code in my data mining library as examples to show F#'s strength.
Example 1. Wrod Tokenizer.
The following code is the Tokenizer module for text mining. It is so short, yet can perform various tokenizing tasks, removing stop words, doing stemming, bi-gram etc.. The conciseness comes from F# as a functional language and .Net library.
module Tokenizer =
let sep = [|' '; '\r'; '\n'; '-'; '.'; ',' ; '\t'; '!'; '?'; '\'' |]

let unigramToken (s:string) =
s.Split(sep, StringSplitOptions.RemoveEmptyEntries) |> Seq.filter (fun w -> not (isStopWord w)) |> (fun s -> s.ToLower())

let unigramTokenWithStem s = s |> unigramToken |> stemmer

let bigramToken (s:string) =
let uni = s.Split(sep, StringSplitOptions.RemoveEmptyEntries)
let bi = seq { for i=0 to uni.Length-2 do yield (uni.[i]+uni.[i+1]) }
bi |> Seq.filter (fun w -> not (isStopWord w)) |> (fun s -> s.ToLower())

let bigramTokenWithStem s = s |> bigramToken |> stemmer

let defaultToken = unigramTokenWithStem
Example 2. Parallel Tasks
You probably see similar code before as an example of the F# asynchronous programming. Here I changed the 'example' code to let it disguises as a Mozilla browser(this is useful when you try to crawling Google search results), also it does some retries and support exception handling. The code is so short!! Using it downloading 10000 posts (I need some special posts from a question answer forum for my research) in parallel costs only half an hour.
let fetchUrl (url:string) =
async {
let req = WebRequest.Create(url) :?> HttpWebRequest
req.UserAgent <- "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"; req.Method <- "GET"; req.AllowAutoRedirect <- true;
req.MaximumAutomaticRedirections <- 4;
let! response1 = req.AsyncGetResponse() let response = response1 :?> HttpWebResponse
use stream = response.GetResponseStream()
use streamreader = new System.IO.StreamReader(stream)
return! streamreader.AsyncReadToEnd() // .ReadToEnd()
_ -> return "" // if there's any exception, just return an empty string
// checked and found that pages that cannot be retrieved mainly because the pages are already deleted
// not because of the downloader
let getPages pageIds rooturl =
pageIds |> (fun id -> let name = rooturl + id.ToString() in fetchUrl name) |> Async.Parallel |> Async.RunSynchronously

A data mining library in F#

In the past month, I have written a prototype of an incomplete data mining system in F#. The more I write, the more I feel that it is the language for data mining. The library will be open sourced this summer after functionality completion and testing. I’d like to pronounce some highlights here:
1. Datasets and IO.
Dense and Sparse datasets are equally support! Various data formats are supported, csv files, weka's ARFF format, libsvm/svm light data format, Excel table, for both read and save. The library does not have its own data format. Processed data are stored in different formats for different needs. Using a built-in database will be considered in the future, e.g. the light-weight SQLite.
There are a set of preprocessing functions for datasets, e.g., filling missing values, feature value normalization, etc.
2. Data mining algorithms.
A few State-of-the-art data mining algorithms are implemented:
classification: decision tree family, logistic regression, support vector machines (by compiling libsvm to .Net using C++/CLI), neural networks, Naive Bayes, KNN.
regression: simple least square linear regression, neural networks and SVM(using libsvm).
clustering: k-means and its variants, EM on Gaussian Mixture Model, spectral clustering
dimension reduction: SOM, PCA, LDA.
feature selection: Mutual Information, decision trees, logistic regression with Wald test.
MISC: Probability Latent Semantic Analysis(PLSA), co-clustering
association rule and sequence mining: I have a plan in mind, but the actual implementation may delay.
The library is mainly for data mining, not for machine learning research. So only the most useful models are implemented. I hope after several years, the they will be very stable with few bugs. Trying to make it big would be dangerous. E.g. Weka is big in the sense of its algorithms, however, the standard statistical and data mining workhouse, logistic regression, is badly implemented there. If even logistic regression cannot apply to a modest text classification problem, what shall we expect Weka's other algorithms?
Also, implement a lot of immature models is less meaningful for practical data mining. E.g. the data mining module in SQL Server 2008 only implements a few state-of-the-art models. The novel algorithms are suitable to be implemented on demand for a specific application or dataset.
3. Parallel.
Some algorithms have parallel implementations in algorithm level, e.g. k-means clustering. At a higher level, I define a set of tasks using the type systems in F#, e.g. the classification task is defined as
type classificationEvaluationTask =
| CrossValidation of int * dataset2 * ILearner * parameters
| RandomSplit of float * dataset2 * ILearner * parameters
| TrainTest of dataset2 * dataset2 * ILearner * parameters
| LeaveOneOut of dataset2 * ILearner * parameters
Data mining tasks in the library run in parallel on default, e.g. performing a classification task and a clustering task at the same time, or, doing cross validation on different parameters to choose the best parameter of a model at the same time. A major technique to boost the performance of classifiers is ensemble learning, which bags a lot of models together. Parallel really helps a lot in this kind of learning task. Of course, parallel programming could be done in C++/Java using threads too. The advantage of using F# is that the interface is very easy to use, parallel executing various tasks at the same time needs only a single line.
4. A service provider design pattern for data mining algorithms. An algorithm, e.g. KNN will have several implementations, F# implementation is called on default, the user could also start a service provider and call the implementation of that provider. A candidate for such a wrapper is OpenCV’s machine learning component for its robustness, speed and friendly license.
ML library in OpenCV only supports dense dataset as machine learning tasks in CV usually use dense data format. I haven't found a good service provider for sparse data sets yet. But for large scale sparse data sets, the common practice is to use SVM, Logistic regression and tree family, which I already have code in C/C++.
5. Visualization. A limited support is already done by extending Luca Bolognese' code snippet. The visualization is mainly for understanding the data and diagnosing the algorithms, e.g. plotting the scatter of two columns of a data table, drawing the objective values through iterations. For high quality visualization, choose Mathematica, Matlab, R or Excel:)
6. Math library. It is a pity that there's no stable, yet free math library in .Net. For data mining, a math library is really important. I use the matrix and vector implementation in F#'s PowerPack. The matrix type in PowerPack supports both dense and sparse matrix, however sparse vector is not implemented. So I implemented the sparse vector myself. PowerPack now deprecates its math provider, which I pick up and use as the linear algebra module. So actually, I made little effort to get matrix supported.

Other math functions are implemented on demand. Some gradient based optimization solvers are implemented to solve linear regression and logistic regression, e.g. conjugate gradient decedent and LBFGS, which are two state-of-the-art gradient based solvers. Currently I implemented them in F#, the performance is modest compared to my C++ implementation. Code tuning is yet to be done. Some probability distributions are implemented for EM and Naive Bayes. Some important statistical algorithms, e.g., various tests, are also considered to be implemented in the future.

I'd like to thank Haipeng Zhang for reviewing this article for me.

[1] S. Zhang, C. Zhang, and Q. Yang.
Data preparation for data mining. Applied Artificial Intelligence, 2003, 17(5-6): 375-381.
[2] Z.-H. Zhou. Three perspectives of data mining. Artificial Intelligence, 2003, 143(1): 139-146.