F# and Data Mining

Monday, November 22, 2010

Sunday, September 19, 2010

WekaSharp: more features

1. An F# wrapper for Weka. The minimal wrapper in F# for Weka.

2. More features. (this post) This post will contain improvement over the minimal wrapper, e.g. more Dataset processing function, some plot functionality, etc.

3. Tutorial/Usage examples. This post is for end users, who might have no interested in reading the implementation details, but rather knowing how to use this wrapper to perform data mining tasks in .Net.

Here lists a set of features that have been added to WekaSharp.

Parameters for data mining algorithms.

F# features: named parameters and pattern matching.

Data mining algorithm usually have some parameters to set before applying to the data. E.g. the K-means clustering needs the user to provide the number of clusters, the Support Vector Machine classification algorithm need the user to provide the kernel type of the SVM to use, etc.

Also different algorithms have different set of parameters. In Weka, algorithm parameters are encoded as a string (similar to the parameter style of command line tools). However, a first time user would not be able to remember what does “-R 100” mean for a specific algorithm. So we’d like to have a parameter maker for each algorithm. F# fully supports this task with its named parameters.

Let’s see an example. The following code is for making parameters for Logistic Regression:

type LogReg() =
    static member DefaultPara = "-R 1.0E08 -M -1"
    static member MakePara(?maxIter, ?ridge) = 
        let maxIterStr = 
            let m = match maxIter with
                    | Some (v) -> v
                    | None -> -1
            "-M " + m.ToString()
        let ridgeStr = 
            let r = match ridge with
                    | Some(v) -> v
                    | None -> 1.0E08
            "-R " + r.ToString()
        maxIterStr + " " + ridgeStr

and this is an example usage of it:

let logRegPara = LogReg.MakePara (ridge = 1.0)

You can see that besides named parameter functionality, we can also only supply part of the parameters.

Another highlight F# feature in this the Parameter module is pattern matching. Sometimes, the parameters could be quite complicated, e.g. the nearest neighbor search algorithm used in KNN. There are 4 search algorithms, each of them is capable for some of the distance metrics. Discriminate unions and pattern matching make the parameter generating task easy:

    type DistanceFunction = Euclidean | Chebyshev | EditDistance | Manhattan
        with
        static member GetString(distanceFunction) = 
                let d = match distanceFunction with
                        | Some (v) -> v
                        | None -> DistanceFunction.Euclidean
                match d with
                | DistanceFunction.Euclidean -> "weka.core.EuclideanDistance -R first-last"
                | DistanceFunction.Chebyshev -> "weka.core.ChebyshevDistance -R first-last"
                | DistanceFunction.EditDistance -> "weka.core.EditDistance -R first-last"
                | DistanceFunction.Manhattan -> "weka.core.ManhattanDistance -R first-last"

            let distanceFunctionStr = DistanceFunction.GetString distanceFunction 

            let searchAlgorithmStr = 
                let s = match searchAlgorithm with
                        | Some (v) -> v
                        | None -> NNSearchAlgorithm.LinearSearch
                match s with
                | NNSearchAlgorithm.LinearSearch -> 
                    "weka.core.neighboursearch.LinearNNSearch" + " -A " + "\"" + distanceFunctionStr + "\""
                | NNSearchAlgorithm.BallTree ->
                    "weka.core.neighboursearch.BallTree" + " -A " + "\"" + distanceFunctionStr + "\""
                | NNSearchAlgorithm.CoverTree ->
                    "weka.core.neighboursearch.CoverTree" + " -A " + "\"" + distanceFunctionStr + "\""
                | NNSearchAlgorithm.KDTree ->
                    "weka.core.neighboursearch.KDTree" + " -A " + "\"" + distanceFunctionStr + "\""

If you are interested, you can browse the Java code(in Weka’s GUI component) for this task and appreciate how succinct F# gets.

More Dataset manipulating functions

Weka’s dataset class(Instances) is very hard to use. Normally, a dataset is simply a table, with each row representing an instance and each column representing a feature. The dataset class in Weka is too general to be manipulated easily. Weka provides various filters to manipulate datasets. However, these filters are hard to remember their names and applying a filtering needs several lines of code.

I first write a function that transforms a 2 dimensional array into Weka dataset; it can transform the in-memory data more easily into a dataset.

let from2DArray (d:float[,]) (intAsNominal:bool)

The parameter intAsNominal=true means that if a columns are all integers, then treat this column as a nominal column. Otherwise all columns are treated as real ones.

I also wrapper common filters into single functions, that means, applying one filter only uses one line code and they are all put in Dataset module.

The parallel processing

The PSeq module (in F# PowerPack) makes writing computing-bound parallel programs effortless. And in data mining, when it comes to parameter tuning or trying different models on different datasets, it would be fast to run these tasks in parallel on a multi-core machine.

The following code shows two functions performing a bulk of classification tasks:

let evalBulkClassify (tasks:ClaEvalTask seq) = 
    tasks
    |> Seq.map evalClassify
    |> Seq.toList

let evalBulkClassifyParallel (task:ClaEvalTask seq) = 
    task
    |> PSeq.map evalClassify
    |> PSeq.toList

The latter one is the parallel version, which simply change the Seq in the sequential version to PSeq.

Sunday, September 5, 2010

Numeric Performance in C, C# and Java

By occasion, I find the following draft by Prof. Peter Sestoft:

Numeric Performance in C, C# and Java

He implemented several typical numerical programs in C, C# and Java, and tested their running time.

The conclusion in the paper is that managed code is also performant! Read the paper for the details.

Prof. Sestoft is an expert in compiler design and compiler writing. The draft also contains the generated byte code with his annotations. Enjoy!

Just a note:
Although the C code in the paper is native, it does not take the advantage of the modern CPU design. A fully optimized implementation, which is very hard for normal programmers to write, is faster than the native code. This is why Matlab/Numpy's matrix multiplication is faster than your C code. Matlab uses Intel MKL optimized implementation.

Friday, September 3, 2010

WekaSharp: Tutorial for using Weka in F#/.Net

There are 3 posts in this series:

1. An F# wrapper for Weka. The minimal wrapper in F# for Weka.

2. More features. (not available yet) This post will contain improvement over the minimal wrapper, e.g. more Dataset processing function, some plot functionality, etc.

3. Tutorial/Usage examples. (this post) This post is for end users, who might have no interested in reading the implementation details, but rather knowing how to use this wrapper to perform data mining tasks in .Net.

Installation

The easiest way is to use Visual Studio 2010 with the latest F# PowerPack. Download the release/source code at http://wekasharp.codeplex.com/ and change the dll references to ones at the bin\Release folder.

Visual Studio 2008 should also be able open the project file. Only the parallel functionality cannot be used in 2008 as Parallel Sequence is only added in .Net 4.0.

The support for Mono is not prepared yet. However, it should be easy to figure out.

As all necessary runtimes are also included in the release, so the user does not need to download Weka or IKVM runtimes.

Quick Guide

The user is encourage to look at the script code in Script.fsx and run them to have a general feeling of the wrapper. These examples cover typical usage of WekaSharp.

Reading the other two posts is also very useful for using WekaSharp, and more importantly, for changing the source code of WekaSharp on your own need.

The License

As Weka is GPL, I must make WekaSharp under GPL.

The Efficiency Concern

People may concern how fast is the wrapper compared to the original Weka in Java. I haven’t thoroughly tested this yet. Some casual tests show that they are just the same, at least on Windows platforms. For instance, I run a 3-fold cross validation on the letter dataset using J48 decision trees, both the wrapper and the Weka (run from GUI) use about 20 seconds. It is quite surprising that IKVM’s compiler does so good a job.

As I will show later, the WekaSharp has some parallel constructs, which enable us to utilize multi-core more conveniently.

I’ve provided a module to do profiling. The design is similar to that of Matlab, but better:) You can use Profile.tic() and Profile.toc(“str”) pair to time the execution of F# code. Or You can get multiple timers by using Profile.getWatch().

The Dataset and IO

The Dataset module contains functions to manipulate datasets. Most of the functions are pure, i.e., they don’t change the input dataset and create a new dataset after processing.

Current 4 types of data files are supported: Weka’s ARFF format, LibSvm, Csv and SvmLight. Both loading and saving are supported, i.e. there are 8 functions.

The following code shows how to load an ARFF dataset and set the last column as the label:

let sonar =
   @"D:\temp\datasets-UCI\UCI\sonar.arff"
   |> Dataset.readArff
   |> Dataset.setClassIndexWithLastAttribute

You can also do the two steps in one function:

let iris =
   @"C:\Program Files\Weka-3.6\data\iris.arff"
   |> Dataset.readArffLastAttributeAsLabel

Parameters

Data mining algorithms usually have parameters, sometimes very complicated parameters. F# provides very convenient syntax construct for optional parameters.

E.g. the full set parameters of an SVM has the SVM type, the kernel type, the C value, and the parameters for the kernel, etc. Maybe you just need to set the C value: Parameter.SVM.MakePara(C = c).

Here Parameter is a module for parameters setting. For each data mining algorithm, there are two members:

* DefaultPara. The default parameter string, which is the same as ones used in Weka GUI.

* MakePara. A function to make different parameters.

Here are several more complicated examples of .MakePara method for different algorithms:

Parameter.SVM.MakePara(kernelType = Parameter.SVMKernelType.LinearKernel, C = 10.0)
Parameter.KMeans.MakePara(K=5, maxIter=10, seed=10)

Parameter.KNN.MakePara(distanceFunction = Parameter.Manhattan, K = 3)

and you can also use the IDE support to find which parameters are supported in a data mining algorithm:

Classification and its evaluation

WekaSharp supports most of the common classification algorithms:

type ClassifierType =
   J48 | LogReg | NeuralNet | KNN | NaiveBayes | SVM | LibLinear | LinReg | AdaBoost | Bagging

There are three types of classification tasks:

type ClaEvalTask =
   | CrossValidation of int * Dataset * ClassifierType * parameters
   | RandomSplit of float * Dataset * ClassifierType * parameters
   | TrainTest of Dataset * Dataset * ClassifierType * parameters

To run this task, you need the evalClassify method in Eval module. The following code shows a complete example using J48 as the classifier:

(* playing decision trees on Iris dataset *)
// load the dataset
let iris =
   @"C:\Program Files\Weka-3.6\data\iris.arff"
   |> Dataset.readArff
   |> Dataset.setClassIndexWithLastAttribute

// describe 3 kinds of classification tasks
let j48Tt = TrainTest(iris, iris, ClassifierType.J48, Parameter.J48.DefaultPara)
let j48Cv = CrossValidation(5, iris, ClassifierType.J48, Parameter.J48.DefaultPara)
let j48Rs = RandomSplit(0.7, iris, ClassifierType.J48, Parameter.J48.DefaultPara)

// perform the task and get result
let ttAccuracy = j48Tt |> Eval.evalClassify |> Eval.getAccuracy
let cvAccuracy = j48Cv |> Eval.evalClassify |> Eval.getAccuracy
let rsAccuracy = j48Rs |> Eval.evalClassify |> Eval.getAccuracy

The evalClassify function returns a result object, you can use “.” to that object in the IDE to find out various types of results available. In the above, we use predefined functions to get the accuracy from it.

Clustering and its evaluation

Performing clustering is very similar to classification. You can define two types of clustering:

type CluEvalTask =
   | ClusterWithLabel of Dataset * ClustererType * parameters
   | DefaultCluster of Dataset * ClustererType * parameters

ClusterWithLabel means that you will need to use the label of the dataset to do evaluation. DefaultCluster does not require the dataset has label (actually, it does now allow datasets to have labels either), so the result will only contain the clustering assignments, but not accuracy, etc.

The following code shows a complete clustering example:

let irisLabeled =
   @"C:\Program Files\Weka-3.6\data\iris.arff"
   |> Dataset.readArffLastAttributeAsLabel

let kmeansTask = ClusterWithLabel(irisLabeled, ClustererType.KMeans, Parameter.KMeans.MakePara(K=3))
let emTask = ClusterWithLabel(irisLabeled, ClustererType.EM, Parameter.EM.MakePara(K=3))
let dbscanTask = ClusterWithLabel(irisLabeled, ClustererType.DBScan, Parameter.DBScan.DefaultPara)


let kmeansResult = Eval.evalClustering kmeansTask |> Eval.getClusterSummary
let emResult = Eval.evalClustering emTask |> Eval.getClusterSummary
let dbscanResult = Eval.evalClustering dbscanTask |> Eval.getClusterSummary

Bulk & parallel processing tasks

Sometimes, you need to run multiple tasks. E.g. you need to run the same task multiple times to see the mean and variance of the result, or you need to try different parameters for an algorithm, or you simply have different data mining tasks to run.

The following example shows how to create a bulk of tasks and run them:

// load the data set
let sonar =
   @"D:\temp\datasets-UCI\UCI\sonar.arff"
   |> Dataset.readArff
   |> Dataset.setClassIndexWithLastAttribute

// set different parameters
let Cs = [0.01; 0.1; 1.; 10.; 50.; 100.; 500.; 1000.; 2000.; 5000. ]

// make the tasks with the parameter set
let tasks =
   Cs
   |> List.map (fun c -> Parameter.SVM.MakePara(C = c))
   |> List.map (fun p -> CrossValidation(3, sonar, ClassifierType.SVM, p))
  
Profile.tic()
// the accuracy result
let results =
   tasks
   |> Eval.evalBulkClassify
   |> List.map Eval.getAccuracy
Profile.toc("sequential time: ")

Here I created different SVM tasks for different C values, run them and get the accuracy as a list.

F# provides very easy syntax to perform multiple tasks at the same time. Thus I provide evalBulkClassifyParallel method:

Profile.tic()
let resultsParallel =
   tasks
   |> Eval.evalBulkClassifyParallel
   |> List.map Eval.getAccuracy
Profile.toc("parallel (PSeq) time: ")

// sequential time: : 9767.804800 ms
// parallel (PSeq) time: : 6154.715500 ms

As the profiling shows, on a two-core machine, parallel executing the tasks does boost the speed.

Plotting

is not finished, but it is still quite usable. To plot the different accuracies for different Cs in the SVM, we can use:

// do the plot
lc.column(y = results, xname = "differnet C", yname = "Accuracy", title = "SVM on iris",
   isValueShownAsLabel = true ) |> display

Making Datasets from Memory

All the above examples use data from data files. In practice, we might want to convert the data in memory, e.g. in an array into a Weka dataset. The following examples shows how to create dataset from F# arrays:

(* create dataset from F# arrays *)

// make the data array
let data = [| 0.; 0.;
             1.; 1.;
             0.; 1.;
             1.; 0.; |]
let xorArray = Array2D.init 4 2 (fun i j -> data.[i*2 + j])

// make weka dataset from array
let xor0 = Dataset.from2DArray xorArray false

// add labels
let xor = xor0 |> Dataset.addClassLabels ["T"; "T"; "F"; "F"]

// make svm tasks

let rbfTask = TrainTest(xor, xor, ClassifierType.SVM, Parameter.SVM.DefaultPara)
let linearTask = TrainTest(xor, xor, ClassifierType.SVM, Parameter.SVM.MakePara(kernelType = Parameter.SVMKernelType.LinearKernel, C = 10.0) )

// rbf svm gets 100% accuracy
let rbfAccuracy = rbfTask |> Eval.evalClassify |> Eval.getAccuracy
// linear svm does not work on XOR data set
let linearAccuracy = linearTask |> Eval.evalClassify |> Eval.getAccuracy

Conclusion

As a user of WekaSharp, I already benefit from it as I can process the data and run data mining algorithms over it both in F#. Originally, I needed to write my data into ARFF file, call Weka, parse the output to get the result. And the F# solution is far more declarative.

Another benefit is that we can write extensions to Weka in F#!

Enjoy!

Monday, August 23, 2010

WekaSharp: An F# wrapper for Weka

There are 3 posts in this series:

1. An F# wrapper for Weka. (this post) The minimal wrapper in F# for Weka.

2. More features. This post will contain improvement over the minimal wrapper, e.g. more Dataset processing function, some plot functionality, etc.

3. Tutorial/Usage examples. This post is for end users, who might have no interested in reading the implementation details, but rather knowing how to use this wrapper to perform data mining tasks in .Net.

WekaSharp is available at CodePlex:http://wekasharp.codeplex.com/.

Weka is one of the most widely used data mining software package. It contains over 50 data mining algorithms, a good GUI support and well written documents. Because of its good features, a lot data mining courses use it as an illustrative software.

Its GPL license also allows it to be used freely in academic.

However, Weka GUI enables us to perform data loading, filtering and classifier building in an interactive way, sometimes programming is still needed. For instance, GUI has poor support for parameter selection and processing 100 data sets. In these tasks, scripting is a better way. However, Java language is not declarative and it does not support interactive shell either.

F# well supports these two features, also data processing in F# in general is far better than Java. So F# and Weka are good combination.

Compile Weka library to a .Net dll

As everybody knows, Weka is written in Java. To use it within .Net framework seamlessly, we need to compile it into a .Net assembly. Fortunately, IKVM project is made for such purpose.

There is an online tutorial for compiling weka.jar to weka.dll. However, this tutorial does not mention how to use external jar libraries in Weka, e.g. libsvm.jar. Because Java’ class loading mechanism is different from .Net’s. We cannot simply compile different jars into separate dlls. The simplest way is to compile weka.jar and libsvm.jar to a single dll:

>ikvmc -sharedclassloader -target:library weka.jar libsvm.jar

Notice that there is a WekaDotNet project at sourceforge. The project is compiled using Visual J#, which is already deprecated in Visual Studio family. Using IKVM to compile jar into .Net is an easier and more stable way.

The overall structure of the wrapper

Based on the Weka class hierarchy, I’ve defined several separate modules for different kinds of functionality:

* Common – contains all common types over the whole wrapper. E.g. the union type for various classifiers.

* Dataset – contains IO routines to read/save various datasets and functions (e.g. split) to manipulate/preprocess datasets.

* Parameter – contains functions to make parameters for the data mining algorithms, all the default parameters are also provided.

* Classify – contains standard classifiers.

* Cluster – contains standard clustering algorithms.

* Eval – contains functions to perform different kinds of classification/clustering tasks and output their result (e.g. accuracy/F1 measure) out.

In the following sections, I will present each module in details.

Common module

In this module, I try to write all types, including classifier types, clustering algorithm types and some evaluation-task types. This module is marked as AutoOpen, i.e., the types in the modules are easily accessible from other modules.

[<AutoOpen>]
module Common =
let NYI() = failwith "Not yet implemented!"

type Dataset = core.Instances

type parameters = string

type DatafileType =
   Arff | LibSVM | Csv | Svmlight

type ClassifierType =
   J48 | LogReg | NeuralNet | KNN | NaiveBayes | SVM | LibLinear

type ClaEvalTask =
   | CrossValidation of int * Dataset * ClassifierType * parameters
   | RandomSplit of float * Dataset * ClassifierType * parameters
   | TrainTest of Dataset * Dataset * ClassifierType * parameters

type ClustererType =
   KMeans | EM | DBScan

I’ve redefined Weka Instances class as Dataset, and use a string for parameters.

Dataset module

This module includes functions to read/save 4 kinds of data files: Weka’s ARFF, libsvm, Csv and SvmLight. It also has a function randomSplit to randomly split a dataset with a given ratio.

This module currently is far from complete. For instance, we can only load datasets from 4 kinds of disk files, and no way to build a dataset using in-memory data, e.g. an array of vectors. We also only provide one data preprocessing step – random split, many other common preprocessing steps are needed.

I’d like to use another post to finish these. In this post, delivering the whole working wrapper (although incomplete) is more important.

Here is some of the implementation:

    let readDataFile (ftype: DatafileType) (fname:string) =
   let loader =
       match ftype with
           | Arff -> new converters.ArffLoader() :> AbstractFileLoader
           | LibSVM -> new converters.LibSVMLoader() :> AbstractFileLoader
           | Csv -> new converters.CSVLoader() :> AbstractFileLoader
           | Svmlight -> new converters.SVMLightLoader() :> AbstractFileLoader
   try
       loader.setFile(new java.io.File(fname))
       loader.getDataSet()
   with
       | _ -> failwith "dataset loading error"

its input is a datafiletype and the file name. Based on this function, 4 concrete functions are defined:

let readArff = readDataFile DatafileType.Arff

    let readLibsvm = readDataFile DatafileType.LibSVM
    let readCsv = readDataFile DatafileType.Csv
    let readSvmlight = readDataFile DatafileType.Svmlight

The data saving functions have similar implementation.

Parameter module

This module contains functions that create parameters for different data mining algorithms. As Weka uses a space-separated string as parameters, it would be unclear to the first-time user that what does “-R 1.0 –M -1” mean for a logistic regression.

I have provided a default parameter string for each algorithm.

Classify module

In this module, I have wrapped most commonly used classification algorithms: C4.5 decision tree (named J48 in Weka), NaiveBayes, Logistic Regression, Multilayer perception neural net, Support Vector Machines (SVM).

The getClassifier function accepts a classifier type and the an option string, and returns a classifier instance:

let getClassifier (ctype:ClassifierType) (option:string)  =
let classifier =
   match ctype with
       | J48 -> new classifiers.trees.J48() :> classifiers.Classifier
       | NaiveBayes -> new classifiers.bayes.NaiveBayes() :> classifiers.Classifier
       //| KNN ->new classifiers.lazy.IBk() :> weka.classifiers.Classifier
       | LogReg -> new classifiers.functions.Logistic() :> classifiers.Classifier
       | NeuralNet -> new classifiers.functions.MultilayerPerceptron() :> classifiers.Classifier
       | SVM -> new classifiers.functions.LibSVM() :> classifiers.Classifier
       | LibLinear -> new classifiers.functions.LibLINEAR() :> classifiers.Classifier
       | _ -> failwith "not supported"

classifier.setOptions(core.Utils.splitOptions option)
classifier

and buildClassifier takes one more parameter – the dataset to build a trained classifier:

let buildClassifier (ctype:ClassifierType) (option:string) (ds:Dataset) =
checkDataset ds
let classifier = getClassifier ctype option
classifier.buildClassifier(ds)
classifier

Handy functions like getJ48, getSVM are also defined:

let getJ48 (option:string) = getClassifier ClassifierType.J48 option
let getSVM (option:string) = getClassifier ClassifierType.SVM option

There are several issues:

1. KNN (Weka class: IBk) is not supported as its namespace in Weka is core.classifiers.lazy. But lazy is a keyword in F#, thus cannot appear in a namespace.

2. The IO wrapper for classifiers are not provided yet.

Cluster module

Similar to Classify module, there are getClusterer and buildClusterer in Cluster module. There are also shortcuts to concrete clustering algorithms such as getKmeans and buildKmeans.

Eval module

This module contains evaluation methods for classification and clustering. For classification, I’ve defined a task discrete union:
type ClaEvalTask =

    | CrossValidation of int * Dataset * ClassifierType * parameters
| RandomSplit of float * Dataset * ClassifierType * parameters
| TrainTest of Dataset * Dataset * ClassifierType * parameters

classifyEval function is to do such a task:

let rec classifyEval (task:ClaEvalTask) =
let rnd = new java.util.Random(System.DateTime.Now.Ticks)
match task with
| TrainTest (dsTrain, dsTest, ctype, para) ->
   Classify.checkDataset dsTrain
   Classify.checkDataset dsTest
   let eval = new classifiers.Evaluation(dsTrain)
   let cl = Classify.buildClassifier ctype para dsTrain
   eval.evaluateModel(cl, dsTest) |> ignore
   eval
| CrossValidation(cv, ds, ctype, para) ->
   Classify.checkDataset ds
   let eval = new classifiers.Evaluation(ds)
   let cl = Classify.getClassifier ctype para
   eval.crossValidateModel(cl, ds, cv, rnd, Array.empty)
   eval
| RandomSplit(ratio, ds, ctype, para) ->
   Classify.checkDataset ds
   let train, test = Dataset.randomSplit ratio ds
   classifyEval (TrainTest(train, test, ctype, para))

the result of this function is an evaluation result object. We can use its properties to get various evaluation metrics, or use the shortcut functions in the Eval module:

let getAccuracy (eval:classifiers.Evaluation) = eval.pctCorrect()

let getPrecison (eval:classifiers.Evaluation) (classIdx:int) = eval.precision(classIdx)
let getRecall (eval:classifiers.Evaluation) (classIdx:int) = eval.recall(classIdx)
let getF1 (eval:classifiers.Evaluation) (classIdx:int) = eval.fMeasure(classIdx)
let getAUC (eval:classifiers.Evaluation) (classIdx:int) = eval.areaUnderROC(classIdx)
let getClassifySummary (eval:classifiers.Evaluation) = eval.toSummaryString()

Conclusion

Two features of the wrapper are

1) a declarative wrapper to Weka. See the following code to see how declarative the wrapper is:

// load the dataset

let dataset =
@"C:\Program Files\Weka-3.6\data\iris.arff"
|> Dataset.readArff
|> Dataset.setClassIndexWithLastAttribute

// describe the cross validation task using SVM as the classifier
let classifyTask = CrossValidation(5, dataset, ClassifierType.SVM, Parameter.SVM.defPara)

// perform the cross validation task and then get the average accuracy
let cvAccuracy =
classifyTask
|> Eval.classifyEval
|> Eval.getAccuracy

The code is understandable even to those who do not know F#.

2) interactively perform data mining algorithms.

F# interactive is well integrated in Visual Studio. It is very stable and convenient to use. The powerful VS IDE provides good Intellisense for F#.

Saturday, August 14, 2010

Reading F# Projects, Part III: The F# math providers.

We’ve covered the usage of Math providers in the matrix series. In this post, the whole source code is decomposed.

As I find it time consuming to covert One Note notes into a blog format. Please see the attached pdf file.

PDF.

Reading F# Projects, Part II: F# Set

The design patterns in a data structure project like F# Set is simple. However, understanding it really needs time as data structures involve heavy generic programming and the type system in F# (or any other static FP language) is very complex.

In this post, we will explore two aspects of F# Set:

1. The general implementation structure for a generic data structure. After knowing the pattern, we could write other data structures in F# and these data structures are generic enough.

2. Know something about generics in F# and type inference. We will come to this part in later readings.

In F# source code, the immutable set implementation (set.fs) is a good one for us to start for data structures. Why not more fundamental data structures like F# sequences? Because those ones are more involved in .Net platform and F# Set is nearly pure F# code. Also because:

I love trees!

F# Set and Map are implemented using balanced binary search tress, to be exact, AVL tress. Balanced BSTs support insert, delete and find operations all in O(log n). So it is widely used for implementing dynamic sets.

In general, trees are the most beautiful data structures and have so many variants. They are also efficient, e.g. R tree and friends are the fundamental data structures used in every database. Quadtrees are used for computational geometry in theory and spatial database in practice.

The F# Set is implemented by an immutable AVL tree. Here immutable means that every node, once created, its values could be be changed. In CS literature, researchers use “persistent” to name immutable data structures. If you do 100 insertions to an empty tree, you can have 101 version of trees, where common nodes are shared to reduce the space cost. It is proved that in such a tree, inserting operation still costs O(log n) time, and O(log n) space for duplicating some nodes in the tree (usually along the insertion path).

As this post is not intended to be a tutorial on data structures, I won’t go into details for the tree implementation. We mainly study how an industry-strength data structure is implemented in F#.

The main structure

set.fs contains about 840 lines of code. The code skeleton is shown below:

namespace Microsoft.FSharp.Collections
   type SetTree<'T> when 'T : comparison =
       | SetEmpty                                          // height = 0  
       | SetNode of 'T * SetTree<'T> *  SetTree<'T> * int    // height = int
       | SetOne  of 'T                                     // height = 1  
   module internal SetTree =
       // functions:
       // indert, delete, find, etc.
       // split, rebalance, etc.
      
       type SetIterator<'T> when 'T : comparison  =
           { mutable stack: SetTree<'T> list;  // invariant: always collapseLHS result
             mutable started : bool           // true when MoveNext has been called  
           }
       type Set<'T when 'T : comparison >(comparer:IComparer<'T>, tree: SetTree<'T>) =
           //member functions:
           //  Add, Remove, Contains, etc.
           //operators:
           //  - + Intersection, etc.
           //overrides:
           //  GetHashCode, Equals, ToString
           //implement interfaces:
           //  IComparable
           //  ICollection<'T>
           //  IEnumerable<'T>
           //  IEnumerable
   module Set =
       // set module functions:
       // add, isEmpty, etc.

Under Microsoft.FSharp.Collections namespace, there are three things: SetTree<’T> type, Set module and SetTree module.

SetTree is a type constructor, the parameter is a comparable type ‘T.

Most of the code is in SetTree internal module:

the first part contains the code to manipulate the tree structure, e.g. inserting a node, deleting a node, etc.

Type SetIterator<’T> and related functions are for implementing IEnumerable interface.
Set<’T> (comparer, tree) is the implementation for Set.

Set module contains the common function to manipulate a set.

In a summary, A container, like Set, usually follows this pattern:

1. An implementation of the underlying data structure, here it is AVL tree.

2. A module for manipulating the data structures; members functions also can do this. Usually we keep functions in both places.

3. Implement the IEnumerable and IEnumerable<’T> interface, so that the container/data structure could be treated as a sequence.

These are kind of rules for implementing new containers in F#, e.g. F# Map.

Detail 1. The generic

It is meaningless if the Set implementation only supports integers, or floats. We want it to be generic, i.e., support any type. Since the underlying implementation is based on trees, so the base type should also support comparison.

The other part of generic, which is more subtle, is the comparer. There are situations, we have different string sets. However, the rules to order them are not the same, i.e., strings maybe compared using different rules. So we should add another parameter into consideration. (Refer to Functor in OCaml if you want to see a more elegant more to deal with this situation.)

Although F# Set currently does not support different comparers. Its internal implementation keeps comparer in mind. (Because F# supports OOP, the different comparers problem could be partially solved by inherit the original element type. Anyway, this is non-elegant.)

We know that once a Set has some concrete values in it, the type is clear. Here are two ways to initialize a Set:

static let empty : Set<'T> =

    let comparer = LanguagePrimitives.FastGenericComparer<'T>
   new Set<'T>(comparer, SetEmpty)

new (elements : seq<'T>) =
   let comparer = LanguagePrimitives.FastGenericComparer<'T>
   new Set<_>(comparer,SetTree.ofSeq comparer elements)

We can see that they get the comparer for type ‘T and use that comparer to pass into the constructor of type Set<’T>.

Detail 2. The comparer

The above code get the comparer for type ‘T using FastGenericComparer, which is in a big file prim-types.fs. The primitive type module should be discussed separately and thoroughly. There are quite a few ad hoc tricks in the implementation.

Here we just need to know that it looks up a table to find a comparer and if the type is in the primitive types, then there are optimizations for them.

Detail 3. From the empty to concrete

F# allows you to write [], Set.Empty. These empties seem to be dynamic. Actually no. You cannot write sth like:

let empties = Array.create 100 Set.empty

the compiler gives the following error:

error FS0030: Value restriction. The value 'empties' has been inferred to have generic type

    val empties : Set<'_a> [] when '_a : comparison

Because the compiler cannot generalize these empties into a single type. It is safe to write

Set.empty + Set([1;2;3])

Because the types of the two sets could be generalized.

Same situation applies to the zeros in different number types.

Detail 4. The Iterator

interface IEnumerable<'T> with
   member s.GetEnumerator() = SetTree.mkIEnumerator s.Tree

interface IEnumerable with
   override s.GetEnumerator() = (SetTree.mkIEnumerator s.Tree :> IEnumerator)

Both of the IEnumerable interfaces are implemented. Notice that difference between the two: there is a upcast in the type. Let’s see the actual implementation:

let mkIterator s = { stack = collapseLHS [s]; started = false }

let mkIEnumerator s =

    let i = ref (mkIterator s)
   { new IEnumerator<_> with
         member x.Current = current !i
     interface IEnumerator with
         member x.Current = box (current !i)
         member x.MoveNext() = moveNext !i
         member x.Reset() = i :=  mkIterator s
     interface System.IDisposable with
         member x.Dispose() = () }

Exercise

An exercise is to read map.fs in F# core. The implementation of F# Map is very similar to F# Set.