F# and Data Mining: WekaSharp: An F# wrapper for Weka

There are 3 posts in this series:

1. An F# wrapper for Weka. (this post) The minimal wrapper in F# for Weka.

2. More features. This post will contain improvement over the minimal wrapper, e.g. more Dataset processing function, some plot functionality, etc.

3. Tutorial/Usage examples. This post is for end users, who might have no interested in reading the implementation details, but rather knowing how to use this wrapper to perform data mining tasks in .Net.

WekaSharp is available at CodePlex:http://wekasharp.codeplex.com/.

Weka is one of the most widely used data mining software package. It contains over 50 data mining algorithms, a good GUI support and well written documents. Because of its good features, a lot data mining courses use it as an illustrative software.

Its GPL license also allows it to be used freely in academic.

However, Weka GUI enables us to perform data loading, filtering and classifier building in an interactive way, sometimes programming is still needed. For instance, GUI has poor support for parameter selection and processing 100 data sets. In these tasks, scripting is a better way. However, Java language is not declarative and it does not support interactive shell either.

F# well supports these two features, also data processing in F# in general is far better than Java. So F# and Weka are good combination.

Compile Weka library to a .Net dll

As everybody knows, Weka is written in Java. To use it within .Net framework seamlessly, we need to compile it into a .Net assembly. Fortunately, IKVM project is made for such purpose.

There is an online tutorial for compiling weka.jar to weka.dll. However, this tutorial does not mention how to use external jar libraries in Weka, e.g. libsvm.jar. Because Java’ class loading mechanism is different from .Net’s. We cannot simply compile different jars into separate dlls. The simplest way is to compile weka.jar and libsvm.jar to a single dll:

>ikvmc -sharedclassloader -target:library weka.jar libsvm.jar

Notice that there is a WekaDotNet project at sourceforge. The project is compiled using Visual J#, which is already deprecated in Visual Studio family. Using IKVM to compile jar into .Net is an easier and more stable way.

The overall structure of the wrapper

Based on the Weka class hierarchy, I’ve defined several separate modules for different kinds of functionality:

* Common – contains all common types over the whole wrapper. E.g. the union type for various classifiers.

* Dataset – contains IO routines to read/save various datasets and functions (e.g. split) to manipulate/preprocess datasets.

* Parameter – contains functions to make parameters for the data mining algorithms, all the default parameters are also provided.

* Classify – contains standard classifiers.

* Cluster – contains standard clustering algorithms.

* Eval – contains functions to perform different kinds of classification/clustering tasks and output their result (e.g. accuracy/F1 measure) out.

In the following sections, I will present each module in details.

Common module

In this module, I try to write all types, including classifier types, clustering algorithm types and some evaluation-task types. This module is marked as AutoOpen, i.e., the types in the modules are easily accessible from other modules.

[<AutoOpen>]
module Common =
let NYI() = failwith "Not yet implemented!"

type Dataset = core.Instances

type parameters = string

type DatafileType =
   Arff | LibSVM | Csv | Svmlight

type ClassifierType =
   J48 | LogReg | NeuralNet | KNN | NaiveBayes | SVM | LibLinear

type ClaEvalTask =
   | CrossValidation of int * Dataset * ClassifierType * parameters
   | RandomSplit of float * Dataset * ClassifierType * parameters
   | TrainTest of Dataset * Dataset * ClassifierType * parameters

type ClustererType =
   KMeans | EM | DBScan

I’ve redefined Weka Instances class as Dataset, and use a string for parameters.

Dataset module

This module includes functions to read/save 4 kinds of data files: Weka’s ARFF, libsvm, Csv and SvmLight. It also has a function randomSplit to randomly split a dataset with a given ratio.

This module currently is far from complete. For instance, we can only load datasets from 4 kinds of disk files, and no way to build a dataset using in-memory data, e.g. an array of vectors. We also only provide one data preprocessing step – random split, many other common preprocessing steps are needed.

I’d like to use another post to finish these. In this post, delivering the whole working wrapper (although incomplete) is more important.

Here is some of the implementation:

    let readDataFile (ftype: DatafileType) (fname:string) =
   let loader =
       match ftype with
           | Arff -> new converters.ArffLoader() :> AbstractFileLoader
           | LibSVM -> new converters.LibSVMLoader() :> AbstractFileLoader
           | Csv -> new converters.CSVLoader() :> AbstractFileLoader
           | Svmlight -> new converters.SVMLightLoader() :> AbstractFileLoader
   try
       loader.setFile(new java.io.File(fname))
       loader.getDataSet()
   with
       | _ -> failwith "dataset loading error"

its input is a datafiletype and the file name. Based on this function, 4 concrete functions are defined:

let readArff = readDataFile DatafileType.Arff

    let readLibsvm = readDataFile DatafileType.LibSVM
    let readCsv = readDataFile DatafileType.Csv
    let readSvmlight = readDataFile DatafileType.Svmlight

The data saving functions have similar implementation.

Parameter module

This module contains functions that create parameters for different data mining algorithms. As Weka uses a space-separated string as parameters, it would be unclear to the first-time user that what does “-R 1.0 –M -1” mean for a logistic regression.

I have provided a default parameter string for each algorithm.

Classify module

In this module, I have wrapped most commonly used classification algorithms: C4.5 decision tree (named J48 in Weka), NaiveBayes, Logistic Regression, Multilayer perception neural net, Support Vector Machines (SVM).

The getClassifier function accepts a classifier type and the an option string, and returns a classifier instance:

let getClassifier (ctype:ClassifierType) (option:string)  =
let classifier =
   match ctype with
       | J48 -> new classifiers.trees.J48() :> classifiers.Classifier
       | NaiveBayes -> new classifiers.bayes.NaiveBayes() :> classifiers.Classifier
       //| KNN ->new classifiers.lazy.IBk() :> weka.classifiers.Classifier
       | LogReg -> new classifiers.functions.Logistic() :> classifiers.Classifier
       | NeuralNet -> new classifiers.functions.MultilayerPerceptron() :> classifiers.Classifier
       | SVM -> new classifiers.functions.LibSVM() :> classifiers.Classifier
       | LibLinear -> new classifiers.functions.LibLINEAR() :> classifiers.Classifier
       | _ -> failwith "not supported"

classifier.setOptions(core.Utils.splitOptions option)
classifier

and buildClassifier takes one more parameter – the dataset to build a trained classifier:

let buildClassifier (ctype:ClassifierType) (option:string) (ds:Dataset) =
checkDataset ds
let classifier = getClassifier ctype option
classifier.buildClassifier(ds)
classifier

Handy functions like getJ48, getSVM are also defined:

let getJ48 (option:string) = getClassifier ClassifierType.J48 option
let getSVM (option:string) = getClassifier ClassifierType.SVM option

There are several issues:

1. KNN (Weka class: IBk) is not supported as its namespace in Weka is core.classifiers.lazy. But lazy is a keyword in F#, thus cannot appear in a namespace.

2. The IO wrapper for classifiers are not provided yet.

Cluster module

Similar to Classify module, there are getClusterer and buildClusterer in Cluster module. There are also shortcuts to concrete clustering algorithms such as getKmeans and buildKmeans.

Eval module

This module contains evaluation methods for classification and clustering. For classification, I’ve defined a task discrete union:
type ClaEvalTask =

    | CrossValidation of int * Dataset * ClassifierType * parameters
| RandomSplit of float * Dataset * ClassifierType * parameters
| TrainTest of Dataset * Dataset * ClassifierType * parameters

classifyEval function is to do such a task:

let rec classifyEval (task:ClaEvalTask) =
let rnd = new java.util.Random(System.DateTime.Now.Ticks)
match task with
| TrainTest (dsTrain, dsTest, ctype, para) ->
   Classify.checkDataset dsTrain
   Classify.checkDataset dsTest
   let eval = new classifiers.Evaluation(dsTrain)
   let cl = Classify.buildClassifier ctype para dsTrain
   eval.evaluateModel(cl, dsTest) |> ignore
   eval
| CrossValidation(cv, ds, ctype, para) ->
   Classify.checkDataset ds
   let eval = new classifiers.Evaluation(ds)
   let cl = Classify.getClassifier ctype para
   eval.crossValidateModel(cl, ds, cv, rnd, Array.empty)
   eval
| RandomSplit(ratio, ds, ctype, para) ->
   Classify.checkDataset ds
   let train, test = Dataset.randomSplit ratio ds
   classifyEval (TrainTest(train, test, ctype, para))

the result of this function is an evaluation result object. We can use its properties to get various evaluation metrics, or use the shortcut functions in the Eval module:

let getAccuracy (eval:classifiers.Evaluation) = eval.pctCorrect()

let getPrecison (eval:classifiers.Evaluation) (classIdx:int) = eval.precision(classIdx)
let getRecall (eval:classifiers.Evaluation) (classIdx:int) = eval.recall(classIdx)
let getF1 (eval:classifiers.Evaluation) (classIdx:int) = eval.fMeasure(classIdx)
let getAUC (eval:classifiers.Evaluation) (classIdx:int) = eval.areaUnderROC(classIdx)
let getClassifySummary (eval:classifiers.Evaluation) = eval.toSummaryString()

Conclusion

Two features of the wrapper are

1) a declarative wrapper to Weka. See the following code to see how declarative the wrapper is:

// load the dataset

let dataset =
@"C:\Program Files\Weka-3.6\data\iris.arff"
|> Dataset.readArff
|> Dataset.setClassIndexWithLastAttribute

// describe the cross validation task using SVM as the classifier
let classifyTask = CrossValidation(5, dataset, ClassifierType.SVM, Parameter.SVM.defPara)

// perform the cross validation task and then get the average accuracy
let cvAccuracy =
classifyTask
|> Eval.classifyEval
|> Eval.getAccuracy

The code is understandable even to those who do not know F#.

2) interactively perform data mining algorithms.

F# interactive is well integrated in Visual Studio. It is very stable and convenient to use. The powerful VS IDE provides good Intellisense for F#.

F# and Data Mining

Monday, August 23, 2010

WekaSharp: An F# wrapper for Weka