Friday, September 3, 2010

WekaSharp: Tutorial for using Weka in F#/.Net

There are 3 posts in this series:

1. An F# wrapper for Weka. The minimal wrapper in F# for Weka.

2. More features. (not available yet) This post will contain improvement over the minimal wrapper, e.g. more Dataset processing function, some plot functionality, etc.

3. Tutorial/Usage examples. (this post) This post is for end users, who might have no interested in reading the implementation details, but rather knowing how to use this wrapper to perform data mining tasks in .Net.

Installation

The easiest way is to use Visual Studio 2010 with the latest F# PowerPack. Download the release/source code at http://wekasharp.codeplex.com/ and change the dll references to ones at the bin\Release folder.

Visual Studio 2008 should also be able open the project file. Only the parallel functionality cannot be used in 2008 as Parallel Sequence is only added in .Net 4.0.

The support for Mono is not prepared yet. However, it should be easy to figure out.

As all necessary runtimes are also included in the release, so the user does not need to download Weka or IKVM runtimes.

Quick Guide

The user is encourage to look at the script code in Script.fsx and run them to have a general feeling of the wrapper. These examples cover typical usage of WekaSharp.

Reading the other two posts is also very useful for using WekaSharp, and more importantly, for changing the source code of WekaSharp on your own need.

The License

As Weka is GPL, I must make WekaSharp under GPL.

The Efficiency Concern

People may concern how fast is the wrapper compared to the original Weka in Java. I haven’t thoroughly tested this yet. Some casual tests show that they are just the same, at least on Windows platforms. For instance, I run a 3-fold cross validation on the letter dataset using J48 decision trees, both the wrapper and the Weka (run from GUI) use about 20 seconds. It is quite surprising that IKVM’s compiler does so good a job.

As I will show later, the WekaSharp has some parallel constructs, which enable us to utilize multi-core more conveniently.

I’ve provided a module to do profiling. The design is similar to that of Matlab, but better:) You can use Profile.tic() and Profile.toc(“str”) pair to time the execution of F# code. Or You can get multiple timers by using Profile.getWatch().

The Dataset and IO

The Dataset module contains functions to manipulate datasets. Most of the functions are pure, i.e., they don’t change the input dataset and create a new dataset after processing.

Current 4 types of data files are supported: Weka’s ARFF format, LibSvm, Csv and SvmLight. Both loading and saving are supported, i.e. there are 8 functions.

The following code shows how to load an ARFF dataset and set the last column as the label:

let sonar =
@"D:\temp\datasets-UCI\UCI\sonar.arff"
|> Dataset.readArff
|> Dataset.setClassIndexWithLastAttribute


You can also do the two steps in one function:



let iris =
@"C:\Program Files\Weka-3.6\data\iris.arff"
|> Dataset.readArffLastAttributeAsLabel


Parameters



Data mining algorithms usually have parameters, sometimes very complicated parameters. F# provides very convenient syntax construct for optional parameters.



E.g. the full set parameters of an SVM has the SVM type, the kernel type, the C value, and the parameters for the kernel, etc. Maybe you just need to set the C value: Parameter.SVM.MakePara(C = c).



Here Parameter is a module for parameters setting. For each data mining algorithm, there are two members:



* DefaultPara. The default parameter string, which is the same as ones used in Weka GUI.



* MakePara. A function to make different parameters.



Here are several more complicated examples of .MakePara method for different algorithms:



Parameter.SVM.MakePara(kernelType = Parameter.SVMKernelType.LinearKernel, C = 10.0)
Parameter.KMeans.MakePara(K=5, maxIter=10, seed=10)


Parameter.KNN.MakePara(distanceFunction = Parameter.Manhattan, K = 3)



and you can also use the IDE support to find which parameters are supported in a data mining algorithm:



image 



Classification and its evaluation



WekaSharp supports most of the common classification algorithms:



type ClassifierType =
J48 | LogReg | NeuralNet | KNN | NaiveBayes | SVM | LibLinear | LinReg | AdaBoost | Bagging


There are three types of classification tasks:



type ClaEvalTask =
| CrossValidation of int * Dataset * ClassifierType * parameters
| RandomSplit of float * Dataset * ClassifierType * parameters
| TrainTest of Dataset * Dataset * ClassifierType * parameters




To run this task, you need the evalClassify method in Eval module. The following code shows a complete example using J48 as the classifier:



(* playing decision trees on Iris dataset *)
// load the dataset
let iris =
@"C:\Program Files\Weka-3.6\data\iris.arff"
|> Dataset.readArff
|> Dataset.setClassIndexWithLastAttribute

// describe 3 kinds of classification tasks
let j48Tt = TrainTest(iris, iris, ClassifierType.J48, Parameter.J48.DefaultPara)
let j48Cv = CrossValidation(5, iris, ClassifierType.J48, Parameter.J48.DefaultPara)
let j48Rs = RandomSplit(0.7, iris, ClassifierType.J48, Parameter.J48.DefaultPara)

// perform the task and get result
let ttAccuracy = j48Tt |> Eval.evalClassify |> Eval.getAccuracy
let cvAccuracy = j48Cv |> Eval.evalClassify |> Eval.getAccuracy
let rsAccuracy = j48Rs |> Eval.evalClassify |> Eval.getAccuracy







The evalClassify function returns a result object, you can use “.” to that object in the IDE to find out various types of results available. In the above, we use predefined functions to get the accuracy from it.



Clustering and its evaluation



Performing clustering is very similar to classification. You can define two types of clustering:



type CluEvalTask =
| ClusterWithLabel of Dataset * ClustererType * parameters
| DefaultCluster of Dataset * ClustererType * parameters



ClusterWithLabel means that you will need to use the label of the dataset to do evaluation. DefaultCluster does not require the dataset has label (actually, it does now allow datasets to have labels either), so the result will only contain the clustering assignments, but not accuracy, etc.



The following code shows a complete clustering example:



let irisLabeled =
@"C:\Program Files\Weka-3.6\data\iris.arff"
|> Dataset.readArffLastAttributeAsLabel

let kmeansTask = ClusterWithLabel(irisLabeled, ClustererType.KMeans, Parameter.KMeans.MakePara(K=3))
let emTask = ClusterWithLabel(irisLabeled, ClustererType.EM, Parameter.EM.MakePara(K=3))
let dbscanTask = ClusterWithLabel(irisLabeled, ClustererType.DBScan, Parameter.DBScan.DefaultPara)


let kmeansResult = Eval.evalClustering kmeansTask |> Eval.getClusterSummary
let emResult = Eval.evalClustering emTask |> Eval.getClusterSummary
let dbscanResult = Eval.evalClustering dbscanTask |> Eval.getClusterSummary



Bulk & parallel processing tasks



Sometimes, you need to run multiple tasks. E.g. you need to run the same task multiple times to see the mean and variance of the result, or you need to try different parameters for an algorithm, or you simply have different data mining tasks to run.



The following example shows how to create a bulk of tasks and run them:



// load the data set
let sonar =
@"D:\temp\datasets-UCI\UCI\sonar.arff"
|> Dataset.readArff
|> Dataset.setClassIndexWithLastAttribute

// set different parameters
let Cs = [0.01; 0.1; 1.; 10.; 50.; 100.; 500.; 1000.; 2000.; 5000. ]

// make the tasks with the parameter set
let tasks =
Cs
|> List.map (fun c -> Parameter.SVM.MakePara(C = c))
|> List.map (fun p -> CrossValidation(3, sonar, ClassifierType.SVM, p))

Profile.tic()
// the accuracy result
let results =
tasks
|> Eval.evalBulkClassify
|> List.map Eval.getAccuracy
Profile.toc("sequential time: ")




Here I created different SVM tasks for different C values, run them and get the accuracy as a list.



F# provides very easy syntax to perform multiple tasks at the same time. Thus I provide evalBulkClassifyParallel method:



Profile.tic()
let resultsParallel =
tasks
|> Eval.evalBulkClassifyParallel
|> List.map Eval.getAccuracy
Profile.toc("parallel (PSeq) time: ")

// sequential time: : 9767.804800 ms
// parallel (PSeq) time: : 6154.715500 ms


As the profiling shows, on a two-core machine, parallel executing the tasks does boost the speed.



Plotting


is not finished, but it is still quite usable. To plot the different accuracies for different Cs in the SVM, we can use:



// do the plot
lc.column(y = results, xname = "differnet C", yname = "Accuracy", title = "SVM on iris",
isValueShownAsLabel = true ) |> display



Making Datasets from Memory



All the above examples use data from data files. In practice, we might want to convert the data in memory, e.g. in an array into a Weka dataset. The following examples shows how to create dataset from F# arrays:



(* create dataset from F# arrays *)

// make the data array
let data = [| 0.; 0.;
1.; 1.;
0.; 1.;
1.; 0.; |]
let xorArray = Array2D.init 4 2 (fun i j -> data.[i*2 + j])

// make weka dataset from array
let xor0 = Dataset.from2DArray xorArray false

// add labels
let xor = xor0 |> Dataset.addClassLabels ["T"; "T"; "F"; "F"]

// make svm tasks

let rbfTask = TrainTest(xor, xor, ClassifierType.SVM, Parameter.SVM.DefaultPara)
let linearTask = TrainTest(xor, xor, ClassifierType.SVM, Parameter.SVM.MakePara(kernelType = Parameter.SVMKernelType.LinearKernel, C = 10.0) )

// rbf svm gets 100% accuracy
let rbfAccuracy = rbfTask |> Eval.evalClassify |> Eval.getAccuracy
// linear svm does not work on XOR data set
let linearAccuracy = linearTask |> Eval.evalClassify |> Eval.getAccuracy



Conclusion



As a user of WekaSharp, I already benefit from it as I can process the data and run data mining algorithms over it both in F#. Originally, I needed to write my data into ARFF file, call Weka, parse the output to get the result. And the F# solution is far more declarative.



Another benefit is that we can write extensions to Weka in F#!



Enjoy!

8 comments:

  1. HI Yin. Thanks for this it's a great post I am interested to install it and try it out. I do have visual studio 2010 with f# and I have downloaded your f# program from codeplex however I do not quite get what you mean when you say 'change the dll references to ones at the bin\Release folder'
    Could you explain further please?? Thanks again.

    ReplyDelete
  2. @Trisia:

    WeakSharp uses IKVM.NET, which provides Java runtime. bin\Release contains all these IKVM runtimes. You need to reference them in your project.

    ReplyDelete
  3. Ethan Zhu:

    The wrapper is great!

    It seems that "libML.dll" is not in the directory bin\Release

    And the Scripts.fsx is still working without "libML.dll"

    ReplyDelete
  4. Hi,

    libml is not released yet. But you can see its source code at https://bitbucket.org/blackswift/libml/overview

    I am so busy to have time to write articles and polish the code recently...

    btw libML.dll is not necessary for using WekaSharp.

    ReplyDelete
  5. Hello Yin, thank you so much for this wrapper and the work you have put behind it. I did have a question about the installation of the wrapper in VS 2013. I have downloaded to code from codeplex and updated the references as you stated and the build succeeded, however when I run the sample script I get an error: "A project with an Output Type of Class Library cannot be started directly." Has anyone else encountered this? Thank you much.

    ReplyDelete
    Replies
    1. Check the properties of the project and make sure is Output type is console application.

      Delete
  6. In the new version of this wrapper where do you point the references?

    ReplyDelete
  7. i have written the below program in visual 2010 but not getting the expected output

    expected output is :

    Hello Java, from C#!
    Performing 66% split evaluation.
    49 out of 51 correct (96.078431372549%)


    code is :

    using System;
    //using System.Collections.Generic;
    //using System.Linq;
    //using System.Web;
    //using System.Web.UI;
    //using System.Web.UI.WebControls;

    public partial class _Default : System.Web.UI.Page
    {
    class DefaultClass
    {
    public static void Default(string[] args)
    {
    Console.WriteLine("hello java, fronm C#!");
    classifyTest();
    }
    const int percentSplit = 66;
    public static void classifyTest()
    {
    try
    {
    weka.core.Instances insts = new weka.core.Instances(new java.io.FileReader("C:/Users/jmk/Desktop/iris.arff"));
    insts.setClassIndex(insts.numAttributes() - 1);

    weka.classifiers.Classifier c1 = new weka.classifiers.trees.J48();
    Console.WriteLine("Performing" + percentSplit + "% split evalutaion.");

    //randomized the order of the instances in the dataset.

    weka.filters.Filter myRandom = new weka.filters.unsupervised.instance.Randomize();
    myRandom.setInputFormat(insts);
    insts = weka.filters.Filter.useFilter(insts, myRandom);

    int trainSize = insts.numInstances() * percentSplit / 100;
    int testSize = insts.numInstances() - trainSize;
    weka.core.Instances train = new weka.core.Instances(insts, 0, trainSize);

    c1.buildClassifier(train);
    int numCorrect = 0;
    for (int i = trainSize; i < insts.numInstances(); i++)
    {
    weka.core.Instance currentInst = insts.instance(i);
    double predictedClass = c1.classifyInstance(currentInst);
    if (predictedClass == insts.instance(i).classValue())
    numCorrect++;
    }
    Console.WriteLine(numCorrect + "out of " + testSize + "correct (" + (double)((double)numCorrect / (double)testSize * 100.0) + "%)");


    }
    catch (java.lang.Exception ex)
    {
    ex.printStackTrace();
    }
    }
    }

    protected void Page_Load(object sender, EventArgs e)
    {

    }
    }

    ReplyDelete