Monday, October 31, 2011

Any numerical computing environment on Java platform?

Recently Tomas Petricek and I have co-written a book chapter, Numerical Computing in F#. It is a free online book chapter for his Manning book: Real World Functional Programming. This book chapter is a survey on numerical computing for .Net languages, specially for F#. I will write another article on this book chapter and present some materials that were cut down from final version on MSDN.

But today, I am looking for things on our competitor’s side -- numerical computing environments for Java platform. I am just wondering what Java guys are doing for numerical computing.

Looking at the mature numerical systems such as Matlab, Mathematica and S-plus/R, we can summarize three general elements of such a system:

1. Math library. Math library is a must for serious numerical computing. It is hard to build well-tested numerical procedures. This kind of low level stuff, e.g. linear algebra routines and FFT, is usually available as a reusable software component.

2. Interactive/Console. The Read-eval-print loop (REPL) is very important for exploratory analysis because it gives a quick feedback on some small fragments of code.

3. Plot/chart library. A plot explains what’s going on clearly and quickly!

I did some web search and found quite a few blog articles; I want to mention articles by Mikio L. Braun. He also maintains his jblas library, which is a JNI wrapper to ATLAS, the highly-optimized BLAS library.

I am aware Incanter, which is a Clojure project aiming to be R in Java platform. But I don't think rewriting basic things, e.g. PCA and linear regression, for Java Platform is a good idea since a stable implementation for these algorithms usually take years because there is just too much details in numerical algorithms.

This time, I found a new library, ScalaLab, which is a on-going project for numerical computing in Scala.

Let’s first talk the library a little bit. For matrix and linear algebra, there are also two quite stable Java libraries: colt and JAMA. The linear algebra functionality of the two libraries should be intact. But the performance is quite bad compared to native code, even several times slower to pure .Net code. [The comparison to .Net code is my personal experience. ] But one should not care too much about performance as the bottleneck of a numerical project differs from projects to projects. A working system is the most important thing, various methods could help to improve the performance. Btw, the performance difference is usually not noticeable when we work on small or sample datasets. I use R quite often recently, the matrix library associated with its Windows version is from Netlib, a normal native performance one. I didn’t have any performance complain over months.

For REPL, basically it is a language issue. After reading through online, I think I will like something built on Scala, such as the ScalaLab environment. Among the mature/quite-mature JVM languages, Scala is most similar to F#. The advantage of Scala to F# is that Scala is a component language, which means Scala has better OO features for library designers and software architects. However, on the small/low level of programming (e.g. a small function), F# feels much more functional and pure than Scala. This, I think, is partially because F# “IS” an ML, while Scala borrows some syntax from Java and C#. The other reason is that OO is at the heart of Scala – everything is an object. Scala’s OO model is greatly influenced by Smalltalk’s. I haven’t learned any Smalltalk. But after some learning in Scala, I can appreciate why Alan Kay, the inventor of Smalltalk, says, “Actually I made up the term "object-oriented", and I can tell you I did not have C++ in mind. “. So even functional programming in Scala is OO-emulated, while F# has a functional base first and then the implementation using .Net is transparent to its syntax. Ok… the above comparison is just off the topic. Anyway, Scala can serve as a good static script language with a REPL.

For the plotting library, JFreeChart seems to be the standard. It is also the plotting library used in Incanter [For this part, I think Incanter really does a great job!]. I am not sure whether there are some advanced or commercial plotting libraries for Java. But I know that .Net has tons of that! Anyway, JFreeChart is ok for daily use.

The last thing, which I haven’t listed above, is parallel computing! Java platform, or more specific, the Java Virtual Machine, proves to be one of the most stable multithreaded systems ever built. For example, Hadoop is written in Java. Although people in Google said that Hadoop is way slow compared to their C++ MapReduce. A lot of big companies are using it. Two months ago, I read the following nice paper by Martin Odersky and Kunle Olukotun:

Language Virtualization for Heterogeneous Parallel Computing

It is just published and already has 28 citations on Google Scholar. The idea is to build a DSL, which serves a middle ware, to write high level programs, e.g. equations involving some matrix operations. Underlying the DSL, there are complicated optimizations going on to transform it and make it parallel using heterogeneous technologies, e.g. GPU and multi-core. So the programmer does not need to know anything about parallel computing and his program is executed in parallel utilizing the multi-cores! The DSL is an internal one built in Scala, another example showing the great expressiveness of Scala. Since such idea is not new, the novel part of the paper is to demonstrate how Scala perfectly fits the requirements of such a paradigm.

I don’t quite have a conclusion for this post. Just some random thoughts on numerical computing in Java and an advertisement for my articles on MSDN微笑

5 comments:

  1. You may be interested:
    http://www.reddit.com/r/scala/comments/f16wm/scientific_programming_in_scala/

    ReplyDelete
  2. Scientific programming these days also generally requires ability to access a database or ETL (Extract, Transform, Load) tool for massaging data into the right format. The above 3 environments generally support these tasks, although I must say there is ton of room for improvement.

    I like your blog. Very informative and a lot of clear descriptions.

    ReplyDelete
  3. You could check out http://nd4j.org, https://github.com/deeplearning4j/nd4j, https://github.com/deeplearning4j/nd4s

    ReplyDelete