Saturday, April 3, 2010

Matrix and linear algebra in F#, Part IV: profile your program, find the bottleneck and speed it up: using matrix multiplication as an example

We will use the face recognition program we developed in Part III to illustrate how to profile programs.

The performance problem of the face recognition program is that it takes about 25 seconds to process 280 faces (each face has 10304=112*92 pixels). I want to find which part takes most of the time. Is it due to the eigen decomposition?

Profile

Profiling means knowing how much time and memory every part of your program take. Good profiling tools could give a detail report on the running behavior of a program. In .Net, there are several good tools for this purpose. Some of them are free.

However, these tools are too big to use in a explorative/interactive programming style. Following Matlab’s tic and toc, I have written them for F#:

let stopWatch = new System.Diagnostics.Stopwatch()

let tic() =
stopWatch.Reset()
stopWatch.Start()

let toc(msg:string) =
stopWatch.Stop()
printfn "%s %f ms" msg stopWatch.Elapsed.TotalMilliseconds
stopWatch.Reset()
stopWatch.Start()
I also use the same style for my C programs:

#include <time.h>
time_t stopWatch;
void tic()
{
stopWatch = clock();
}
void toc(const char *msg)
{
double s = (clock() - stopWatch) * 1000.0 / CLOCKS_PER_SEC;
printf("%s: %.2lf ms\n", msg, s);
stopWatch = clock();
}
This implementation enables us to use a sequence of tocs to profile a code block line by line without the need to write tic every time. However this simple profiler can not be nested as there is only a global timer.

After putting several tocs into my eigCov:


tic()
let colm = colMean faces
toc("mean face")
let A = Matrix.init nrow ncol (fun i j -> faces.[i,j] - colm.[i])
toc("minus mean face")
let L = A.Transpose * A
toc("get L")
let val1, vec1 = eig L
toc("lapack eig")
let v = val1 * (1.0 / float (ncol - 1))
let u = A * vec1.Transpose
toc("get u")
// normalize eigen vectors
let mutable cnt = 0
for i=0 to ncol-1 do
if
abs v.[i] < 1e-8 then
u.[0..nrow-1,i..i] <- Matrix.create nrow 1 0.0
else
cnt <- cnt + 1
let norm = Vector.norm (u.[0..nrow-1,i..i].ToVector())
u.[0..nrow-1,i..i] <- u.[0..nrow-1,i..i] * ( 1.0 / float norm )
toc("normolize u")
It gets the following output:

mean face 639.891700 ms
minus mean face 195.484600 ms
get L 12885.751500 ms
lapack eig 247.015700 ms
get u 7169.749100 ms
normolize u 623.919700 ms

The bottle neck is in "Get L” and “get u”, both of which are matrix multiplication operations. The main computation part “lapack eig” actually costs very little time (as the 280-by-280 matrix is not big).

The matrix multiplication in “get L” is 280x10304 multiplies 10304x280, which is quite large!


Matrix Multiplication

Matrix multiplication is a great example to show that why we need well tuned libraries to perform numerical computations.

Table: the time cost for self-made(not optimized) implementations





















implementations operator * for matrix type 3 F# loops(use matrix) 3 F# loops(use float Array2D) 3 C loops (native)
time(seconds) 13.07 38.81 13.52 8.10
I used the following 3 loops:

for (i=0; i<N; i++)
for (j=0; j<N; j++) {
double s = 0;
for (k=0; k<M; k++)
s += a[i][k] * b[k][j];
c[i][j] = s;
}
The two F# programs are similar to the above one in C. We can find that if we use a lot of F# matrix element access operator [,], the performance is very poor. The * operator of matrix type does not have any optimization in it, so it costs the same amount of time as 3-loops (Array2D) does. Native code is faster than the managed ones as it does not do boundary checking for index.

Matlab, R and Numpy/Python all wrap optimized matrix implementations. Let us next see how they perform:

Table: the time cost for optimized implementations

















Software Matlab 2009a R 2.10.1 Numpy 1.3.0
time 0.25 1.98 0.35
Matlab is the fastest, but it uses 200% CPU. While R and Numpy are single threaded. R takes 2 seconds. Matlab uses Intel MKL, Numpy uses Lapack(maybe optimized version), I don’t know what algebra routines R uses.

We can see the difference between optimized versions and non-optimized ones is very large. A similar report from a .Net math software vender is here. The conclusion is that we should use optimized matrix multiplication procedure when matrices are large.


Add optimized matrix multiplication support to math-provider

The math-provider has made an wrapper for dgemm_, which is a BLAS procedure for computing:

C  :=
alpha*op( A )*op( B ) + beta*C
where op (A) = A or A’.
This is general case of matrix multiplication. The math-provider also has an wrapper for dgemv_, matrix-vector multiplication.

However, they are not directly callable from Lapack service provider.

add the following code to linear_algebra_service.fs:
let
MM a b =
  Service().dgemm_(a,b)

and add the following to linear_algebra.fs:
let MM A B =
if HaveService() then
LinearAlgebraService.MM A B
else
A * B
and add the following to linear_algebra.fsi:
/// BLAS matrix multiplication
val MM: A:matrix -> B:matrix -> matrix
Now we define a local name for it:

let mm = LinearAlgebra.MM 
Using BLAS matrix multiplication routine costs about 4 seconds, which is faster than native C’s performance(8 seconds), but worse than Matlab’s or Numpy’s. The reason is that the BLAS(Netlib BLAS 3.1.1) is not optimized for my platform. If I use ATLAS for BLAS and use an Intel Fortran Compiler, the performance will be close to Numpy’s or Matlab’s.

By using BLAS’s matrix multiplication routine for the face recognition, the total running reduces from 25 seconds to 9 seconds.


Attachment



wait.

2 comments:

  1. I just did a test on Java. It costs 24 seconds on average to perform A*A', where A is a 280*10304 random matrix.

    ReplyDelete
  2. This blog post provides valuable insights into profiling techniques and their importance for optimizing code.

    ReplyDelete