We will use the face recognition program we developed in Part III to illustrate how to profile programs.

The performance problem of the face recognition program is that it takes about 25 seconds to process 280 faces (each face has 10304=112*92 pixels). I want to find which part takes most of the time. Is it due to the eigen decomposition?

## Profile

Profiling means knowing how much time and memory every part of your program take. Good profiling tools could give a detail report on the running behavior of a program. In .Net, there are several good tools for this purpose. Some of them are free.

However, these tools are too big to use in a explorative/interactive programming style. Following Matlab’s tic and toc, I have written them for F#:

let stopWatch = new System.Diagnostics.Stopwatch()I also use the same style for my C programs:

let tic() =

stopWatch.Reset()

stopWatch.Start()

let toc(msg:string) =

stopWatch.Stop()

printfn "%s %f ms" msg stopWatch.Elapsed.TotalMilliseconds

stopWatch.Reset()

stopWatch.Start()

#include <time.h>This implementation enables us to use a sequence of

time_t stopWatch;

void tic()

{

stopWatch = clock();

}

void toc(const char *msg)

{

double s = (clock() - stopWatch) * 1000.0 / CLOCKS_PER_SEC;

printf("%s: %.2lf ms\n", msg, s);

stopWatch = clock();

}

**toc**s to

*profile a code block line by line*without the need to write

**tic**every time. However this simple profiler can not be nested as there is only a global timer.

After putting several **toc**s into my eigCov:

tic()It gets the following output:

let colm = colMean faces

toc("mean face")

let A = Matrix.init nrow ncol (fun i j -> faces.[i,j] - colm.[i])

toc("minus mean face")

let L = A.Transpose * A

toc("get L")

let val1, vec1 = eig L

toc("lapack eig")

let v = val1 * (1.0 / float (ncol - 1))

let u = A * vec1.Transpose

toc("get u")

// normalize eigen vectors

let mutable cnt = 0

for i=0 to ncol-1 do

if abs v.[i] < 1e-8 then

u.[0..nrow-1,i..i] <- Matrix.create nrow 1 0.0

else

cnt <- cnt + 1

let norm = Vector.norm (u.[0..nrow-1,i..i].ToVector())

u.[0..nrow-1,i..i] <- u.[0..nrow-1,i..i] * ( 1.0 / float norm )

toc("normolize u")

mean face 639.891700 ms

minus mean face 195.484600 ms

get L 12885.751500 ms

lapack eig 247.015700 ms

get u 7169.749100 ms

normolize u 623.919700 ms

The matrix multiplication in “get L” is 280x10304 multiplies 10304x280, which is quite large!

## Matrix Multiplication

Matrix multiplication is a great example to show that why we need well tuned libraries to perform numerical computations.Table: the time cost for self-made(not optimized) implementations

implementations | operator * for matrix type | 3 F# loops(use matrix) | 3 F# loops(use float Array2D) | 3 C loops (native) |

time(seconds) | 13.07 | 38.81 | 13.52 | 8.10 |

for (i=0; i<N; i++)The two F# programs are similar to the above one in C. We can find that if we use a lot of F# matrix element access operator [,], the performance is very poor. The * operator of matrix type does not have any optimization in it, so it costs the same amount of time as 3-loops (Array2D) does. Native code is faster than the managed ones as it does not do boundary checking for index.

for (j=0; j<N; j++) {

double s = 0;

for (k=0; k<M; k++)

s += a[i][k] * b[k][j];

c[i][j] = s;

}

Matlab, R and Numpy/Python all wrap optimized matrix implementations. Let us next see how they perform:

Table: the time cost for optimized implementationsSoftware | Matlab 2009a | R 2.10.1 | Numpy 1.3.0 |

time | 0.25 | 1.98 | 0.35 |

We can see the difference between optimized versions and non-optimized ones is very large. A similar report from a .Net math software vender is here. The conclusion is that **we should use optimized matrix multiplication procedure when matrices are large**.

## Add optimized matrix multiplication support to math-provider

The math-provider has made an wrapper for dgemm_, which is a BLAS procedure for computing:C :=This is general case of matrix multiplication. The math-provider also has an wrapper for dgemv_, matrix-vector multiplication.

alpha*op( A )*op( B ) + beta*C

where op (A) = A or A’.

However, they are not directly callable from Lapack service provider.

add the following code to linear_algebra_service.fs:let MM a b =

Service().dgemm_(a,b)

and add the following to linear_algebra.fs:

let MM A B =and add the following to linear_algebra.fsi:

if HaveService() then

LinearAlgebraService.MM A B

else

A * B

/// BLAS matrix multiplication

val MM: A:matrix -> B:matrix -> matrixNow we define a local name for it:

`let mm = LinearAlgebra.MM `

Using BLAS matrix multiplication routine costs about **4 seconds**, which is faster than native C’s performance(8 seconds), but worse than Matlab’s or Numpy’s. The reason is that the BLAS(Netlib BLAS 3.1.1) is not optimized for my platform. If I use ATLAS for BLAS and use an Intel Fortran Compiler, the performance will be close to Numpy’s or Matlab’s.

By using BLAS’s matrix multiplication routine for the face recognition, the total running reduces from 25 seconds to 9 seconds.

## Attachment

wait.

I just did a test on Java. It costs 24 seconds on average to perform A*A', where A is a 280*10304 random matrix.

ReplyDelete