We will use the face recognition program we developed in Part III to illustrate how to profile programs.
The performance problem of the face recognition program is that it takes about 25 seconds to process 280 faces (each face has 10304=112*92 pixels). I want to find which part takes most of the time. Is it due to the eigen decomposition?
Profile
Profiling means knowing how much time and memory every part of your program take. Good profiling tools could give a detail report on the running behavior of a program. In .Net, there are several good tools for this purpose. Some of them are free.
However, these tools are too big to use in a explorative/interactive programming style. Following Matlab’s tic and toc, I have written them for F#:
let stopWatch = new System.Diagnostics.Stopwatch()I also use the same style for my C programs:
let tic() =
stopWatch.Reset()
stopWatch.Start()
let toc(msg:string) =
stopWatch.Stop()
printfn "%s %f ms" msg stopWatch.Elapsed.TotalMilliseconds
stopWatch.Reset()
stopWatch.Start()
#include <time.h>This implementation enables us to use a sequence of tocs to profile a code block line by line without the need to write tic every time. However this simple profiler can not be nested as there is only a global timer.
time_t stopWatch;
void tic()
{
stopWatch = clock();
}
void toc(const char *msg)
{
double s = (clock() - stopWatch) * 1000.0 / CLOCKS_PER_SEC;
printf("%s: %.2lf ms\n", msg, s);
stopWatch = clock();
}
After putting several tocs into my eigCov:
tic()It gets the following output:
let colm = colMean faces
toc("mean face")
let A = Matrix.init nrow ncol (fun i j -> faces.[i,j] - colm.[i])
toc("minus mean face")
let L = A.Transpose * A
toc("get L")
let val1, vec1 = eig L
toc("lapack eig")
let v = val1 * (1.0 / float (ncol - 1))
let u = A * vec1.Transpose
toc("get u")
// normalize eigen vectors
let mutable cnt = 0
for i=0 to ncol-1 do
if abs v.[i] < 1e-8 then
u.[0..nrow-1,i..i] <- Matrix.create nrow 1 0.0
else
cnt <- cnt + 1
let norm = Vector.norm (u.[0..nrow-1,i..i].ToVector())
u.[0..nrow-1,i..i] <- u.[0..nrow-1,i..i] * ( 1.0 / float norm )
toc("normolize u")
mean face 639.891700 ms
minus mean face 195.484600 ms
get L 12885.751500 ms
lapack eig 247.015700 ms
get u 7169.749100 ms
normolize u 623.919700 ms
The matrix multiplication in “get L” is 280x10304 multiplies 10304x280, which is quite large!
Matrix Multiplication
Matrix multiplication is a great example to show that why we need well tuned libraries to perform numerical computations.Table: the time cost for self-made(not optimized) implementations
implementations | operator * for matrix type | 3 F# loops(use matrix) | 3 F# loops(use float Array2D) | 3 C loops (native) |
time(seconds) | 13.07 | 38.81 | 13.52 | 8.10 |
for (i=0; i<N; i++)The two F# programs are similar to the above one in C. We can find that if we use a lot of F# matrix element access operator [,], the performance is very poor. The * operator of matrix type does not have any optimization in it, so it costs the same amount of time as 3-loops (Array2D) does. Native code is faster than the managed ones as it does not do boundary checking for index.
for (j=0; j<N; j++) {
double s = 0;
for (k=0; k<M; k++)
s += a[i][k] * b[k][j];
c[i][j] = s;
}
Matlab, R and Numpy/Python all wrap optimized matrix implementations. Let us next see how they perform:
Table: the time cost for optimized implementationsSoftware | Matlab 2009a | R 2.10.1 | Numpy 1.3.0 |
time | 0.25 | 1.98 | 0.35 |
We can see the difference between optimized versions and non-optimized ones is very large. A similar report from a .Net math software vender is here. The conclusion is that we should use optimized matrix multiplication procedure when matrices are large.
Add optimized matrix multiplication support to math-provider
The math-provider has made an wrapper for dgemm_, which is a BLAS procedure for computing:C :=This is general case of matrix multiplication. The math-provider also has an wrapper for dgemv_, matrix-vector multiplication.
alpha*op( A )*op( B ) + beta*C
where op (A) = A or A’.
However, they are not directly callable from Lapack service provider.
add the following code to linear_algebra_service.fs:let MM a b =
Service().dgemm_(a,b)
and add the following to linear_algebra.fs:
let MM A B =and add the following to linear_algebra.fsi:
if HaveService() then
LinearAlgebraService.MM A B
else
A * B
/// BLAS matrix multiplication
val MM: A:matrix -> B:matrix -> matrixNow we define a local name for it:
let mm = LinearAlgebra.MM
Using BLAS matrix multiplication routine costs about 4 seconds, which is faster than native C’s performance(8 seconds), but worse than Matlab’s or Numpy’s. The reason is that the BLAS(Netlib BLAS 3.1.1) is not optimized for my platform. If I use ATLAS for BLAS and use an Intel Fortran Compiler, the performance will be close to Numpy’s or Matlab’s.By using BLAS’s matrix multiplication routine for the face recognition, the total running reduces from 25 seconds to 9 seconds.
Attachment
wait.
I just did a test on Java. It costs 24 seconds on average to perform A*A', where A is a 280*10304 random matrix.
ReplyDeleteThis blog post provides valuable insights into profiling techniques and their importance for optimizing code.
ReplyDelete