The following graphs compare naïve and
(non-ATLAS) BLAS implementations of matrix multiplication in C. They use the same Intel
Core Duo Linux machine at 1.66GHz.
The computation is , where
,
and
are all full matrices in
. For the smaller matrices, at least one second was
spent repeating the multiplication, to get a higher precision result.
Code was compiled with gcc 4.3.1 and options -O2 -funroll-loops. No performance improvement was observed with -O3. (See computation speeds for some reference measurements on the same computer.)
for (i = 0; i < n; i++) for (j = 0; j < n; j++) { C[i + j*n] = 0; for (k = 0; k < n; k++) C[i + j*n] += A[i + k*n]*B[k + n*j]; }
dgemm_(&charN, &charN, &n, &n, &n, &one_d, A, &n, B, &n, &zero_d, C, &n);
For comparison, a simple multiply-add was timed with vectors of length 200 (ie,
using level 2 cache only). This reference time was scaled by and plotted
on the same chart below. Apparently BLAS even schedules instructions better
than would be possible by just repeating this primitive operation.
z[i] += x[i] * y[i];
(All three graphs are the same: the first and second use (different) logarithmic scales on the y-axis, while the second has logarithmic scales on both the x- and the y-axes.)