I recently noticed that the same code on the same machine had vastly different run times in different virtual environments. This looked a little suspicious to me. Digging deeper, I realized that the reason for this was that numpy was linked against different BLAS libraries in different virtual environments, which had tremendous effects on the performance. However, taking a closer look on what implementation is actually used under the hood on your machine, and possibly compiling an implementation that is tailored to your very system, can yield significant speed–ups almost for free, without changing a single line of your code.

BLAS & LAPACK

BLAS is the low–level part of your system that is responsible for efficiently performing numerical linear algebra, i.e. all the heavy number crunching. More precisely:

Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C and Fortran. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits.

Together with something like LAPACK, it is the cornerstone of today's scientific software packages.

LAPACK (Linear Algebra Package) is a standard software library for numerical linear algebra. It provides routines for solving systems of linear equations and linear least squares, eigenvalue problems, and singular value decomposition. It also includes routines to implement the associated matrix factorizations such as LU, QR, Cholesky and Schur decomposition.

Now, as mentioned above, BLAS & LAPACK are merely defining a standard interface, but what really matters is the implementation you choose. There are many different implementations under different licenses, which all have their advantages and disadvantages. It can sometimes be difficult to choose the most beneficial implementation for your machine.

Different Implementations

Default BLAS & LAPACK

By default, you will most likely only have the standard BLAS & LAPACK implementation on your system (e.g. Arch Linux packages blas/cblas and lapack). While this situation is certainly sufficient if you don't care about performance and works out of the box, is has considerable shortcomings compared to more recent implementation, e.g. not fully supporting multi–core computations. We will see how its overall performance compares to other implementations below.

Automatically Tuned Linear Algebra Software (ATLAS)

The main purpose of the ATLAS project is to provide a portable high–performance implementation for many different platforms. To achieve this, they use a complex mechanism to automatically determine the optimal parameters for the respective hardware during compilation. ATLAS has become quite popular in recent years, since it uses the BSD license and is platform independent. When compiled properly, it probably provides a reasonably good BLAS library for most performance–critical applications on most platforms.

On Arch Linux you can use the atlas-lapack package from the repository, which will download and compile ATLAS on your machine. However, before starting the compilation, make sure to disable CPU throttling in your OS, since this will mess up the timing when benchmarking ATLAS during compilation. To do this, I had to run sudo cpupower frequency-set -g performance on my machine. Additionally, I had to disable intel_pstate which actually required me to reboot my system. After this, the compilation process ran smooth and ATLAS determined the optimal settings for my machine, which ended up taking a couple of hours.

Intel Math Kernel Library (Intel MKL)

The Intel Math Kernel Library (Intel MKL) is a proprietary library that is hand–optimized specifically for Intel processors. As of February 2017, it is free of charge for some use cases. You can read more about the Community Licensing here.

If you are using Arch Linux you can obtain it by installing the intel-mkl package, otherwise search your repository or download it directly from Intel's developer page. The compilation process is straightforward, just follow the instructions provided.

OpenBLAS

OpenBLAS is another popular open–source implementation that is based on a fork of GotoBLAS2. They claim in their FAQ that OpenBLAS achieves a performance comparable to Intel MKL for Intel's Sandy Bridge CPUs. Furthermore, OpenBLAS is well–known for its multi-threading features and apparently scales very nicely with the number of cores.

Obtain a copy from openblas.net or GitHub and follow the installation instructions. This will walk you through the compilation and installation process.

Time for Benchmarks

A reliable comparison can only be based on hard numbers. For this purpose, I drafted a small Python script based on an existing script that I found in this Stackoverflow thread. Here it is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from __future__ import print_function

import numpy as np
from time import time

# Let's take the randomness out of random numbers (for reproducibility)
np.random.seed(0)

size = 4096
A, B = np.random.random((size, size)), np.random.random((size, size))
C, D = np.random.random((size * 128,)), np.random.random((size * 128,))
E = np.random.random((int(size / 2), int(size / 4)))
F = np.random.random((int(size / 2), int(size / 2)))
F = np.dot(F, F.T)
G = np.random.random((int(size / 2), int(size / 2)))

# Matrix multiplication
N = 20
t = time()
for i in range(N):
    np.dot(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B

# Vector multiplication
N = 5000
t = time()
for i in range(N):
    np.dot(C, D)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D

# Singular Value Decomposition (SVD)
N = 3
t = time()
for i in range(N):
    np.linalg.svd(E, full_matrices = False)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E

# Cholesky Decomposition
N = 3
t = time()
for i in range(N):
    np.linalg.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

# Eigendecomposition
t = time()
for i in range(N):
    np.linalg.eig(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

This performs some matrix multiplication, vector–vector multiplication, singular value decomposition (SVD), Cholesky factorization and Eigendecomposition, and averages the timing results (which are of course arbitrary) over multiple runs. This is how you can find out which BLAS implementation numpy is using under the hood:

1
2
import numpy as np
np.__config__.show()

You can run the script successively using different BLAS implementations, either in different virtual environments or using a different BLAS configuration in site.cfg. For a fair comparison, you should set your CPU to a constant clock frequency and not perform any other heavy lifting while the scripts are running, especially when they make use of multiple cores on your machine.

Let's find out how the different implementation compare in terms of performance on my laptop!

Default BLAS & LAPACK

1
2
3
4
5
Dotted two 4096x4096 matrices in 64.22 s.
Dotted two vectors of length 524288 in 0.80 ms.
SVD of a 2048x1024 matrix in 10.31 s.
Cholesky decomposition of a 2048x2048 matrix in 6.74 s.
Eigendecomposition of a 2048x2048 matrix in 53.77 s.

ATLAS

1
2
3
4
5
Dotted two 4096x4096 matrices in 3.46 s.
Dotted two vectors of length 524288 in 0.73 ms.
SVD of a 2048x1024 matrix in 2.02 s.
Cholesky decomposition of a 2048x2048 matrix in 0.51 s.
Eigendecomposition of a 2048x2048 matrix in 29.90 s.

Intel MKL

1
2
3
4
5
Dotted two 4096x4096 matrices in 2.44 s.
Dotted two vectors of length 524288 in 0.75 ms.
SVD of a 2048x1024 matrix in 1.34 s.
Cholesky decomposition of a 2048x2048 matrix in 0.40 s.
Eigendecomposition of a 2048x2048 matrix in 10.07 s.

OpenBLAS

1
2
3
4
5
Dotted two 4096x4096 matrices in 3.97 s.
Dotted two vectors of length 524288 in 0.74 ms.
SVD of a 2048x1024 matrix in 1.96 s.
Cholesky decomposition of a 2048x2048 matrix in 0.46 s.
Eigendecomposition of a 2048x2048 matrix in 32.95 s.

Today's Lesson

We can see that there are tremendous differences in run time. Unsurprisingly, the default implementation seems to be slowest in most regards (by far!). Intel's Math Kernel Library (Intel MKL) performs best, which is again not truly surprising, given the fact that I use an Intel i5 processor on my machine. ATLAS and OpenBLAS are both within the same ballpark and doing reasonably well. All implementations, except for the default implementation, make use of all cores on my machine, which can explain some but not all of the performance gains. You can control the number of threads used by setting the respective environment variables (e.g. OMP_NUM_THREADS).

The difference between slowest and fastest is quite remarkable. If the low–level math computation is actually your bottleneck, and you are not already making use of all cores on your machine, you can expect speed–ups up to a factor of \(25\) (in my case) or something within that order of magnitude. This is roughly equivalent of getting the run time down from a full nightly run, i.e. from leaving the office in the afternoon to the first cup of coffee in the morning, to only half an hour.

Further Thoughts

  • Always keep an eye on what BLAS library numpy is actually using on your machine. Use np.__config__.show() to find out about this.
  • By the way, it also matters which BLAS implementation scipy is linked against.
  • If you care about performance, whatever BLAS library you decide to use in the end should be tailored to your machine's very hardware. It you don't mind the trouble, go ahead and compile it yourself.
  • As a rule of thumb, when your machine has a reasonably modern Intel processor, then you're probably best off using Intel MKL.
  • It seems that, once again, the Anaconda distribution saves your ass by providing sane defaults, i.e. shipping with Intel MKL built-in in my case. If you've obtained numpy through conda you are probably doing fine.
  • There are more libraries such as Eigen or Boost that should be worth to look at. If you have an AMD processor, take a look at ACML.

Ressources