Dr. Yves J. Hilpisch
The Python Quants GmbH
July 2013
Numba is an open-source, NumPy-aware optimizing compiler for Python sponsored by The Python Quants GmbH It uses the remarkable LLVM compiler infrastructure to compile Python byte-code to machine code especially for use in the NumPy run-time and SciPy modules.
A simple example illustrates how Numba works. We start with a heavily looping function.
import time
def f(n):
t0 = time.time()
result = 0.0
for i in range(n):
for j in range(n * i):
result += sin(pi / 2)
return int(result), time.time()-t0
This function is executed to measure pure Python speed.
n = 500
res_py = f(n)
print "Number of Loops %8d" % res_py[0]
print "Time in Sec for Python %8.3f" % res_py[1]
Number of Loops 62375000 Time in Sec for Python 161.194
We then compile the function just-in-time with Numba. Just a single line of code.
import numba as nb
f_nb = nb.autojit(f)
Get execution speed.
res_nb = f_nb(n)
print "Number of Loops %8d" % res_nb[0]
print "Time in Sec for Python %8.3f" % res_nb[1]
Number of Loops 62375000 Time in Sec for Python 0.080
The speed-up is considerable – especially given the marginal additional effort.
print "Number of Loops %8d" % res_py[0]
print "Speed-up of Numba %8d" % (res_py[1] / res_nb[1])
Number of Loops 62375000 Speed-up of Numba 2004
The simulation of the movement of N bodies in relation to each other, where N is a large number and a body can be a star/planet or a small particle, is an important problem in Physics.
The basic simulation algorithm involves a number of iterations ("loops") since one has to determine the force that every single body exerts on all the other bodies in the system to be simulated (e.g. a galaxy or a system of atomic particles).
The algorithm lends iteself pretty well for parallization. However, the example code that follows does not use any parallelization techniques. It illustrates rather what speed-ups are achievable by
A description and discussion of the algorithm is found in the recent article Vladimirov, Andrey and Vadim Karpusenko (2013): "Test-Driving Intel Xeon-Phi Coprocessors with a Basic N-Body Simulation." White Paper, Colfax International.
We first use pure Python to implement the algorithm. We start by generating random vectors of particle locations and a zero vector for the speed of each particle.
import random
nParticles = 1000
particle = []
for i in range(nParticles):
particle.append([random.gauss(0.0, 1.0) for j in range(3)])
particlev = [[0, 0, 0] for x in particle]
Next, we define the simulation function for the movements of each particle given their gravitational interaction.
import time
def nbody(particle, particlev): # lists as input
t0 = time.time(); nSteps = 5; dt = 0.01
for step in range(1, nSteps + 1, 1):
for i in range(nParticles):
Fx = 0.0; Fy = 0.0; Fz = 0.0
for j in range(nParticles):
if j != i:
dx = particle[j][0] - particle[i][0]
dy = particle[j][1] - particle[i][1]
dz = particle[j][2] - particle[i][2]
drSquared = dx * dx + dy * dy + dz * dz
drPowerN32 = 1.0 / (drSquared + sqrt(drSquared))
Fx += dx * drPowerN32
Fy += dy * drPowerN32
Fz += dz * drPowerN32
particlev[i][0] += dt * Fx
particlev[i][1] += dt * Fy
particlev[i][2] += dt * Fz
for i in range(nParticles):
particle[i][0] += particlev[i][0] * dt
particle[i][1] += particlev[i][1] * dt
particle[i][2] += particlev[i][2] * dt
return time.time() - t0
We then execute the function to measure execution speed.
%time ti_py = nbody(particle, particlev)
CPU times: user 1min 9s, sys: 8 ms, total: 1min 9s Wall time: 1min 10s
The basic algorithm allows for some vectorization of operation by using NumPy.
import numpy as np
particle = np.random.standard_normal((nParticles, 3))
particlev = np.zeros_like(particle)
Here, the somewhat shorter NumPy function.
def nbody_np(particle, particlev): # NumPy arrays as input
t0 = time.time(); nSteps = 5; dt = 0.01
for step in range(1, nSteps + 1, 1):
Fp = np.zeros((nParticles, 3))
for i in range(nParticles):
dp = particle - particle[i]
drSquared = np.sum(dp ** 2, axis=1)
h = drSquared + np.sqrt(drSquared)
# drPowerN32 = np.where(h > 0., 1. / h, 1E-10)
drPowerN32 = 1. / np.maximum(h, 1E-10)
Fp += -(dp.T * drPowerN32).T
particlev += dt * Fp
particle += particlev * dt
return time.time() - t0
This function is again executed for speed measurement.
%time ti_np = nbody_np(particle, particlev)
CPU times: user 712 ms, sys: 2 ms, total: 714 ms Wall time: 718 ms
The speed-up already is considerable.
ti_py / ti_np
97.68499579758402
Now, we implement a somewhat adjusted Python/NumPy function that has NumPy array indexing.
def nbody(particle, particlev): # NumPy arrays as input
t0 = time.time(); nSteps = 5; dt = 0.01
for step in range(1, nSteps + 1, 1):
for i in range(nParticles):
Fx = 0.0; Fy = 0.0; Fz = 0.0
for j in range(nParticles):
if j != i:
dx = particle[j,0] - particle[i,0]
dy = particle[j,1] - particle[i,1]
dz = particle[j,2] - particle[i,2]
drSquared = dx * dx + dy * dy + dz * dz
drPowerN32 = 1.0 / (drSquared + sqrt(drSquared))
Fx += dx * drPowerN32
Fy += dy * drPowerN32
Fz += dz * drPowerN32
particlev[i, 0] += dt * Fx
particlev[i, 1] += dt * Fy
particlev[i, 2] += dt * Fz
for i in range(nParticles):
particle[i,0] += particlev[i,0] * dt
particle[i,1] += particlev[i,1] * dt
particle[i,2] += particlev[i,2] * dt
return time.time() - t0
This function is then comiled with Numba.
import numba as nb
nbody_nb = nb.autojit(nbody)
firstrun = nbody_nb(particle, particlev)
This again is executed for speed comparisons.
%time ti_nb = nbody_nb(particle, particlev)
CPU times: user 103 ms, sys: 1e+03 µs, total: 104 ms Wall time: 105 ms
This again leads to a considerable speed-up versus NumPy.
ti_np / ti_nb # speed-up vs. NumPy function
6.858800002733168
However, the speed-up in comparison to the pure Python solution is even more impressive.
ti_py / ti_nb # speed-up vs. pure Python function
670.0018494434587
The Python Quants GmbH – the company Web site
Dr. Yves J. Hilpisch – my personal Web site
Derivatives Analytics with Python – my new book
Read an Excerpt and Order the Book
Contact Us