
Dr. Yves J. Hilpisch
The Python Quants GmbH
PyData, New York City – 08. November 2013
Python for Performance Computing:
This tutorial focuses on
It does not address such important issues like
A fundamental Python stack for interactive data analytics and visualization should at least contain the following libraries tools:
It is best to use either the Python distribution Anaconda or the Web-based analytics environment Wakari. Both provide almost "complete" Python environments.
When it comes to performance critical operations, a number of additional libraries can be used to speed-up code execution.
All these are also part of Anaconda and Wakari.
We define a convenience function to compare the performance of a set of Python functions executed on the same data or on different data.
def perf_comp_data(func_list, data_list, rep=3, number=1):
from timeit import repeat
res_list = {}
for name in enumerate(func_list):
stmt = name[1] + '(' + data_list[name[0]] + ')'
setup = "from __main__ import " + name[1] + ', ' + data_list[name[0]]
results = repeat(stmt=stmt, setup=setup, repeat=rep, number=number)
res_list[name[1]] = sum(results) / rep
res_sort = sorted(res_list.iteritems(), key=lambda (k, v): (v, k))
for item in res_sort:
rel = item[1] / res_sort[0][1]
print 'function: ' + item[0] + ', av. time sec: %9.5f, ' % item[1] \
+ 'relative: %6.1f' % rel
A typical compute intensive operation is to evaluate a complex mathematical expression on a large array of data.
First, let us define an example function/numerical expression.
from math import *
def f(x):
return abs(cos(x)) ** 0.5 + sin(2 + 3 * x)
Second, as our benchmark case a pure Python implementation.
I = 200000
a_py = range(I)
def f1(a):
res = []
for x in a:
res.append(f(x))
return res
One can also use different paradigms to implement the same function, like iterators or the eval function.
def f2(a):
return [f(x) for x in a]
def f3(a):
ex = 'abs(cos(x)) ** 0.5 + sin(2 + 3 * x)'
return [eval(ex) for x in a]
Third, a vectorized implementations with NumPy.
import numpy as np
a_np = np.arange(I)
def f4(a):
return (np.abs(np.cos(a)) ** 0.5 +
np.sin(2 + 3 * a))
Fourth, we use numexpr to evaluate the numerical expression.
import numexpr as ne
def f5(a):
ex = 'abs(cos(a)) ** 0.5 + sin(2 + 3 * a)'
ne.set_num_threads(1)
return ne.evaluate(ex)
def f6(a):
ex = 'abs(cos(a)) ** 0.5 + sin(2 + 3 * a)'
ne.set_num_threads(4)
return ne.evaluate(ex)
Now, let's compare the performance of all different function implementations.
func_list = ['f1', 'f2', 'f3', 'f4', 'f5', 'f6']
data_list = ['a_py', 'a_py', 'a_py', 'a_np', 'a_np', 'a_np']
perf_comp_data(func_list, data_list, rep=1)
function: f6, av. time sec: 0.00627, relative: 1.0 function: f5, av. time sec: 0.01631, relative: 2.6 function: f4, av. time sec: 0.01861, relative: 3.0 function: f2, av. time sec: 0.15558, relative: 24.8 function: f1, av. time sec: 0.27338, relative: 43.6 function: f3, av. time sec: 6.27276, relative: 1000.1
Depending on the respective algorithm you implement with NumPy, memory layout can play a non-negligible role.
import numpy as np
np.zeros((3, 3), dtype=np.float64, order='C')
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
The way you initialize NumPy array can have a significant influence on the performance of operations on these arrays (given a certain size of the array).
Consider the C-like, i.e. row-wise, storage.
c = np.array([[ 1., 1., 1.],
[ 2., 2., 2.],
[ 3., 3., 3.]], order='C')
In this case, the 1s, the 2s and the 3s are stored next to each other.
By contrast, consider the Fortran-like, i.e. column-wise, storage.
f = np.array([[ 1., 1., 1.],
[ 2., 2., 2.],
[ 3., 3., 3.]], order='F')
Now, the data is stored in a way that 1, 2, 3 are next to each other for each column.
Let's see whether the memory layout makes a difference in some way when the array is large.
x = np.random.standard_normal((3, 1500000))
C = np.array(x, order='C')
F = np.array(x, order='F')
x = 0.0
Let's implement standard operations on the C-like layout array.
First, calculating sums.
%timeit C.sum(axis=0)
100 loops, best of 3: 11.1 ms per loop
%timeit C.sum(axis=1)
100 loops, best of 3: 5.93 ms per loop
Second, standard deviations.
%timeit C.std(axis=0)
10 loops, best of 3: 69.7 ms per loop
%timeit C.std(axis=1)
10 loops, best of 3: 31.6 ms per loop
Now the Fortran-like layout. Sums first.
%timeit F.sum(axis=0)
10 loops, best of 3: 35.6 ms per loop
%timeit F.sum(axis=1)
10 loops, best of 3: 26.4 ms per loop
Standard deviations second.
%timeit F.std(axis=0)
10 loops, best of 3: 117 ms per loop
%timeit F.std(axis=1)
10 loops, best of 3: 70.8 ms per loop
C = 0.0; F = 0.0
Currently, and in the foreseeable future as well, most corporate data is stored in SQL databases. We take this as a starting point and generate a SQL table with dummy data.
filename = 'numbers'
import sqlite3 as sq3 # imports the SQLite library
# Remove files if they exist
import os
try:
os.remove(filename + '.db')
os.remove(filename + '.h5')
os.remove(filename + '.hs')
os.remove(filename + '.csv')
except:
pass
We create a table and populate the table with the dummy data.
query = 'CREATE TABLE numbs (no1 real, no2 real, no3 real, no4 real, no5 real)'
con = sq3.connect(filename + '.db')
con.execute(query)
con.commit()
# Writing Random Numbers to SQLite3 Database
rows = 5000000 # 5,000,000 rows for the SQL table (ca. 250 MB in size)
nr = np.random.standard_normal((rows, 5))
%time con.executemany('INSERT INTO numbs VALUES(?, ?, ?, ?, ?)', nr)
con.commit()
CPU times: user 1min 2s, sys: 962 ms, total: 1min 3s Wall time: 1min 3s
We can now read the data into a NumPy array.
# Reading a couple of rows from SQLite3 database table
array(con.execute('SELECT * FROM numbs').fetchmany(3)).round(3)
array([[ 0.149, -0.874, 0.736, -0.551, -1.399],
[ 1.299, -0.281, -1.416, 0.995, 0.337],
[ 0.87 , -1.057, 0.544, -1.416, -2.613]])
# Reading the whole table into an NumPy array
%time result = array(con.execute('SELECT * FROM numbs').fetchall())
CPU times: user 39.3 s, sys: 1.83 s, total: 41.1 s Wall time: 41.6 s
PyTables, which is based on the HDF5 file format, allows much faster, disk-based I/O than a typical SQL database. For example, writing the same data is much faster compared to SQLite3.
import tables as tb
%%time
h5 = tb.open_file('numbers.h5', 'w')
h5.create_array('/', 'numbs', obj=result, title='numbs')
h5.close()
CPU times: user 3.89 ms, sys: 209 ms, total: 213 ms Wall time: 370 ms
nr = 0.0; result = 0.0 # memory clean-up
Reading speed is equally impressive.
%%time
h5 = tb.open_file('numbers.h5', 'r')
arr = h5.root.numbs.read()
h5.close()
CPU times: user 2.6 ms, sys: 167 ms, total: 170 ms Wall time: 158 ms
arr.nbytes
200000000
arr = 0.0 # memory clean-up
pandas is optimized both for convenience and performance when it comes to analytics and I/O operations.
import pandas as pd
import pandas.io.sql as pds
%time data = pds.read_frame('SELECT * FROM numbs', con)
con.close()
CPU times: user 18.3 s, sys: 1.51 s, total: 19.8 s Wall time: 20.2 s
data.head()
| no1 | no2 | no3 | no4 | no5 | |
|---|---|---|---|---|---|
| 0 | 0.149151 | -0.873864 | 0.736056 | -0.550805 | -1.399441 |
| 1 | 1.298855 | -0.281183 | -1.415695 | 0.995101 | 0.337464 |
| 2 | 0.870333 | -1.056957 | 0.543588 | -1.416220 | -2.613365 |
| 3 | 0.509911 | 0.139642 | 0.152018 | 0.197079 | 0.292907 |
| 4 | -1.048647 | -1.113482 | -0.037894 | 1.216061 | -0.263849 |
The plotting of larger data sets often is a problem. This can easily be accomplished here by only plotting every 500th data row, for instance.
%time data[::500].cumsum().plot(grid=True, legend=True) # every 500th row is plotted
CPU times: user 106 ms, sys: 5.53 ms, total: 111 ms Wall time: 119 ms
<matplotlib.axes.AxesSubplot at 0x107cae990>
There are some reasons why you may want to write the DataFrame to disk:
pandas is tightly integrated with PyTables. It makes data storage and I/O operations with pandas not only convenient but quite fast as well.
%%time
h5 = pd.HDFStore(filename + '.hs', 'w')
h5['data'] = data # writes the DataFrame to HDF5 database
h5.close()
CPU times: user 263 ms, sys: 374 ms, total: 638 ms Wall time: 793 ms
Now, we can read the data from the HDF5 file into another DataFrame object. Performance of such an operation generally is quite high.
%%time
h5 = pd.HDFStore(filename + '.hs', 'r')
data_new = h5['data']
h5.close()
CPU times: user 30 ms, sys: 276 ms, total: 306 ms Wall time: 305 ms
%time data.sum() - data_new.sum() # checking if the DataFrames are the same
CPU times: user 750 ms, sys: 231 ms, total: 981 ms Wall time: 976 ms
no1 0 no2 0 no3 0 no4 0 no5 0 dtype: float64
data = 0.0; data_new = 0.0 # memory clean-up
Some final checks.
ll numb*
-rw-r--r-- 1 yhilpisch staff 272386048 8 Nov 14:26 numbers.db -rw-rw-rw- 1 yhilpisch staff 200002304 8 Nov 14:26 numbers.h5 -rw-rw-rw- 1 yhilpisch staff 240007368 8 Nov 14:27 numbers.hs
try:
os.remove(filename + '.db')
os.remove(filename + '.h5')
os.remove(filename + '.hs')
os.remove(filename + '.csv')
except:
pass
The algorithm we (arbitrarily) choose is the Monte Carlo estimator for a European option value in the Black-Scholes-Merton set-up. In this set-up the underlying follows the stochastic differential equation:
\[dS_t = r S_t dt + \sigma S_t dZ_t\]
The Monte Carlo estimator for a call option value looks like follows:
\(C_0 = e^{-rT} \frac{1}{I} \sum_{i=1}^{I} \max[S_T(i) - K, 0]\)
where \(S_T(i)\) is the \(i\)-th simulated value of the underlying at maturity \(T\).
A dynamic implementation of the BSM Monte Carlo valuation could look like follows.
def BSM_MCS(S0):
''' Dynamic Black-Scholes-Merton MCS Estimator for European Calls. '''
import numpy as np
K = 100.; T = 1.0; r = 0.05; vola = 0.2
M = 50; I = 20000
dt = T / M
rand = np.random.standard_normal((M + 1, I))
S = np.zeros((M + 1, I)); S[0] = S0
for t in range(1, M + 1):
S[t] = S[t-1] * np.exp((r - 0.5 * vola ** 2) * dt + vola * np.sqrt(dt) * rand[t])
BSM_MCS_Value = np.sum(np.maximum(S[-1] - K, 0)) / I
return BSM_MCS_Value
The benchmark case of sequential valuation of \(n\) options.
def seqValue(n):
optionValues = []
for S in linspace(80, 100, n):
optionValues.append(BSM_MCS(S))
return optionValues
First, start a (local) cluster.
# in the shell ...
####
# ipcluster start -n 4
####
# or maybe more cores
Second, enable parallel computing capabilities.
from IPython.parallel import Client
c = Client(profile="default")
view = c.load_balanced_view()
The function for the parallel execution of a number \(n\) of different options valuations.
def parValue(n):
optionValues = []
for S in linspace(80, 100, n):
value = view.apply_async(BSM_MCS, S)
# asynchronously calculate the option values
optionValues.append(value)
c.wait(optionValues)
return optionValues
Let us compare performance of the two different approaches.
n = 20 # number of option valuations
func_list = ['seqValue', 'parValue']
data_list = 2 * ['n']
perf_comp_data(func_list, data_list, rep=3)
function: parValue, av. time sec: 0.75665, relative: 1.0 function: seqValue, av. time sec: 1.22566, relative: 1.6
Performance scales linearly with the number of CPU cores available.
Numba is an open-source, NumPy-aware optimizing compiler for Python sponsored by The Python Quants GmbH It uses the remarkable LLVM compiler infrastructure to compile Python byte-code to machine code especially for use in the NumPy run-time and SciPy modules.
Let us start with a problem that typically lead to performance issues in Python: alogrithms with a nested loops. A sandbox variant can illustrate the problem.
from math import cos, log
def f_py(I, J):
res = 0
for i in range(I):
for j in range (J):
res += int(cos(log(1)))
return res
I, J = 1000, 1000
%time f_py(I, J)
CPU times: user 491 ms, sys: 6.89 ms, total: 498 ms Wall time: 494 ms
1000000
In principle, this can be vectorized with NumPy arrays.
def f_np(I, J):
a = np.ones((I, J), dtype=np.float64)
return int(np.sum(np.cos(np.log(a)))), a
%time res, a = f_np(I, J)
CPU times: user 20.3 ms, sys: 6.44 ms, total: 26.7 ms Wall time: 25.8 ms
This is much faster, but not really memory efficient ...
a.nbytes
8000000
Numba can provide help.
import numba as nb
%time f_nb = nb.autojit(f_py)
CPU times: user 2.47 ms, sys: 657 µs, total: 3.13 ms Wall time: 2.73 ms
It preserves memory efficiency while speeding up execution considerably in this case.
%time f_nb(I, J)
DEBUG -- translate:361:translate
; ModuleID = 'tmp.module.__main__.f_py.107dfe578'
@PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**)
define i32 @__numba_specialized_0___main___2E_f_py(i32 %I, i32 %J) {
entry:
%nsteps2 = alloca i64
%target_temp1 = alloca i64
%nsteps = alloca i64
%target_temp = alloca i64
%return_value = alloca i32
store i64 0, i64* %target_temp, !tbaa !2
%0 = sext i32 %I to i64
store i64 %0, i64* %nsteps, !tbaa !3
br label %"for_condition_4:13"
cleanup_label: ; preds = %"exit_for_4:4", %error_label
%1 = load i32* %return_value
ret i32 %1
error_label: ; No predecessors!
store i32 123456789, i32* %return_value
br label %cleanup_label
"for_condition_4:13": ; preds = %entry, %"exit_for_5:8"
%res_2 = phi i32 [ 0, %entry ], [ %res_3, %"exit_for_5:8" ]
%j_1 = phi i64 [ 123456789, %entry ], [ %j_2, %"exit_for_5:8" ]
%2 = load i64* %target_temp, !tbaa !2
%3 = load i64* %nsteps, !tbaa !3
%4 = icmp slt i64 %2, %3
%5 = icmp ne i1 %4, false
br i1 %5, label %"loop_body_5:8", label %"exit_for_4:4"
"exit_for_4:4": ; preds = %"for_condition_4:13"
store i32 %res_2, i32* %return_value
br label %cleanup_label
"loop_body_5:8": ; preds = %"for_condition_4:13"
%6 = load i64* %target_temp, !tbaa !2
%7 = load i64* %target_temp, !tbaa !2
%8 = add i64 %7, 1
store i64 %8, i64* %target_temp, !tbaa !2
store i64 0, i64* %target_temp1, !tbaa !4
%9 = sext i32 %J to i64
store i64 %9, i64* %nsteps2, !tbaa !5
br label %"for_condition_5:17"
"for_condition_5:17": ; preds = %"loop_body_5:8", %"loop_body_6:19"
%res_3 = phi i32 [ %20, %"loop_body_6:19" ], [ %res_2, %"loop_body_5:8" ]
%j_2 = phi i64 [ %14, %"loop_body_6:19" ], [ %j_1, %"loop_body_5:8" ]
%10 = load i64* %target_temp1, !tbaa !4
%11 = load i64* %nsteps2, !tbaa !5
%12 = icmp slt i64 %10, %11
%13 = icmp ne i1 %12, false
br i1 %13, label %"loop_body_6:19", label %"exit_for_5:8"
"exit_for_5:8": ; preds = %"for_condition_5:17"
br label %"for_condition_4:13"
"loop_body_6:19": ; preds = %"for_condition_5:17"
%14 = load i64* %target_temp1, !tbaa !4
%15 = load i64* %target_temp1, !tbaa !4
%16 = add i64 %15, 1
store i64 %16, i64* %target_temp1, !tbaa !4
%17 = call double @"numba.math.['double'].log"(double 1.000000e+00)
%18 = call double @"numba.math.['double'].cos"(double %17)
%19 = fptosi double %18 to i32
%20 = add i32 %res_3, %19
br label %"for_condition_5:17"
}
declare { i64, i8* }* @Py_BuildValue(i8*, ...)
declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...)
declare void @PyErr_Clear()
declare double @"numba.math.['double'].cos"(double)
declare double @"numba.math.['double'].log"(double)
!tbaa = !{!0, !1, !2, !3, !4, !5}
!0 = metadata !{metadata !"root"}
!1 = metadata !{metadata !"char *", metadata !0}
!2 = metadata !{metadata !"unique0", metadata !1}
!3 = metadata !{metadata !"unique1", metadata !1}
!4 = metadata !{metadata !"unique2", metadata !1}
!5 = metadata !{metadata !"unique3", metadata !1}
DEBUG -- translate:361:translate
; ModuleID = 'numba_executable_module'
@PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**)
define void @Py_INCREF({ i64, i8* }* %obj) {
decl:
%obj1 = alloca { i64, i8* }*
store { i64, i8* }* %obj, { i64, i8* }** %obj1
%0 = bitcast { i64, i8* }* %obj to i64*
%1 = load i64* %0
%2 = add i64 %1, 1
store i64 %2, i64* %0
ret void
}
define void @Py_DECREF({ i64, i8* }* %obj) {
decl:
%obj1 = alloca { i64, i8* }*
store { i64, i8* }* %obj, { i64, i8* }** %obj1
%0 = bitcast { i64, i8* }* %obj to i64*
%1 = load i64* %0
%2 = icmp sgt i64 %1, 1
br i1 %2, label %if.then, label %if.else
if.then: ; preds = %decl
%3 = add i64 %1, -1
store i64 %3, i64* %0
br label %if.end
if.else: ; preds = %decl
call void @Py_DecRef({ i64, i8* }* %obj)
br label %if.end
if.end: ; preds = %if.else, %if.then
ret void
}
declare void @Py_DecRef({ i64, i8* }*)
define void @Py_XINCREF({ i64, i8* }* %obj) {
decl:
%obj1 = alloca { i64, i8* }*
store { i64, i8* }* %obj, { i64, i8* }** %obj1
%0 = ptrtoint { i64, i8* }* %obj to i64
%1 = icmp ne i64 %0, 0
br i1 %1, label %if.then, label %if.end
if.then: ; preds = %decl
%2 = bitcast { i64, i8* }* %obj to i64*
%3 = load i64* %2
%4 = add i64 %3, 1
store i64 %4, i64* %2
br label %if.end
if.end: ; preds = %if.then, %decl
ret void
}
define void @Py_XDECREF({ i64, i8* }* %obj) {
decl:
%obj1 = alloca { i64, i8* }*
store { i64, i8* }* %obj, { i64, i8* }** %obj1
%0 = ptrtoint { i64, i8* }* %obj to i64
%1 = icmp ne i64 %0, 0
br i1 %1, label %if.then, label %if.end
if.then: ; preds = %decl
call void @Py_DECREF({ i64, i8* }* %obj)
br label %if.end
if.end: ; preds = %if.then, %decl
ret void
}
define i8* @IndexAxis(i8* %data, i64* %in_shape, i64* %in_strides, i64 %src_dim, i64 %index) {
decl:
%data1 = alloca i8*
%in_shape2 = alloca i64*
%in_strides3 = alloca i64*
%src_dim4 = alloca i64
%index5 = alloca i64
%result = alloca i8*
store i8* %data, i8** %data1
store i64* %in_shape, i64** %in_shape2
store i64* %in_strides, i64** %in_strides3
store i64 %src_dim, i64* %src_dim4
store i64 %index, i64* %index5
%0 = load i64** %in_strides3
%1 = load i64* %src_dim4
%2 = getelementptr inbounds i64* %0, i64 %1
%3 = load i64* %2
%4 = mul i64 %3, %index
%5 = load i8** %data1
%6 = getelementptr inbounds i8* %5, i64 %4
store i8* %6, i8** %result
ret i8* %6
}
define void @NewAxis(i64* %out_shape, i64* %out_strides, i32 %dst_dim) {
decl:
%out_shape1 = alloca i64*
%out_strides2 = alloca i64*
%dst_dim3 = alloca i32
store i64* %out_shape, i64** %out_shape1
store i64* %out_strides, i64** %out_strides2
store i32 %dst_dim, i32* %dst_dim3
%0 = load i64** %out_shape1
%1 = getelementptr inbounds i64* %0, i32 %dst_dim
store i64 1, i64* %1
%2 = load i64** %out_strides2
%3 = load i32* %dst_dim3
%4 = getelementptr inbounds i64* %2, i32 %3
store i64 0, i64* %4
ret void
}
define i32 @Broadcast(i64* %dst_shape, i64* %src_shape, i64* %src_strides, i32 %max_ndim, i32 %ndim) {
decl:
%dst_shape1 = alloca i64*
%src_shape2 = alloca i64*
%src_strides3 = alloca i64*
%max_ndim4 = alloca i32
%ndim5 = alloca i32
%0 = alloca i32
store i64* %dst_shape, i64** %dst_shape1
store i64* %src_shape, i64** %src_shape2
store i64* %src_strides, i64** %src_strides3
store i32 %max_ndim, i32* %max_ndim4
store i32 %ndim, i32* %ndim5
%1 = load i32* %max_ndim4
%2 = sub i32 %1, %ndim
store i32 0, i32* %0
br label %loop.cond
loop.cond: ; preds = %if.end11, %decl
%3 = load i32* %0
%4 = load i32* %ndim5
%5 = icmp slt i32 %3, %4
br i1 %5, label %loop.body, label %loop.end
loop.body: ; preds = %loop.cond
%6 = load i64** %src_shape2
%7 = getelementptr inbounds i64* %6, i32 %3
%8 = add i32 %3, %2
%9 = load i64** %dst_shape1
%10 = getelementptr inbounds i64* %9, i32 %8
%11 = load i64* %7
%12 = icmp eq i64 %11, 1
br i1 %12, label %if.then, label %if.else
loop.end: ; preds = %if.else7, %loop.cond
%merge = phi i32 [ 1, %loop.cond ], [ 0, %if.else7 ]
ret i32 %merge
if.then: ; preds = %loop.body
%13 = load i64** %src_strides3
%14 = getelementptr inbounds i64* %13, i32 %3
store i64 0, i64* %14
br label %if.end11
if.else: ; preds = %loop.body
%15 = load i64* %10
%16 = icmp eq i64 %15, 1
br i1 %16, label %if.then6, label %if.else7
if.then6: ; preds = %if.else
store i64 %11, i64* %10
br label %if.end11
if.else7: ; preds = %if.else
%17 = icmp ne i64 %11, %15
br i1 %17, label %loop.end, label %if.end11
if.end11: ; preds = %if.else7, %if.then6, %if.then
%18 = load i32* %0
%19 = add i32 %18, 1
store i32 %19, i32* %0
br label %loop.cond
}
define i32 @__numba_specialized_0___main___2E_f_py(i32 %I, i32 %J) {
entry:
%0 = icmp sgt i32 %I, 0
%1 = icmp sgt i32 %J, 0
%or.cond = and i1 %0, %1
br i1 %or.cond, label %"loop_body_5:8.lr.ph.split.us", label %"exit_for_4:4"
"loop_body_5:8.lr.ph.split.us": ; preds = %entry
%2 = sext i32 %I to i64
%3 = sext i32 %J to i64
br label %"loop_body_6:19.lr.ph.us"
"loop_body_6:19.us": ; preds = %"loop_body_6:19.us", %"loop_body_6:19.lr.ph.us"
%lsr.iv = phi i64 [ %lsr.iv.next, %"loop_body_6:19.us" ], [ %3, %"loop_body_6:19.lr.ph.us" ]
%res_35.us = phi i32 [ %res_26.us, %"loop_body_6:19.lr.ph.us" ], [ %7, %"loop_body_6:19.us" ]
%4 = tail call double @"numba.math.['double'].log"(double 1.000000e+00)
%5 = tail call double @"numba.math.['double'].cos"(double %4)
%6 = fptosi double %5 to i32
%7 = add i32 %6, %res_35.us
%lsr.iv.next = add i64 %lsr.iv, -1
%exitcond = icmp eq i64 %lsr.iv.next, 0
br i1 %exitcond, label %"for_condition_4:13.loopexit.us", label %"loop_body_6:19.us"
"for_condition_4:13.loopexit.us": ; preds = %"loop_body_6:19.us"
%exitcond8 = icmp eq i64 %9, %2
br i1 %exitcond8, label %"exit_for_4:4", label %"loop_body_6:19.lr.ph.us"
"loop_body_6:19.lr.ph.us": ; preds = %"loop_body_5:8.lr.ph.split.us", %"for_condition_4:13.loopexit.us"
%res_26.us = phi i32 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %7, %"for_condition_4:13.loopexit.us" ]
%8 = phi i64 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %9, %"for_condition_4:13.loopexit.us" ]
%9 = add i64 %8, 1
br label %"loop_body_6:19.us"
"exit_for_4:4": ; preds = %"for_condition_4:13.loopexit.us", %entry
%res_2.lcssa = phi i32 [ 0, %entry ], [ %7, %"for_condition_4:13.loopexit.us" ]
ret i32 %res_2.lcssa
}
declare double @"numba.math.['double'].cos"(double)
declare double @"numba.math.['double'].log"(double)
define { i64, i8* }* @__numba_specialized_1_numba_2E_codegen_2E_llvmwrapper_2E___numba_wrapper_f_py(i8* %self, { i64, i8* }* %args) {
entry:
%objtemp = alloca { i64, i8* }*
store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6
%0 = alloca { i64, i8* }*
%1 = alloca { i64, i8* }*
%return_value = alloca { i64, i8* }*
%2 = call i32 ({ i64, i8* }*, i8*, ...)* @PyArg_ParseTuple({ i64, i8* }* %args, i8* getelementptr inbounds ([3 x i8]* @__STR_0, i32 0, i32 0), { i64, i8* }** %1, { i64, i8* }** %0)
%3 = icmp eq i32 %2, 0
br i1 %3, label %cleanup.if.true, label %cleanup.if.end
cleanup_label: ; preds = %no_error, %error_label
%4 = load { i64, i8* }** %objtemp, !tbaa !6
call void @Py_XDECREF({ i64, i8* }* %4)
%5 = load { i64, i8* }** %return_value
ret { i64, i8* }* %5
error_label: ; preds = %empty7, %empty8, %empty5, %empty2, %cleanup.if.true
store { i64, i8* }* null, { i64, i8* }** %return_value
%6 = load { i64, i8* }** %return_value, !tbaa !6
call void @Py_XINCREF({ i64, i8* }* %6)
br label %cleanup_label
cleanup.if.true: ; preds = %entry
br label %error_label
cleanup.if.end: ; preds = %entry
%7 = load { i64, i8* }** %1
%8 = load { i64, i8* }** %0
%9 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %7)
br label %empty
empty: ; preds = %cleanup.if.end
%10 = call i8* @PyErr_Occurred()
%11 = ptrtoint i8* %10 to i64
%12 = icmp ne i64 %11, 0
br i1 %12, label %empty2, label %empty1
empty1: ; preds = %empty
%13 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %8)
br label %empty3
empty2: ; preds = %empty
br label %error_label
empty3: ; preds = %empty1
%14 = call i8* @PyErr_Occurred()
%15 = ptrtoint i8* %14 to i64
%16 = icmp ne i64 %15, 0
br i1 %16, label %empty5, label %empty4
empty4: ; preds = %empty3
%17 = call i32 @__numba_specialized_0___main___2E_f_py(i32 %9, i32 %13)
br label %empty6
empty5: ; preds = %empty3
br label %error_label
empty6: ; preds = %empty4
%18 = call i8* @PyErr_Occurred()
%19 = ptrtoint i8* %18 to i64
%20 = icmp ne i64 %19, 0
br i1 %20, label %empty8, label %empty7
empty7: ; preds = %empty6
%21 = sext i32 %17 to i64
%22 = call { i64, i8* }* inttoptr (i64 4433235252 to { i64, i8* }* (i64)*)(i64 %21)
store { i64, i8* }* %22, { i64, i8* }** %objtemp, !tbaa !6
%23 = ptrtoint { i64, i8* }* %22 to i64
%24 = icmp eq i64 %23, 0
br i1 %24, label %error_label, label %no_error
empty8: ; preds = %empty6
br label %error_label
no_error: ; preds = %empty7
%25 = load { i64, i8* }** %objtemp, !tbaa !6
store { i64, i8* }* %25, { i64, i8* }** %return_value
%26 = load { i64, i8* }** %return_value, !tbaa !6
call void @Py_XINCREF({ i64, i8* }* %26)
br label %cleanup_label
}
declare i8* @PyErr_Occurred()
declare { i64, i8* }* @Py_BuildValue(i8*, ...)
declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...)
declare void @PyErr_Clear()
!tbaa = !{!0, !1, !2, !3, !4, !5, !0, !1, !6}
!0 = metadata !{metadata !"root"}
!1 = metadata !{metadata !"char *", metadata !0}
!2 = metadata !{metadata !"unique0", metadata !1}
!3 = metadata !{metadata !"unique1", metadata !1}
!4 = metadata !{metadata !"unique2", metadata !1}
!5 = metadata !{metadata !"unique3", metadata !1}
!6 = metadata !{metadata !"object", metadata !1}
CPU times: user 248 ms, sys: 27.5 ms, total: 275 ms Wall time: 337 ms
1000000
Let us compare the performance of the different alternatives.
func_list = ['f_py', 'f_np', 'f_nb']
data_list = 3 * ['I, J']
perf_comp_data(func_list, data_list, rep=3)
function: f_nb, av. time sec: 0.01615, relative: 1.0 function: f_np, av. time sec: 0.03077, relative: 1.9 function: f_py, av. time sec: 0.51423, relative: 31.8
The simulation of the movement of N bodies in relation to each other, where \(N\) is a large number and a body can be a star/planet or a small particle, is an important problem in Physics.
The basic simulation algorithm involves a number of iterations ("loops") since one has to determine the force that every single body exerts on all the other bodies in the system to be simulated (e.g. a galaxy or a system of atomic particles).
The algorithm lends iteself pretty well for parallization. However, the example code that follows does not use any parallelization techniques. It illustrates rather what speed-ups are achievable by
A description and discussion of the algorithm is found in the recent article Vladimirov, Andrey and Vadim Karpusenko (2013): "Test-Driving Intel Xeon-Phi Coprocessors with a Basic N-Body Simulation." White Paper, Colfax International.
We first use pure Python to implement the algorithm. We start by generating random vectors of particle locations and a zero vector for the speed of each particle.
import random
nParticles = 150
nSteps = 15
p_py = []
for i in range(nParticles):
p_py.append([random.gauss(0.0, 1.0) for j in range(3)])
pv_py = [[0, 0, 0] for x in p_py]
Next, we define the simulation function for the movements of each particle given their gravitational interaction.
def nbody_py(particle, particlev): # lists as input
dt = 0.01
for step in range(1, nSteps + 1, 1):
for i in range(nParticles):
Fx = 0.0; Fy = 0.0; Fz = 0.0
for j in range(nParticles):
if j != i:
dx = particle[j][0] - particle[i][0]
dy = particle[j][1] - particle[i][1]
dz = particle[j][2] - particle[i][2]
drSquared = dx * dx + dy * dy + dz * dz
drPowerN32 = 1.0 / (drSquared + sqrt(drSquared))
Fx += dx * drPowerN32
Fy += dy * drPowerN32
Fz += dz * drPowerN32
particlev[i][0] += dt * Fx
particlev[i][1] += dt * Fy
particlev[i][2] += dt * Fz
for i in range(nParticles):
particle[i][0] += particlev[i][0] * dt
particle[i][1] += particlev[i][1] * dt
particle[i][2] += particlev[i][2] * dt
return particle, particlev
We then execute the function to measure execution speed.
%time p_py, pv_py = nbody_py(p_py, pv_py)
CPU times: user 593 ms, sys: 7.4 ms, total: 601 ms Wall time: 596 ms
The basic algorithm allows for a number of vectorizations by using NumPy.
import numpy as np
p_np = np.random.standard_normal((nParticles, 3))
pv_np = np.zeros_like(p_np)
Using NumPy arrays and matplotlib's 3d plotting capabilities, we can easily visualize the intial positions of the particles in the space.
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(p_np[:, 0], p_np[:, 1], p_np[:, 2],
c='r', marker='.')
<mpl_toolkits.mplot3d.art3d.Patch3DCollection at 0x10b3617d0>
The respective NumPy simulation function is considerably more compact.
def nbody_np(particle, particlev): # NumPy arrays as input
nSteps = 15; dt = 0.01
for step in range(1, nSteps + 1, 1):
Fp = np.zeros((nParticles, 3))
for i in range(nParticles):
dp = particle - particle[i]
drSquared = np.sum(dp ** 2, axis=1)
h = drSquared + np.sqrt(drSquared)
drPowerN32 = 1. / np.maximum(h, 1E-10)
Fp += -(dp.T * drPowerN32).T
particlev += dt * Fp
particle += particlev * dt
return particle, particlev
It is also considerably faster.
%time p_np, pv_np = nbody_np(p_np, pv_np)
CPU times: user 106 ms, sys: 1.25 ms, total: 107 ms Wall time: 106 ms
Let's check how the particles moved. The following is a snapshot of the final positions of all particles.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(p_np[:, 0], p_np[:, 1], p_np[:, 2],
c='r', marker='.')
<mpl_toolkits.mplot3d.art3d.Patch3DCollection at 0x10bac3490>
For this case, we first implement a function which is somehow a hybrid function of the pure Python (loop-heavy) and the NumPy array class-based version.
def nbody_hy(particle, particlev): # NumPy arrays as input
dt = 0.01
for step in range(1, nSteps + 1, 1):
for i in range(nParticles):
Fx = 0.0; Fy = 0.0; Fz = 0.0
for j in range(nParticles):
if j != i:
dx = particle[j,0] - particle[i,0]
dy = particle[j,1] - particle[i,1]
dz = particle[j,2] - particle[i,2]
drSquared = dx * dx + dy * dy + dz * dz
drPowerN32 = 1.0 / (drSquared + sqrt(drSquared))
Fx += dx * drPowerN32
Fy += dy * drPowerN32
Fz += dz * drPowerN32
particlev[i, 0] += dt * Fx
particlev[i, 1] += dt * Fy
particlev[i, 2] += dt * Fz
for i in range(nParticles):
particle[i,0] += particlev[i,0] * dt
particle[i,1] += particlev[i,1] * dt
particle[i,2] += particlev[i,2] * dt
return particle, particlev
This function is then compiled with Numba.
import numba as nb
nbody_nb = nb.autojit(nbody_hy)
This new, compiled version of the hybrid function is yet faster.
p_nb = np.random.standard_normal((nParticles, 3))
pv_nb = np.zeros_like(p_nb)
%time p_nb, pv_nb = nbody_nb(p_nb, pv_nb)
DEBUG -- translate:361:translate
; ModuleID = 'tmp.module.__main__.nbody_hy.10baa19b0'
@PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**)
define { i64, i8* }* @__numba_specialized_2___main___2E_nbody_hy({ i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev) {
entry:
%objtemp = alloca { i64, i8* }*
store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6
%nsteps8 = alloca i64
%target_temp7 = alloca i64
%nsteps6 = alloca i64
%target_temp5 = alloca i64
%nsteps4 = alloca i64
%target_temp3 = alloca i64
%nsteps = alloca i64
%target_temp = alloca i64
%particlev2 = alloca { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }*
%particle1 = alloca { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }*
store { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particle1
%0 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }*
call void @Py_INCREF({ i64, i8* }* %0)
%1 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i32 0, i32 2
%2 = load i8** %1, !tbaa !2
%3 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i32 0, i32 4
%4 = load i64** %3, !tbaa !4
%5 = getelementptr i64* %4, i32 0
%6 = load i64* %5
%7 = getelementptr i64* %4, i32 1
%8 = load i64* %7
%9 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i32 0, i32 5
%10 = load i64** %9, !tbaa !5
%11 = getelementptr i64* %10, i32 0
%12 = load i64* %11
%13 = getelementptr i64* %10, i32 1
%14 = load i64* %13
store { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particlev2
%15 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }*
call void @Py_INCREF({ i64, i8* }* %15)
%16 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i32 0, i32 2
%17 = load i8** %16, !tbaa !2
%18 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i32 0, i32 4
%19 = load i64** %18, !tbaa !4
%20 = getelementptr i64* %19, i32 0
%21 = load i64* %20
%22 = getelementptr i64* %19, i32 1
%23 = load i64* %22
%24 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i32 0, i32 5
%25 = load i64** %24, !tbaa !5
%26 = getelementptr i64* %25, i32 0
%27 = load i64* %26
%28 = getelementptr i64* %25, i32 1
%29 = load i64* %28
%return_value = alloca { i64, i8* }*
store i64 0, i64* %target_temp, !tbaa !7
br i1 true, label %ifexp.then, label %ifexp.else
cleanup_label: ; preds = %"no_error_23:21", %error_label
%30 = load { i64, i8* }** %objtemp, !tbaa !6
call void @Py_XDECREF({ i64, i8* }* %30)
%31 = load { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particlev2, !tbaa !6
%32 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %31 to { i64, i8* }*
call void @Py_XDECREF({ i64, i8* }* %32)
%33 = load { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particle1, !tbaa !6
%34 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %33 to { i64, i8* }*
call void @Py_XDECREF({ i64, i8* }* %34)
%35 = load { i64, i8* }** %return_value
ret { i64, i8* }* %35
error_label: ; preds = %"exit_for_3:4"
store { i64, i8* }* null, { i64, i8* }** %return_value
%36 = load { i64, i8* }** %return_value, !tbaa !6
call void @Py_XINCREF({ i64, i8* }* %36)
br label %cleanup_label
ifexp.then: ; preds = %entry
br label %ifexp.merge
ifexp.else: ; preds = %entry
br label %ifexp.merge
ifexp.merge: ; preds = %ifexp.else, %ifexp.then
%37 = phi i32 [ 1, %ifexp.then ], [ -1, %ifexp.else ]
%38 = sext i32 %37 to i64
%39 = sub i64 16, %38
%40 = sdiv i64 %39, 1
store i64 %40, i64* %nsteps, !tbaa !8
br label %"for_condition_3:16"
"for_condition_3:16": ; preds = %ifexp.merge, %"exit_for_19:8"
%i_1 = phi i64 [ 123456789, %ifexp.merge ], [ %i_4, %"exit_for_19:8" ]
%Fx_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %Fx_2, %"exit_for_19:8" ]
%drSquared_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %drSquared_2, %"exit_for_19:8" ]
%Fy_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %Fy_2, %"exit_for_19:8" ]
%Fz_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %Fz_2, %"exit_for_19:8" ]
%dz_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %dz_2, %"exit_for_19:8" ]
%j_1 = phi i64 [ 123456789, %ifexp.merge ], [ %j_2, %"exit_for_19:8" ]
%dx_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %dx_2, %"exit_for_19:8" ]
%dy_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %dy_2, %"exit_for_19:8" ]
%drPowerN32_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %drPowerN32_2, %"exit_for_19:8" ]
%41 = load i64* %target_temp, !tbaa !7
%42 = load i64* %nsteps, !tbaa !8
%43 = icmp slt i64 %41, %42
%44 = icmp ne i1 %43, false
br i1 %44, label %"loop_body_4:8", label %"exit_for_3:4"
"exit_for_3:4": ; preds = %"for_condition_3:16"
%45 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }*
%46 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }*
%47 = call { i64, i8* }* (i64, ...)* @PyTuple_Pack(i64 2, { i64, i8* }* %45, { i64, i8* }* %46)
store { i64, i8* }* %47, { i64, i8* }** %objtemp, !tbaa !6
%48 = ptrtoint { i64, i8* }* %47 to i64
%49 = icmp eq i64 %48, 0
br i1 %49, label %error_label, label %"no_error_23:21"
"loop_body_4:8": ; preds = %"for_condition_3:16"
%50 = load i64* %target_temp, !tbaa !7
%51 = mul i64 %50, 1
%52 = add i64 1, %51
%53 = load i64* %target_temp, !tbaa !7
%54 = add i64 %53, 1
store i64 %54, i64* %target_temp, !tbaa !7
store i64 0, i64* %target_temp3, !tbaa !9
store i64 150, i64* %nsteps4, !tbaa !10
br label %"for_condition_4:17"
"for_condition_4:17": ; preds = %"loop_body_4:8", %"exit_for_6:12"
%i_2 = phi i64 [ %i_1, %"loop_body_4:8" ], [ %59, %"exit_for_6:12" ]
%Fx_2 = phi double [ %Fx_1, %"loop_body_4:8" ], [ %Fx_4, %"exit_for_6:12" ]
%drSquared_2 = phi double [ %drSquared_1, %"loop_body_4:8" ], [ %drSquared_3, %"exit_for_6:12" ]
%Fy_2 = phi double [ %Fy_1, %"loop_body_4:8" ], [ %Fy_4, %"exit_for_6:12" ]
%Fz_2 = phi double [ %Fz_1, %"loop_body_4:8" ], [ %Fz_4, %"exit_for_6:12" ]
%j_2 = phi i64 [ %j_1, %"loop_body_4:8" ], [ %j_3, %"exit_for_6:12" ]
%dx_2 = phi double [ %dx_1, %"loop_body_4:8" ], [ %dx_3, %"exit_for_6:12" ]
%dy_2 = phi double [ %dy_1, %"loop_body_4:8" ], [ %dy_3, %"exit_for_6:12" ]
%dz_2 = phi double [ %dz_1, %"loop_body_4:8" ], [ %dz_3, %"exit_for_6:12" ]
%drPowerN32_2 = phi double [ %drPowerN32_1, %"loop_body_4:8" ], [ %drPowerN32_3, %"exit_for_6:12" ]
%55 = load i64* %target_temp3, !tbaa !9
%56 = load i64* %nsteps4, !tbaa !10
%57 = icmp slt i64 %55, %56
%58 = icmp ne i1 %57, false
br i1 %58, label %"loop_body_5:12", label %"exit_for_4:8"
"exit_for_4:8": ; preds = %"for_condition_4:17"
store i64 0, i64* %target_temp7, !tbaa !13
store i64 150, i64* %nsteps8, !tbaa !14
br label %"for_condition_19:17"
"loop_body_5:12": ; preds = %"for_condition_4:17"
%59 = load i64* %target_temp3, !tbaa !9
%60 = load i64* %target_temp3, !tbaa !9
%61 = add i64 %60, 1
store i64 %61, i64* %target_temp3, !tbaa !9
store i64 0, i64* %target_temp5, !tbaa !11
store i64 150, i64* %nsteps6, !tbaa !12
br label %"for_condition_6:21"
"for_condition_6:21": ; preds = %"loop_body_5:12", %"exit_if_7:16"
%drSquared_3 = phi double [ %drSquared_2, %"loop_body_5:12" ], [ %drSquared_5, %"exit_if_7:16" ]
%Fx_4 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fx_6, %"exit_if_7:16" ]
%Fy_4 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fy_6, %"exit_if_7:16" ]
%Fz_4 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fz_6, %"exit_if_7:16" ]
%j_3 = phi i64 [ %j_2, %"loop_body_5:12" ], [ %66, %"exit_if_7:16" ]
%dx_3 = phi double [ %dx_2, %"loop_body_5:12" ], [ %dx_5, %"exit_if_7:16" ]
%dy_3 = phi double [ %dy_2, %"loop_body_5:12" ], [ %dy_5, %"exit_if_7:16" ]
%dz_3 = phi double [ %dz_2, %"loop_body_5:12" ], [ %dz_5, %"exit_if_7:16" ]
%drPowerN32_3 = phi double [ %drPowerN32_2, %"loop_body_5:12" ], [ %drPowerN32_5, %"exit_if_7:16" ]
%62 = load i64* %target_temp5, !tbaa !11
%63 = load i64* %nsteps6, !tbaa !12
%64 = icmp slt i64 %62, %63
%65 = icmp ne i1 %64, false
br i1 %65, label %"loop_body_7:16", label %"exit_for_6:12"
"exit_for_6:12": ; preds = %"for_condition_6:21"
br label %"for_condition_4:17"
"loop_body_7:16": ; preds = %"for_condition_6:21"
%66 = load i64* %target_temp5, !tbaa !11
%67 = load i64* %target_temp5, !tbaa !11
%68 = add i64 %67, 1
store i64 %68, i64* %target_temp5, !tbaa !11
br label %"if_cond_7:19"
"if_cond_7:19": ; preds = %"loop_body_7:16"
%69 = icmp ne i64 %66, %59
%70 = icmp ne i1 %69, false
br i1 %70, label %"if_body_8:20", label %"exit_if_7:16"
"exit_if_7:16": ; preds = %"if_cond_7:19", %"if_body_8:20"
%drSquared_5 = phi double [ %165, %"if_body_8:20" ], [ %drSquared_3, %"if_cond_7:19" ]
%Fx_6 = phi double [ %170, %"if_body_8:20" ], [ %Fx_4, %"if_cond_7:19" ]
%Fy_6 = phi double [ %172, %"if_body_8:20" ], [ %Fy_4, %"if_cond_7:19" ]
%Fz_6 = phi double [ %174, %"if_body_8:20" ], [ %Fz_4, %"if_cond_7:19" ]
%dx_5 = phi double [ %130, %"if_body_8:20" ], [ %dx_3, %"if_cond_7:19" ]
%dy_5 = phi double [ %145, %"if_body_8:20" ], [ %dy_3, %"if_cond_7:19" ]
%dz_5 = phi double [ %160, %"if_body_8:20" ], [ %dz_3, %"if_cond_7:19" ]
%drPowerN32_5 = phi double [ %168, %"if_body_8:20" ], [ %drPowerN32_3, %"if_cond_7:19" ]
%71 = mul i64 %59, %27
%72 = add i64 0, %71
%73 = mul i64 0, %29
%74 = add i64 %72, %73
%75 = getelementptr i8* %17, i64 %74
%76 = bitcast i8* %75 to double*
%77 = load double* %76, !tbaa !2
%78 = fmul double 1.000000e-02, %Fx_6
%79 = fadd double %77, %78
%80 = mul i64 %59, %27
%81 = add i64 0, %80
%82 = mul i64 0, %29
%83 = add i64 %81, %82
%84 = getelementptr i8* %17, i64 %83
%85 = bitcast i8* %84 to double*
store double %79, double* %85, !tbaa !2
%86 = mul i64 %59, %27
%87 = add i64 0, %86
%88 = mul i64 1, %29
%89 = add i64 %87, %88
%90 = getelementptr i8* %17, i64 %89
%91 = bitcast i8* %90 to double*
%92 = load double* %91, !tbaa !2
%93 = fmul double 1.000000e-02, %Fy_6
%94 = fadd double %92, %93
%95 = mul i64 %59, %27
%96 = add i64 0, %95
%97 = mul i64 1, %29
%98 = add i64 %96, %97
%99 = getelementptr i8* %17, i64 %98
%100 = bitcast i8* %99 to double*
store double %94, double* %100, !tbaa !2
%101 = mul i64 %59, %27
%102 = add i64 0, %101
%103 = mul i64 2, %29
%104 = add i64 %102, %103
%105 = getelementptr i8* %17, i64 %104
%106 = bitcast i8* %105 to double*
%107 = load double* %106, !tbaa !2
%108 = fmul double 1.000000e-02, %Fz_6
%109 = fadd double %107, %108
%110 = mul i64 %59, %27
%111 = add i64 0, %110
%112 = mul i64 2, %29
%113 = add i64 %111, %112
%114 = getelementptr i8* %17, i64 %113
%115 = bitcast i8* %114 to double*
store double %109, double* %115, !tbaa !2
br label %"for_condition_6:21"
"if_body_8:20": ; preds = %"if_cond_7:19"
%116 = mul i64 %66, %12
%117 = add i64 0, %116
%118 = mul i64 0, %14
%119 = add i64 %117, %118
%120 = getelementptr i8* %2, i64 %119
%121 = bitcast i8* %120 to double*
%122 = load double* %121, !tbaa !2
%123 = mul i64 %59, %12
%124 = add i64 0, %123
%125 = mul i64 0, %14
%126 = add i64 %124, %125
%127 = getelementptr i8* %2, i64 %126
%128 = bitcast i8* %127 to double*
%129 = load double* %128, !tbaa !2
%130 = fsub double %122, %129
%131 = mul i64 %66, %12
%132 = add i64 0, %131
%133 = mul i64 1, %14
%134 = add i64 %132, %133
%135 = getelementptr i8* %2, i64 %134
%136 = bitcast i8* %135 to double*
%137 = load double* %136, !tbaa !2
%138 = mul i64 %59, %12
%139 = add i64 0, %138
%140 = mul i64 1, %14
%141 = add i64 %139, %140
%142 = getelementptr i8* %2, i64 %141
%143 = bitcast i8* %142 to double*
%144 = load double* %143, !tbaa !2
%145 = fsub double %137, %144
%146 = mul i64 %66, %12
%147 = add i64 0, %146
%148 = mul i64 2, %14
%149 = add i64 %147, %148
%150 = getelementptr i8* %2, i64 %149
%151 = bitcast i8* %150 to double*
%152 = load double* %151, !tbaa !2
%153 = mul i64 %59, %12
%154 = add i64 0, %153
%155 = mul i64 2, %14
%156 = add i64 %154, %155
%157 = getelementptr i8* %2, i64 %156
%158 = bitcast i8* %157 to double*
%159 = load double* %158, !tbaa !2
%160 = fsub double %152, %159
%161 = fmul double %130, %130
%162 = fmul double %145, %145
%163 = fadd double %161, %162
%164 = fmul double %160, %160
%165 = fadd double %163, %164
%166 = call double @"numba.math.['double'].sqrt"(double %165)
%167 = fadd double %165, %166
%168 = fdiv double 1.000000e+00, %167
%169 = fmul double %130, %168
%170 = fadd double %Fx_4, %169
%171 = fmul double %145, %168
%172 = fadd double %Fy_4, %171
%173 = fmul double %160, %168
%174 = fadd double %Fz_4, %173
br label %"exit_if_7:16"
"for_condition_19:17": ; preds = %"exit_for_4:8", %"loop_body_20:29"
%i_4 = phi i64 [ %i_2, %"exit_for_4:8" ], [ %179, %"loop_body_20:29" ]
%175 = load i64* %target_temp7, !tbaa !13
%176 = load i64* %nsteps8, !tbaa !14
%177 = icmp slt i64 %175, %176
%178 = icmp ne i1 %177, false
br i1 %178, label %"loop_body_20:29", label %"exit_for_19:8"
"exit_for_19:8": ; preds = %"for_condition_19:17"
br label %"for_condition_3:16"
"loop_body_20:29": ; preds = %"for_condition_19:17"
%179 = load i64* %target_temp7, !tbaa !13
%180 = load i64* %target_temp7, !tbaa !13
%181 = add i64 %180, 1
store i64 %181, i64* %target_temp7, !tbaa !13
%182 = mul i64 %179, %12
%183 = add i64 0, %182
%184 = mul i64 0, %14
%185 = add i64 %183, %184
%186 = getelementptr i8* %2, i64 %185
%187 = bitcast i8* %186 to double*
%188 = load double* %187, !tbaa !2
%189 = mul i64 %179, %27
%190 = add i64 0, %189
%191 = mul i64 0, %29
%192 = add i64 %190, %191
%193 = getelementptr i8* %17, i64 %192
%194 = bitcast i8* %193 to double*
%195 = load double* %194, !tbaa !2
%196 = fmul double %195, 1.000000e-02
%197 = fadd double %188, %196
%198 = mul i64 %179, %12
%199 = add i64 0, %198
%200 = mul i64 0, %14
%201 = add i64 %199, %200
%202 = getelementptr i8* %2, i64 %201
%203 = bitcast i8* %202 to double*
store double %197, double* %203, !tbaa !2
%204 = mul i64 %179, %12
%205 = add i64 0, %204
%206 = mul i64 1, %14
%207 = add i64 %205, %206
%208 = getelementptr i8* %2, i64 %207
%209 = bitcast i8* %208 to double*
%210 = load double* %209, !tbaa !2
%211 = mul i64 %179, %27
%212 = add i64 0, %211
%213 = mul i64 1, %29
%214 = add i64 %212, %213
%215 = getelementptr i8* %17, i64 %214
%216 = bitcast i8* %215 to double*
%217 = load double* %216, !tbaa !2
%218 = fmul double %217, 1.000000e-02
%219 = fadd double %210, %218
%220 = mul i64 %179, %12
%221 = add i64 0, %220
%222 = mul i64 1, %14
%223 = add i64 %221, %222
%224 = getelementptr i8* %2, i64 %223
%225 = bitcast i8* %224 to double*
store double %219, double* %225, !tbaa !2
%226 = mul i64 %179, %12
%227 = add i64 0, %226
%228 = mul i64 2, %14
%229 = add i64 %227, %228
%230 = getelementptr i8* %2, i64 %229
%231 = bitcast i8* %230 to double*
%232 = load double* %231, !tbaa !2
%233 = mul i64 %179, %27
%234 = add i64 0, %233
%235 = mul i64 2, %29
%236 = add i64 %234, %235
%237 = getelementptr i8* %17, i64 %236
%238 = bitcast i8* %237 to double*
%239 = load double* %238, !tbaa !2
%240 = fmul double %239, 1.000000e-02
%241 = fadd double %232, %240
%242 = mul i64 %179, %12
%243 = add i64 0, %242
%244 = mul i64 2, %14
%245 = add i64 %243, %244
%246 = getelementptr i8* %2, i64 %245
%247 = bitcast i8* %246 to double*
store double %241, double* %247, !tbaa !2
br label %"for_condition_19:17"
"no_error_23:21": ; preds = %"exit_for_3:4"
%248 = load { i64, i8* }** %objtemp, !tbaa !6
store { i64, i8* }* %248, { i64, i8* }** %return_value
%249 = load { i64, i8* }** %return_value, !tbaa !6
call void @Py_XINCREF({ i64, i8* }* %249)
br label %cleanup_label
}
declare { i64, i8* }* @PyTuple_Pack(i64, ...)
declare { i64, i8* }* @Py_BuildValue(i8*, ...)
declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...)
declare void @PyErr_Clear()
declare void @Py_INCREF({ i64, i8* }*)
declare void @Py_XINCREF({ i64, i8* }*)
declare double @"numba.math.['double'].sqrt"(double)
declare void @Py_XDECREF({ i64, i8* }*)
!tbaa = !{!0, !1, !2, !3, !4, !5, !6, !7, !8, !9, !10, !11, !12, !13, !14}
!0 = metadata !{metadata !"root"}
!1 = metadata !{metadata !"char *", metadata !0}
!2 = metadata !{metadata !"float3264 *", metadata !1}
!3 = metadata !{metadata !"npy_intp **", metadata !1}
!4 = metadata !{metadata !"tbaa_type(numpy shape, npy_intp *) *", metadata !3}
!5 = metadata !{metadata !"tbaa_type(numpy strides, npy_intp *) *", metadata !3}
!6 = metadata !{metadata !"object", metadata !1}
!7 = metadata !{metadata !"unique0", metadata !1}
!8 = metadata !{metadata !"unique1", metadata !1}
!9 = metadata !{metadata !"unique2", metadata !1}
!10 = metadata !{metadata !"unique3", metadata !1}
!11 = metadata !{metadata !"unique4", metadata !1}
!12 = metadata !{metadata !"unique5", metadata !1}
!13 = metadata !{metadata !"unique6", metadata !1}
!14 = metadata !{metadata !"unique7", metadata !1}
DEBUG -- translate:361:translate
; ModuleID = 'numba_executable_module'
@PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**)
define void @Py_INCREF({ i64, i8* }* %obj) {
decl:
%obj1 = alloca { i64, i8* }*
store { i64, i8* }* %obj, { i64, i8* }** %obj1
%0 = bitcast { i64, i8* }* %obj to i64*
%1 = load i64* %0
%2 = add i64 %1, 1
store i64 %2, i64* %0
ret void
}
define void @Py_DECREF({ i64, i8* }* %obj) {
decl:
%obj1 = alloca { i64, i8* }*
store { i64, i8* }* %obj, { i64, i8* }** %obj1
%0 = bitcast { i64, i8* }* %obj to i64*
%1 = load i64* %0
%2 = icmp sgt i64 %1, 1
br i1 %2, label %if.then, label %if.else
if.then: ; preds = %decl
%3 = bitcast { i64, i8* }* %obj to i64*
%4 = add i64 %1, -1
store i64 %4, i64* %3
br label %if.end
if.else: ; preds = %decl
call void @Py_DecRef({ i64, i8* }* %obj)
br label %if.end
if.end: ; preds = %if.else, %if.then
ret void
}
declare void @Py_DecRef({ i64, i8* }*)
define void @Py_XINCREF({ i64, i8* }* %obj) {
decl:
%obj1 = alloca { i64, i8* }*
store { i64, i8* }* %obj, { i64, i8* }** %obj1
%0 = ptrtoint { i64, i8* }* %obj to i64
%1 = icmp ne i64 %0, 0
br i1 %1, label %if.then, label %if.end
if.then: ; preds = %decl
%2 = bitcast { i64, i8* }* %obj to i64*
%3 = load i64* %2
%4 = add i64 %3, 1
store i64 %4, i64* %2
br label %if.end
if.end: ; preds = %if.then, %decl
ret void
}
define void @Py_XDECREF({ i64, i8* }* %obj) {
decl:
%obj1 = alloca { i64, i8* }*
store { i64, i8* }* %obj, { i64, i8* }** %obj1
%0 = ptrtoint { i64, i8* }* %obj to i64
%1 = icmp ne i64 %0, 0
br i1 %1, label %if.then, label %if.end
if.then: ; preds = %decl
call void @Py_DECREF({ i64, i8* }* %obj)
br label %if.end
if.end: ; preds = %if.then, %decl
ret void
}
define i8* @IndexAxis(i8* %data, i64* %in_shape, i64* %in_strides, i64 %src_dim, i64 %index) {
decl:
%data1 = alloca i8*
%in_shape2 = alloca i64*
%in_strides3 = alloca i64*
%src_dim4 = alloca i64
%index5 = alloca i64
%result = alloca i8*
store i8* %data, i8** %data1
store i64* %in_shape, i64** %in_shape2
store i64* %in_strides, i64** %in_strides3
store i64 %src_dim, i64* %src_dim4
store i64 %index, i64* %index5
%0 = load i64** %in_strides3
%1 = load i64* %src_dim4
%2 = getelementptr inbounds i64* %0, i64 %1
%3 = load i64* %2
%4 = mul i64 %3, %index
%5 = load i8** %data1
%6 = getelementptr inbounds i8* %5, i64 %4
store i8* %6, i8** %result
ret i8* %6
}
define void @NewAxis(i64* %out_shape, i64* %out_strides, i32 %dst_dim) {
decl:
%out_shape1 = alloca i64*
%out_strides2 = alloca i64*
%dst_dim3 = alloca i32
store i64* %out_shape, i64** %out_shape1
store i64* %out_strides, i64** %out_strides2
store i32 %dst_dim, i32* %dst_dim3
%0 = load i64** %out_shape1
%1 = getelementptr inbounds i64* %0, i32 %dst_dim
store i64 1, i64* %1
%2 = load i64** %out_strides2
%3 = load i32* %dst_dim3
%4 = getelementptr inbounds i64* %2, i32 %3
store i64 0, i64* %4
ret void
}
define i32 @Broadcast(i64* %dst_shape, i64* %src_shape, i64* %src_strides, i32 %max_ndim, i32 %ndim) {
decl:
%dst_shape1 = alloca i64*
%src_shape2 = alloca i64*
%src_strides3 = alloca i64*
%max_ndim4 = alloca i32
%ndim5 = alloca i32
%0 = alloca i32
store i64* %dst_shape, i64** %dst_shape1
store i64* %src_shape, i64** %src_shape2
store i64* %src_strides, i64** %src_strides3
store i32 %max_ndim, i32* %max_ndim4
store i32 %ndim, i32* %ndim5
%1 = load i32* %max_ndim4
%2 = sub i32 %1, %ndim
store i32 0, i32* %0
br label %loop.cond
loop.cond: ; preds = %if.end11, %decl
%3 = load i32* %0
%4 = load i32* %ndim5
%5 = icmp slt i32 %3, %4
br i1 %5, label %loop.body, label %loop.end
loop.body: ; preds = %loop.cond
%6 = load i64** %src_shape2
%7 = getelementptr inbounds i64* %6, i32 %3
%8 = add i32 %3, %2
%9 = load i64** %dst_shape1
%10 = getelementptr inbounds i64* %9, i32 %8
%11 = load i64* %7
%12 = icmp eq i64 %11, 1
br i1 %12, label %if.then, label %if.else
loop.end: ; preds = %if.else7, %loop.cond
%merge = phi i32 [ 1, %loop.cond ], [ 0, %if.else7 ]
ret i32 %merge
if.then: ; preds = %loop.body
%13 = load i64** %src_strides3
%14 = getelementptr inbounds i64* %13, i32 %3
store i64 0, i64* %14
br label %if.end11
if.else: ; preds = %loop.body
%15 = load i64* %10
%16 = icmp eq i64 %15, 1
br i1 %16, label %if.then6, label %if.else7
if.then6: ; preds = %if.else
store i64 %11, i64* %10
br label %if.end11
if.else7: ; preds = %if.else
%17 = icmp ne i64 %11, %15
br i1 %17, label %loop.end, label %if.end11
if.end11: ; preds = %if.else7, %if.then6, %if.then
%18 = load i32* %0
%19 = add i32 %18, 1
store i32 %19, i32* %0
br label %loop.cond
}
define i32 @__numba_specialized_0___main___2E_f_py(i32 %I, i32 %J) {
entry:
%0 = icmp sgt i32 %I, 0
%1 = icmp sgt i32 %J, 0
%or.cond = and i1 %0, %1
br i1 %or.cond, label %"loop_body_5:8.lr.ph.split.us", label %"exit_for_4:4"
"loop_body_5:8.lr.ph.split.us": ; preds = %entry
%2 = sext i32 %I to i64
%3 = sext i32 %J to i64
br label %"loop_body_6:19.lr.ph.us"
"loop_body_6:19.us": ; preds = %"loop_body_6:19.us", %"loop_body_6:19.lr.ph.us"
%lsr.iv = phi i64 [ %lsr.iv.next, %"loop_body_6:19.us" ], [ %3, %"loop_body_6:19.lr.ph.us" ]
%res_35.us = phi i32 [ %res_26.us, %"loop_body_6:19.lr.ph.us" ], [ %7, %"loop_body_6:19.us" ]
%4 = tail call double @"numba.math.['double'].log"(double 1.000000e+00)
%5 = tail call double @"numba.math.['double'].cos"(double %4)
%6 = fptosi double %5 to i32
%7 = add i32 %6, %res_35.us
%lsr.iv.next = add i64 %lsr.iv, -1
%exitcond = icmp eq i64 %lsr.iv.next, 0
br i1 %exitcond, label %"for_condition_4:13.loopexit.us", label %"loop_body_6:19.us"
"for_condition_4:13.loopexit.us": ; preds = %"loop_body_6:19.us"
%exitcond8 = icmp eq i64 %9, %2
br i1 %exitcond8, label %"exit_for_4:4", label %"loop_body_6:19.lr.ph.us"
"loop_body_6:19.lr.ph.us": ; preds = %"loop_body_5:8.lr.ph.split.us", %"for_condition_4:13.loopexit.us"
%res_26.us = phi i32 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %7, %"for_condition_4:13.loopexit.us" ]
%8 = phi i64 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %9, %"for_condition_4:13.loopexit.us" ]
%9 = add i64 %8, 1
br label %"loop_body_6:19.us"
"exit_for_4:4": ; preds = %"for_condition_4:13.loopexit.us", %entry
%res_2.lcssa = phi i32 [ 0, %entry ], [ %7, %"for_condition_4:13.loopexit.us" ]
ret i32 %res_2.lcssa
}
declare double @"numba.math.['double'].cos"(double)
declare double @"numba.math.['double'].log"(double)
define { i64, i8* }* @__numba_specialized_1_numba_2E_codegen_2E_llvmwrapper_2E___numba_wrapper_f_py(i8* %self, { i64, i8* }* %args) {
entry:
%objtemp = alloca { i64, i8* }*
store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6
%0 = alloca { i64, i8* }*
%1 = alloca { i64, i8* }*
%return_value = alloca { i64, i8* }*
%2 = call i32 ({ i64, i8* }*, i8*, ...)* @PyArg_ParseTuple({ i64, i8* }* %args, i8* getelementptr inbounds ([3 x i8]* @__STR_0, i32 0, i32 0), { i64, i8* }** %1, { i64, i8* }** %0)
%3 = icmp eq i32 %2, 0
br i1 %3, label %error_label, label %empty
cleanup_label: ; preds = %no_error, %error_label
%4 = load { i64, i8* }** %objtemp, !tbaa !6
call void @Py_XDECREF({ i64, i8* }* %4)
%5 = load { i64, i8* }** %return_value
ret { i64, i8* }* %5
error_label: ; preds = %empty6, %empty3, %empty, %entry, %empty7
store { i64, i8* }* null, { i64, i8* }** %return_value
%6 = load { i64, i8* }** %return_value, !tbaa !6
call void @Py_XINCREF({ i64, i8* }* %6)
br label %cleanup_label
empty: ; preds = %entry
%7 = load { i64, i8* }** %1
%8 = load { i64, i8* }** %0
%9 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %7)
%10 = call i8* @PyErr_Occurred()
%11 = ptrtoint i8* %10 to i64
%12 = icmp ne i64 %11, 0
br i1 %12, label %error_label, label %empty3
empty3: ; preds = %empty
%13 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %8)
%14 = call i8* @PyErr_Occurred()
%15 = ptrtoint i8* %14 to i64
%16 = icmp ne i64 %15, 0
br i1 %16, label %error_label, label %empty6
empty6: ; preds = %empty3
%17 = call i32 @__numba_specialized_0___main___2E_f_py(i32 %9, i32 %13)
%18 = call i8* @PyErr_Occurred()
%19 = ptrtoint i8* %18 to i64
%20 = icmp ne i64 %19, 0
br i1 %20, label %error_label, label %empty7
empty7: ; preds = %empty6
%21 = sext i32 %17 to i64
%22 = call { i64, i8* }* inttoptr (i64 4433235252 to { i64, i8* }* (i64)*)(i64 %21)
store { i64, i8* }* %22, { i64, i8* }** %objtemp, !tbaa !6
%23 = ptrtoint { i64, i8* }* %22 to i64
%24 = icmp eq i64 %23, 0
br i1 %24, label %error_label, label %no_error
no_error: ; preds = %empty7
%25 = load { i64, i8* }** %objtemp, !tbaa !6
store { i64, i8* }* %25, { i64, i8* }** %return_value
%26 = load { i64, i8* }** %return_value, !tbaa !6
call void @Py_XINCREF({ i64, i8* }* %26)
br label %cleanup_label
}
declare i8* @PyErr_Occurred()
declare { i64, i8* }* @Py_BuildValue(i8*, ...)
declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...)
declare void @PyErr_Clear()
define { i64, i8* }* @__numba_specialized_2___main___2E_nbody_hy({ i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev) {
entry:
%0 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }*
tail call void @Py_INCREF({ i64, i8* }* %0)
%1 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i64 0, i32 2
%2 = load i8** %1, align 8, !tbaa !7
%3 = ptrtoint i8* %2 to i64
%4 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i64 0, i32 5
%5 = load i64** %4, align 8, !tbaa !10
%6 = load i64* %5, align 8
%7 = getelementptr i64* %5, i64 1
%8 = load i64* %7, align 8
%9 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }*
tail call void @Py_INCREF({ i64, i8* }* %9)
%10 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i64 0, i32 2
%11 = load i8** %10, align 8, !tbaa !7
%12 = ptrtoint i8* %11 to i64
%13 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i64 0, i32 5
%14 = load i64** %13, align 8, !tbaa !10
%15 = load i64* %14, align 8
%16 = getelementptr i64* %14, i64 1
%17 = load i64* %16, align 8
%18 = shl i64 %17, 1
%19 = shl i64 %8, 1
%20 = sub i64 0, %3
%21 = sub i64 0, %12
br label %"loop_body_4:8"
cleanup_label: ; preds = %"no_error_23:21", %error_label
%return_value.0.load14 = phi { i64, i8* }* [ %25, %"no_error_23:21" ], [ null, %error_label ]
%22 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }*
%23 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }*
tail call void @Py_XDECREF({ i64, i8* }* %25)
tail call void @Py_XDECREF({ i64, i8* }* %22)
tail call void @Py_XDECREF({ i64, i8* }* %23)
ret { i64, i8* }* %return_value.0.load14
error_label: ; preds = %"exit_for_3:4"
tail call void @Py_XINCREF({ i64, i8* }* null)
br label %cleanup_label
"for_condition_3:16.loopexit": ; preds = %"loop_body_20:29"
%24 = add i64 %27, 1
%exitcond35 = icmp eq i64 %24, 15
br i1 %exitcond35, label %"exit_for_3:4", label %"loop_body_4:8"
"exit_for_3:4": ; preds = %"for_condition_3:16.loopexit"
%25 = tail call { i64, i8* }* (i64, ...)* @PyTuple_Pack(i64 2, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev)
%26 = icmp eq { i64, i8* }* %25, null
br i1 %26, label %error_label, label %"no_error_23:21"
"loop_body_4:8": ; preds = %"for_condition_3:16.loopexit", %entry
%27 = phi i64 [ 0, %entry ], [ %24, %"for_condition_3:16.loopexit" ]
br label %"loop_body_5:12"
"for_condition_4:17.loopexit": ; preds = %"exit_if_7:16"
%28 = add i64 %29, 1
%exitcond33 = icmp eq i64 %28, 150
br i1 %exitcond33, label %"loop_body_20:29", label %"loop_body_5:12"
"loop_body_5:12": ; preds = %"for_condition_4:17.loopexit", %"loop_body_4:8"
%29 = phi i64 [ 0, %"loop_body_4:8" ], [ %28, %"for_condition_4:17.loopexit" ]
%30 = mul i64 %29, %15
%31 = add i64 %30, %17
%32 = add i64 %30, %18
%33 = mul i64 %29, %6
%34 = add i64 %33, %8
%35 = add i64 %33, %19
br label %"loop_body_7:16"
"loop_body_7:16": ; preds = %"exit_if_7:16", %"loop_body_5:12"
%lsr.iv = phi i64 [ %lsr.iv.next, %"exit_if_7:16" ], [ %20, %"loop_body_5:12" ]
%Fz_418 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fz_6, %"exit_if_7:16" ]
%Fy_417 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fy_6, %"exit_if_7:16" ]
%Fx_416 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fx_6, %"exit_if_7:16" ]
%36 = phi i64 [ 0, %"loop_body_5:12" ], [ %37, %"exit_if_7:16" ]
%37 = add i64 %36, 1
%38 = icmp eq i64 %29, %36
br i1 %38, label %"exit_if_7:16", label %"if_body_8:20"
"exit_if_7:16": ; preds = %"loop_body_7:16", %"if_body_8:20"
%Fx_6 = phi double [ %70, %"if_body_8:20" ], [ %Fx_416, %"loop_body_7:16" ]
%Fy_6 = phi double [ %72, %"if_body_8:20" ], [ %Fy_417, %"loop_body_7:16" ]
%Fz_6 = phi double [ %74, %"if_body_8:20" ], [ %Fz_418, %"loop_body_7:16" ]
%sunkaddr = ptrtoint i8* %11 to i64
%sunkaddr23 = add i64 %sunkaddr, %30
%sunkaddr24 = inttoptr i64 %sunkaddr23 to double*
%39 = load double* %sunkaddr24, align 8, !tbaa !7
%40 = fmul double %Fx_6, 1.000000e-02
%41 = fadd double %40, %39
store double %41, double* %sunkaddr24, align 8, !tbaa !7
%sunkaddr25 = ptrtoint i8* %11 to i64
%sunkaddr26 = add i64 %sunkaddr25, %31
%sunkaddr27 = inttoptr i64 %sunkaddr26 to double*
%42 = load double* %sunkaddr27, align 8, !tbaa !7
%43 = fmul double %Fy_6, 1.000000e-02
%44 = fadd double %43, %42
store double %44, double* %sunkaddr27, align 8, !tbaa !7
%sunkaddr28 = ptrtoint i8* %11 to i64
%sunkaddr29 = add i64 %sunkaddr28, %32
%sunkaddr30 = inttoptr i64 %sunkaddr29 to double*
%45 = load double* %sunkaddr30, align 8, !tbaa !7
%46 = fmul double %Fz_6, 1.000000e-02
%47 = fadd double %46, %45
store double %47, double* %sunkaddr30, align 8, !tbaa !7
%lsr.iv.next = sub i64 %lsr.iv, %6
%exitcond = icmp eq i64 150, %37
br i1 %exitcond, label %"for_condition_4:17.loopexit", label %"loop_body_7:16"
"if_body_8:20": ; preds = %"loop_body_7:16"
%48 = mul i64 %lsr.iv, -1
%49 = inttoptr i64 %48 to double*
%50 = load double* %49, align 8, !tbaa !7
%sunkaddr31 = ptrtoint i8* %2 to i64
%sunkaddr32 = add i64 %sunkaddr31, %33
%sunkaddr33 = inttoptr i64 %sunkaddr32 to double*
%51 = load double* %sunkaddr33, align 8, !tbaa !7
%52 = fsub double %50, %51
%53 = mul i64 %lsr.iv, -1
%sunkaddr34 = add i64 %8, %53
%sunkaddr35 = inttoptr i64 %sunkaddr34 to double*
%54 = load double* %sunkaddr35, align 8, !tbaa !7
%sunkaddr36 = ptrtoint i8* %2 to i64
%sunkaddr37 = add i64 %sunkaddr36, %34
%sunkaddr38 = inttoptr i64 %sunkaddr37 to double*
%55 = load double* %sunkaddr38, align 8, !tbaa !7
%56 = fsub double %54, %55
%57 = mul i64 %lsr.iv, -1
%sunkaddr39 = mul i64 %8, 2
%sunkaddr40 = add i64 %57, %sunkaddr39
%sunkaddr41 = inttoptr i64 %sunkaddr40 to double*
%58 = load double* %sunkaddr41, align 8, !tbaa !7
%sunkaddr42 = ptrtoint i8* %2 to i64
%sunkaddr43 = add i64 %sunkaddr42, %35
%sunkaddr44 = inttoptr i64 %sunkaddr43 to double*
%59 = load double* %sunkaddr44, align 8, !tbaa !7
%60 = fsub double %58, %59
%61 = fmul double %52, %52
%62 = fmul double %56, %56
%63 = fadd double %61, %62
%64 = fmul double %60, %60
%65 = fadd double %63, %64
%66 = tail call double @"numba.math.['double'].sqrt"(double %65)
%67 = fadd double %66, %65
%68 = fdiv double 1.000000e+00, %67
%69 = fmul double %52, %68
%70 = fadd double %Fx_416, %69
%71 = fmul double %56, %68
%72 = fadd double %Fy_417, %71
%73 = fmul double %60, %68
%74 = fadd double %Fz_418, %73
br label %"exit_if_7:16"
"loop_body_20:29": ; preds = %"for_condition_4:17.loopexit", %"loop_body_20:29"
%lsr.iv21 = phi i64 [ %lsr.iv.next22, %"loop_body_20:29" ], [ 150, %"for_condition_4:17.loopexit" ]
%lsr.iv15 = phi i64 [ %lsr.iv.next16, %"loop_body_20:29" ], [ %21, %"for_condition_4:17.loopexit" ]
%lsr.iv4 = phi i8* [ %91, %"loop_body_20:29" ], [ %2, %"for_condition_4:17.loopexit" ]
%lsr.iv414 = bitcast i8* %lsr.iv4 to double*
%lsr.iv45 = bitcast i8* %lsr.iv4 to i1*
%75 = load double* %lsr.iv414, align 8, !tbaa !7
%76 = mul i64 %lsr.iv15, -1
%77 = inttoptr i64 %76 to double*
%78 = load double* %77, align 8, !tbaa !7
%79 = fmul double %78, 1.000000e-02
%80 = fadd double %75, %79
store double %80, double* %lsr.iv414, align 8, !tbaa !7
%scevgep12 = getelementptr i8* %lsr.iv4, i64 %8
%scevgep1213 = bitcast i8* %scevgep12 to double*
%81 = load double* %scevgep1213, align 8, !tbaa !7
%82 = mul i64 %lsr.iv15, -1
%sunkaddr45 = add i64 %17, %82
%sunkaddr46 = inttoptr i64 %sunkaddr45 to double*
%83 = load double* %sunkaddr46, align 8, !tbaa !7
%84 = fmul double %83, 1.000000e-02
%85 = fadd double %81, %84
%scevgep10 = getelementptr i8* %lsr.iv4, i64 %8
%scevgep1011 = bitcast i8* %scevgep10 to double*
store double %85, double* %scevgep1011, align 8, !tbaa !7
%sunkaddr47 = ptrtoint i8* %lsr.iv4 to i64
%sunkaddr48 = mul i64 %8, 2
%sunkaddr49 = add i64 %sunkaddr47, %sunkaddr48
%sunkaddr50 = inttoptr i64 %sunkaddr49 to double*
%86 = load double* %sunkaddr50, align 8, !tbaa !7
%87 = mul i64 %lsr.iv15, -1
%sunkaddr51 = mul i64 %17, 2
%sunkaddr52 = add i64 %87, %sunkaddr51
%sunkaddr53 = inttoptr i64 %sunkaddr52 to double*
%88 = load double* %sunkaddr53, align 8, !tbaa !7
%89 = fmul double %88, 1.000000e-02
%90 = fadd double %86, %89
%sunkaddr54 = ptrtoint i8* %lsr.iv4 to i64
%sunkaddr55 = mul i64 %8, 2
%sunkaddr56 = add i64 %sunkaddr54, %sunkaddr55
%sunkaddr57 = inttoptr i64 %sunkaddr56 to double*
store double %90, double* %sunkaddr57, align 8, !tbaa !7
%scevgep = getelementptr i1* %lsr.iv45, i64 %6
%91 = bitcast i1* %scevgep to i8*
%lsr.iv.next16 = sub i64 %lsr.iv15, %15
%lsr.iv.next22 = add i64 %lsr.iv21, -1
%exitcond34 = icmp eq i64 %lsr.iv.next22, 0
br i1 %exitcond34, label %"for_condition_3:16.loopexit", label %"loop_body_20:29"
"no_error_23:21": ; preds = %"exit_for_3:4"
tail call void @Py_XINCREF({ i64, i8* }* %25)
br label %cleanup_label
}
declare { i64, i8* }* @PyTuple_Pack(i64, ...)
declare double @"numba.math.['double'].sqrt"(double)
define { i64, i8* }* @__numba_specialized_3_numba_2E_codegen_2E_llvmwrapper_2E___numba_wrapper_nbody_hy(i8* %self, { i64, i8* }* %args) {
entry:
%objtemp = alloca { i64, i8* }*
store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6
%0 = alloca { i64, i8* }*
%1 = alloca { i64, i8* }*
%return_value = alloca { i64, i8* }*
%2 = call i32 ({ i64, i8* }*, i8*, ...)* @PyArg_ParseTuple({ i64, i8* }* %args, i8* getelementptr inbounds ([3 x i8]* @__STR_0, i32 0, i32 0), { i64, i8* }** %1, { i64, i8* }** %0)
%3 = icmp eq i32 %2, 0
br i1 %3, label %cleanup.if.true, label %cleanup.if.end
cleanup_label: ; preds = %no_error, %error_label
%4 = load { i64, i8* }** %objtemp, !tbaa !6
call void @Py_XDECREF({ i64, i8* }* %4)
%5 = load { i64, i8* }** %return_value
ret { i64, i8* }* %5
error_label: ; preds = %cleanup.if.end, %cleanup.if.true
store { i64, i8* }* null, { i64, i8* }** %return_value
%6 = load { i64, i8* }** %return_value, !tbaa !6
call void @Py_XINCREF({ i64, i8* }* %6)
br label %cleanup_label
cleanup.if.true: ; preds = %entry
br label %error_label
cleanup.if.end: ; preds = %entry
%7 = load { i64, i8* }** %1
%8 = load { i64, i8* }** %0
%9 = bitcast { i64, i8* }* %7 to { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }*
%10 = bitcast { i64, i8* }* %8 to { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }*
%11 = call { i64, i8* }* @__numba_specialized_2___main___2E_nbody_hy({ i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %9, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %10)
store { i64, i8* }* %11, { i64, i8* }** %objtemp, !tbaa !6
%12 = ptrtoint { i64, i8* }* %11 to i64
%13 = icmp eq i64 %12, 0
br i1 %13, label %error_label, label %no_error
no_error: ; preds = %cleanup.if.end
%14 = load { i64, i8* }** %objtemp, !tbaa !6
store { i64, i8* }* %14, { i64, i8* }** %return_value
%15 = load { i64, i8* }** %return_value, !tbaa !6
call void @Py_XINCREF({ i64, i8* }* %15)
br label %cleanup_label
}
!tbaa = !{!0, !1, !2, !3, !4, !5, !0, !1, !6, !0, !1, !7, !8, !9, !10, !6, !2, !3, !4, !5, !11, !12, !13, !14, !0, !1, !6}
!0 = metadata !{metadata !"root"}
!1 = metadata !{metadata !"char *", metadata !0}
!2 = metadata !{metadata !"unique0", metadata !1}
!3 = metadata !{metadata !"unique1", metadata !1}
!4 = metadata !{metadata !"unique2", metadata !1}
!5 = metadata !{metadata !"unique3", metadata !1}
!6 = metadata !{metadata !"object", metadata !1}
!7 = metadata !{metadata !"float3264 *", metadata !1}
!8 = metadata !{metadata !"npy_intp **", metadata !1}
!9 = metadata !{metadata !"tbaa_type(numpy shape, npy_intp *) *", metadata !8}
!10 = metadata !{metadata !"tbaa_type(numpy strides, npy_intp *) *", metadata !8}
!11 = metadata !{metadata !"unique4", metadata !1}
!12 = metadata !{metadata !"unique5", metadata !1}
!13 = metadata !{metadata !"unique6", metadata !1}
!14 = metadata !{metadata !"unique7", metadata !1}
CPU times: user 424 ms, sys: 35.5 ms, total: 460 ms Wall time: 435 ms
The dynamically compiled version is by far the fastest.
func_list = ['nbody_py', 'nbody_np', 'nbody_nb', 'nbody_hy']
data_list = ['p_py, pv_py', 'p_np, pv_np', 'p_nb, pv_nb', 'p_nb, pv_nb']
perf_comp_data(func_list, data_list, rep=3)
function: nbody_nb, av. time sec: 0.00607, relative: 1.0 function: nbody_np, av. time sec: 0.09028, relative: 14.9 function: nbody_py, av. time sec: 0.54923, relative: 90.5 function: nbody_hy, av. time sec: 5.44442, relative: 896.8
Nowadays, the Python ecosystem provides a number of ways to improve the performance of code.
A major benefit of the Python ecosystem is, that all these approaches generally are easily implementable, meaning that the additional effort included is generally quite low (even for non-experts). In other words, performance improvements often are low hanging fruits.

The Python Quants – the company Web site
Dr. Yves J. Hilpisch – my personal Web site
Derivatives Analytics with Python – my new book
Contact Us