Dr. Yves J. Hilpisch
The Python Quants GmbH
PyData, New York City – 08. November 2013
Python for Performance Computing:
This tutorial focuses on
It does not address such important issues like
A fundamental Python stack for interactive data analytics and visualization should at least contain the following libraries tools:
It is best to use either the Python distribution Anaconda or the Web-based analytics environment Wakari. Both provide almost "complete" Python environments.
When it comes to performance critical operations, a number of additional libraries can be used to speed-up code execution.
All these are also part of Anaconda and Wakari.
We define a convenience function to compare the performance of a set of Python functions executed on the same data or on different data.
def perf_comp_data(func_list, data_list, rep=3, number=1):
from timeit import repeat
res_list = {}
for name in enumerate(func_list):
stmt = name[1] + '(' + data_list[name[0]] + ')'
setup = "from __main__ import " + name[1] + ', ' + data_list[name[0]]
results = repeat(stmt=stmt, setup=setup, repeat=rep, number=number)
res_list[name[1]] = sum(results) / rep
res_sort = sorted(res_list.iteritems(), key=lambda (k, v): (v, k))
for item in res_sort:
rel = item[1] / res_sort[0][1]
print 'function: ' + item[0] + ', av. time sec: %9.5f, ' % item[1] \
+ 'relative: %6.1f' % rel
A typical compute intensive operation is to evaluate a complex mathematical expression on a large array of data.
First, let us define an example function/numerical expression.
from math import *
def f(x):
return abs(cos(x)) ** 0.5 + sin(2 + 3 * x)
Second, as our benchmark case a pure Python implementation.
I = 200000
a_py = range(I)
def f1(a):
res = []
for x in a:
res.append(f(x))
return res
One can also use different paradigms to implement the same function, like iterators or the eval function.
def f2(a):
return [f(x) for x in a]
def f3(a):
ex = 'abs(cos(x)) ** 0.5 + sin(2 + 3 * x)'
return [eval(ex) for x in a]
Third, a vectorized implementations with NumPy.
import numpy as np
a_np = np.arange(I)
def f4(a):
return (np.abs(np.cos(a)) ** 0.5 +
np.sin(2 + 3 * a))
Fourth, we use numexpr to evaluate the numerical expression.
import numexpr as ne
def f5(a):
ex = 'abs(cos(a)) ** 0.5 + sin(2 + 3 * a)'
ne.set_num_threads(1)
return ne.evaluate(ex)
def f6(a):
ex = 'abs(cos(a)) ** 0.5 + sin(2 + 3 * a)'
ne.set_num_threads(4)
return ne.evaluate(ex)
Now, let's compare the performance of all different function implementations.
func_list = ['f1', 'f2', 'f3', 'f4', 'f5', 'f6']
data_list = ['a_py', 'a_py', 'a_py', 'a_np', 'a_np', 'a_np']
perf_comp_data(func_list, data_list, rep=1)
function: f6, av. time sec: 0.00627, relative: 1.0 function: f5, av. time sec: 0.01631, relative: 2.6 function: f4, av. time sec: 0.01861, relative: 3.0 function: f2, av. time sec: 0.15558, relative: 24.8 function: f1, av. time sec: 0.27338, relative: 43.6 function: f3, av. time sec: 6.27276, relative: 1000.1
Depending on the respective algorithm you implement with NumPy, memory layout can play a non-negligible role.
import numpy as np
np.zeros((3, 3), dtype=np.float64, order='C')
array([[ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]])
The way you initialize NumPy array can have a significant influence on the performance of operations on these arrays (given a certain size of the array).
Consider the C-like, i.e. row-wise, storage.
c = np.array([[ 1., 1., 1.],
[ 2., 2., 2.],
[ 3., 3., 3.]], order='C')
In this case, the 1s, the 2s and the 3s are stored next to each other.
By contrast, consider the Fortran-like, i.e. column-wise, storage.
f = np.array([[ 1., 1., 1.],
[ 2., 2., 2.],
[ 3., 3., 3.]], order='F')
Now, the data is stored in a way that 1, 2, 3 are next to each other for each column.
Let's see whether the memory layout makes a difference in some way when the array is large.
x = np.random.standard_normal((3, 1500000))
C = np.array(x, order='C')
F = np.array(x, order='F')
x = 0.0
Let's implement standard operations on the C-like layout array.
First, calculating sums.
%timeit C.sum(axis=0)
100 loops, best of 3: 11.1 ms per loop
%timeit C.sum(axis=1)
100 loops, best of 3: 5.93 ms per loop
Second, standard deviations.
%timeit C.std(axis=0)
10 loops, best of 3: 69.7 ms per loop
%timeit C.std(axis=1)
10 loops, best of 3: 31.6 ms per loop
Now the Fortran-like layout. Sums first.
%timeit F.sum(axis=0)
10 loops, best of 3: 35.6 ms per loop
%timeit F.sum(axis=1)
10 loops, best of 3: 26.4 ms per loop
Standard deviations second.
%timeit F.std(axis=0)
10 loops, best of 3: 117 ms per loop
%timeit F.std(axis=1)
10 loops, best of 3: 70.8 ms per loop
C = 0.0; F = 0.0
Currently, and in the foreseeable future as well, most corporate data is stored in SQL databases. We take this as a starting point and generate a SQL table with dummy data.
filename = 'numbers'
import sqlite3 as sq3 # imports the SQLite library
# Remove files if they exist
import os
try:
os.remove(filename + '.db')
os.remove(filename + '.h5')
os.remove(filename + '.hs')
os.remove(filename + '.csv')
except:
pass
We create a table and populate the table with the dummy data.
query = 'CREATE TABLE numbs (no1 real, no2 real, no3 real, no4 real, no5 real)'
con = sq3.connect(filename + '.db')
con.execute(query)
con.commit()
# Writing Random Numbers to SQLite3 Database
rows = 5000000 # 5,000,000 rows for the SQL table (ca. 250 MB in size)
nr = np.random.standard_normal((rows, 5))
%time con.executemany('INSERT INTO numbs VALUES(?, ?, ?, ?, ?)', nr)
con.commit()
CPU times: user 1min 2s, sys: 962 ms, total: 1min 3s Wall time: 1min 3s
We can now read the data into a NumPy array.
# Reading a couple of rows from SQLite3 database table
array(con.execute('SELECT * FROM numbs').fetchmany(3)).round(3)
array([[ 0.149, -0.874, 0.736, -0.551, -1.399], [ 1.299, -0.281, -1.416, 0.995, 0.337], [ 0.87 , -1.057, 0.544, -1.416, -2.613]])
# Reading the whole table into an NumPy array
%time result = array(con.execute('SELECT * FROM numbs').fetchall())
CPU times: user 39.3 s, sys: 1.83 s, total: 41.1 s Wall time: 41.6 s
PyTables, which is based on the HDF5 file format, allows much faster, disk-based I/O than a typical SQL database. For example, writing the same data is much faster compared to SQLite3.
import tables as tb
%%time
h5 = tb.open_file('numbers.h5', 'w')
h5.create_array('/', 'numbs', obj=result, title='numbs')
h5.close()
CPU times: user 3.89 ms, sys: 209 ms, total: 213 ms Wall time: 370 ms
nr = 0.0; result = 0.0 # memory clean-up
Reading speed is equally impressive.
%%time
h5 = tb.open_file('numbers.h5', 'r')
arr = h5.root.numbs.read()
h5.close()
CPU times: user 2.6 ms, sys: 167 ms, total: 170 ms Wall time: 158 ms
arr.nbytes
200000000
arr = 0.0 # memory clean-up
pandas is optimized both for convenience and performance when it comes to analytics and I/O operations.
import pandas as pd
import pandas.io.sql as pds
%time data = pds.read_frame('SELECT * FROM numbs', con)
con.close()
CPU times: user 18.3 s, sys: 1.51 s, total: 19.8 s Wall time: 20.2 s
data.head()
no1 | no2 | no3 | no4 | no5 | |
---|---|---|---|---|---|
0 | 0.149151 | -0.873864 | 0.736056 | -0.550805 | -1.399441 |
1 | 1.298855 | -0.281183 | -1.415695 | 0.995101 | 0.337464 |
2 | 0.870333 | -1.056957 | 0.543588 | -1.416220 | -2.613365 |
3 | 0.509911 | 0.139642 | 0.152018 | 0.197079 | 0.292907 |
4 | -1.048647 | -1.113482 | -0.037894 | 1.216061 | -0.263849 |
The plotting of larger data sets often is a problem. This can easily be accomplished here by only plotting every 500th data row, for instance.
%time data[::500].cumsum().plot(grid=True, legend=True) # every 500th row is plotted
CPU times: user 106 ms, sys: 5.53 ms, total: 111 ms Wall time: 119 ms
<matplotlib.axes.AxesSubplot at 0x107cae990>
There are some reasons why you may want to write the DataFrame to disk:
pandas is tightly integrated with PyTables. It makes data storage and I/O operations with pandas not only convenient but quite fast as well.
%%time
h5 = pd.HDFStore(filename + '.hs', 'w')
h5['data'] = data # writes the DataFrame to HDF5 database
h5.close()
CPU times: user 263 ms, sys: 374 ms, total: 638 ms Wall time: 793 ms
Now, we can read the data from the HDF5 file into another DataFrame object. Performance of such an operation generally is quite high.
%%time
h5 = pd.HDFStore(filename + '.hs', 'r')
data_new = h5['data']
h5.close()
CPU times: user 30 ms, sys: 276 ms, total: 306 ms Wall time: 305 ms
%time data.sum() - data_new.sum() # checking if the DataFrames are the same
CPU times: user 750 ms, sys: 231 ms, total: 981 ms Wall time: 976 ms
no1 0 no2 0 no3 0 no4 0 no5 0 dtype: float64
data = 0.0; data_new = 0.0 # memory clean-up
Some final checks.
ll numb*
-rw-r--r-- 1 yhilpisch staff 272386048 8 Nov 14:26 numbers.db -rw-rw-rw- 1 yhilpisch staff 200002304 8 Nov 14:26 numbers.h5 -rw-rw-rw- 1 yhilpisch staff 240007368 8 Nov 14:27 numbers.hs
try:
os.remove(filename + '.db')
os.remove(filename + '.h5')
os.remove(filename + '.hs')
os.remove(filename + '.csv')
except:
pass
The algorithm we (arbitrarily) choose is the Monte Carlo estimator for a European option value in the Black-Scholes-Merton set-up. In this set-up the underlying follows the stochastic differential equation:
\[dS_t = r S_t dt + \sigma S_t dZ_t\]
The Monte Carlo estimator for a call option value looks like follows:
\(C_0 = e^{-rT} \frac{1}{I} \sum_{i=1}^{I} \max[S_T(i) - K, 0]\)
where \(S_T(i)\) is the \(i\)-th simulated value of the underlying at maturity \(T\).
A dynamic implementation of the BSM Monte Carlo valuation could look like follows.
def BSM_MCS(S0):
''' Dynamic Black-Scholes-Merton MCS Estimator for European Calls. '''
import numpy as np
K = 100.; T = 1.0; r = 0.05; vola = 0.2
M = 50; I = 20000
dt = T / M
rand = np.random.standard_normal((M + 1, I))
S = np.zeros((M + 1, I)); S[0] = S0
for t in range(1, M + 1):
S[t] = S[t-1] * np.exp((r - 0.5 * vola ** 2) * dt + vola * np.sqrt(dt) * rand[t])
BSM_MCS_Value = np.sum(np.maximum(S[-1] - K, 0)) / I
return BSM_MCS_Value
The benchmark case of sequential valuation of \(n\) options.
def seqValue(n):
optionValues = []
for S in linspace(80, 100, n):
optionValues.append(BSM_MCS(S))
return optionValues
First, start a (local) cluster.
# in the shell ...
####
# ipcluster start -n 4
####
# or maybe more cores
Second, enable parallel computing capabilities.
from IPython.parallel import Client
c = Client(profile="default")
view = c.load_balanced_view()
The function for the parallel execution of a number \(n\) of different options valuations.
def parValue(n):
optionValues = []
for S in linspace(80, 100, n):
value = view.apply_async(BSM_MCS, S)
# asynchronously calculate the option values
optionValues.append(value)
c.wait(optionValues)
return optionValues
Let us compare performance of the two different approaches.
n = 20 # number of option valuations
func_list = ['seqValue', 'parValue']
data_list = 2 * ['n']
perf_comp_data(func_list, data_list, rep=3)
function: parValue, av. time sec: 0.75665, relative: 1.0 function: seqValue, av. time sec: 1.22566, relative: 1.6
Performance scales linearly with the number of CPU cores available.
Numba is an open-source, NumPy-aware optimizing compiler for Python sponsored by The Python Quants GmbH It uses the remarkable LLVM compiler infrastructure to compile Python byte-code to machine code especially for use in the NumPy run-time and SciPy modules.
Let us start with a problem that typically lead to performance issues in Python: alogrithms with a nested loops. A sandbox variant can illustrate the problem.
from math import cos, log
def f_py(I, J):
res = 0
for i in range(I):
for j in range (J):
res += int(cos(log(1)))
return res
I, J = 1000, 1000
%time f_py(I, J)
CPU times: user 491 ms, sys: 6.89 ms, total: 498 ms Wall time: 494 ms
1000000
In principle, this can be vectorized with NumPy arrays.
def f_np(I, J):
a = np.ones((I, J), dtype=np.float64)
return int(np.sum(np.cos(np.log(a)))), a
%time res, a = f_np(I, J)
CPU times: user 20.3 ms, sys: 6.44 ms, total: 26.7 ms Wall time: 25.8 ms
This is much faster, but not really memory efficient ...
a.nbytes
8000000
Numba can provide help.
import numba as nb
%time f_nb = nb.autojit(f_py)
CPU times: user 2.47 ms, sys: 657 µs, total: 3.13 ms Wall time: 2.73 ms
It preserves memory efficiency while speeding up execution considerably in this case.
%time f_nb(I, J)
DEBUG -- translate:361:translate ; ModuleID = 'tmp.module.__main__.f_py.107dfe578' @PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**) define i32 @__numba_specialized_0___main___2E_f_py(i32 %I, i32 %J) { entry: %nsteps2 = alloca i64 %target_temp1 = alloca i64 %nsteps = alloca i64 %target_temp = alloca i64 %return_value = alloca i32 store i64 0, i64* %target_temp, !tbaa !2 %0 = sext i32 %I to i64 store i64 %0, i64* %nsteps, !tbaa !3 br label %"for_condition_4:13" cleanup_label: ; preds = %"exit_for_4:4", %error_label %1 = load i32* %return_value ret i32 %1 error_label: ; No predecessors! store i32 123456789, i32* %return_value br label %cleanup_label "for_condition_4:13": ; preds = %entry, %"exit_for_5:8" %res_2 = phi i32 [ 0, %entry ], [ %res_3, %"exit_for_5:8" ] %j_1 = phi i64 [ 123456789, %entry ], [ %j_2, %"exit_for_5:8" ] %2 = load i64* %target_temp, !tbaa !2 %3 = load i64* %nsteps, !tbaa !3 %4 = icmp slt i64 %2, %3 %5 = icmp ne i1 %4, false br i1 %5, label %"loop_body_5:8", label %"exit_for_4:4" "exit_for_4:4": ; preds = %"for_condition_4:13" store i32 %res_2, i32* %return_value br label %cleanup_label "loop_body_5:8": ; preds = %"for_condition_4:13" %6 = load i64* %target_temp, !tbaa !2 %7 = load i64* %target_temp, !tbaa !2 %8 = add i64 %7, 1 store i64 %8, i64* %target_temp, !tbaa !2 store i64 0, i64* %target_temp1, !tbaa !4 %9 = sext i32 %J to i64 store i64 %9, i64* %nsteps2, !tbaa !5 br label %"for_condition_5:17" "for_condition_5:17": ; preds = %"loop_body_5:8", %"loop_body_6:19" %res_3 = phi i32 [ %20, %"loop_body_6:19" ], [ %res_2, %"loop_body_5:8" ] %j_2 = phi i64 [ %14, %"loop_body_6:19" ], [ %j_1, %"loop_body_5:8" ] %10 = load i64* %target_temp1, !tbaa !4 %11 = load i64* %nsteps2, !tbaa !5 %12 = icmp slt i64 %10, %11 %13 = icmp ne i1 %12, false br i1 %13, label %"loop_body_6:19", label %"exit_for_5:8" "exit_for_5:8": ; preds = %"for_condition_5:17" br label %"for_condition_4:13" "loop_body_6:19": ; preds = %"for_condition_5:17" %14 = load i64* %target_temp1, !tbaa !4 %15 = load i64* %target_temp1, !tbaa !4 %16 = add i64 %15, 1 store i64 %16, i64* %target_temp1, !tbaa !4 %17 = call double @"numba.math.['double'].log"(double 1.000000e+00) %18 = call double @"numba.math.['double'].cos"(double %17) %19 = fptosi double %18 to i32 %20 = add i32 %res_3, %19 br label %"for_condition_5:17" } declare { i64, i8* }* @Py_BuildValue(i8*, ...) declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...) declare void @PyErr_Clear() declare double @"numba.math.['double'].cos"(double) declare double @"numba.math.['double'].log"(double) !tbaa = !{!0, !1, !2, !3, !4, !5} !0 = metadata !{metadata !"root"} !1 = metadata !{metadata !"char *", metadata !0} !2 = metadata !{metadata !"unique0", metadata !1} !3 = metadata !{metadata !"unique1", metadata !1} !4 = metadata !{metadata !"unique2", metadata !1} !5 = metadata !{metadata !"unique3", metadata !1} DEBUG -- translate:361:translate ; ModuleID = 'numba_executable_module' @PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**) define void @Py_INCREF({ i64, i8* }* %obj) { decl: %obj1 = alloca { i64, i8* }* store { i64, i8* }* %obj, { i64, i8* }** %obj1 %0 = bitcast { i64, i8* }* %obj to i64* %1 = load i64* %0 %2 = add i64 %1, 1 store i64 %2, i64* %0 ret void } define void @Py_DECREF({ i64, i8* }* %obj) { decl: %obj1 = alloca { i64, i8* }* store { i64, i8* }* %obj, { i64, i8* }** %obj1 %0 = bitcast { i64, i8* }* %obj to i64* %1 = load i64* %0 %2 = icmp sgt i64 %1, 1 br i1 %2, label %if.then, label %if.else if.then: ; preds = %decl %3 = add i64 %1, -1 store i64 %3, i64* %0 br label %if.end if.else: ; preds = %decl call void @Py_DecRef({ i64, i8* }* %obj) br label %if.end if.end: ; preds = %if.else, %if.then ret void } declare void @Py_DecRef({ i64, i8* }*) define void @Py_XINCREF({ i64, i8* }* %obj) { decl: %obj1 = alloca { i64, i8* }* store { i64, i8* }* %obj, { i64, i8* }** %obj1 %0 = ptrtoint { i64, i8* }* %obj to i64 %1 = icmp ne i64 %0, 0 br i1 %1, label %if.then, label %if.end if.then: ; preds = %decl %2 = bitcast { i64, i8* }* %obj to i64* %3 = load i64* %2 %4 = add i64 %3, 1 store i64 %4, i64* %2 br label %if.end if.end: ; preds = %if.then, %decl ret void } define void @Py_XDECREF({ i64, i8* }* %obj) { decl: %obj1 = alloca { i64, i8* }* store { i64, i8* }* %obj, { i64, i8* }** %obj1 %0 = ptrtoint { i64, i8* }* %obj to i64 %1 = icmp ne i64 %0, 0 br i1 %1, label %if.then, label %if.end if.then: ; preds = %decl call void @Py_DECREF({ i64, i8* }* %obj) br label %if.end if.end: ; preds = %if.then, %decl ret void } define i8* @IndexAxis(i8* %data, i64* %in_shape, i64* %in_strides, i64 %src_dim, i64 %index) { decl: %data1 = alloca i8* %in_shape2 = alloca i64* %in_strides3 = alloca i64* %src_dim4 = alloca i64 %index5 = alloca i64 %result = alloca i8* store i8* %data, i8** %data1 store i64* %in_shape, i64** %in_shape2 store i64* %in_strides, i64** %in_strides3 store i64 %src_dim, i64* %src_dim4 store i64 %index, i64* %index5 %0 = load i64** %in_strides3 %1 = load i64* %src_dim4 %2 = getelementptr inbounds i64* %0, i64 %1 %3 = load i64* %2 %4 = mul i64 %3, %index %5 = load i8** %data1 %6 = getelementptr inbounds i8* %5, i64 %4 store i8* %6, i8** %result ret i8* %6 } define void @NewAxis(i64* %out_shape, i64* %out_strides, i32 %dst_dim) { decl: %out_shape1 = alloca i64* %out_strides2 = alloca i64* %dst_dim3 = alloca i32 store i64* %out_shape, i64** %out_shape1 store i64* %out_strides, i64** %out_strides2 store i32 %dst_dim, i32* %dst_dim3 %0 = load i64** %out_shape1 %1 = getelementptr inbounds i64* %0, i32 %dst_dim store i64 1, i64* %1 %2 = load i64** %out_strides2 %3 = load i32* %dst_dim3 %4 = getelementptr inbounds i64* %2, i32 %3 store i64 0, i64* %4 ret void } define i32 @Broadcast(i64* %dst_shape, i64* %src_shape, i64* %src_strides, i32 %max_ndim, i32 %ndim) { decl: %dst_shape1 = alloca i64* %src_shape2 = alloca i64* %src_strides3 = alloca i64* %max_ndim4 = alloca i32 %ndim5 = alloca i32 %0 = alloca i32 store i64* %dst_shape, i64** %dst_shape1 store i64* %src_shape, i64** %src_shape2 store i64* %src_strides, i64** %src_strides3 store i32 %max_ndim, i32* %max_ndim4 store i32 %ndim, i32* %ndim5 %1 = load i32* %max_ndim4 %2 = sub i32 %1, %ndim store i32 0, i32* %0 br label %loop.cond loop.cond: ; preds = %if.end11, %decl %3 = load i32* %0 %4 = load i32* %ndim5 %5 = icmp slt i32 %3, %4 br i1 %5, label %loop.body, label %loop.end loop.body: ; preds = %loop.cond %6 = load i64** %src_shape2 %7 = getelementptr inbounds i64* %6, i32 %3 %8 = add i32 %3, %2 %9 = load i64** %dst_shape1 %10 = getelementptr inbounds i64* %9, i32 %8 %11 = load i64* %7 %12 = icmp eq i64 %11, 1 br i1 %12, label %if.then, label %if.else loop.end: ; preds = %if.else7, %loop.cond %merge = phi i32 [ 1, %loop.cond ], [ 0, %if.else7 ] ret i32 %merge if.then: ; preds = %loop.body %13 = load i64** %src_strides3 %14 = getelementptr inbounds i64* %13, i32 %3 store i64 0, i64* %14 br label %if.end11 if.else: ; preds = %loop.body %15 = load i64* %10 %16 = icmp eq i64 %15, 1 br i1 %16, label %if.then6, label %if.else7 if.then6: ; preds = %if.else store i64 %11, i64* %10 br label %if.end11 if.else7: ; preds = %if.else %17 = icmp ne i64 %11, %15 br i1 %17, label %loop.end, label %if.end11 if.end11: ; preds = %if.else7, %if.then6, %if.then %18 = load i32* %0 %19 = add i32 %18, 1 store i32 %19, i32* %0 br label %loop.cond } define i32 @__numba_specialized_0___main___2E_f_py(i32 %I, i32 %J) { entry: %0 = icmp sgt i32 %I, 0 %1 = icmp sgt i32 %J, 0 %or.cond = and i1 %0, %1 br i1 %or.cond, label %"loop_body_5:8.lr.ph.split.us", label %"exit_for_4:4" "loop_body_5:8.lr.ph.split.us": ; preds = %entry %2 = sext i32 %I to i64 %3 = sext i32 %J to i64 br label %"loop_body_6:19.lr.ph.us" "loop_body_6:19.us": ; preds = %"loop_body_6:19.us", %"loop_body_6:19.lr.ph.us" %lsr.iv = phi i64 [ %lsr.iv.next, %"loop_body_6:19.us" ], [ %3, %"loop_body_6:19.lr.ph.us" ] %res_35.us = phi i32 [ %res_26.us, %"loop_body_6:19.lr.ph.us" ], [ %7, %"loop_body_6:19.us" ] %4 = tail call double @"numba.math.['double'].log"(double 1.000000e+00) %5 = tail call double @"numba.math.['double'].cos"(double %4) %6 = fptosi double %5 to i32 %7 = add i32 %6, %res_35.us %lsr.iv.next = add i64 %lsr.iv, -1 %exitcond = icmp eq i64 %lsr.iv.next, 0 br i1 %exitcond, label %"for_condition_4:13.loopexit.us", label %"loop_body_6:19.us" "for_condition_4:13.loopexit.us": ; preds = %"loop_body_6:19.us" %exitcond8 = icmp eq i64 %9, %2 br i1 %exitcond8, label %"exit_for_4:4", label %"loop_body_6:19.lr.ph.us" "loop_body_6:19.lr.ph.us": ; preds = %"loop_body_5:8.lr.ph.split.us", %"for_condition_4:13.loopexit.us" %res_26.us = phi i32 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %7, %"for_condition_4:13.loopexit.us" ] %8 = phi i64 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %9, %"for_condition_4:13.loopexit.us" ] %9 = add i64 %8, 1 br label %"loop_body_6:19.us" "exit_for_4:4": ; preds = %"for_condition_4:13.loopexit.us", %entry %res_2.lcssa = phi i32 [ 0, %entry ], [ %7, %"for_condition_4:13.loopexit.us" ] ret i32 %res_2.lcssa } declare double @"numba.math.['double'].cos"(double) declare double @"numba.math.['double'].log"(double) define { i64, i8* }* @__numba_specialized_1_numba_2E_codegen_2E_llvmwrapper_2E___numba_wrapper_f_py(i8* %self, { i64, i8* }* %args) { entry: %objtemp = alloca { i64, i8* }* store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6 %0 = alloca { i64, i8* }* %1 = alloca { i64, i8* }* %return_value = alloca { i64, i8* }* %2 = call i32 ({ i64, i8* }*, i8*, ...)* @PyArg_ParseTuple({ i64, i8* }* %args, i8* getelementptr inbounds ([3 x i8]* @__STR_0, i32 0, i32 0), { i64, i8* }** %1, { i64, i8* }** %0) %3 = icmp eq i32 %2, 0 br i1 %3, label %cleanup.if.true, label %cleanup.if.end cleanup_label: ; preds = %no_error, %error_label %4 = load { i64, i8* }** %objtemp, !tbaa !6 call void @Py_XDECREF({ i64, i8* }* %4) %5 = load { i64, i8* }** %return_value ret { i64, i8* }* %5 error_label: ; preds = %empty7, %empty8, %empty5, %empty2, %cleanup.if.true store { i64, i8* }* null, { i64, i8* }** %return_value %6 = load { i64, i8* }** %return_value, !tbaa !6 call void @Py_XINCREF({ i64, i8* }* %6) br label %cleanup_label cleanup.if.true: ; preds = %entry br label %error_label cleanup.if.end: ; preds = %entry %7 = load { i64, i8* }** %1 %8 = load { i64, i8* }** %0 %9 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %7) br label %empty empty: ; preds = %cleanup.if.end %10 = call i8* @PyErr_Occurred() %11 = ptrtoint i8* %10 to i64 %12 = icmp ne i64 %11, 0 br i1 %12, label %empty2, label %empty1 empty1: ; preds = %empty %13 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %8) br label %empty3 empty2: ; preds = %empty br label %error_label empty3: ; preds = %empty1 %14 = call i8* @PyErr_Occurred() %15 = ptrtoint i8* %14 to i64 %16 = icmp ne i64 %15, 0 br i1 %16, label %empty5, label %empty4 empty4: ; preds = %empty3 %17 = call i32 @__numba_specialized_0___main___2E_f_py(i32 %9, i32 %13) br label %empty6 empty5: ; preds = %empty3 br label %error_label empty6: ; preds = %empty4 %18 = call i8* @PyErr_Occurred() %19 = ptrtoint i8* %18 to i64 %20 = icmp ne i64 %19, 0 br i1 %20, label %empty8, label %empty7 empty7: ; preds = %empty6 %21 = sext i32 %17 to i64 %22 = call { i64, i8* }* inttoptr (i64 4433235252 to { i64, i8* }* (i64)*)(i64 %21) store { i64, i8* }* %22, { i64, i8* }** %objtemp, !tbaa !6 %23 = ptrtoint { i64, i8* }* %22 to i64 %24 = icmp eq i64 %23, 0 br i1 %24, label %error_label, label %no_error empty8: ; preds = %empty6 br label %error_label no_error: ; preds = %empty7 %25 = load { i64, i8* }** %objtemp, !tbaa !6 store { i64, i8* }* %25, { i64, i8* }** %return_value %26 = load { i64, i8* }** %return_value, !tbaa !6 call void @Py_XINCREF({ i64, i8* }* %26) br label %cleanup_label } declare i8* @PyErr_Occurred() declare { i64, i8* }* @Py_BuildValue(i8*, ...) declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...) declare void @PyErr_Clear() !tbaa = !{!0, !1, !2, !3, !4, !5, !0, !1, !6} !0 = metadata !{metadata !"root"} !1 = metadata !{metadata !"char *", metadata !0} !2 = metadata !{metadata !"unique0", metadata !1} !3 = metadata !{metadata !"unique1", metadata !1} !4 = metadata !{metadata !"unique2", metadata !1} !5 = metadata !{metadata !"unique3", metadata !1} !6 = metadata !{metadata !"object", metadata !1}
CPU times: user 248 ms, sys: 27.5 ms, total: 275 ms Wall time: 337 ms
1000000
Let us compare the performance of the different alternatives.
func_list = ['f_py', 'f_np', 'f_nb']
data_list = 3 * ['I, J']
perf_comp_data(func_list, data_list, rep=3)
function: f_nb, av. time sec: 0.01615, relative: 1.0 function: f_np, av. time sec: 0.03077, relative: 1.9 function: f_py, av. time sec: 0.51423, relative: 31.8
The simulation of the movement of N bodies in relation to each other, where \(N\) is a large number and a body can be a star/planet or a small particle, is an important problem in Physics.
The basic simulation algorithm involves a number of iterations ("loops") since one has to determine the force that every single body exerts on all the other bodies in the system to be simulated (e.g. a galaxy or a system of atomic particles).
The algorithm lends iteself pretty well for parallization. However, the example code that follows does not use any parallelization techniques. It illustrates rather what speed-ups are achievable by
A description and discussion of the algorithm is found in the recent article Vladimirov, Andrey and Vadim Karpusenko (2013): "Test-Driving Intel Xeon-Phi Coprocessors with a Basic N-Body Simulation." White Paper, Colfax International.
We first use pure Python to implement the algorithm. We start by generating random vectors of particle locations and a zero vector for the speed of each particle.
import random
nParticles = 150
nSteps = 15
p_py = []
for i in range(nParticles):
p_py.append([random.gauss(0.0, 1.0) for j in range(3)])
pv_py = [[0, 0, 0] for x in p_py]
Next, we define the simulation function for the movements of each particle given their gravitational interaction.
def nbody_py(particle, particlev): # lists as input
dt = 0.01
for step in range(1, nSteps + 1, 1):
for i in range(nParticles):
Fx = 0.0; Fy = 0.0; Fz = 0.0
for j in range(nParticles):
if j != i:
dx = particle[j][0] - particle[i][0]
dy = particle[j][1] - particle[i][1]
dz = particle[j][2] - particle[i][2]
drSquared = dx * dx + dy * dy + dz * dz
drPowerN32 = 1.0 / (drSquared + sqrt(drSquared))
Fx += dx * drPowerN32
Fy += dy * drPowerN32
Fz += dz * drPowerN32
particlev[i][0] += dt * Fx
particlev[i][1] += dt * Fy
particlev[i][2] += dt * Fz
for i in range(nParticles):
particle[i][0] += particlev[i][0] * dt
particle[i][1] += particlev[i][1] * dt
particle[i][2] += particlev[i][2] * dt
return particle, particlev
We then execute the function to measure execution speed.
%time p_py, pv_py = nbody_py(p_py, pv_py)
CPU times: user 593 ms, sys: 7.4 ms, total: 601 ms Wall time: 596 ms
The basic algorithm allows for a number of vectorizations by using NumPy.
import numpy as np
p_np = np.random.standard_normal((nParticles, 3))
pv_np = np.zeros_like(p_np)
Using NumPy arrays and matplotlib's 3d plotting capabilities, we can easily visualize the intial positions of the particles in the space.
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(p_np[:, 0], p_np[:, 1], p_np[:, 2],
c='r', marker='.')
<mpl_toolkits.mplot3d.art3d.Patch3DCollection at 0x10b3617d0>
The respective NumPy simulation function is considerably more compact.
def nbody_np(particle, particlev): # NumPy arrays as input
nSteps = 15; dt = 0.01
for step in range(1, nSteps + 1, 1):
Fp = np.zeros((nParticles, 3))
for i in range(nParticles):
dp = particle - particle[i]
drSquared = np.sum(dp ** 2, axis=1)
h = drSquared + np.sqrt(drSquared)
drPowerN32 = 1. / np.maximum(h, 1E-10)
Fp += -(dp.T * drPowerN32).T
particlev += dt * Fp
particle += particlev * dt
return particle, particlev
It is also considerably faster.
%time p_np, pv_np = nbody_np(p_np, pv_np)
CPU times: user 106 ms, sys: 1.25 ms, total: 107 ms Wall time: 106 ms
Let's check how the particles moved. The following is a snapshot of the final positions of all particles.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(p_np[:, 0], p_np[:, 1], p_np[:, 2],
c='r', marker='.')
<mpl_toolkits.mplot3d.art3d.Patch3DCollection at 0x10bac3490>
For this case, we first implement a function which is somehow a hybrid function of the pure Python (loop-heavy) and the NumPy array class-based version.
def nbody_hy(particle, particlev): # NumPy arrays as input
dt = 0.01
for step in range(1, nSteps + 1, 1):
for i in range(nParticles):
Fx = 0.0; Fy = 0.0; Fz = 0.0
for j in range(nParticles):
if j != i:
dx = particle[j,0] - particle[i,0]
dy = particle[j,1] - particle[i,1]
dz = particle[j,2] - particle[i,2]
drSquared = dx * dx + dy * dy + dz * dz
drPowerN32 = 1.0 / (drSquared + sqrt(drSquared))
Fx += dx * drPowerN32
Fy += dy * drPowerN32
Fz += dz * drPowerN32
particlev[i, 0] += dt * Fx
particlev[i, 1] += dt * Fy
particlev[i, 2] += dt * Fz
for i in range(nParticles):
particle[i,0] += particlev[i,0] * dt
particle[i,1] += particlev[i,1] * dt
particle[i,2] += particlev[i,2] * dt
return particle, particlev
This function is then compiled with Numba.
import numba as nb
nbody_nb = nb.autojit(nbody_hy)
This new, compiled version of the hybrid function is yet faster.
p_nb = np.random.standard_normal((nParticles, 3))
pv_nb = np.zeros_like(p_nb)
%time p_nb, pv_nb = nbody_nb(p_nb, pv_nb)
DEBUG -- translate:361:translate ; ModuleID = 'tmp.module.__main__.nbody_hy.10baa19b0' @PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**) define { i64, i8* }* @__numba_specialized_2___main___2E_nbody_hy({ i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev) { entry: %objtemp = alloca { i64, i8* }* store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6 %nsteps8 = alloca i64 %target_temp7 = alloca i64 %nsteps6 = alloca i64 %target_temp5 = alloca i64 %nsteps4 = alloca i64 %target_temp3 = alloca i64 %nsteps = alloca i64 %target_temp = alloca i64 %particlev2 = alloca { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle1 = alloca { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* store { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particle1 %0 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }* call void @Py_INCREF({ i64, i8* }* %0) %1 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i32 0, i32 2 %2 = load i8** %1, !tbaa !2 %3 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i32 0, i32 4 %4 = load i64** %3, !tbaa !4 %5 = getelementptr i64* %4, i32 0 %6 = load i64* %5 %7 = getelementptr i64* %4, i32 1 %8 = load i64* %7 %9 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i32 0, i32 5 %10 = load i64** %9, !tbaa !5 %11 = getelementptr i64* %10, i32 0 %12 = load i64* %11 %13 = getelementptr i64* %10, i32 1 %14 = load i64* %13 store { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particlev2 %15 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }* call void @Py_INCREF({ i64, i8* }* %15) %16 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i32 0, i32 2 %17 = load i8** %16, !tbaa !2 %18 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i32 0, i32 4 %19 = load i64** %18, !tbaa !4 %20 = getelementptr i64* %19, i32 0 %21 = load i64* %20 %22 = getelementptr i64* %19, i32 1 %23 = load i64* %22 %24 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i32 0, i32 5 %25 = load i64** %24, !tbaa !5 %26 = getelementptr i64* %25, i32 0 %27 = load i64* %26 %28 = getelementptr i64* %25, i32 1 %29 = load i64* %28 %return_value = alloca { i64, i8* }* store i64 0, i64* %target_temp, !tbaa !7 br i1 true, label %ifexp.then, label %ifexp.else cleanup_label: ; preds = %"no_error_23:21", %error_label %30 = load { i64, i8* }** %objtemp, !tbaa !6 call void @Py_XDECREF({ i64, i8* }* %30) %31 = load { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particlev2, !tbaa !6 %32 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %31 to { i64, i8* }* call void @Py_XDECREF({ i64, i8* }* %32) %33 = load { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particle1, !tbaa !6 %34 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %33 to { i64, i8* }* call void @Py_XDECREF({ i64, i8* }* %34) %35 = load { i64, i8* }** %return_value ret { i64, i8* }* %35 error_label: ; preds = %"exit_for_3:4" store { i64, i8* }* null, { i64, i8* }** %return_value %36 = load { i64, i8* }** %return_value, !tbaa !6 call void @Py_XINCREF({ i64, i8* }* %36) br label %cleanup_label ifexp.then: ; preds = %entry br label %ifexp.merge ifexp.else: ; preds = %entry br label %ifexp.merge ifexp.merge: ; preds = %ifexp.else, %ifexp.then %37 = phi i32 [ 1, %ifexp.then ], [ -1, %ifexp.else ] %38 = sext i32 %37 to i64 %39 = sub i64 16, %38 %40 = sdiv i64 %39, 1 store i64 %40, i64* %nsteps, !tbaa !8 br label %"for_condition_3:16" "for_condition_3:16": ; preds = %ifexp.merge, %"exit_for_19:8" %i_1 = phi i64 [ 123456789, %ifexp.merge ], [ %i_4, %"exit_for_19:8" ] %Fx_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %Fx_2, %"exit_for_19:8" ] %drSquared_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %drSquared_2, %"exit_for_19:8" ] %Fy_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %Fy_2, %"exit_for_19:8" ] %Fz_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %Fz_2, %"exit_for_19:8" ] %dz_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %dz_2, %"exit_for_19:8" ] %j_1 = phi i64 [ 123456789, %ifexp.merge ], [ %j_2, %"exit_for_19:8" ] %dx_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %dx_2, %"exit_for_19:8" ] %dy_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %dy_2, %"exit_for_19:8" ] %drPowerN32_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %drPowerN32_2, %"exit_for_19:8" ] %41 = load i64* %target_temp, !tbaa !7 %42 = load i64* %nsteps, !tbaa !8 %43 = icmp slt i64 %41, %42 %44 = icmp ne i1 %43, false br i1 %44, label %"loop_body_4:8", label %"exit_for_3:4" "exit_for_3:4": ; preds = %"for_condition_3:16" %45 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }* %46 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }* %47 = call { i64, i8* }* (i64, ...)* @PyTuple_Pack(i64 2, { i64, i8* }* %45, { i64, i8* }* %46) store { i64, i8* }* %47, { i64, i8* }** %objtemp, !tbaa !6 %48 = ptrtoint { i64, i8* }* %47 to i64 %49 = icmp eq i64 %48, 0 br i1 %49, label %error_label, label %"no_error_23:21" "loop_body_4:8": ; preds = %"for_condition_3:16" %50 = load i64* %target_temp, !tbaa !7 %51 = mul i64 %50, 1 %52 = add i64 1, %51 %53 = load i64* %target_temp, !tbaa !7 %54 = add i64 %53, 1 store i64 %54, i64* %target_temp, !tbaa !7 store i64 0, i64* %target_temp3, !tbaa !9 store i64 150, i64* %nsteps4, !tbaa !10 br label %"for_condition_4:17" "for_condition_4:17": ; preds = %"loop_body_4:8", %"exit_for_6:12" %i_2 = phi i64 [ %i_1, %"loop_body_4:8" ], [ %59, %"exit_for_6:12" ] %Fx_2 = phi double [ %Fx_1, %"loop_body_4:8" ], [ %Fx_4, %"exit_for_6:12" ] %drSquared_2 = phi double [ %drSquared_1, %"loop_body_4:8" ], [ %drSquared_3, %"exit_for_6:12" ] %Fy_2 = phi double [ %Fy_1, %"loop_body_4:8" ], [ %Fy_4, %"exit_for_6:12" ] %Fz_2 = phi double [ %Fz_1, %"loop_body_4:8" ], [ %Fz_4, %"exit_for_6:12" ] %j_2 = phi i64 [ %j_1, %"loop_body_4:8" ], [ %j_3, %"exit_for_6:12" ] %dx_2 = phi double [ %dx_1, %"loop_body_4:8" ], [ %dx_3, %"exit_for_6:12" ] %dy_2 = phi double [ %dy_1, %"loop_body_4:8" ], [ %dy_3, %"exit_for_6:12" ] %dz_2 = phi double [ %dz_1, %"loop_body_4:8" ], [ %dz_3, %"exit_for_6:12" ] %drPowerN32_2 = phi double [ %drPowerN32_1, %"loop_body_4:8" ], [ %drPowerN32_3, %"exit_for_6:12" ] %55 = load i64* %target_temp3, !tbaa !9 %56 = load i64* %nsteps4, !tbaa !10 %57 = icmp slt i64 %55, %56 %58 = icmp ne i1 %57, false br i1 %58, label %"loop_body_5:12", label %"exit_for_4:8" "exit_for_4:8": ; preds = %"for_condition_4:17" store i64 0, i64* %target_temp7, !tbaa !13 store i64 150, i64* %nsteps8, !tbaa !14 br label %"for_condition_19:17" "loop_body_5:12": ; preds = %"for_condition_4:17" %59 = load i64* %target_temp3, !tbaa !9 %60 = load i64* %target_temp3, !tbaa !9 %61 = add i64 %60, 1 store i64 %61, i64* %target_temp3, !tbaa !9 store i64 0, i64* %target_temp5, !tbaa !11 store i64 150, i64* %nsteps6, !tbaa !12 br label %"for_condition_6:21" "for_condition_6:21": ; preds = %"loop_body_5:12", %"exit_if_7:16" %drSquared_3 = phi double [ %drSquared_2, %"loop_body_5:12" ], [ %drSquared_5, %"exit_if_7:16" ] %Fx_4 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fx_6, %"exit_if_7:16" ] %Fy_4 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fy_6, %"exit_if_7:16" ] %Fz_4 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fz_6, %"exit_if_7:16" ] %j_3 = phi i64 [ %j_2, %"loop_body_5:12" ], [ %66, %"exit_if_7:16" ] %dx_3 = phi double [ %dx_2, %"loop_body_5:12" ], [ %dx_5, %"exit_if_7:16" ] %dy_3 = phi double [ %dy_2, %"loop_body_5:12" ], [ %dy_5, %"exit_if_7:16" ] %dz_3 = phi double [ %dz_2, %"loop_body_5:12" ], [ %dz_5, %"exit_if_7:16" ] %drPowerN32_3 = phi double [ %drPowerN32_2, %"loop_body_5:12" ], [ %drPowerN32_5, %"exit_if_7:16" ] %62 = load i64* %target_temp5, !tbaa !11 %63 = load i64* %nsteps6, !tbaa !12 %64 = icmp slt i64 %62, %63 %65 = icmp ne i1 %64, false br i1 %65, label %"loop_body_7:16", label %"exit_for_6:12" "exit_for_6:12": ; preds = %"for_condition_6:21" br label %"for_condition_4:17" "loop_body_7:16": ; preds = %"for_condition_6:21" %66 = load i64* %target_temp5, !tbaa !11 %67 = load i64* %target_temp5, !tbaa !11 %68 = add i64 %67, 1 store i64 %68, i64* %target_temp5, !tbaa !11 br label %"if_cond_7:19" "if_cond_7:19": ; preds = %"loop_body_7:16" %69 = icmp ne i64 %66, %59 %70 = icmp ne i1 %69, false br i1 %70, label %"if_body_8:20", label %"exit_if_7:16" "exit_if_7:16": ; preds = %"if_cond_7:19", %"if_body_8:20" %drSquared_5 = phi double [ %165, %"if_body_8:20" ], [ %drSquared_3, %"if_cond_7:19" ] %Fx_6 = phi double [ %170, %"if_body_8:20" ], [ %Fx_4, %"if_cond_7:19" ] %Fy_6 = phi double [ %172, %"if_body_8:20" ], [ %Fy_4, %"if_cond_7:19" ] %Fz_6 = phi double [ %174, %"if_body_8:20" ], [ %Fz_4, %"if_cond_7:19" ] %dx_5 = phi double [ %130, %"if_body_8:20" ], [ %dx_3, %"if_cond_7:19" ] %dy_5 = phi double [ %145, %"if_body_8:20" ], [ %dy_3, %"if_cond_7:19" ] %dz_5 = phi double [ %160, %"if_body_8:20" ], [ %dz_3, %"if_cond_7:19" ] %drPowerN32_5 = phi double [ %168, %"if_body_8:20" ], [ %drPowerN32_3, %"if_cond_7:19" ] %71 = mul i64 %59, %27 %72 = add i64 0, %71 %73 = mul i64 0, %29 %74 = add i64 %72, %73 %75 = getelementptr i8* %17, i64 %74 %76 = bitcast i8* %75 to double* %77 = load double* %76, !tbaa !2 %78 = fmul double 1.000000e-02, %Fx_6 %79 = fadd double %77, %78 %80 = mul i64 %59, %27 %81 = add i64 0, %80 %82 = mul i64 0, %29 %83 = add i64 %81, %82 %84 = getelementptr i8* %17, i64 %83 %85 = bitcast i8* %84 to double* store double %79, double* %85, !tbaa !2 %86 = mul i64 %59, %27 %87 = add i64 0, %86 %88 = mul i64 1, %29 %89 = add i64 %87, %88 %90 = getelementptr i8* %17, i64 %89 %91 = bitcast i8* %90 to double* %92 = load double* %91, !tbaa !2 %93 = fmul double 1.000000e-02, %Fy_6 %94 = fadd double %92, %93 %95 = mul i64 %59, %27 %96 = add i64 0, %95 %97 = mul i64 1, %29 %98 = add i64 %96, %97 %99 = getelementptr i8* %17, i64 %98 %100 = bitcast i8* %99 to double* store double %94, double* %100, !tbaa !2 %101 = mul i64 %59, %27 %102 = add i64 0, %101 %103 = mul i64 2, %29 %104 = add i64 %102, %103 %105 = getelementptr i8* %17, i64 %104 %106 = bitcast i8* %105 to double* %107 = load double* %106, !tbaa !2 %108 = fmul double 1.000000e-02, %Fz_6 %109 = fadd double %107, %108 %110 = mul i64 %59, %27 %111 = add i64 0, %110 %112 = mul i64 2, %29 %113 = add i64 %111, %112 %114 = getelementptr i8* %17, i64 %113 %115 = bitcast i8* %114 to double* store double %109, double* %115, !tbaa !2 br label %"for_condition_6:21" "if_body_8:20": ; preds = %"if_cond_7:19" %116 = mul i64 %66, %12 %117 = add i64 0, %116 %118 = mul i64 0, %14 %119 = add i64 %117, %118 %120 = getelementptr i8* %2, i64 %119 %121 = bitcast i8* %120 to double* %122 = load double* %121, !tbaa !2 %123 = mul i64 %59, %12 %124 = add i64 0, %123 %125 = mul i64 0, %14 %126 = add i64 %124, %125 %127 = getelementptr i8* %2, i64 %126 %128 = bitcast i8* %127 to double* %129 = load double* %128, !tbaa !2 %130 = fsub double %122, %129 %131 = mul i64 %66, %12 %132 = add i64 0, %131 %133 = mul i64 1, %14 %134 = add i64 %132, %133 %135 = getelementptr i8* %2, i64 %134 %136 = bitcast i8* %135 to double* %137 = load double* %136, !tbaa !2 %138 = mul i64 %59, %12 %139 = add i64 0, %138 %140 = mul i64 1, %14 %141 = add i64 %139, %140 %142 = getelementptr i8* %2, i64 %141 %143 = bitcast i8* %142 to double* %144 = load double* %143, !tbaa !2 %145 = fsub double %137, %144 %146 = mul i64 %66, %12 %147 = add i64 0, %146 %148 = mul i64 2, %14 %149 = add i64 %147, %148 %150 = getelementptr i8* %2, i64 %149 %151 = bitcast i8* %150 to double* %152 = load double* %151, !tbaa !2 %153 = mul i64 %59, %12 %154 = add i64 0, %153 %155 = mul i64 2, %14 %156 = add i64 %154, %155 %157 = getelementptr i8* %2, i64 %156 %158 = bitcast i8* %157 to double* %159 = load double* %158, !tbaa !2 %160 = fsub double %152, %159 %161 = fmul double %130, %130 %162 = fmul double %145, %145 %163 = fadd double %161, %162 %164 = fmul double %160, %160 %165 = fadd double %163, %164 %166 = call double @"numba.math.['double'].sqrt"(double %165) %167 = fadd double %165, %166 %168 = fdiv double 1.000000e+00, %167 %169 = fmul double %130, %168 %170 = fadd double %Fx_4, %169 %171 = fmul double %145, %168 %172 = fadd double %Fy_4, %171 %173 = fmul double %160, %168 %174 = fadd double %Fz_4, %173 br label %"exit_if_7:16" "for_condition_19:17": ; preds = %"exit_for_4:8", %"loop_body_20:29" %i_4 = phi i64 [ %i_2, %"exit_for_4:8" ], [ %179, %"loop_body_20:29" ] %175 = load i64* %target_temp7, !tbaa !13 %176 = load i64* %nsteps8, !tbaa !14 %177 = icmp slt i64 %175, %176 %178 = icmp ne i1 %177, false br i1 %178, label %"loop_body_20:29", label %"exit_for_19:8" "exit_for_19:8": ; preds = %"for_condition_19:17" br label %"for_condition_3:16" "loop_body_20:29": ; preds = %"for_condition_19:17" %179 = load i64* %target_temp7, !tbaa !13 %180 = load i64* %target_temp7, !tbaa !13 %181 = add i64 %180, 1 store i64 %181, i64* %target_temp7, !tbaa !13 %182 = mul i64 %179, %12 %183 = add i64 0, %182 %184 = mul i64 0, %14 %185 = add i64 %183, %184 %186 = getelementptr i8* %2, i64 %185 %187 = bitcast i8* %186 to double* %188 = load double* %187, !tbaa !2 %189 = mul i64 %179, %27 %190 = add i64 0, %189 %191 = mul i64 0, %29 %192 = add i64 %190, %191 %193 = getelementptr i8* %17, i64 %192 %194 = bitcast i8* %193 to double* %195 = load double* %194, !tbaa !2 %196 = fmul double %195, 1.000000e-02 %197 = fadd double %188, %196 %198 = mul i64 %179, %12 %199 = add i64 0, %198 %200 = mul i64 0, %14 %201 = add i64 %199, %200 %202 = getelementptr i8* %2, i64 %201 %203 = bitcast i8* %202 to double* store double %197, double* %203, !tbaa !2 %204 = mul i64 %179, %12 %205 = add i64 0, %204 %206 = mul i64 1, %14 %207 = add i64 %205, %206 %208 = getelementptr i8* %2, i64 %207 %209 = bitcast i8* %208 to double* %210 = load double* %209, !tbaa !2 %211 = mul i64 %179, %27 %212 = add i64 0, %211 %213 = mul i64 1, %29 %214 = add i64 %212, %213 %215 = getelementptr i8* %17, i64 %214 %216 = bitcast i8* %215 to double* %217 = load double* %216, !tbaa !2 %218 = fmul double %217, 1.000000e-02 %219 = fadd double %210, %218 %220 = mul i64 %179, %12 %221 = add i64 0, %220 %222 = mul i64 1, %14 %223 = add i64 %221, %222 %224 = getelementptr i8* %2, i64 %223 %225 = bitcast i8* %224 to double* store double %219, double* %225, !tbaa !2 %226 = mul i64 %179, %12 %227 = add i64 0, %226 %228 = mul i64 2, %14 %229 = add i64 %227, %228 %230 = getelementptr i8* %2, i64 %229 %231 = bitcast i8* %230 to double* %232 = load double* %231, !tbaa !2 %233 = mul i64 %179, %27 %234 = add i64 0, %233 %235 = mul i64 2, %29 %236 = add i64 %234, %235 %237 = getelementptr i8* %17, i64 %236 %238 = bitcast i8* %237 to double* %239 = load double* %238, !tbaa !2 %240 = fmul double %239, 1.000000e-02 %241 = fadd double %232, %240 %242 = mul i64 %179, %12 %243 = add i64 0, %242 %244 = mul i64 2, %14 %245 = add i64 %243, %244 %246 = getelementptr i8* %2, i64 %245 %247 = bitcast i8* %246 to double* store double %241, double* %247, !tbaa !2 br label %"for_condition_19:17" "no_error_23:21": ; preds = %"exit_for_3:4" %248 = load { i64, i8* }** %objtemp, !tbaa !6 store { i64, i8* }* %248, { i64, i8* }** %return_value %249 = load { i64, i8* }** %return_value, !tbaa !6 call void @Py_XINCREF({ i64, i8* }* %249) br label %cleanup_label } declare { i64, i8* }* @PyTuple_Pack(i64, ...) declare { i64, i8* }* @Py_BuildValue(i8*, ...) declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...) declare void @PyErr_Clear() declare void @Py_INCREF({ i64, i8* }*) declare void @Py_XINCREF({ i64, i8* }*) declare double @"numba.math.['double'].sqrt"(double) declare void @Py_XDECREF({ i64, i8* }*) !tbaa = !{!0, !1, !2, !3, !4, !5, !6, !7, !8, !9, !10, !11, !12, !13, !14} !0 = metadata !{metadata !"root"} !1 = metadata !{metadata !"char *", metadata !0} !2 = metadata !{metadata !"float3264 *", metadata !1} !3 = metadata !{metadata !"npy_intp **", metadata !1} !4 = metadata !{metadata !"tbaa_type(numpy shape, npy_intp *) *", metadata !3} !5 = metadata !{metadata !"tbaa_type(numpy strides, npy_intp *) *", metadata !3} !6 = metadata !{metadata !"object", metadata !1} !7 = metadata !{metadata !"unique0", metadata !1} !8 = metadata !{metadata !"unique1", metadata !1} !9 = metadata !{metadata !"unique2", metadata !1} !10 = metadata !{metadata !"unique3", metadata !1} !11 = metadata !{metadata !"unique4", metadata !1} !12 = metadata !{metadata !"unique5", metadata !1} !13 = metadata !{metadata !"unique6", metadata !1} !14 = metadata !{metadata !"unique7", metadata !1} DEBUG -- translate:361:translate ; ModuleID = 'numba_executable_module' @PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**) define void @Py_INCREF({ i64, i8* }* %obj) { decl: %obj1 = alloca { i64, i8* }* store { i64, i8* }* %obj, { i64, i8* }** %obj1 %0 = bitcast { i64, i8* }* %obj to i64* %1 = load i64* %0 %2 = add i64 %1, 1 store i64 %2, i64* %0 ret void } define void @Py_DECREF({ i64, i8* }* %obj) { decl: %obj1 = alloca { i64, i8* }* store { i64, i8* }* %obj, { i64, i8* }** %obj1 %0 = bitcast { i64, i8* }* %obj to i64* %1 = load i64* %0 %2 = icmp sgt i64 %1, 1 br i1 %2, label %if.then, label %if.else if.then: ; preds = %decl %3 = bitcast { i64, i8* }* %obj to i64* %4 = add i64 %1, -1 store i64 %4, i64* %3 br label %if.end if.else: ; preds = %decl call void @Py_DecRef({ i64, i8* }* %obj) br label %if.end if.end: ; preds = %if.else, %if.then ret void } declare void @Py_DecRef({ i64, i8* }*) define void @Py_XINCREF({ i64, i8* }* %obj) { decl: %obj1 = alloca { i64, i8* }* store { i64, i8* }* %obj, { i64, i8* }** %obj1 %0 = ptrtoint { i64, i8* }* %obj to i64 %1 = icmp ne i64 %0, 0 br i1 %1, label %if.then, label %if.end if.then: ; preds = %decl %2 = bitcast { i64, i8* }* %obj to i64* %3 = load i64* %2 %4 = add i64 %3, 1 store i64 %4, i64* %2 br label %if.end if.end: ; preds = %if.then, %decl ret void } define void @Py_XDECREF({ i64, i8* }* %obj) { decl: %obj1 = alloca { i64, i8* }* store { i64, i8* }* %obj, { i64, i8* }** %obj1 %0 = ptrtoint { i64, i8* }* %obj to i64 %1 = icmp ne i64 %0, 0 br i1 %1, label %if.then, label %if.end if.then: ; preds = %decl call void @Py_DECREF({ i64, i8* }* %obj) br label %if.end if.end: ; preds = %if.then, %decl ret void } define i8* @IndexAxis(i8* %data, i64* %in_shape, i64* %in_strides, i64 %src_dim, i64 %index) { decl: %data1 = alloca i8* %in_shape2 = alloca i64* %in_strides3 = alloca i64* %src_dim4 = alloca i64 %index5 = alloca i64 %result = alloca i8* store i8* %data, i8** %data1 store i64* %in_shape, i64** %in_shape2 store i64* %in_strides, i64** %in_strides3 store i64 %src_dim, i64* %src_dim4 store i64 %index, i64* %index5 %0 = load i64** %in_strides3 %1 = load i64* %src_dim4 %2 = getelementptr inbounds i64* %0, i64 %1 %3 = load i64* %2 %4 = mul i64 %3, %index %5 = load i8** %data1 %6 = getelementptr inbounds i8* %5, i64 %4 store i8* %6, i8** %result ret i8* %6 } define void @NewAxis(i64* %out_shape, i64* %out_strides, i32 %dst_dim) { decl: %out_shape1 = alloca i64* %out_strides2 = alloca i64* %dst_dim3 = alloca i32 store i64* %out_shape, i64** %out_shape1 store i64* %out_strides, i64** %out_strides2 store i32 %dst_dim, i32* %dst_dim3 %0 = load i64** %out_shape1 %1 = getelementptr inbounds i64* %0, i32 %dst_dim store i64 1, i64* %1 %2 = load i64** %out_strides2 %3 = load i32* %dst_dim3 %4 = getelementptr inbounds i64* %2, i32 %3 store i64 0, i64* %4 ret void } define i32 @Broadcast(i64* %dst_shape, i64* %src_shape, i64* %src_strides, i32 %max_ndim, i32 %ndim) { decl: %dst_shape1 = alloca i64* %src_shape2 = alloca i64* %src_strides3 = alloca i64* %max_ndim4 = alloca i32 %ndim5 = alloca i32 %0 = alloca i32 store i64* %dst_shape, i64** %dst_shape1 store i64* %src_shape, i64** %src_shape2 store i64* %src_strides, i64** %src_strides3 store i32 %max_ndim, i32* %max_ndim4 store i32 %ndim, i32* %ndim5 %1 = load i32* %max_ndim4 %2 = sub i32 %1, %ndim store i32 0, i32* %0 br label %loop.cond loop.cond: ; preds = %if.end11, %decl %3 = load i32* %0 %4 = load i32* %ndim5 %5 = icmp slt i32 %3, %4 br i1 %5, label %loop.body, label %loop.end loop.body: ; preds = %loop.cond %6 = load i64** %src_shape2 %7 = getelementptr inbounds i64* %6, i32 %3 %8 = add i32 %3, %2 %9 = load i64** %dst_shape1 %10 = getelementptr inbounds i64* %9, i32 %8 %11 = load i64* %7 %12 = icmp eq i64 %11, 1 br i1 %12, label %if.then, label %if.else loop.end: ; preds = %if.else7, %loop.cond %merge = phi i32 [ 1, %loop.cond ], [ 0, %if.else7 ] ret i32 %merge if.then: ; preds = %loop.body %13 = load i64** %src_strides3 %14 = getelementptr inbounds i64* %13, i32 %3 store i64 0, i64* %14 br label %if.end11 if.else: ; preds = %loop.body %15 = load i64* %10 %16 = icmp eq i64 %15, 1 br i1 %16, label %if.then6, label %if.else7 if.then6: ; preds = %if.else store i64 %11, i64* %10 br label %if.end11 if.else7: ; preds = %if.else %17 = icmp ne i64 %11, %15 br i1 %17, label %loop.end, label %if.end11 if.end11: ; preds = %if.else7, %if.then6, %if.then %18 = load i32* %0 %19 = add i32 %18, 1 store i32 %19, i32* %0 br label %loop.cond } define i32 @__numba_specialized_0___main___2E_f_py(i32 %I, i32 %J) { entry: %0 = icmp sgt i32 %I, 0 %1 = icmp sgt i32 %J, 0 %or.cond = and i1 %0, %1 br i1 %or.cond, label %"loop_body_5:8.lr.ph.split.us", label %"exit_for_4:4" "loop_body_5:8.lr.ph.split.us": ; preds = %entry %2 = sext i32 %I to i64 %3 = sext i32 %J to i64 br label %"loop_body_6:19.lr.ph.us" "loop_body_6:19.us": ; preds = %"loop_body_6:19.us", %"loop_body_6:19.lr.ph.us" %lsr.iv = phi i64 [ %lsr.iv.next, %"loop_body_6:19.us" ], [ %3, %"loop_body_6:19.lr.ph.us" ] %res_35.us = phi i32 [ %res_26.us, %"loop_body_6:19.lr.ph.us" ], [ %7, %"loop_body_6:19.us" ] %4 = tail call double @"numba.math.['double'].log"(double 1.000000e+00) %5 = tail call double @"numba.math.['double'].cos"(double %4) %6 = fptosi double %5 to i32 %7 = add i32 %6, %res_35.us %lsr.iv.next = add i64 %lsr.iv, -1 %exitcond = icmp eq i64 %lsr.iv.next, 0 br i1 %exitcond, label %"for_condition_4:13.loopexit.us", label %"loop_body_6:19.us" "for_condition_4:13.loopexit.us": ; preds = %"loop_body_6:19.us" %exitcond8 = icmp eq i64 %9, %2 br i1 %exitcond8, label %"exit_for_4:4", label %"loop_body_6:19.lr.ph.us" "loop_body_6:19.lr.ph.us": ; preds = %"loop_body_5:8.lr.ph.split.us", %"for_condition_4:13.loopexit.us" %res_26.us = phi i32 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %7, %"for_condition_4:13.loopexit.us" ] %8 = phi i64 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %9, %"for_condition_4:13.loopexit.us" ] %9 = add i64 %8, 1 br label %"loop_body_6:19.us" "exit_for_4:4": ; preds = %"for_condition_4:13.loopexit.us", %entry %res_2.lcssa = phi i32 [ 0, %entry ], [ %7, %"for_condition_4:13.loopexit.us" ] ret i32 %res_2.lcssa } declare double @"numba.math.['double'].cos"(double) declare double @"numba.math.['double'].log"(double) define { i64, i8* }* @__numba_specialized_1_numba_2E_codegen_2E_llvmwrapper_2E___numba_wrapper_f_py(i8* %self, { i64, i8* }* %args) { entry: %objtemp = alloca { i64, i8* }* store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6 %0 = alloca { i64, i8* }* %1 = alloca { i64, i8* }* %return_value = alloca { i64, i8* }* %2 = call i32 ({ i64, i8* }*, i8*, ...)* @PyArg_ParseTuple({ i64, i8* }* %args, i8* getelementptr inbounds ([3 x i8]* @__STR_0, i32 0, i32 0), { i64, i8* }** %1, { i64, i8* }** %0) %3 = icmp eq i32 %2, 0 br i1 %3, label %error_label, label %empty cleanup_label: ; preds = %no_error, %error_label %4 = load { i64, i8* }** %objtemp, !tbaa !6 call void @Py_XDECREF({ i64, i8* }* %4) %5 = load { i64, i8* }** %return_value ret { i64, i8* }* %5 error_label: ; preds = %empty6, %empty3, %empty, %entry, %empty7 store { i64, i8* }* null, { i64, i8* }** %return_value %6 = load { i64, i8* }** %return_value, !tbaa !6 call void @Py_XINCREF({ i64, i8* }* %6) br label %cleanup_label empty: ; preds = %entry %7 = load { i64, i8* }** %1 %8 = load { i64, i8* }** %0 %9 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %7) %10 = call i8* @PyErr_Occurred() %11 = ptrtoint i8* %10 to i64 %12 = icmp ne i64 %11, 0 br i1 %12, label %error_label, label %empty3 empty3: ; preds = %empty %13 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %8) %14 = call i8* @PyErr_Occurred() %15 = ptrtoint i8* %14 to i64 %16 = icmp ne i64 %15, 0 br i1 %16, label %error_label, label %empty6 empty6: ; preds = %empty3 %17 = call i32 @__numba_specialized_0___main___2E_f_py(i32 %9, i32 %13) %18 = call i8* @PyErr_Occurred() %19 = ptrtoint i8* %18 to i64 %20 = icmp ne i64 %19, 0 br i1 %20, label %error_label, label %empty7 empty7: ; preds = %empty6 %21 = sext i32 %17 to i64 %22 = call { i64, i8* }* inttoptr (i64 4433235252 to { i64, i8* }* (i64)*)(i64 %21) store { i64, i8* }* %22, { i64, i8* }** %objtemp, !tbaa !6 %23 = ptrtoint { i64, i8* }* %22 to i64 %24 = icmp eq i64 %23, 0 br i1 %24, label %error_label, label %no_error no_error: ; preds = %empty7 %25 = load { i64, i8* }** %objtemp, !tbaa !6 store { i64, i8* }* %25, { i64, i8* }** %return_value %26 = load { i64, i8* }** %return_value, !tbaa !6 call void @Py_XINCREF({ i64, i8* }* %26) br label %cleanup_label } declare i8* @PyErr_Occurred() declare { i64, i8* }* @Py_BuildValue(i8*, ...) declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...) declare void @PyErr_Clear() define { i64, i8* }* @__numba_specialized_2___main___2E_nbody_hy({ i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev) { entry: %0 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }* tail call void @Py_INCREF({ i64, i8* }* %0) %1 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i64 0, i32 2 %2 = load i8** %1, align 8, !tbaa !7 %3 = ptrtoint i8* %2 to i64 %4 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i64 0, i32 5 %5 = load i64** %4, align 8, !tbaa !10 %6 = load i64* %5, align 8 %7 = getelementptr i64* %5, i64 1 %8 = load i64* %7, align 8 %9 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }* tail call void @Py_INCREF({ i64, i8* }* %9) %10 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i64 0, i32 2 %11 = load i8** %10, align 8, !tbaa !7 %12 = ptrtoint i8* %11 to i64 %13 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i64 0, i32 5 %14 = load i64** %13, align 8, !tbaa !10 %15 = load i64* %14, align 8 %16 = getelementptr i64* %14, i64 1 %17 = load i64* %16, align 8 %18 = shl i64 %17, 1 %19 = shl i64 %8, 1 %20 = sub i64 0, %3 %21 = sub i64 0, %12 br label %"loop_body_4:8" cleanup_label: ; preds = %"no_error_23:21", %error_label %return_value.0.load14 = phi { i64, i8* }* [ %25, %"no_error_23:21" ], [ null, %error_label ] %22 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }* %23 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }* tail call void @Py_XDECREF({ i64, i8* }* %25) tail call void @Py_XDECREF({ i64, i8* }* %22) tail call void @Py_XDECREF({ i64, i8* }* %23) ret { i64, i8* }* %return_value.0.load14 error_label: ; preds = %"exit_for_3:4" tail call void @Py_XINCREF({ i64, i8* }* null) br label %cleanup_label "for_condition_3:16.loopexit": ; preds = %"loop_body_20:29" %24 = add i64 %27, 1 %exitcond35 = icmp eq i64 %24, 15 br i1 %exitcond35, label %"exit_for_3:4", label %"loop_body_4:8" "exit_for_3:4": ; preds = %"for_condition_3:16.loopexit" %25 = tail call { i64, i8* }* (i64, ...)* @PyTuple_Pack(i64 2, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev) %26 = icmp eq { i64, i8* }* %25, null br i1 %26, label %error_label, label %"no_error_23:21" "loop_body_4:8": ; preds = %"for_condition_3:16.loopexit", %entry %27 = phi i64 [ 0, %entry ], [ %24, %"for_condition_3:16.loopexit" ] br label %"loop_body_5:12" "for_condition_4:17.loopexit": ; preds = %"exit_if_7:16" %28 = add i64 %29, 1 %exitcond33 = icmp eq i64 %28, 150 br i1 %exitcond33, label %"loop_body_20:29", label %"loop_body_5:12" "loop_body_5:12": ; preds = %"for_condition_4:17.loopexit", %"loop_body_4:8" %29 = phi i64 [ 0, %"loop_body_4:8" ], [ %28, %"for_condition_4:17.loopexit" ] %30 = mul i64 %29, %15 %31 = add i64 %30, %17 %32 = add i64 %30, %18 %33 = mul i64 %29, %6 %34 = add i64 %33, %8 %35 = add i64 %33, %19 br label %"loop_body_7:16" "loop_body_7:16": ; preds = %"exit_if_7:16", %"loop_body_5:12" %lsr.iv = phi i64 [ %lsr.iv.next, %"exit_if_7:16" ], [ %20, %"loop_body_5:12" ] %Fz_418 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fz_6, %"exit_if_7:16" ] %Fy_417 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fy_6, %"exit_if_7:16" ] %Fx_416 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fx_6, %"exit_if_7:16" ] %36 = phi i64 [ 0, %"loop_body_5:12" ], [ %37, %"exit_if_7:16" ] %37 = add i64 %36, 1 %38 = icmp eq i64 %29, %36 br i1 %38, label %"exit_if_7:16", label %"if_body_8:20" "exit_if_7:16": ; preds = %"loop_body_7:16", %"if_body_8:20" %Fx_6 = phi double [ %70, %"if_body_8:20" ], [ %Fx_416, %"loop_body_7:16" ] %Fy_6 = phi double [ %72, %"if_body_8:20" ], [ %Fy_417, %"loop_body_7:16" ] %Fz_6 = phi double [ %74, %"if_body_8:20" ], [ %Fz_418, %"loop_body_7:16" ] %sunkaddr = ptrtoint i8* %11 to i64 %sunkaddr23 = add i64 %sunkaddr, %30 %sunkaddr24 = inttoptr i64 %sunkaddr23 to double* %39 = load double* %sunkaddr24, align 8, !tbaa !7 %40 = fmul double %Fx_6, 1.000000e-02 %41 = fadd double %40, %39 store double %41, double* %sunkaddr24, align 8, !tbaa !7 %sunkaddr25 = ptrtoint i8* %11 to i64 %sunkaddr26 = add i64 %sunkaddr25, %31 %sunkaddr27 = inttoptr i64 %sunkaddr26 to double* %42 = load double* %sunkaddr27, align 8, !tbaa !7 %43 = fmul double %Fy_6, 1.000000e-02 %44 = fadd double %43, %42 store double %44, double* %sunkaddr27, align 8, !tbaa !7 %sunkaddr28 = ptrtoint i8* %11 to i64 %sunkaddr29 = add i64 %sunkaddr28, %32 %sunkaddr30 = inttoptr i64 %sunkaddr29 to double* %45 = load double* %sunkaddr30, align 8, !tbaa !7 %46 = fmul double %Fz_6, 1.000000e-02 %47 = fadd double %46, %45 store double %47, double* %sunkaddr30, align 8, !tbaa !7 %lsr.iv.next = sub i64 %lsr.iv, %6 %exitcond = icmp eq i64 150, %37 br i1 %exitcond, label %"for_condition_4:17.loopexit", label %"loop_body_7:16" "if_body_8:20": ; preds = %"loop_body_7:16" %48 = mul i64 %lsr.iv, -1 %49 = inttoptr i64 %48 to double* %50 = load double* %49, align 8, !tbaa !7 %sunkaddr31 = ptrtoint i8* %2 to i64 %sunkaddr32 = add i64 %sunkaddr31, %33 %sunkaddr33 = inttoptr i64 %sunkaddr32 to double* %51 = load double* %sunkaddr33, align 8, !tbaa !7 %52 = fsub double %50, %51 %53 = mul i64 %lsr.iv, -1 %sunkaddr34 = add i64 %8, %53 %sunkaddr35 = inttoptr i64 %sunkaddr34 to double* %54 = load double* %sunkaddr35, align 8, !tbaa !7 %sunkaddr36 = ptrtoint i8* %2 to i64 %sunkaddr37 = add i64 %sunkaddr36, %34 %sunkaddr38 = inttoptr i64 %sunkaddr37 to double* %55 = load double* %sunkaddr38, align 8, !tbaa !7 %56 = fsub double %54, %55 %57 = mul i64 %lsr.iv, -1 %sunkaddr39 = mul i64 %8, 2 %sunkaddr40 = add i64 %57, %sunkaddr39 %sunkaddr41 = inttoptr i64 %sunkaddr40 to double* %58 = load double* %sunkaddr41, align 8, !tbaa !7 %sunkaddr42 = ptrtoint i8* %2 to i64 %sunkaddr43 = add i64 %sunkaddr42, %35 %sunkaddr44 = inttoptr i64 %sunkaddr43 to double* %59 = load double* %sunkaddr44, align 8, !tbaa !7 %60 = fsub double %58, %59 %61 = fmul double %52, %52 %62 = fmul double %56, %56 %63 = fadd double %61, %62 %64 = fmul double %60, %60 %65 = fadd double %63, %64 %66 = tail call double @"numba.math.['double'].sqrt"(double %65) %67 = fadd double %66, %65 %68 = fdiv double 1.000000e+00, %67 %69 = fmul double %52, %68 %70 = fadd double %Fx_416, %69 %71 = fmul double %56, %68 %72 = fadd double %Fy_417, %71 %73 = fmul double %60, %68 %74 = fadd double %Fz_418, %73 br label %"exit_if_7:16" "loop_body_20:29": ; preds = %"for_condition_4:17.loopexit", %"loop_body_20:29" %lsr.iv21 = phi i64 [ %lsr.iv.next22, %"loop_body_20:29" ], [ 150, %"for_condition_4:17.loopexit" ] %lsr.iv15 = phi i64 [ %lsr.iv.next16, %"loop_body_20:29" ], [ %21, %"for_condition_4:17.loopexit" ] %lsr.iv4 = phi i8* [ %91, %"loop_body_20:29" ], [ %2, %"for_condition_4:17.loopexit" ] %lsr.iv414 = bitcast i8* %lsr.iv4 to double* %lsr.iv45 = bitcast i8* %lsr.iv4 to i1* %75 = load double* %lsr.iv414, align 8, !tbaa !7 %76 = mul i64 %lsr.iv15, -1 %77 = inttoptr i64 %76 to double* %78 = load double* %77, align 8, !tbaa !7 %79 = fmul double %78, 1.000000e-02 %80 = fadd double %75, %79 store double %80, double* %lsr.iv414, align 8, !tbaa !7 %scevgep12 = getelementptr i8* %lsr.iv4, i64 %8 %scevgep1213 = bitcast i8* %scevgep12 to double* %81 = load double* %scevgep1213, align 8, !tbaa !7 %82 = mul i64 %lsr.iv15, -1 %sunkaddr45 = add i64 %17, %82 %sunkaddr46 = inttoptr i64 %sunkaddr45 to double* %83 = load double* %sunkaddr46, align 8, !tbaa !7 %84 = fmul double %83, 1.000000e-02 %85 = fadd double %81, %84 %scevgep10 = getelementptr i8* %lsr.iv4, i64 %8 %scevgep1011 = bitcast i8* %scevgep10 to double* store double %85, double* %scevgep1011, align 8, !tbaa !7 %sunkaddr47 = ptrtoint i8* %lsr.iv4 to i64 %sunkaddr48 = mul i64 %8, 2 %sunkaddr49 = add i64 %sunkaddr47, %sunkaddr48 %sunkaddr50 = inttoptr i64 %sunkaddr49 to double* %86 = load double* %sunkaddr50, align 8, !tbaa !7 %87 = mul i64 %lsr.iv15, -1 %sunkaddr51 = mul i64 %17, 2 %sunkaddr52 = add i64 %87, %sunkaddr51 %sunkaddr53 = inttoptr i64 %sunkaddr52 to double* %88 = load double* %sunkaddr53, align 8, !tbaa !7 %89 = fmul double %88, 1.000000e-02 %90 = fadd double %86, %89 %sunkaddr54 = ptrtoint i8* %lsr.iv4 to i64 %sunkaddr55 = mul i64 %8, 2 %sunkaddr56 = add i64 %sunkaddr54, %sunkaddr55 %sunkaddr57 = inttoptr i64 %sunkaddr56 to double* store double %90, double* %sunkaddr57, align 8, !tbaa !7 %scevgep = getelementptr i1* %lsr.iv45, i64 %6 %91 = bitcast i1* %scevgep to i8* %lsr.iv.next16 = sub i64 %lsr.iv15, %15 %lsr.iv.next22 = add i64 %lsr.iv21, -1 %exitcond34 = icmp eq i64 %lsr.iv.next22, 0 br i1 %exitcond34, label %"for_condition_3:16.loopexit", label %"loop_body_20:29" "no_error_23:21": ; preds = %"exit_for_3:4" tail call void @Py_XINCREF({ i64, i8* }* %25) br label %cleanup_label } declare { i64, i8* }* @PyTuple_Pack(i64, ...) declare double @"numba.math.['double'].sqrt"(double) define { i64, i8* }* @__numba_specialized_3_numba_2E_codegen_2E_llvmwrapper_2E___numba_wrapper_nbody_hy(i8* %self, { i64, i8* }* %args) { entry: %objtemp = alloca { i64, i8* }* store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6 %0 = alloca { i64, i8* }* %1 = alloca { i64, i8* }* %return_value = alloca { i64, i8* }* %2 = call i32 ({ i64, i8* }*, i8*, ...)* @PyArg_ParseTuple({ i64, i8* }* %args, i8* getelementptr inbounds ([3 x i8]* @__STR_0, i32 0, i32 0), { i64, i8* }** %1, { i64, i8* }** %0) %3 = icmp eq i32 %2, 0 br i1 %3, label %cleanup.if.true, label %cleanup.if.end cleanup_label: ; preds = %no_error, %error_label %4 = load { i64, i8* }** %objtemp, !tbaa !6 call void @Py_XDECREF({ i64, i8* }* %4) %5 = load { i64, i8* }** %return_value ret { i64, i8* }* %5 error_label: ; preds = %cleanup.if.end, %cleanup.if.true store { i64, i8* }* null, { i64, i8* }** %return_value %6 = load { i64, i8* }** %return_value, !tbaa !6 call void @Py_XINCREF({ i64, i8* }* %6) br label %cleanup_label cleanup.if.true: ; preds = %entry br label %error_label cleanup.if.end: ; preds = %entry %7 = load { i64, i8* }** %1 %8 = load { i64, i8* }** %0 %9 = bitcast { i64, i8* }* %7 to { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %10 = bitcast { i64, i8* }* %8 to { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %11 = call { i64, i8* }* @__numba_specialized_2___main___2E_nbody_hy({ i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %9, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %10) store { i64, i8* }* %11, { i64, i8* }** %objtemp, !tbaa !6 %12 = ptrtoint { i64, i8* }* %11 to i64 %13 = icmp eq i64 %12, 0 br i1 %13, label %error_label, label %no_error no_error: ; preds = %cleanup.if.end %14 = load { i64, i8* }** %objtemp, !tbaa !6 store { i64, i8* }* %14, { i64, i8* }** %return_value %15 = load { i64, i8* }** %return_value, !tbaa !6 call void @Py_XINCREF({ i64, i8* }* %15) br label %cleanup_label } !tbaa = !{!0, !1, !2, !3, !4, !5, !0, !1, !6, !0, !1, !7, !8, !9, !10, !6, !2, !3, !4, !5, !11, !12, !13, !14, !0, !1, !6} !0 = metadata !{metadata !"root"} !1 = metadata !{metadata !"char *", metadata !0} !2 = metadata !{metadata !"unique0", metadata !1} !3 = metadata !{metadata !"unique1", metadata !1} !4 = metadata !{metadata !"unique2", metadata !1} !5 = metadata !{metadata !"unique3", metadata !1} !6 = metadata !{metadata !"object", metadata !1} !7 = metadata !{metadata !"float3264 *", metadata !1} !8 = metadata !{metadata !"npy_intp **", metadata !1} !9 = metadata !{metadata !"tbaa_type(numpy shape, npy_intp *) *", metadata !8} !10 = metadata !{metadata !"tbaa_type(numpy strides, npy_intp *) *", metadata !8} !11 = metadata !{metadata !"unique4", metadata !1} !12 = metadata !{metadata !"unique5", metadata !1} !13 = metadata !{metadata !"unique6", metadata !1} !14 = metadata !{metadata !"unique7", metadata !1}
CPU times: user 424 ms, sys: 35.5 ms, total: 460 ms Wall time: 435 ms
The dynamically compiled version is by far the fastest.
func_list = ['nbody_py', 'nbody_np', 'nbody_nb', 'nbody_hy']
data_list = ['p_py, pv_py', 'p_np, pv_np', 'p_nb, pv_nb', 'p_nb, pv_nb']
perf_comp_data(func_list, data_list, rep=3)
function: nbody_nb, av. time sec: 0.00607, relative: 1.0 function: nbody_np, av. time sec: 0.09028, relative: 14.9 function: nbody_py, av. time sec: 0.54923, relative: 90.5 function: nbody_hy, av. time sec: 5.44442, relative: 896.8
Nowadays, the Python ecosystem provides a number of ways to improve the performance of code.
A major benefit of the Python ecosystem is, that all these approaches generally are easily implementable, meaning that the additional effort included is generally quite low (even for non-experts). In other words, performance improvements often are low hanging fruits.
The Python Quants – the company Web site
Dr. Yves J. Hilpisch – my personal Web site
Derivatives Analytics with Python – my new book
Contact Us