The Python Quants



Performance Python

Dr. Yves J. Hilpisch

The Python Quants GmbH

www.pythonquants.com

yves@pythonquants.com

@dyjh

PyData, New York City – 08. November 2013

Performance Python

Python for Performance Computing:

  • measure: getting to results faster
  • pillar 1: agile interactive analytics & prototyping
  • pillar 2: rapid application development & deployment
  • pillar 3: performant execution of algorithms & code

This tutorial focuses on

  • some fundamental Python issues when it comes to performance
  • selected approaches to improve the performance of algorithms & code
  • simple, interactive examples

It does not address such important issues like

  • architectural issues regarding hardware and software
  • development processes, testing, documentation and production
  • real world problem modeling

Some Python Fundamentals

Fundamental Python Libraries

A fundamental Python stack for interactive data analytics and visualization should at least contain the following libraries tools:

  • Python – the Python interpreter itself
  • NumPy – high performance, flexible array structures and operations
  • SciPy – collection of scientific modules and functions (e.g. for regression, optimization, integration)
  • pandas – time series and panel data analysis and I/O
  • PyTables – hierarchical, high performance database (e.g. for out-of-memory analytics)
  • matplotlib – 2d and 3d visualization
  • IPython – interactive data analytics, visualization, publishing

It is best to use either the Python distribution Anaconda or the Web-based analytics environment Wakari. Both provide almost "complete" Python environments.

When it comes to performance critical operations, a number of additional libraries can be used to speed-up code execution.

  • numexpr – for fast numerical operations
  • Numba – for dynamically compiling Python code for the CPU
  • NumbaPro – for dynamically compiling Python code for multi-core CPU and GPUs
  • Cython – for merging Python with C paradigms for static compilation
  • IPython.Parallel – for the parallel execution of code/functions over a cluster
  • ... many others

All these are also part of Anaconda and Wakari.

Comparing Performance of Python Functions

We define a convenience function to compare the performance of a set of Python functions executed on the same data or on different data.

In [1]:
def perf_comp_data(func_list, data_list, rep=3, number=1):
    from timeit import repeat
    res_list = {}
    for name in enumerate(func_list):
        stmt = name[1] + '(' + data_list[name[0]] + ')'
        setup = "from __main__ import " + name[1] + ', ' + data_list[name[0]]
        results = repeat(stmt=stmt, setup=setup, repeat=rep, number=number)
        res_list[name[1]] = sum(results) / rep
    res_sort = sorted(res_list.iteritems(), key=lambda (k, v): (v, k))
    for item in res_sort:
        rel = item[1] / res_sort[0][1]
        print 'function: ' + item[0] + ', av. time sec: %9.5f, ' % item[1] \
            + 'relative: %6.1f' % rel

Speeding Up Numerical Computations

A typical compute intensive operation is to evaluate a complex mathematical expression on a large array of data.

First, let us define an example function/numerical expression.

In [2]:
from math import *
def f(x):
    return abs(cos(x)) ** 0.5 + sin(2 + 3 * x)

Second, as our benchmark case a pure Python implementation.

In [3]:
I = 200000
a_py = range(I)
In [4]:
def f1(a):
    res = []
    for x in a:
        res.append(f(x))
    return res

One can also use different paradigms to implement the same function, like iterators or the eval function.

In [5]:
def f2(a):
    return [f(x) for x in a]
In [6]:
def f3(a):
    ex = 'abs(cos(x)) ** 0.5 + sin(2 + 3 * x)'
    return [eval(ex) for x in a]

Third, a vectorized implementations with NumPy.

In [7]:
import numpy as np
In [8]:
a_np = np.arange(I)
In [9]:
def f4(a):
    return (np.abs(np.cos(a)) ** 0.5 +
            np.sin(2 + 3 * a))

Fourth, we use numexpr to evaluate the numerical expression.

In [10]:
import numexpr as ne
In [11]:
def f5(a):
    ex = 'abs(cos(a)) ** 0.5 + sin(2 + 3 * a)'
    ne.set_num_threads(1)
    return ne.evaluate(ex)
In [12]:
def f6(a):
    ex = 'abs(cos(a)) ** 0.5 + sin(2 + 3 * a)'
    ne.set_num_threads(4)
    return ne.evaluate(ex)

Now, let's compare the performance of all different function implementations.

In [13]:
func_list = ['f1', 'f2', 'f3', 'f4', 'f5', 'f6']
data_list = ['a_py', 'a_py', 'a_py', 'a_np', 'a_np', 'a_np']
In [14]:
perf_comp_data(func_list, data_list, rep=1)
function: f6, av. time sec:   0.00627, relative:    1.0
function: f5, av. time sec:   0.01631, relative:    2.6
function: f4, av. time sec:   0.01861, relative:    3.0
function: f2, av. time sec:   0.15558, relative:   24.8
function: f1, av. time sec:   0.27338, relative:   43.6
function: f3, av. time sec:   6.27276, relative: 1000.1

Speeding Up Code Through Memory Layout

Depending on the respective algorithm you implement with NumPy, memory layout can play a non-negligible role.

In [15]:
import numpy as np
In [16]:
np.zeros((3, 3), dtype=np.float64, order='C')
Out[16]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

The way you initialize NumPy array can have a significant influence on the performance of operations on these arrays (given a certain size of the array).

  • shape: either an int, a sequence of ints or a reference to another numpy.ndarray
  • dtype (optional): a numpy.dtype which are NumPy-specific basic data types for numpy.ndarray objects
  • order (optional): the order of storing elements in memory, 'C' for C-like (i.e. row-wise) or 'F' for Fortran-like (i.e. column-wise)

Consider the C-like, i.e. row-wise, storage.

In [17]:
c = np.array([[ 1.,  1.,  1.],
              [ 2.,  2.,  2.],
              [ 3.,  3.,  3.]], order='C')

In this case, the 1s, the 2s and the 3s are stored next to each other.

By contrast, consider the Fortran-like, i.e. column-wise, storage.

In [18]:
f = np.array([[ 1.,  1.,  1.],
              [ 2.,  2.,  2.],
              [ 3.,  3.,  3.]], order='F')

Now, the data is stored in a way that 1, 2, 3 are next to each other for each column.

Let's see whether the memory layout makes a difference in some way when the array is large.

In [19]:
x = np.random.standard_normal((3, 1500000))
C = np.array(x, order='C')
F = np.array(x, order='F')
x = 0.0

Let's implement standard operations on the C-like layout array.

First, calculating sums.

In [20]:
%timeit C.sum(axis=0)
100 loops, best of 3: 11.1 ms per loop

In [21]:
%timeit C.sum(axis=1)
100 loops, best of 3: 5.93 ms per loop

Second, standard deviations.

In [22]:
%timeit C.std(axis=0)
10 loops, best of 3: 69.7 ms per loop

In [23]:
%timeit C.std(axis=1)
10 loops, best of 3: 31.6 ms per loop

Now the Fortran-like layout. Sums first.

In [24]:
%timeit F.sum(axis=0)
10 loops, best of 3: 35.6 ms per loop

In [25]:
%timeit F.sum(axis=1)
10 loops, best of 3: 26.4 ms per loop

Standard deviations second.

In [26]:
%timeit F.std(axis=0)
10 loops, best of 3: 117 ms per loop

In [27]:
%timeit F.std(axis=1)
10 loops, best of 3: 70.8 ms per loop

In [28]:
C = 0.0; F = 0.0

Speeding Up I/O Operations

SQL Databases

Currently, and in the foreseeable future as well, most corporate data is stored in SQL databases. We take this as a starting point and generate a SQL table with dummy data.

In [29]:
filename = 'numbers'
import sqlite3 as sq3  # imports the SQLite library
# Remove files if they exist
import os
try:
    os.remove(filename + '.db')
    os.remove(filename + '.h5')
    os.remove(filename + '.hs')
    os.remove(filename + '.csv')
except:
    pass

We create a table and populate the table with the dummy data.

In [30]:
query = 'CREATE TABLE numbs (no1 real, no2 real, no3 real, no4 real, no5 real)'
con = sq3.connect(filename + '.db')
con.execute(query)
con.commit()
In [31]:
# Writing Random Numbers to SQLite3 Database
rows = 5000000  # 5,000,000 rows for the SQL table (ca. 250 MB in size)
nr = np.random.standard_normal((rows, 5))
%time con.executemany('INSERT INTO numbs VALUES(?, ?, ?, ?, ?)', nr)
con.commit()
CPU times: user 1min 2s, sys: 962 ms, total: 1min 3s
Wall time: 1min 3s

We can now read the data into a NumPy array.

In [32]:
# Reading a couple of rows from SQLite3 database table
array(con.execute('SELECT * FROM numbs').fetchmany(3)).round(3)
Out[32]:
array([[ 0.149, -0.874,  0.736, -0.551, -1.399],
       [ 1.299, -0.281, -1.416,  0.995,  0.337],
       [ 0.87 , -1.057,  0.544, -1.416, -2.613]])
In [33]:
# Reading the whole table into an NumPy array
%time result = array(con.execute('SELECT * FROM numbs').fetchall())
CPU times: user 39.3 s, sys: 1.83 s, total: 41.1 s
Wall time: 41.6 s

PyTables, which is based on the HDF5 file format, allows much faster, disk-based I/O than a typical SQL database. For example, writing the same data is much faster compared to SQLite3.

In [34]:
import tables as tb
In [35]:
%%time
h5 = tb.open_file('numbers.h5', 'w')
h5.create_array('/', 'numbs', obj=result, title='numbs')
h5.close()
CPU times: user 3.89 ms, sys: 209 ms, total: 213 ms
Wall time: 370 ms

In [36]:
nr = 0.0; result = 0.0  # memory clean-up

Reading speed is equally impressive.

In [37]:
%%time
h5 = tb.open_file('numbers.h5', 'r')
arr = h5.root.numbs.read()
h5.close()
CPU times: user 2.6 ms, sys: 167 ms, total: 170 ms
Wall time: 158 ms

In [38]:
arr.nbytes
Out[38]:
200000000
In [39]:
arr = 0.0  # memory clean-up

Reading and Writing Data with pandas

pandas is optimized both for convenience and performance when it comes to analytics and I/O operations.

In [40]:
import pandas as pd
import pandas.io.sql as pds
In [41]:
%time data = pds.read_frame('SELECT * FROM numbs', con)
con.close()
CPU times: user 18.3 s, sys: 1.51 s, total: 19.8 s
Wall time: 20.2 s

In [42]:
data.head()
Out[42]:
no1 no2 no3 no4 no5
0 0.149151 -0.873864 0.736056 -0.550805 -1.399441
1 1.298855 -0.281183 -1.415695 0.995101 0.337464
2 0.870333 -1.056957 0.543588 -1.416220 -2.613365
3 0.509911 0.139642 0.152018 0.197079 0.292907
4 -1.048647 -1.113482 -0.037894 1.216061 -0.263849

The plotting of larger data sets often is a problem. This can easily be accomplished here by only plotting every 500th data row, for instance.

In [43]:
%time data[::500].cumsum().plot(grid=True, legend=True)  # every 500th row is plotted
CPU times: user 106 ms, sys: 5.53 ms, total: 111 ms
Wall time: 119 ms

Out[43]:
<matplotlib.axes.AxesSubplot at 0x107cae990>

There are some reasons why you may want to write the DataFrame to disk:

  • Database Performance: you do not want to query the SQL database everytime anew
  • IO Speed: you want to increase data/information retrievel
  • Data Structure: maybe you want to store additional data with the original data set or you have generated a DataFrame out of multiple sources

pandas is tightly integrated with PyTables. It makes data storage and I/O operations with pandas not only convenient but quite fast as well.

In [44]:
%%time
h5 = pd.HDFStore(filename + '.hs', 'w')
h5['data'] = data  # writes the DataFrame to HDF5 database
h5.close()
CPU times: user 263 ms, sys: 374 ms, total: 638 ms
Wall time: 793 ms

Now, we can read the data from the HDF5 file into another DataFrame object. Performance of such an operation generally is quite high.

In [45]:
%%time
h5 = pd.HDFStore(filename + '.hs', 'r')
data_new = h5['data']
h5.close()
CPU times: user 30 ms, sys: 276 ms, total: 306 ms
Wall time: 305 ms

In [46]:
%time data.sum() - data_new.sum()   # checking if the DataFrames are the same
CPU times: user 750 ms, sys: 231 ms, total: 981 ms
Wall time: 976 ms

Out[46]:
no1    0
no2    0
no3    0
no4    0
no5    0
dtype: float64
In [47]:
data = 0.0; data_new = 0.0  # memory clean-up

Some final checks.

In [48]:
ll numb*
-rw-r--r--  1 yhilpisch  staff  272386048  8 Nov 14:26 numbers.db
-rw-rw-rw-  1 yhilpisch  staff  200002304  8 Nov 14:26 numbers.h5
-rw-rw-rw-  1 yhilpisch  staff  240007368  8 Nov 14:27 numbers.hs

In [49]:
try:
    os.remove(filename + '.db')
    os.remove(filename + '.h5')
    os.remove(filename + '.hs')
    os.remove(filename + '.csv')
except:
    pass

Speeding Up Code Execution through Distributed Computing

The algorithm we (arbitrarily) choose is the Monte Carlo estimator for a European option value in the Black-Scholes-Merton set-up. In this set-up the underlying follows the stochastic differential equation:

\[dS_t = r S_t dt + \sigma S_t dZ_t\]

The Monte Carlo estimator for a call option value looks like follows:

\(C_0 = e^{-rT} \frac{1}{I} \sum_{i=1}^{I} \max[S_T(i) - K, 0]\)

where \(S_T(i)\) is the \(i\)-th simulated value of the underlying at maturity \(T\).

A dynamic implementation of the BSM Monte Carlo valuation could look like follows.

In [50]:
def BSM_MCS(S0):
    ''' Dynamic Black-Scholes-Merton MCS Estimator for European Calls. '''
    import numpy as np
    K = 100.; T = 1.0; r = 0.05; vola = 0.2
    M = 50; I = 20000
    dt = T / M
    rand = np.random.standard_normal((M + 1, I))
    S = np.zeros((M + 1, I)); S[0] = S0
    for t in range(1, M + 1):
        S[t] = S[t-1] * np.exp((r - 0.5 * vola ** 2) * dt + vola * np.sqrt(dt) * rand[t])
    BSM_MCS_Value = np.sum(np.maximum(S[-1] - K, 0)) / I
    return BSM_MCS_Value

The Sequential Calculation

The benchmark case of sequential valuation of \(n\) options.

In [51]:
def seqValue(n):
    optionValues = []
    for S in linspace(80, 100, n):
        optionValues.append(BSM_MCS(S))
    return optionValues

The Parallel Calculation

First, start a (local) cluster.

In [52]:
# in the shell ...
####
# ipcluster start -n 4
####
# or maybe more cores

Second, enable parallel computing capabilities.

In [53]:
from IPython.parallel import Client
c = Client(profile="default")
view = c.load_balanced_view()

The function for the parallel execution of a number \(n\) of different options valuations.

In [54]:
def parValue(n):
    optionValues = []
    for S in linspace(80, 100, n):
        value = view.apply_async(BSM_MCS, S)
        # asynchronously calculate the option values
        optionValues.append(value)
    c.wait(optionValues)
    return optionValues

Let us compare performance of the two different approaches.

In [55]:
n = 20  # number of option valuations
func_list = ['seqValue', 'parValue']
data_list = 2 * ['n']
In [56]:
perf_comp_data(func_list, data_list, rep=3)
function: parValue, av. time sec:   0.75665, relative:    1.0
function: seqValue, av. time sec:   1.22566, relative:    1.6

Performance scales linearly with the number of CPU cores available.

Speeding Up Code through Dynamic Compiling

Numba is an open-source, NumPy-aware optimizing compiler for Python sponsored by The Python Quants GmbH It uses the remarkable LLVM compiler infrastructure to compile Python byte-code to machine code especially for use in the NumPy run-time and SciPy modules.

Introductory Examples

Let us start with a problem that typically lead to performance issues in Python: alogrithms with a nested loops. A sandbox variant can illustrate the problem.

In [57]:
from math import cos, log
def f_py(I, J):
    res = 0
    for i in range(I):
        for j in range (J):
            res += int(cos(log(1)))
    return res
In [58]:
I, J = 1000, 1000
%time f_py(I, J)
CPU times: user 491 ms, sys: 6.89 ms, total: 498 ms
Wall time: 494 ms

Out[58]:
1000000

In principle, this can be vectorized with NumPy arrays.

In [59]:
def f_np(I, J):
    a = np.ones((I, J), dtype=np.float64)
    return int(np.sum(np.cos(np.log(a)))), a
In [60]:
%time res, a = f_np(I, J)
CPU times: user 20.3 ms, sys: 6.44 ms, total: 26.7 ms
Wall time: 25.8 ms

This is much faster, but not really memory efficient ...

In [61]:
a.nbytes
Out[61]:
8000000

Numba can provide help.

In [62]:
import numba as nb
In [63]:
%time f_nb = nb.autojit(f_py)
CPU times: user 2.47 ms, sys: 657 µs, total: 3.13 ms
Wall time: 2.73 ms

It preserves memory efficiency while speeding up execution considerably in this case.

In [64]:
%time f_nb(I, J)

DEBUG -- translate:361:translate
; ModuleID = 'tmp.module.__main__.f_py.107dfe578'

@PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**)

define i32 @__numba_specialized_0___main___2E_f_py(i32 %I, i32 %J) {
entry:
  %nsteps2 = alloca i64
  %target_temp1 = alloca i64
  %nsteps = alloca i64
  %target_temp = alloca i64
  %return_value = alloca i32
  store i64 0, i64* %target_temp, !tbaa !2
  %0 = sext i32 %I to i64
  store i64 %0, i64* %nsteps, !tbaa !3
  br label %"for_condition_4:13"

cleanup_label:                                    ; preds = %"exit_for_4:4", %error_label
  %1 = load i32* %return_value
  ret i32 %1

error_label:                                      ; No predecessors!
  store i32 123456789, i32* %return_value
  br label %cleanup_label

"for_condition_4:13":                             ; preds = %entry, %"exit_for_5:8"
  %res_2 = phi i32 [ 0, %entry ], [ %res_3, %"exit_for_5:8" ]
  %j_1 = phi i64 [ 123456789, %entry ], [ %j_2, %"exit_for_5:8" ]
  %2 = load i64* %target_temp, !tbaa !2
  %3 = load i64* %nsteps, !tbaa !3
  %4 = icmp slt i64 %2, %3
  %5 = icmp ne i1 %4, false
  br i1 %5, label %"loop_body_5:8", label %"exit_for_4:4"

"exit_for_4:4":                                   ; preds = %"for_condition_4:13"
  store i32 %res_2, i32* %return_value
  br label %cleanup_label

"loop_body_5:8":                                  ; preds = %"for_condition_4:13"
  %6 = load i64* %target_temp, !tbaa !2
  %7 = load i64* %target_temp, !tbaa !2
  %8 = add i64 %7, 1
  store i64 %8, i64* %target_temp, !tbaa !2
  store i64 0, i64* %target_temp1, !tbaa !4
  %9 = sext i32 %J to i64
  store i64 %9, i64* %nsteps2, !tbaa !5
  br label %"for_condition_5:17"

"for_condition_5:17":                             ; preds = %"loop_body_5:8", %"loop_body_6:19"
  %res_3 = phi i32 [ %20, %"loop_body_6:19" ], [ %res_2, %"loop_body_5:8" ]
  %j_2 = phi i64 [ %14, %"loop_body_6:19" ], [ %j_1, %"loop_body_5:8" ]
  %10 = load i64* %target_temp1, !tbaa !4
  %11 = load i64* %nsteps2, !tbaa !5
  %12 = icmp slt i64 %10, %11
  %13 = icmp ne i1 %12, false
  br i1 %13, label %"loop_body_6:19", label %"exit_for_5:8"

"exit_for_5:8":                                   ; preds = %"for_condition_5:17"
  br label %"for_condition_4:13"

"loop_body_6:19":                                 ; preds = %"for_condition_5:17"
  %14 = load i64* %target_temp1, !tbaa !4
  %15 = load i64* %target_temp1, !tbaa !4
  %16 = add i64 %15, 1
  store i64 %16, i64* %target_temp1, !tbaa !4
  %17 = call double @"numba.math.['double'].log"(double 1.000000e+00)
  %18 = call double @"numba.math.['double'].cos"(double %17)
  %19 = fptosi double %18 to i32
  %20 = add i32 %res_3, %19
  br label %"for_condition_5:17"
}

declare { i64, i8* }* @Py_BuildValue(i8*, ...)

declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...)

declare void @PyErr_Clear()

declare double @"numba.math.['double'].cos"(double)

declare double @"numba.math.['double'].log"(double)

!tbaa = !{!0, !1, !2, !3, !4, !5}

!0 = metadata !{metadata !"root"}
!1 = metadata !{metadata !"char *", metadata !0}
!2 = metadata !{metadata !"unique0", metadata !1}
!3 = metadata !{metadata !"unique1", metadata !1}
!4 = metadata !{metadata !"unique2", metadata !1}
!5 = metadata !{metadata !"unique3", metadata !1}


DEBUG -- translate:361:translate
; ModuleID = 'numba_executable_module'

@PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**)

define void @Py_INCREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = bitcast { i64, i8* }* %obj to i64*
  %1 = load i64* %0
  %2 = add i64 %1, 1
  store i64 %2, i64* %0
  ret void
}

define void @Py_DECREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = bitcast { i64, i8* }* %obj to i64*
  %1 = load i64* %0
  %2 = icmp sgt i64 %1, 1
  br i1 %2, label %if.then, label %if.else

if.then:                                          ; preds = %decl
  %3 = add i64 %1, -1
  store i64 %3, i64* %0
  br label %if.end

if.else:                                          ; preds = %decl
  call void @Py_DecRef({ i64, i8* }* %obj)
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  ret void
}

declare void @Py_DecRef({ i64, i8* }*)

define void @Py_XINCREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = ptrtoint { i64, i8* }* %obj to i64
  %1 = icmp ne i64 %0, 0
  br i1 %1, label %if.then, label %if.end

if.then:                                          ; preds = %decl
  %2 = bitcast { i64, i8* }* %obj to i64*
  %3 = load i64* %2
  %4 = add i64 %3, 1
  store i64 %4, i64* %2
  br label %if.end

if.end:                                           ; preds = %if.then, %decl
  ret void
}

define void @Py_XDECREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = ptrtoint { i64, i8* }* %obj to i64
  %1 = icmp ne i64 %0, 0
  br i1 %1, label %if.then, label %if.end

if.then:                                          ; preds = %decl
  call void @Py_DECREF({ i64, i8* }* %obj)
  br label %if.end

if.end:                                           ; preds = %if.then, %decl
  ret void
}

define i8* @IndexAxis(i8* %data, i64* %in_shape, i64* %in_strides, i64 %src_dim, i64 %index) {
decl:
  %data1 = alloca i8*
  %in_shape2 = alloca i64*
  %in_strides3 = alloca i64*
  %src_dim4 = alloca i64
  %index5 = alloca i64
  %result = alloca i8*
  store i8* %data, i8** %data1
  store i64* %in_shape, i64** %in_shape2
  store i64* %in_strides, i64** %in_strides3
  store i64 %src_dim, i64* %src_dim4
  store i64 %index, i64* %index5
  %0 = load i64** %in_strides3
  %1 = load i64* %src_dim4
  %2 = getelementptr inbounds i64* %0, i64 %1
  %3 = load i64* %2
  %4 = mul i64 %3, %index
  %5 = load i8** %data1
  %6 = getelementptr inbounds i8* %5, i64 %4
  store i8* %6, i8** %result
  ret i8* %6
}

define void @NewAxis(i64* %out_shape, i64* %out_strides, i32 %dst_dim) {
decl:
  %out_shape1 = alloca i64*
  %out_strides2 = alloca i64*
  %dst_dim3 = alloca i32
  store i64* %out_shape, i64** %out_shape1
  store i64* %out_strides, i64** %out_strides2
  store i32 %dst_dim, i32* %dst_dim3
  %0 = load i64** %out_shape1
  %1 = getelementptr inbounds i64* %0, i32 %dst_dim
  store i64 1, i64* %1
  %2 = load i64** %out_strides2
  %3 = load i32* %dst_dim3
  %4 = getelementptr inbounds i64* %2, i32 %3
  store i64 0, i64* %4
  ret void
}

define i32 @Broadcast(i64* %dst_shape, i64* %src_shape, i64* %src_strides, i32 %max_ndim, i32 %ndim) {
decl:
  %dst_shape1 = alloca i64*
  %src_shape2 = alloca i64*
  %src_strides3 = alloca i64*
  %max_ndim4 = alloca i32
  %ndim5 = alloca i32
  %0 = alloca i32
  store i64* %dst_shape, i64** %dst_shape1
  store i64* %src_shape, i64** %src_shape2
  store i64* %src_strides, i64** %src_strides3
  store i32 %max_ndim, i32* %max_ndim4
  store i32 %ndim, i32* %ndim5
  %1 = load i32* %max_ndim4
  %2 = sub i32 %1, %ndim
  store i32 0, i32* %0
  br label %loop.cond

loop.cond:                                        ; preds = %if.end11, %decl
  %3 = load i32* %0
  %4 = load i32* %ndim5
  %5 = icmp slt i32 %3, %4
  br i1 %5, label %loop.body, label %loop.end

loop.body:                                        ; preds = %loop.cond
  %6 = load i64** %src_shape2
  %7 = getelementptr inbounds i64* %6, i32 %3
  %8 = add i32 %3, %2
  %9 = load i64** %dst_shape1
  %10 = getelementptr inbounds i64* %9, i32 %8
  %11 = load i64* %7
  %12 = icmp eq i64 %11, 1
  br i1 %12, label %if.then, label %if.else

loop.end:                                         ; preds = %if.else7, %loop.cond
  %merge = phi i32 [ 1, %loop.cond ], [ 0, %if.else7 ]
  ret i32 %merge

if.then:                                          ; preds = %loop.body
  %13 = load i64** %src_strides3
  %14 = getelementptr inbounds i64* %13, i32 %3
  store i64 0, i64* %14
  br label %if.end11

if.else:                                          ; preds = %loop.body
  %15 = load i64* %10
  %16 = icmp eq i64 %15, 1
  br i1 %16, label %if.then6, label %if.else7

if.then6:                                         ; preds = %if.else
  store i64 %11, i64* %10
  br label %if.end11

if.else7:                                         ; preds = %if.else
  %17 = icmp ne i64 %11, %15
  br i1 %17, label %loop.end, label %if.end11

if.end11:                                         ; preds = %if.else7, %if.then6, %if.then
  %18 = load i32* %0
  %19 = add i32 %18, 1
  store i32 %19, i32* %0
  br label %loop.cond
}

define i32 @__numba_specialized_0___main___2E_f_py(i32 %I, i32 %J) {
entry:
  %0 = icmp sgt i32 %I, 0
  %1 = icmp sgt i32 %J, 0
  %or.cond = and i1 %0, %1
  br i1 %or.cond, label %"loop_body_5:8.lr.ph.split.us", label %"exit_for_4:4"

"loop_body_5:8.lr.ph.split.us":                   ; preds = %entry
  %2 = sext i32 %I to i64
  %3 = sext i32 %J to i64
  br label %"loop_body_6:19.lr.ph.us"

"loop_body_6:19.us":                              ; preds = %"loop_body_6:19.us", %"loop_body_6:19.lr.ph.us"
  %lsr.iv = phi i64 [ %lsr.iv.next, %"loop_body_6:19.us" ], [ %3, %"loop_body_6:19.lr.ph.us" ]
  %res_35.us = phi i32 [ %res_26.us, %"loop_body_6:19.lr.ph.us" ], [ %7, %"loop_body_6:19.us" ]
  %4 = tail call double @"numba.math.['double'].log"(double 1.000000e+00)
  %5 = tail call double @"numba.math.['double'].cos"(double %4)
  %6 = fptosi double %5 to i32
  %7 = add i32 %6, %res_35.us
  %lsr.iv.next = add i64 %lsr.iv, -1
  %exitcond = icmp eq i64 %lsr.iv.next, 0
  br i1 %exitcond, label %"for_condition_4:13.loopexit.us", label %"loop_body_6:19.us"

"for_condition_4:13.loopexit.us":                 ; preds = %"loop_body_6:19.us"
  %exitcond8 = icmp eq i64 %9, %2
  br i1 %exitcond8, label %"exit_for_4:4", label %"loop_body_6:19.lr.ph.us"

"loop_body_6:19.lr.ph.us":                        ; preds = %"loop_body_5:8.lr.ph.split.us", %"for_condition_4:13.loopexit.us"
  %res_26.us = phi i32 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %7, %"for_condition_4:13.loopexit.us" ]
  %8 = phi i64 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %9, %"for_condition_4:13.loopexit.us" ]
  %9 = add i64 %8, 1
  br label %"loop_body_6:19.us"

"exit_for_4:4":                                   ; preds = %"for_condition_4:13.loopexit.us", %entry
  %res_2.lcssa = phi i32 [ 0, %entry ], [ %7, %"for_condition_4:13.loopexit.us" ]
  ret i32 %res_2.lcssa
}

declare double @"numba.math.['double'].cos"(double)

declare double @"numba.math.['double'].log"(double)

define { i64, i8* }* @__numba_specialized_1_numba_2E_codegen_2E_llvmwrapper_2E___numba_wrapper_f_py(i8* %self, { i64, i8* }* %args) {
entry:
  %objtemp = alloca { i64, i8* }*
  store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6
  %0 = alloca { i64, i8* }*
  %1 = alloca { i64, i8* }*
  %return_value = alloca { i64, i8* }*
  %2 = call i32 ({ i64, i8* }*, i8*, ...)* @PyArg_ParseTuple({ i64, i8* }* %args, i8* getelementptr inbounds ([3 x i8]* @__STR_0, i32 0, i32 0), { i64, i8* }** %1, { i64, i8* }** %0)
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %cleanup.if.true, label %cleanup.if.end

cleanup_label:                                    ; preds = %no_error, %error_label
  %4 = load { i64, i8* }** %objtemp, !tbaa !6
  call void @Py_XDECREF({ i64, i8* }* %4)
  %5 = load { i64, i8* }** %return_value
  ret { i64, i8* }* %5

error_label:                                      ; preds = %empty7, %empty8, %empty5, %empty2, %cleanup.if.true
  store { i64, i8* }* null, { i64, i8* }** %return_value
  %6 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %6)
  br label %cleanup_label

cleanup.if.true:                                  ; preds = %entry
  br label %error_label

cleanup.if.end:                                   ; preds = %entry
  %7 = load { i64, i8* }** %1
  %8 = load { i64, i8* }** %0
  %9 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %7)
  br label %empty

empty:                                            ; preds = %cleanup.if.end
  %10 = call i8* @PyErr_Occurred()
  %11 = ptrtoint i8* %10 to i64
  %12 = icmp ne i64 %11, 0
  br i1 %12, label %empty2, label %empty1

empty1:                                           ; preds = %empty
  %13 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %8)
  br label %empty3

empty2:                                           ; preds = %empty
  br label %error_label

empty3:                                           ; preds = %empty1
  %14 = call i8* @PyErr_Occurred()
  %15 = ptrtoint i8* %14 to i64
  %16 = icmp ne i64 %15, 0
  br i1 %16, label %empty5, label %empty4

empty4:                                           ; preds = %empty3
  %17 = call i32 @__numba_specialized_0___main___2E_f_py(i32 %9, i32 %13)
  br label %empty6

empty5:                                           ; preds = %empty3
  br label %error_label

empty6:                                           ; preds = %empty4
  %18 = call i8* @PyErr_Occurred()
  %19 = ptrtoint i8* %18 to i64
  %20 = icmp ne i64 %19, 0
  br i1 %20, label %empty8, label %empty7

empty7:                                           ; preds = %empty6
  %21 = sext i32 %17 to i64
  %22 = call { i64, i8* }* inttoptr (i64 4433235252 to { i64, i8* }* (i64)*)(i64 %21)
  store { i64, i8* }* %22, { i64, i8* }** %objtemp, !tbaa !6
  %23 = ptrtoint { i64, i8* }* %22 to i64
  %24 = icmp eq i64 %23, 0
  br i1 %24, label %error_label, label %no_error

empty8:                                           ; preds = %empty6
  br label %error_label

no_error:                                         ; preds = %empty7
  %25 = load { i64, i8* }** %objtemp, !tbaa !6
  store { i64, i8* }* %25, { i64, i8* }** %return_value
  %26 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %26)
  br label %cleanup_label
}

declare i8* @PyErr_Occurred()

declare { i64, i8* }* @Py_BuildValue(i8*, ...)

declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...)

declare void @PyErr_Clear()

!tbaa = !{!0, !1, !2, !3, !4, !5, !0, !1, !6}

!0 = metadata !{metadata !"root"}
!1 = metadata !{metadata !"char *", metadata !0}
!2 = metadata !{metadata !"unique0", metadata !1}
!3 = metadata !{metadata !"unique1", metadata !1}
!4 = metadata !{metadata !"unique2", metadata !1}
!5 = metadata !{metadata !"unique3", metadata !1}
!6 = metadata !{metadata !"object", metadata !1}


CPU times: user 248 ms, sys: 27.5 ms, total: 275 ms
Wall time: 337 ms

Out[64]:
1000000

Let us compare the performance of the different alternatives.

In [65]:
func_list = ['f_py', 'f_np', 'f_nb']
data_list = 3 * ['I, J']
In [66]:
perf_comp_data(func_list, data_list, rep=3)
function: f_nb, av. time sec:   0.01615, relative:    1.0
function: f_np, av. time sec:   0.03077, relative:    1.9
function: f_py, av. time sec:   0.51423, relative:   31.8

N-Body Simulation

The simulation of the movement of N bodies in relation to each other, where \(N\) is a large number and a body can be a star/planet or a small particle, is an important problem in Physics.

The basic simulation algorithm involves a number of iterations ("loops") since one has to determine the force that every single body exerts on all the other bodies in the system to be simulated (e.g. a galaxy or a system of atomic particles).

The algorithm lends iteself pretty well for parallization. However, the example code that follows does not use any parallelization techniques. It illustrates rather what speed-ups are achievable by

  • using vectorization approaches with NumPy and
  • just-in-time compiling capabilities of Numba

A description and discussion of the algorithm is found in the recent article Vladimirov, Andrey and Vadim Karpusenko (2013): "Test-Driving Intel Xeon-Phi Coprocessors with a Basic N-Body Simulation." White Paper, Colfax International.

Pure Python

We first use pure Python to implement the algorithm. We start by generating random vectors of particle locations and a zero vector for the speed of each particle.

In [67]:
import random
In [68]:
nParticles = 150
nSteps = 15
In [69]:
p_py = []
for i in range(nParticles):
    p_py.append([random.gauss(0.0, 1.0) for j in range(3)])
In [70]:
pv_py = [[0, 0, 0] for x in p_py]

Next, we define the simulation function for the movements of each particle given their gravitational interaction.

In [71]:
def nbody_py(particle, particlev):  # lists as input
    dt = 0.01
    for step in range(1, nSteps + 1, 1):
        for i in range(nParticles):
            Fx = 0.0; Fy = 0.0; Fz = 0.0
            for j in range(nParticles):
                if j != i:
                    dx = particle[j][0] - particle[i][0]
                    dy = particle[j][1] - particle[i][1]
                    dz = particle[j][2] - particle[i][2]
                    drSquared = dx * dx + dy * dy + dz * dz
                    drPowerN32 = 1.0 / (drSquared + sqrt(drSquared))
                    Fx += dx * drPowerN32
                    Fy += dy * drPowerN32
                    Fz += dz * drPowerN32
                particlev[i][0] += dt * Fx
                particlev[i][1] += dt * Fy
                particlev[i][2] += dt * Fz
        for i in range(nParticles):
            particle[i][0] += particlev[i][0] * dt
            particle[i][1] += particlev[i][1] * dt
            particle[i][2] += particlev[i][2] * dt
    return particle, particlev

We then execute the function to measure execution speed.

In [72]:
%time p_py, pv_py = nbody_py(p_py, pv_py)
CPU times: user 593 ms, sys: 7.4 ms, total: 601 ms
Wall time: 596 ms

NumPy Solution

The basic algorithm allows for a number of vectorizations by using NumPy.

In [73]:
import numpy as np
In [74]:
p_np = np.random.standard_normal((nParticles, 3))
pv_np = np.zeros_like(p_np)

Using NumPy arrays and matplotlib's 3d plotting capabilities, we can easily visualize the intial positions of the particles in the space.

In [75]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
In [76]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(p_np[:, 0], p_np[:, 1], p_np[:, 2],
           c='r', marker='.')
Out[76]:
<mpl_toolkits.mplot3d.art3d.Patch3DCollection at 0x10b3617d0>

The respective NumPy simulation function is considerably more compact.

In [77]:
def nbody_np(particle, particlev):  # NumPy arrays as input
    nSteps = 15; dt = 0.01
    for step in range(1, nSteps + 1, 1):
        Fp = np.zeros((nParticles, 3))
        for i in range(nParticles):
            dp = particle - particle[i]
            drSquared = np.sum(dp ** 2, axis=1)
            h = drSquared + np.sqrt(drSquared)
            drPowerN32 = 1. / np.maximum(h, 1E-10)
            Fp += -(dp.T * drPowerN32).T
            particlev += dt * Fp
        particle += particlev * dt
    return particle, particlev

It is also considerably faster.

In [78]:
%time p_np, pv_np = nbody_np(p_np, pv_np)
CPU times: user 106 ms, sys: 1.25 ms, total: 107 ms
Wall time: 106 ms

Let's check how the particles moved. The following is a snapshot of the final positions of all particles.

In [79]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(p_np[:, 0], p_np[:, 1], p_np[:, 2],
           c='r', marker='.')
Out[79]:
<mpl_toolkits.mplot3d.art3d.Patch3DCollection at 0x10bac3490>

Dynamically Compiled Version

For this case, we first implement a function which is somehow a hybrid function of the pure Python (loop-heavy) and the NumPy array class-based version.

In [80]:
def nbody_hy(particle, particlev):  # NumPy arrays as input
    dt = 0.01
    for step in range(1, nSteps + 1, 1):
        for i in range(nParticles):
            Fx = 0.0; Fy = 0.0; Fz = 0.0
            for j in range(nParticles):
                if j != i:
                    dx = particle[j,0] - particle[i,0]
                    dy = particle[j,1] - particle[i,1]
                    dz = particle[j,2] - particle[i,2]
                    drSquared = dx * dx + dy * dy + dz * dz
                    drPowerN32 = 1.0 / (drSquared + sqrt(drSquared))
                    Fx += dx * drPowerN32
                    Fy += dy * drPowerN32
                    Fz += dz * drPowerN32
                particlev[i, 0] += dt * Fx
                particlev[i, 1] += dt * Fy
                particlev[i, 2] += dt * Fz
        for i in range(nParticles):
            particle[i,0] += particlev[i,0] * dt
            particle[i,1] += particlev[i,1] * dt
            particle[i,2] += particlev[i,2] * dt
    return particle, particlev

This function is then compiled with Numba.

In [81]:
import numba as nb
In [82]:
nbody_nb = nb.autojit(nbody_hy)

This new, compiled version of the hybrid function is yet faster.

In [83]:
p_nb = np.random.standard_normal((nParticles, 3))
pv_nb = np.zeros_like(p_nb)
In [84]:
%time p_nb, pv_nb = nbody_nb(p_nb, pv_nb)

DEBUG -- translate:361:translate
; ModuleID = 'tmp.module.__main__.nbody_hy.10baa19b0'

@PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**)

define { i64, i8* }* @__numba_specialized_2___main___2E_nbody_hy({ i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev) {
entry:
  %objtemp = alloca { i64, i8* }*
  store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6
  %nsteps8 = alloca i64
  %target_temp7 = alloca i64
  %nsteps6 = alloca i64
  %target_temp5 = alloca i64
  %nsteps4 = alloca i64
  %target_temp3 = alloca i64
  %nsteps = alloca i64
  %target_temp = alloca i64
  %particlev2 = alloca { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }*
  %particle1 = alloca { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }*
  store { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particle1
  %0 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }*
  call void @Py_INCREF({ i64, i8* }* %0)
  %1 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i32 0, i32 2
  %2 = load i8** %1, !tbaa !2
  %3 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i32 0, i32 4
  %4 = load i64** %3, !tbaa !4
  %5 = getelementptr i64* %4, i32 0
  %6 = load i64* %5
  %7 = getelementptr i64* %4, i32 1
  %8 = load i64* %7
  %9 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i32 0, i32 5
  %10 = load i64** %9, !tbaa !5
  %11 = getelementptr i64* %10, i32 0
  %12 = load i64* %11
  %13 = getelementptr i64* %10, i32 1
  %14 = load i64* %13
  store { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particlev2
  %15 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }*
  call void @Py_INCREF({ i64, i8* }* %15)
  %16 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i32 0, i32 2
  %17 = load i8** %16, !tbaa !2
  %18 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i32 0, i32 4
  %19 = load i64** %18, !tbaa !4
  %20 = getelementptr i64* %19, i32 0
  %21 = load i64* %20
  %22 = getelementptr i64* %19, i32 1
  %23 = load i64* %22
  %24 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i32 0, i32 5
  %25 = load i64** %24, !tbaa !5
  %26 = getelementptr i64* %25, i32 0
  %27 = load i64* %26
  %28 = getelementptr i64* %25, i32 1
  %29 = load i64* %28
  %return_value = alloca { i64, i8* }*
  store i64 0, i64* %target_temp, !tbaa !7
  br i1 true, label %ifexp.then, label %ifexp.else

cleanup_label:                                    ; preds = %"no_error_23:21", %error_label
  %30 = load { i64, i8* }** %objtemp, !tbaa !6
  call void @Py_XDECREF({ i64, i8* }* %30)
  %31 = load { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particlev2, !tbaa !6
  %32 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %31 to { i64, i8* }*
  call void @Py_XDECREF({ i64, i8* }* %32)
  %33 = load { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particle1, !tbaa !6
  %34 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %33 to { i64, i8* }*
  call void @Py_XDECREF({ i64, i8* }* %34)
  %35 = load { i64, i8* }** %return_value
  ret { i64, i8* }* %35

error_label:                                      ; preds = %"exit_for_3:4"
  store { i64, i8* }* null, { i64, i8* }** %return_value
  %36 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %36)
  br label %cleanup_label

ifexp.then:                                       ; preds = %entry
  br label %ifexp.merge

ifexp.else:                                       ; preds = %entry
  br label %ifexp.merge

ifexp.merge:                                      ; preds = %ifexp.else, %ifexp.then
  %37 = phi i32 [ 1, %ifexp.then ], [ -1, %ifexp.else ]
  %38 = sext i32 %37 to i64
  %39 = sub i64 16, %38
  %40 = sdiv i64 %39, 1
  store i64 %40, i64* %nsteps, !tbaa !8
  br label %"for_condition_3:16"

"for_condition_3:16":                             ; preds = %ifexp.merge, %"exit_for_19:8"
  %i_1 = phi i64 [ 123456789, %ifexp.merge ], [ %i_4, %"exit_for_19:8" ]
  %Fx_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %Fx_2, %"exit_for_19:8" ]
  %drSquared_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %drSquared_2, %"exit_for_19:8" ]
  %Fy_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %Fy_2, %"exit_for_19:8" ]
  %Fz_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %Fz_2, %"exit_for_19:8" ]
  %dz_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %dz_2, %"exit_for_19:8" ]
  %j_1 = phi i64 [ 123456789, %ifexp.merge ], [ %j_2, %"exit_for_19:8" ]
  %dx_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %dx_2, %"exit_for_19:8" ]
  %dy_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %dy_2, %"exit_for_19:8" ]
  %drPowerN32_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %drPowerN32_2, %"exit_for_19:8" ]
  %41 = load i64* %target_temp, !tbaa !7
  %42 = load i64* %nsteps, !tbaa !8
  %43 = icmp slt i64 %41, %42
  %44 = icmp ne i1 %43, false
  br i1 %44, label %"loop_body_4:8", label %"exit_for_3:4"

"exit_for_3:4":                                   ; preds = %"for_condition_3:16"
  %45 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }*
  %46 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }*
  %47 = call { i64, i8* }* (i64, ...)* @PyTuple_Pack(i64 2, { i64, i8* }* %45, { i64, i8* }* %46)
  store { i64, i8* }* %47, { i64, i8* }** %objtemp, !tbaa !6
  %48 = ptrtoint { i64, i8* }* %47 to i64
  %49 = icmp eq i64 %48, 0
  br i1 %49, label %error_label, label %"no_error_23:21"

"loop_body_4:8":                                  ; preds = %"for_condition_3:16"
  %50 = load i64* %target_temp, !tbaa !7
  %51 = mul i64 %50, 1
  %52 = add i64 1, %51
  %53 = load i64* %target_temp, !tbaa !7
  %54 = add i64 %53, 1
  store i64 %54, i64* %target_temp, !tbaa !7
  store i64 0, i64* %target_temp3, !tbaa !9
  store i64 150, i64* %nsteps4, !tbaa !10
  br label %"for_condition_4:17"

"for_condition_4:17":                             ; preds = %"loop_body_4:8", %"exit_for_6:12"
  %i_2 = phi i64 [ %i_1, %"loop_body_4:8" ], [ %59, %"exit_for_6:12" ]
  %Fx_2 = phi double [ %Fx_1, %"loop_body_4:8" ], [ %Fx_4, %"exit_for_6:12" ]
  %drSquared_2 = phi double [ %drSquared_1, %"loop_body_4:8" ], [ %drSquared_3, %"exit_for_6:12" ]
  %Fy_2 = phi double [ %Fy_1, %"loop_body_4:8" ], [ %Fy_4, %"exit_for_6:12" ]
  %Fz_2 = phi double [ %Fz_1, %"loop_body_4:8" ], [ %Fz_4, %"exit_for_6:12" ]
  %j_2 = phi i64 [ %j_1, %"loop_body_4:8" ], [ %j_3, %"exit_for_6:12" ]
  %dx_2 = phi double [ %dx_1, %"loop_body_4:8" ], [ %dx_3, %"exit_for_6:12" ]
  %dy_2 = phi double [ %dy_1, %"loop_body_4:8" ], [ %dy_3, %"exit_for_6:12" ]
  %dz_2 = phi double [ %dz_1, %"loop_body_4:8" ], [ %dz_3, %"exit_for_6:12" ]
  %drPowerN32_2 = phi double [ %drPowerN32_1, %"loop_body_4:8" ], [ %drPowerN32_3, %"exit_for_6:12" ]
  %55 = load i64* %target_temp3, !tbaa !9
  %56 = load i64* %nsteps4, !tbaa !10
  %57 = icmp slt i64 %55, %56
  %58 = icmp ne i1 %57, false
  br i1 %58, label %"loop_body_5:12", label %"exit_for_4:8"

"exit_for_4:8":                                   ; preds = %"for_condition_4:17"
  store i64 0, i64* %target_temp7, !tbaa !13
  store i64 150, i64* %nsteps8, !tbaa !14
  br label %"for_condition_19:17"

"loop_body_5:12":                                 ; preds = %"for_condition_4:17"
  %59 = load i64* %target_temp3, !tbaa !9
  %60 = load i64* %target_temp3, !tbaa !9
  %61 = add i64 %60, 1
  store i64 %61, i64* %target_temp3, !tbaa !9
  store i64 0, i64* %target_temp5, !tbaa !11
  store i64 150, i64* %nsteps6, !tbaa !12
  br label %"for_condition_6:21"

"for_condition_6:21":                             ; preds = %"loop_body_5:12", %"exit_if_7:16"
  %drSquared_3 = phi double [ %drSquared_2, %"loop_body_5:12" ], [ %drSquared_5, %"exit_if_7:16" ]
  %Fx_4 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fx_6, %"exit_if_7:16" ]
  %Fy_4 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fy_6, %"exit_if_7:16" ]
  %Fz_4 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fz_6, %"exit_if_7:16" ]
  %j_3 = phi i64 [ %j_2, %"loop_body_5:12" ], [ %66, %"exit_if_7:16" ]
  %dx_3 = phi double [ %dx_2, %"loop_body_5:12" ], [ %dx_5, %"exit_if_7:16" ]
  %dy_3 = phi double [ %dy_2, %"loop_body_5:12" ], [ %dy_5, %"exit_if_7:16" ]
  %dz_3 = phi double [ %dz_2, %"loop_body_5:12" ], [ %dz_5, %"exit_if_7:16" ]
  %drPowerN32_3 = phi double [ %drPowerN32_2, %"loop_body_5:12" ], [ %drPowerN32_5, %"exit_if_7:16" ]
  %62 = load i64* %target_temp5, !tbaa !11
  %63 = load i64* %nsteps6, !tbaa !12
  %64 = icmp slt i64 %62, %63
  %65 = icmp ne i1 %64, false
  br i1 %65, label %"loop_body_7:16", label %"exit_for_6:12"

"exit_for_6:12":                                  ; preds = %"for_condition_6:21"
  br label %"for_condition_4:17"

"loop_body_7:16":                                 ; preds = %"for_condition_6:21"
  %66 = load i64* %target_temp5, !tbaa !11
  %67 = load i64* %target_temp5, !tbaa !11
  %68 = add i64 %67, 1
  store i64 %68, i64* %target_temp5, !tbaa !11
  br label %"if_cond_7:19"

"if_cond_7:19":                                   ; preds = %"loop_body_7:16"
  %69 = icmp ne i64 %66, %59
  %70 = icmp ne i1 %69, false
  br i1 %70, label %"if_body_8:20", label %"exit_if_7:16"

"exit_if_7:16":                                   ; preds = %"if_cond_7:19", %"if_body_8:20"
  %drSquared_5 = phi double [ %165, %"if_body_8:20" ], [ %drSquared_3, %"if_cond_7:19" ]
  %Fx_6 = phi double [ %170, %"if_body_8:20" ], [ %Fx_4, %"if_cond_7:19" ]
  %Fy_6 = phi double [ %172, %"if_body_8:20" ], [ %Fy_4, %"if_cond_7:19" ]
  %Fz_6 = phi double [ %174, %"if_body_8:20" ], [ %Fz_4, %"if_cond_7:19" ]
  %dx_5 = phi double [ %130, %"if_body_8:20" ], [ %dx_3, %"if_cond_7:19" ]
  %dy_5 = phi double [ %145, %"if_body_8:20" ], [ %dy_3, %"if_cond_7:19" ]
  %dz_5 = phi double [ %160, %"if_body_8:20" ], [ %dz_3, %"if_cond_7:19" ]
  %drPowerN32_5 = phi double [ %168, %"if_body_8:20" ], [ %drPowerN32_3, %"if_cond_7:19" ]
  %71 = mul i64 %59, %27
  %72 = add i64 0, %71
  %73 = mul i64 0, %29
  %74 = add i64 %72, %73
  %75 = getelementptr i8* %17, i64 %74
  %76 = bitcast i8* %75 to double*
  %77 = load double* %76, !tbaa !2
  %78 = fmul double 1.000000e-02, %Fx_6
  %79 = fadd double %77, %78
  %80 = mul i64 %59, %27
  %81 = add i64 0, %80
  %82 = mul i64 0, %29
  %83 = add i64 %81, %82
  %84 = getelementptr i8* %17, i64 %83
  %85 = bitcast i8* %84 to double*
  store double %79, double* %85, !tbaa !2
  %86 = mul i64 %59, %27
  %87 = add i64 0, %86
  %88 = mul i64 1, %29
  %89 = add i64 %87, %88
  %90 = getelementptr i8* %17, i64 %89
  %91 = bitcast i8* %90 to double*
  %92 = load double* %91, !tbaa !2
  %93 = fmul double 1.000000e-02, %Fy_6
  %94 = fadd double %92, %93
  %95 = mul i64 %59, %27
  %96 = add i64 0, %95
  %97 = mul i64 1, %29
  %98 = add i64 %96, %97
  %99 = getelementptr i8* %17, i64 %98
  %100 = bitcast i8* %99 to double*
  store double %94, double* %100, !tbaa !2
  %101 = mul i64 %59, %27
  %102 = add i64 0, %101
  %103 = mul i64 2, %29
  %104 = add i64 %102, %103
  %105 = getelementptr i8* %17, i64 %104
  %106 = bitcast i8* %105 to double*
  %107 = load double* %106, !tbaa !2
  %108 = fmul double 1.000000e-02, %Fz_6
  %109 = fadd double %107, %108
  %110 = mul i64 %59, %27
  %111 = add i64 0, %110
  %112 = mul i64 2, %29
  %113 = add i64 %111, %112
  %114 = getelementptr i8* %17, i64 %113
  %115 = bitcast i8* %114 to double*
  store double %109, double* %115, !tbaa !2
  br label %"for_condition_6:21"

"if_body_8:20":                                   ; preds = %"if_cond_7:19"
  %116 = mul i64 %66, %12
  %117 = add i64 0, %116
  %118 = mul i64 0, %14
  %119 = add i64 %117, %118
  %120 = getelementptr i8* %2, i64 %119
  %121 = bitcast i8* %120 to double*
  %122 = load double* %121, !tbaa !2
  %123 = mul i64 %59, %12
  %124 = add i64 0, %123
  %125 = mul i64 0, %14
  %126 = add i64 %124, %125
  %127 = getelementptr i8* %2, i64 %126
  %128 = bitcast i8* %127 to double*
  %129 = load double* %128, !tbaa !2
  %130 = fsub double %122, %129
  %131 = mul i64 %66, %12
  %132 = add i64 0, %131
  %133 = mul i64 1, %14
  %134 = add i64 %132, %133
  %135 = getelementptr i8* %2, i64 %134
  %136 = bitcast i8* %135 to double*
  %137 = load double* %136, !tbaa !2
  %138 = mul i64 %59, %12
  %139 = add i64 0, %138
  %140 = mul i64 1, %14
  %141 = add i64 %139, %140
  %142 = getelementptr i8* %2, i64 %141
  %143 = bitcast i8* %142 to double*
  %144 = load double* %143, !tbaa !2
  %145 = fsub double %137, %144
  %146 = mul i64 %66, %12
  %147 = add i64 0, %146
  %148 = mul i64 2, %14
  %149 = add i64 %147, %148
  %150 = getelementptr i8* %2, i64 %149
  %151 = bitcast i8* %150 to double*
  %152 = load double* %151, !tbaa !2
  %153 = mul i64 %59, %12
  %154 = add i64 0, %153
  %155 = mul i64 2, %14
  %156 = add i64 %154, %155
  %157 = getelementptr i8* %2, i64 %156
  %158 = bitcast i8* %157 to double*
  %159 = load double* %158, !tbaa !2
  %160 = fsub double %152, %159
  %161 = fmul double %130, %130
  %162 = fmul double %145, %145
  %163 = fadd double %161, %162
  %164 = fmul double %160, %160
  %165 = fadd double %163, %164
  %166 = call double @"numba.math.['double'].sqrt"(double %165)
  %167 = fadd double %165, %166
  %168 = fdiv double 1.000000e+00, %167
  %169 = fmul double %130, %168
  %170 = fadd double %Fx_4, %169
  %171 = fmul double %145, %168
  %172 = fadd double %Fy_4, %171
  %173 = fmul double %160, %168
  %174 = fadd double %Fz_4, %173
  br label %"exit_if_7:16"

"for_condition_19:17":                            ; preds = %"exit_for_4:8", %"loop_body_20:29"
  %i_4 = phi i64 [ %i_2, %"exit_for_4:8" ], [ %179, %"loop_body_20:29" ]
  %175 = load i64* %target_temp7, !tbaa !13
  %176 = load i64* %nsteps8, !tbaa !14
  %177 = icmp slt i64 %175, %176
  %178 = icmp ne i1 %177, false
  br i1 %178, label %"loop_body_20:29", label %"exit_for_19:8"

"exit_for_19:8":                                  ; preds = %"for_condition_19:17"
  br label %"for_condition_3:16"

"loop_body_20:29":                                ; preds = %"for_condition_19:17"
  %179 = load i64* %target_temp7, !tbaa !13
  %180 = load i64* %target_temp7, !tbaa !13
  %181 = add i64 %180, 1
  store i64 %181, i64* %target_temp7, !tbaa !13
  %182 = mul i64 %179, %12
  %183 = add i64 0, %182
  %184 = mul i64 0, %14
  %185 = add i64 %183, %184
  %186 = getelementptr i8* %2, i64 %185
  %187 = bitcast i8* %186 to double*
  %188 = load double* %187, !tbaa !2
  %189 = mul i64 %179, %27
  %190 = add i64 0, %189
  %191 = mul i64 0, %29
  %192 = add i64 %190, %191
  %193 = getelementptr i8* %17, i64 %192
  %194 = bitcast i8* %193 to double*
  %195 = load double* %194, !tbaa !2
  %196 = fmul double %195, 1.000000e-02
  %197 = fadd double %188, %196
  %198 = mul i64 %179, %12
  %199 = add i64 0, %198
  %200 = mul i64 0, %14
  %201 = add i64 %199, %200
  %202 = getelementptr i8* %2, i64 %201
  %203 = bitcast i8* %202 to double*
  store double %197, double* %203, !tbaa !2
  %204 = mul i64 %179, %12
  %205 = add i64 0, %204
  %206 = mul i64 1, %14
  %207 = add i64 %205, %206
  %208 = getelementptr i8* %2, i64 %207
  %209 = bitcast i8* %208 to double*
  %210 = load double* %209, !tbaa !2
  %211 = mul i64 %179, %27
  %212 = add i64 0, %211
  %213 = mul i64 1, %29
  %214 = add i64 %212, %213
  %215 = getelementptr i8* %17, i64 %214
  %216 = bitcast i8* %215 to double*
  %217 = load double* %216, !tbaa !2
  %218 = fmul double %217, 1.000000e-02
  %219 = fadd double %210, %218
  %220 = mul i64 %179, %12
  %221 = add i64 0, %220
  %222 = mul i64 1, %14
  %223 = add i64 %221, %222
  %224 = getelementptr i8* %2, i64 %223
  %225 = bitcast i8* %224 to double*
  store double %219, double* %225, !tbaa !2
  %226 = mul i64 %179, %12
  %227 = add i64 0, %226
  %228 = mul i64 2, %14
  %229 = add i64 %227, %228
  %230 = getelementptr i8* %2, i64 %229
  %231 = bitcast i8* %230 to double*
  %232 = load double* %231, !tbaa !2
  %233 = mul i64 %179, %27
  %234 = add i64 0, %233
  %235 = mul i64 2, %29
  %236 = add i64 %234, %235
  %237 = getelementptr i8* %17, i64 %236
  %238 = bitcast i8* %237 to double*
  %239 = load double* %238, !tbaa !2
  %240 = fmul double %239, 1.000000e-02
  %241 = fadd double %232, %240
  %242 = mul i64 %179, %12
  %243 = add i64 0, %242
  %244 = mul i64 2, %14
  %245 = add i64 %243, %244
  %246 = getelementptr i8* %2, i64 %245
  %247 = bitcast i8* %246 to double*
  store double %241, double* %247, !tbaa !2
  br label %"for_condition_19:17"

"no_error_23:21":                                 ; preds = %"exit_for_3:4"
  %248 = load { i64, i8* }** %objtemp, !tbaa !6
  store { i64, i8* }* %248, { i64, i8* }** %return_value
  %249 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %249)
  br label %cleanup_label
}

declare { i64, i8* }* @PyTuple_Pack(i64, ...)

declare { i64, i8* }* @Py_BuildValue(i8*, ...)

declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...)

declare void @PyErr_Clear()

declare void @Py_INCREF({ i64, i8* }*)

declare void @Py_XINCREF({ i64, i8* }*)

declare double @"numba.math.['double'].sqrt"(double)

declare void @Py_XDECREF({ i64, i8* }*)

!tbaa = !{!0, !1, !2, !3, !4, !5, !6, !7, !8, !9, !10, !11, !12, !13, !14}

!0 = metadata !{metadata !"root"}
!1 = metadata !{metadata !"char *", metadata !0}
!2 = metadata !{metadata !"float3264 *", metadata !1}
!3 = metadata !{metadata !"npy_intp **", metadata !1}
!4 = metadata !{metadata !"tbaa_type(numpy shape, npy_intp *) *", metadata !3}
!5 = metadata !{metadata !"tbaa_type(numpy strides, npy_intp *) *", metadata !3}
!6 = metadata !{metadata !"object", metadata !1}
!7 = metadata !{metadata !"unique0", metadata !1}
!8 = metadata !{metadata !"unique1", metadata !1}
!9 = metadata !{metadata !"unique2", metadata !1}
!10 = metadata !{metadata !"unique3", metadata !1}
!11 = metadata !{metadata !"unique4", metadata !1}
!12 = metadata !{metadata !"unique5", metadata !1}
!13 = metadata !{metadata !"unique6", metadata !1}
!14 = metadata !{metadata !"unique7", metadata !1}


DEBUG -- translate:361:translate
; ModuleID = 'numba_executable_module'

@PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**)

define void @Py_INCREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = bitcast { i64, i8* }* %obj to i64*
  %1 = load i64* %0
  %2 = add i64 %1, 1
  store i64 %2, i64* %0
  ret void
}

define void @Py_DECREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = bitcast { i64, i8* }* %obj to i64*
  %1 = load i64* %0
  %2 = icmp sgt i64 %1, 1
  br i1 %2, label %if.then, label %if.else

if.then:                                          ; preds = %decl
  %3 = bitcast { i64, i8* }* %obj to i64*
  %4 = add i64 %1, -1
  store i64 %4, i64* %3
  br label %if.end

if.else:                                          ; preds = %decl
  call void @Py_DecRef({ i64, i8* }* %obj)
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  ret void
}

declare void @Py_DecRef({ i64, i8* }*)

define void @Py_XINCREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = ptrtoint { i64, i8* }* %obj to i64
  %1 = icmp ne i64 %0, 0
  br i1 %1, label %if.then, label %if.end

if.then:                                          ; preds = %decl
  %2 = bitcast { i64, i8* }* %obj to i64*
  %3 = load i64* %2
  %4 = add i64 %3, 1
  store i64 %4, i64* %2
  br label %if.end

if.end:                                           ; preds = %if.then, %decl
  ret void
}

define void @Py_XDECREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = ptrtoint { i64, i8* }* %obj to i64
  %1 = icmp ne i64 %0, 0
  br i1 %1, label %if.then, label %if.end

if.then:                                          ; preds = %decl
  call void @Py_DECREF({ i64, i8* }* %obj)
  br label %if.end

if.end:                                           ; preds = %if.then, %decl
  ret void
}

define i8* @IndexAxis(i8* %data, i64* %in_shape, i64* %in_strides, i64 %src_dim, i64 %index) {
decl:
  %data1 = alloca i8*
  %in_shape2 = alloca i64*
  %in_strides3 = alloca i64*
  %src_dim4 = alloca i64
  %index5 = alloca i64
  %result = alloca i8*
  store i8* %data, i8** %data1
  store i64* %in_shape, i64** %in_shape2
  store i64* %in_strides, i64** %in_strides3
  store i64 %src_dim, i64* %src_dim4
  store i64 %index, i64* %index5
  %0 = load i64** %in_strides3
  %1 = load i64* %src_dim4
  %2 = getelementptr inbounds i64* %0, i64 %1
  %3 = load i64* %2
  %4 = mul i64 %3, %index
  %5 = load i8** %data1
  %6 = getelementptr inbounds i8* %5, i64 %4
  store i8* %6, i8** %result
  ret i8* %6
}

define void @NewAxis(i64* %out_shape, i64* %out_strides, i32 %dst_dim) {
decl:
  %out_shape1 = alloca i64*
  %out_strides2 = alloca i64*
  %dst_dim3 = alloca i32
  store i64* %out_shape, i64** %out_shape1
  store i64* %out_strides, i64** %out_strides2
  store i32 %dst_dim, i32* %dst_dim3
  %0 = load i64** %out_shape1
  %1 = getelementptr inbounds i64* %0, i32 %dst_dim
  store i64 1, i64* %1
  %2 = load i64** %out_strides2
  %3 = load i32* %dst_dim3
  %4 = getelementptr inbounds i64* %2, i32 %3
  store i64 0, i64* %4
  ret void
}

define i32 @Broadcast(i64* %dst_shape, i64* %src_shape, i64* %src_strides, i32 %max_ndim, i32 %ndim) {
decl:
  %dst_shape1 = alloca i64*
  %src_shape2 = alloca i64*
  %src_strides3 = alloca i64*
  %max_ndim4 = alloca i32
  %ndim5 = alloca i32
  %0 = alloca i32
  store i64* %dst_shape, i64** %dst_shape1
  store i64* %src_shape, i64** %src_shape2
  store i64* %src_strides, i64** %src_strides3
  store i32 %max_ndim, i32* %max_ndim4
  store i32 %ndim, i32* %ndim5
  %1 = load i32* %max_ndim4
  %2 = sub i32 %1, %ndim
  store i32 0, i32* %0
  br label %loop.cond

loop.cond:                                        ; preds = %if.end11, %decl
  %3 = load i32* %0
  %4 = load i32* %ndim5
  %5 = icmp slt i32 %3, %4
  br i1 %5, label %loop.body, label %loop.end

loop.body:                                        ; preds = %loop.cond
  %6 = load i64** %src_shape2
  %7 = getelementptr inbounds i64* %6, i32 %3
  %8 = add i32 %3, %2
  %9 = load i64** %dst_shape1
  %10 = getelementptr inbounds i64* %9, i32 %8
  %11 = load i64* %7
  %12 = icmp eq i64 %11, 1
  br i1 %12, label %if.then, label %if.else

loop.end:                                         ; preds = %if.else7, %loop.cond
  %merge = phi i32 [ 1, %loop.cond ], [ 0, %if.else7 ]
  ret i32 %merge

if.then:                                          ; preds = %loop.body
  %13 = load i64** %src_strides3
  %14 = getelementptr inbounds i64* %13, i32 %3
  store i64 0, i64* %14
  br label %if.end11

if.else:                                          ; preds = %loop.body
  %15 = load i64* %10
  %16 = icmp eq i64 %15, 1
  br i1 %16, label %if.then6, label %if.else7

if.then6:                                         ; preds = %if.else
  store i64 %11, i64* %10
  br label %if.end11

if.else7:                                         ; preds = %if.else
  %17 = icmp ne i64 %11, %15
  br i1 %17, label %loop.end, label %if.end11

if.end11:                                         ; preds = %if.else7, %if.then6, %if.then
  %18 = load i32* %0
  %19 = add i32 %18, 1
  store i32 %19, i32* %0
  br label %loop.cond
}

define i32 @__numba_specialized_0___main___2E_f_py(i32 %I, i32 %J) {
entry:
  %0 = icmp sgt i32 %I, 0
  %1 = icmp sgt i32 %J, 0
  %or.cond = and i1 %0, %1
  br i1 %or.cond, label %"loop_body_5:8.lr.ph.split.us", label %"exit_for_4:4"

"loop_body_5:8.lr.ph.split.us":                   ; preds = %entry
  %2 = sext i32 %I to i64
  %3 = sext i32 %J to i64
  br label %"loop_body_6:19.lr.ph.us"

"loop_body_6:19.us":                              ; preds = %"loop_body_6:19.us", %"loop_body_6:19.lr.ph.us"
  %lsr.iv = phi i64 [ %lsr.iv.next, %"loop_body_6:19.us" ], [ %3, %"loop_body_6:19.lr.ph.us" ]
  %res_35.us = phi i32 [ %res_26.us, %"loop_body_6:19.lr.ph.us" ], [ %7, %"loop_body_6:19.us" ]
  %4 = tail call double @"numba.math.['double'].log"(double 1.000000e+00)
  %5 = tail call double @"numba.math.['double'].cos"(double %4)
  %6 = fptosi double %5 to i32
  %7 = add i32 %6, %res_35.us
  %lsr.iv.next = add i64 %lsr.iv, -1
  %exitcond = icmp eq i64 %lsr.iv.next, 0
  br i1 %exitcond, label %"for_condition_4:13.loopexit.us", label %"loop_body_6:19.us"

"for_condition_4:13.loopexit.us":                 ; preds = %"loop_body_6:19.us"
  %exitcond8 = icmp eq i64 %9, %2
  br i1 %exitcond8, label %"exit_for_4:4", label %"loop_body_6:19.lr.ph.us"

"loop_body_6:19.lr.ph.us":                        ; preds = %"loop_body_5:8.lr.ph.split.us", %"for_condition_4:13.loopexit.us"
  %res_26.us = phi i32 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %7, %"for_condition_4:13.loopexit.us" ]
  %8 = phi i64 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %9, %"for_condition_4:13.loopexit.us" ]
  %9 = add i64 %8, 1
  br label %"loop_body_6:19.us"

"exit_for_4:4":                                   ; preds = %"for_condition_4:13.loopexit.us", %entry
  %res_2.lcssa = phi i32 [ 0, %entry ], [ %7, %"for_condition_4:13.loopexit.us" ]
  ret i32 %res_2.lcssa
}

declare double @"numba.math.['double'].cos"(double)

declare double @"numba.math.['double'].log"(double)

define { i64, i8* }* @__numba_specialized_1_numba_2E_codegen_2E_llvmwrapper_2E___numba_wrapper_f_py(i8* %self, { i64, i8* }* %args) {
entry:
  %objtemp = alloca { i64, i8* }*
  store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6
  %0 = alloca { i64, i8* }*
  %1 = alloca { i64, i8* }*
  %return_value = alloca { i64, i8* }*
  %2 = call i32 ({ i64, i8* }*, i8*, ...)* @PyArg_ParseTuple({ i64, i8* }* %args, i8* getelementptr inbounds ([3 x i8]* @__STR_0, i32 0, i32 0), { i64, i8* }** %1, { i64, i8* }** %0)
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %error_label, label %empty

cleanup_label:                                    ; preds = %no_error, %error_label
  %4 = load { i64, i8* }** %objtemp, !tbaa !6
  call void @Py_XDECREF({ i64, i8* }* %4)
  %5 = load { i64, i8* }** %return_value
  ret { i64, i8* }* %5

error_label:                                      ; preds = %empty6, %empty3, %empty, %entry, %empty7
  store { i64, i8* }* null, { i64, i8* }** %return_value
  %6 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %6)
  br label %cleanup_label

empty:                                            ; preds = %entry
  %7 = load { i64, i8* }** %1
  %8 = load { i64, i8* }** %0
  %9 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %7)
  %10 = call i8* @PyErr_Occurred()
  %11 = ptrtoint i8* %10 to i64
  %12 = icmp ne i64 %11, 0
  br i1 %12, label %error_label, label %empty3

empty3:                                           ; preds = %empty
  %13 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %8)
  %14 = call i8* @PyErr_Occurred()
  %15 = ptrtoint i8* %14 to i64
  %16 = icmp ne i64 %15, 0
  br i1 %16, label %error_label, label %empty6

empty6:                                           ; preds = %empty3
  %17 = call i32 @__numba_specialized_0___main___2E_f_py(i32 %9, i32 %13)
  %18 = call i8* @PyErr_Occurred()
  %19 = ptrtoint i8* %18 to i64
  %20 = icmp ne i64 %19, 0
  br i1 %20, label %error_label, label %empty7

empty7:                                           ; preds = %empty6
  %21 = sext i32 %17 to i64
  %22 = call { i64, i8* }* inttoptr (i64 4433235252 to { i64, i8* }* (i64)*)(i64 %21)
  store { i64, i8* }* %22, { i64, i8* }** %objtemp, !tbaa !6
  %23 = ptrtoint { i64, i8* }* %22 to i64
  %24 = icmp eq i64 %23, 0
  br i1 %24, label %error_label, label %no_error

no_error:                                         ; preds = %empty7
  %25 = load { i64, i8* }** %objtemp, !tbaa !6
  store { i64, i8* }* %25, { i64, i8* }** %return_value
  %26 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %26)
  br label %cleanup_label
}

declare i8* @PyErr_Occurred()

declare { i64, i8* }* @Py_BuildValue(i8*, ...)

declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...)

declare void @PyErr_Clear()

define { i64, i8* }* @__numba_specialized_2___main___2E_nbody_hy({ i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev) {
entry:
  %0 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }*
  tail call void @Py_INCREF({ i64, i8* }* %0)
  %1 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i64 0, i32 2
  %2 = load i8** %1, align 8, !tbaa !7
  %3 = ptrtoint i8* %2 to i64
  %4 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i64 0, i32 5
  %5 = load i64** %4, align 8, !tbaa !10
  %6 = load i64* %5, align 8
  %7 = getelementptr i64* %5, i64 1
  %8 = load i64* %7, align 8
  %9 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }*
  tail call void @Py_INCREF({ i64, i8* }* %9)
  %10 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i64 0, i32 2
  %11 = load i8** %10, align 8, !tbaa !7
  %12 = ptrtoint i8* %11 to i64
  %13 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i64 0, i32 5
  %14 = load i64** %13, align 8, !tbaa !10
  %15 = load i64* %14, align 8
  %16 = getelementptr i64* %14, i64 1
  %17 = load i64* %16, align 8
  %18 = shl i64 %17, 1
  %19 = shl i64 %8, 1
  %20 = sub i64 0, %3
  %21 = sub i64 0, %12
  br label %"loop_body_4:8"

cleanup_label:                                    ; preds = %"no_error_23:21", %error_label
  %return_value.0.load14 = phi { i64, i8* }* [ %25, %"no_error_23:21" ], [ null, %error_label ]
  %22 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }*
  %23 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }*
  tail call void @Py_XDECREF({ i64, i8* }* %25)
  tail call void @Py_XDECREF({ i64, i8* }* %22)
  tail call void @Py_XDECREF({ i64, i8* }* %23)
  ret { i64, i8* }* %return_value.0.load14

error_label:                                      ; preds = %"exit_for_3:4"
  tail call void @Py_XINCREF({ i64, i8* }* null)
  br label %cleanup_label

"for_condition_3:16.loopexit":                    ; preds = %"loop_body_20:29"
  %24 = add i64 %27, 1
  %exitcond35 = icmp eq i64 %24, 15
  br i1 %exitcond35, label %"exit_for_3:4", label %"loop_body_4:8"

"exit_for_3:4":                                   ; preds = %"for_condition_3:16.loopexit"
  %25 = tail call { i64, i8* }* (i64, ...)* @PyTuple_Pack(i64 2, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev)
  %26 = icmp eq { i64, i8* }* %25, null
  br i1 %26, label %error_label, label %"no_error_23:21"

"loop_body_4:8":                                  ; preds = %"for_condition_3:16.loopexit", %entry
  %27 = phi i64 [ 0, %entry ], [ %24, %"for_condition_3:16.loopexit" ]
  br label %"loop_body_5:12"

"for_condition_4:17.loopexit":                    ; preds = %"exit_if_7:16"
  %28 = add i64 %29, 1
  %exitcond33 = icmp eq i64 %28, 150
  br i1 %exitcond33, label %"loop_body_20:29", label %"loop_body_5:12"

"loop_body_5:12":                                 ; preds = %"for_condition_4:17.loopexit", %"loop_body_4:8"
  %29 = phi i64 [ 0, %"loop_body_4:8" ], [ %28, %"for_condition_4:17.loopexit" ]
  %30 = mul i64 %29, %15
  %31 = add i64 %30, %17
  %32 = add i64 %30, %18
  %33 = mul i64 %29, %6
  %34 = add i64 %33, %8
  %35 = add i64 %33, %19
  br label %"loop_body_7:16"

"loop_body_7:16":                                 ; preds = %"exit_if_7:16", %"loop_body_5:12"
  %lsr.iv = phi i64 [ %lsr.iv.next, %"exit_if_7:16" ], [ %20, %"loop_body_5:12" ]
  %Fz_418 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fz_6, %"exit_if_7:16" ]
  %Fy_417 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fy_6, %"exit_if_7:16" ]
  %Fx_416 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fx_6, %"exit_if_7:16" ]
  %36 = phi i64 [ 0, %"loop_body_5:12" ], [ %37, %"exit_if_7:16" ]
  %37 = add i64 %36, 1
  %38 = icmp eq i64 %29, %36
  br i1 %38, label %"exit_if_7:16", label %"if_body_8:20"

"exit_if_7:16":                                   ; preds = %"loop_body_7:16", %"if_body_8:20"
  %Fx_6 = phi double [ %70, %"if_body_8:20" ], [ %Fx_416, %"loop_body_7:16" ]
  %Fy_6 = phi double [ %72, %"if_body_8:20" ], [ %Fy_417, %"loop_body_7:16" ]
  %Fz_6 = phi double [ %74, %"if_body_8:20" ], [ %Fz_418, %"loop_body_7:16" ]
  %sunkaddr = ptrtoint i8* %11 to i64
  %sunkaddr23 = add i64 %sunkaddr, %30
  %sunkaddr24 = inttoptr i64 %sunkaddr23 to double*
  %39 = load double* %sunkaddr24, align 8, !tbaa !7
  %40 = fmul double %Fx_6, 1.000000e-02
  %41 = fadd double %40, %39
  store double %41, double* %sunkaddr24, align 8, !tbaa !7
  %sunkaddr25 = ptrtoint i8* %11 to i64
  %sunkaddr26 = add i64 %sunkaddr25, %31
  %sunkaddr27 = inttoptr i64 %sunkaddr26 to double*
  %42 = load double* %sunkaddr27, align 8, !tbaa !7
  %43 = fmul double %Fy_6, 1.000000e-02
  %44 = fadd double %43, %42
  store double %44, double* %sunkaddr27, align 8, !tbaa !7
  %sunkaddr28 = ptrtoint i8* %11 to i64
  %sunkaddr29 = add i64 %sunkaddr28, %32
  %sunkaddr30 = inttoptr i64 %sunkaddr29 to double*
  %45 = load double* %sunkaddr30, align 8, !tbaa !7
  %46 = fmul double %Fz_6, 1.000000e-02
  %47 = fadd double %46, %45
  store double %47, double* %sunkaddr30, align 8, !tbaa !7
  %lsr.iv.next = sub i64 %lsr.iv, %6
  %exitcond = icmp eq i64 150, %37
  br i1 %exitcond, label %"for_condition_4:17.loopexit", label %"loop_body_7:16"

"if_body_8:20":                                   ; preds = %"loop_body_7:16"
  %48 = mul i64 %lsr.iv, -1
  %49 = inttoptr i64 %48 to double*
  %50 = load double* %49, align 8, !tbaa !7
  %sunkaddr31 = ptrtoint i8* %2 to i64
  %sunkaddr32 = add i64 %sunkaddr31, %33
  %sunkaddr33 = inttoptr i64 %sunkaddr32 to double*
  %51 = load double* %sunkaddr33, align 8, !tbaa !7
  %52 = fsub double %50, %51
  %53 = mul i64 %lsr.iv, -1
  %sunkaddr34 = add i64 %8, %53
  %sunkaddr35 = inttoptr i64 %sunkaddr34 to double*
  %54 = load double* %sunkaddr35, align 8, !tbaa !7
  %sunkaddr36 = ptrtoint i8* %2 to i64
  %sunkaddr37 = add i64 %sunkaddr36, %34
  %sunkaddr38 = inttoptr i64 %sunkaddr37 to double*
  %55 = load double* %sunkaddr38, align 8, !tbaa !7
  %56 = fsub double %54, %55
  %57 = mul i64 %lsr.iv, -1
  %sunkaddr39 = mul i64 %8, 2
  %sunkaddr40 = add i64 %57, %sunkaddr39
  %sunkaddr41 = inttoptr i64 %sunkaddr40 to double*
  %58 = load double* %sunkaddr41, align 8, !tbaa !7
  %sunkaddr42 = ptrtoint i8* %2 to i64
  %sunkaddr43 = add i64 %sunkaddr42, %35
  %sunkaddr44 = inttoptr i64 %sunkaddr43 to double*
  %59 = load double* %sunkaddr44, align 8, !tbaa !7
  %60 = fsub double %58, %59
  %61 = fmul double %52, %52
  %62 = fmul double %56, %56
  %63 = fadd double %61, %62
  %64 = fmul double %60, %60
  %65 = fadd double %63, %64
  %66 = tail call double @"numba.math.['double'].sqrt"(double %65)
  %67 = fadd double %66, %65
  %68 = fdiv double 1.000000e+00, %67
  %69 = fmul double %52, %68
  %70 = fadd double %Fx_416, %69
  %71 = fmul double %56, %68
  %72 = fadd double %Fy_417, %71
  %73 = fmul double %60, %68
  %74 = fadd double %Fz_418, %73
  br label %"exit_if_7:16"

"loop_body_20:29":                                ; preds = %"for_condition_4:17.loopexit", %"loop_body_20:29"
  %lsr.iv21 = phi i64 [ %lsr.iv.next22, %"loop_body_20:29" ], [ 150, %"for_condition_4:17.loopexit" ]
  %lsr.iv15 = phi i64 [ %lsr.iv.next16, %"loop_body_20:29" ], [ %21, %"for_condition_4:17.loopexit" ]
  %lsr.iv4 = phi i8* [ %91, %"loop_body_20:29" ], [ %2, %"for_condition_4:17.loopexit" ]
  %lsr.iv414 = bitcast i8* %lsr.iv4 to double*
  %lsr.iv45 = bitcast i8* %lsr.iv4 to i1*
  %75 = load double* %lsr.iv414, align 8, !tbaa !7
  %76 = mul i64 %lsr.iv15, -1
  %77 = inttoptr i64 %76 to double*
  %78 = load double* %77, align 8, !tbaa !7
  %79 = fmul double %78, 1.000000e-02
  %80 = fadd double %75, %79
  store double %80, double* %lsr.iv414, align 8, !tbaa !7
  %scevgep12 = getelementptr i8* %lsr.iv4, i64 %8
  %scevgep1213 = bitcast i8* %scevgep12 to double*
  %81 = load double* %scevgep1213, align 8, !tbaa !7
  %82 = mul i64 %lsr.iv15, -1
  %sunkaddr45 = add i64 %17, %82
  %sunkaddr46 = inttoptr i64 %sunkaddr45 to double*
  %83 = load double* %sunkaddr46, align 8, !tbaa !7
  %84 = fmul double %83, 1.000000e-02
  %85 = fadd double %81, %84
  %scevgep10 = getelementptr i8* %lsr.iv4, i64 %8
  %scevgep1011 = bitcast i8* %scevgep10 to double*
  store double %85, double* %scevgep1011, align 8, !tbaa !7
  %sunkaddr47 = ptrtoint i8* %lsr.iv4 to i64
  %sunkaddr48 = mul i64 %8, 2
  %sunkaddr49 = add i64 %sunkaddr47, %sunkaddr48
  %sunkaddr50 = inttoptr i64 %sunkaddr49 to double*
  %86 = load double* %sunkaddr50, align 8, !tbaa !7
  %87 = mul i64 %lsr.iv15, -1
  %sunkaddr51 = mul i64 %17, 2
  %sunkaddr52 = add i64 %87, %sunkaddr51
  %sunkaddr53 = inttoptr i64 %sunkaddr52 to double*
  %88 = load double* %sunkaddr53, align 8, !tbaa !7
  %89 = fmul double %88, 1.000000e-02
  %90 = fadd double %86, %89
  %sunkaddr54 = ptrtoint i8* %lsr.iv4 to i64
  %sunkaddr55 = mul i64 %8, 2
  %sunkaddr56 = add i64 %sunkaddr54, %sunkaddr55
  %sunkaddr57 = inttoptr i64 %sunkaddr56 to double*
  store double %90, double* %sunkaddr57, align 8, !tbaa !7
  %scevgep = getelementptr i1* %lsr.iv45, i64 %6
  %91 = bitcast i1* %scevgep to i8*
  %lsr.iv.next16 = sub i64 %lsr.iv15, %15
  %lsr.iv.next22 = add i64 %lsr.iv21, -1
  %exitcond34 = icmp eq i64 %lsr.iv.next22, 0
  br i1 %exitcond34, label %"for_condition_3:16.loopexit", label %"loop_body_20:29"

"no_error_23:21":                                 ; preds = %"exit_for_3:4"
  tail call void @Py_XINCREF({ i64, i8* }* %25)
  br label %cleanup_label
}

declare { i64, i8* }* @PyTuple_Pack(i64, ...)

declare double @"numba.math.['double'].sqrt"(double)

define { i64, i8* }* @__numba_specialized_3_numba_2E_codegen_2E_llvmwrapper_2E___numba_wrapper_nbody_hy(i8* %self, { i64, i8* }* %args) {
entry:
  %objtemp = alloca { i64, i8* }*
  store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6
  %0 = alloca { i64, i8* }*
  %1 = alloca { i64, i8* }*
  %return_value = alloca { i64, i8* }*
  %2 = call i32 ({ i64, i8* }*, i8*, ...)* @PyArg_ParseTuple({ i64, i8* }* %args, i8* getelementptr inbounds ([3 x i8]* @__STR_0, i32 0, i32 0), { i64, i8* }** %1, { i64, i8* }** %0)
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %cleanup.if.true, label %cleanup.if.end

cleanup_label:                                    ; preds = %no_error, %error_label
  %4 = load { i64, i8* }** %objtemp, !tbaa !6
  call void @Py_XDECREF({ i64, i8* }* %4)
  %5 = load { i64, i8* }** %return_value
  ret { i64, i8* }* %5

error_label:                                      ; preds = %cleanup.if.end, %cleanup.if.true
  store { i64, i8* }* null, { i64, i8* }** %return_value
  %6 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %6)
  br label %cleanup_label

cleanup.if.true:                                  ; preds = %entry
  br label %error_label

cleanup.if.end:                                   ; preds = %entry
  %7 = load { i64, i8* }** %1
  %8 = load { i64, i8* }** %0
  %9 = bitcast { i64, i8* }* %7 to { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }*
  %10 = bitcast { i64, i8* }* %8 to { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }*
  %11 = call { i64, i8* }* @__numba_specialized_2___main___2E_nbody_hy({ i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %9, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %10)
  store { i64, i8* }* %11, { i64, i8* }** %objtemp, !tbaa !6
  %12 = ptrtoint { i64, i8* }* %11 to i64
  %13 = icmp eq i64 %12, 0
  br i1 %13, label %error_label, label %no_error

no_error:                                         ; preds = %cleanup.if.end
  %14 = load { i64, i8* }** %objtemp, !tbaa !6
  store { i64, i8* }* %14, { i64, i8* }** %return_value
  %15 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %15)
  br label %cleanup_label
}

!tbaa = !{!0, !1, !2, !3, !4, !5, !0, !1, !6, !0, !1, !7, !8, !9, !10, !6, !2, !3, !4, !5, !11, !12, !13, !14, !0, !1, !6}

!0 = metadata !{metadata !"root"}
!1 = metadata !{metadata !"char *", metadata !0}
!2 = metadata !{metadata !"unique0", metadata !1}
!3 = metadata !{metadata !"unique1", metadata !1}
!4 = metadata !{metadata !"unique2", metadata !1}
!5 = metadata !{metadata !"unique3", metadata !1}
!6 = metadata !{metadata !"object", metadata !1}
!7 = metadata !{metadata !"float3264 *", metadata !1}
!8 = metadata !{metadata !"npy_intp **", metadata !1}
!9 = metadata !{metadata !"tbaa_type(numpy shape, npy_intp *) *", metadata !8}
!10 = metadata !{metadata !"tbaa_type(numpy strides, npy_intp *) *", metadata !8}
!11 = metadata !{metadata !"unique4", metadata !1}
!12 = metadata !{metadata !"unique5", metadata !1}
!13 = metadata !{metadata !"unique6", metadata !1}
!14 = metadata !{metadata !"unique7", metadata !1}


CPU times: user 424 ms, sys: 35.5 ms, total: 460 ms
Wall time: 435 ms

Speed Comparisons

The dynamically compiled version is by far the fastest.

In [85]:
func_list = ['nbody_py', 'nbody_np', 'nbody_nb', 'nbody_hy']
data_list = ['p_py, pv_py', 'p_np, pv_np', 'p_nb, pv_nb', 'p_nb, pv_nb']
In [86]:
perf_comp_data(func_list, data_list, rep=3)
function: nbody_nb, av. time sec:   0.00607, relative:    1.0
function: nbody_np, av. time sec:   0.09028, relative:   14.9
function: nbody_py, av. time sec:   0.54923, relative:   90.5
function: nbody_hy, av. time sec:   5.44442, relative:  896.8

Performance Python Approaches

Nowadays, the Python ecosystem provides a number of ways to improve the performance of code.

  • paradigms: some Python paradigms might be more performant than others given a specific problem
  • libraries: there is a wealth of libraries available for different types of problems which often lead to much higher performance given a problem that fits into the scope of the library
  • compiling: a number of powerful compiling solutions are available, static and dynamic ones
  • parallelization: some Python libraries have built-in parallellization capabilities, others allow to harness the full power of multiple cores CPUs, whole clusters or GPUs

A major benefit of the Python ecosystem is, that all these approaches generally are easily implementable, meaning that the additional effort included is generally quite low (even for non-experts). In other words, performance improvements often are low hanging fruits.

The Python Quants



The Python Quants GmbH

The Python Quants – the company Web site

www.pythonquants.com

Dr. Yves J. Hilpisch – my personal Web site

www.hilpisch.com

Derivatives Analytics with Python – my new book

Read About and Order the Book

Contact Us

yves@pythonquants.com

@dyjh