The Python Quants

Performance Python¶

Dr. Yves J. Hilpisch

The Python Quants GmbH

www.pythonquants.com

yves@pythonquants.com

@dyjh

PyData, New York City – 08. November 2013

Performance Python¶

Python for Performance Computing:

measure: getting to results faster
pillar 1: agile interactive analytics & prototyping
pillar 2: rapid application development & deployment
pillar 3: performant execution of algorithms & code

This tutorial focuses on

some fundamental Python issues when it comes to performance
selected approaches to improve the performance of algorithms & code
simple, interactive examples

It does not address such important issues like

architectural issues regarding hardware and software
development processes, testing, documentation and production
real world problem modeling

Some Python Fundamentals¶

Fundamental Python Libraries¶

A fundamental Python stack for interactive data analytics and visualization should at least contain the following libraries tools:

Python – the Python interpreter itself
NumPy – high performance, flexible array structures and operations
SciPy – collection of scientific modules and functions (e.g. for regression, optimization, integration)
pandas – time series and panel data analysis and I/O
PyTables – hierarchical, high performance database (e.g. for out-of-memory analytics)
matplotlib – 2d and 3d visualization
IPython – interactive data analytics, visualization, publishing

It is best to use either the Python distribution Anaconda or the Web-based analytics environment Wakari. Both provide almost "complete" Python environments.

When it comes to performance critical operations, a number of additional libraries can be used to speed-up code execution.

numexpr – for fast numerical operations
Numba – for dynamically compiling Python code for the CPU
NumbaPro – for dynamically compiling Python code for multi-core CPU and GPUs
Cython – for merging Python with C paradigms for static compilation
IPython.Parallel – for the parallel execution of code/functions over a cluster
... many others

All these are also part of Anaconda and Wakari.

Comparing Performance of Python Functions¶

We define a convenience function to compare the performance of a set of Python functions executed on the same data or on different data.

In [1]:

def perf_comp_data(func_list, data_list, rep=3, number=1):
    from timeit import repeat
    res_list = {}
    for name in enumerate(func_list):
        stmt = name[1] + '(' + data_list[name[0]] + ')'
        setup = "from __main__ import " + name[1] + ', ' + data_list[name[0]]
        results = repeat(stmt=stmt, setup=setup, repeat=rep, number=number)
        res_list[name[1]] = sum(results) / rep
    res_sort = sorted(res_list.iteritems(), key=lambda (k, v): (v, k))
    for item in res_sort:
        rel = item[1] / res_sort[0][1]
        print 'function: ' + item[0] + ', av. time sec: %9.5f, ' % item[1] \
            + 'relative: %6.1f' % rel

Speeding Up Numerical Computations¶

A typical compute intensive operation is to evaluate a complex mathematical expression on a large array of data.

First, let us define an example function/numerical expression.

In [2]:

from math import *
def f(x):
    return abs(cos(x)) ** 0.5 + sin(2 + 3 * x)

Second, as our benchmark case a pure Python implementation.

In [3]:

I = 200000
a_py = range(I)

In [4]:

def f1(a):
    res = []
    for x in a:
        res.append(f(x))
    return res

One can also use different paradigms to implement the same function, like iterators or the eval function.

In [5]:

def f2(a):
    return [f(x) for x in a]

In [6]:

def f3(a):
    ex = 'abs(cos(x)) ** 0.5 + sin(2 + 3 * x)'
    return [eval(ex) for x in a]

Third, a vectorized implementations with NumPy.

In [7]:

import numpy as np

In [8]:

a_np = np.arange(I)

In [9]:

def f4(a):
    return (np.abs(np.cos(a)) ** 0.5 +
            np.sin(2 + 3 * a))

Fourth, we use numexpr to evaluate the numerical expression.

In [10]:

import numexpr as ne

In [11]:

def f5(a):
    ex = 'abs(cos(a)) ** 0.5 + sin(2 + 3 * a)'
    ne.set_num_threads(1)
    return ne.evaluate(ex)

In [12]:

def f6(a):
    ex = 'abs(cos(a)) ** 0.5 + sin(2 + 3 * a)'
    ne.set_num_threads(4)
    return ne.evaluate(ex)

Now, let's compare the performance of all different function implementations.

In [13]:

func_list = ['f1', 'f2', 'f3', 'f4', 'f5', 'f6']
data_list = ['a_py', 'a_py', 'a_py', 'a_np', 'a_np', 'a_np']

In [14]:

perf_comp_data(func_list, data_list, rep=1)

function: f6, av. time sec:   0.00627, relative:    1.0
function: f5, av. time sec:   0.01631, relative:    2.6
function: f4, av. time sec:   0.01861, relative:    3.0
function: f2, av. time sec:   0.15558, relative:   24.8
function: f1, av. time sec:   0.27338, relative:   43.6
function: f3, av. time sec:   6.27276, relative: 1000.1

Speeding Up Code Through Memory Layout¶

Depending on the respective algorithm you implement with NumPy, memory layout can play a non-negligible role.

In [15]:

import numpy as np

In [16]:

np.zeros((3, 3), dtype=np.float64, order='C')

Out[16]:

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

The way you initialize NumPy array can have a significant influence on the performance of operations on these arrays (given a certain size of the array).

shape: either an int, a sequence of ints or a reference to another numpy.ndarray
dtype (optional): a numpy.dtype which are NumPy-specific basic data types for numpy.ndarray objects
order (optional): the order of storing elements in memory, 'C' for C-like (i.e. row-wise) or 'F' for Fortran-like (i.e. column-wise)

Consider the C-like, i.e. row-wise, storage.

In [17]:

c = np.array([[ 1.,  1.,  1.],
              [ 2.,  2.,  2.],
              [ 3.,  3.,  3.]], order='C')

In this case, the 1s, the 2s and the 3s are stored next to each other.

By contrast, consider the Fortran-like, i.e. column-wise, storage.

In [18]:

f = np.array([[ 1.,  1.,  1.],
              [ 2.,  2.,  2.],
              [ 3.,  3.,  3.]], order='F')

Now, the data is stored in a way that 1, 2, 3 are next to each other for each column.

Let's see whether the memory layout makes a difference in some way when the array is large.

In [19]:

x = np.random.standard_normal((3, 1500000))
C = np.array(x, order='C')
F = np.array(x, order='F')
x = 0.0

Let's implement standard operations on the C-like layout array.

First, calculating sums.

In [20]:

%timeit C.sum(axis=0)

100 loops, best of 3: 11.1 ms per loop

In [21]:

%timeit C.sum(axis=1)

100 loops, best of 3: 5.93 ms per loop

Second, standard deviations.

In [22]:

%timeit C.std(axis=0)

10 loops, best of 3: 69.7 ms per loop

In [23]:

%timeit C.std(axis=1)

10 loops, best of 3: 31.6 ms per loop

Now the Fortran-like layout. Sums first.

In [24]:

%timeit F.sum(axis=0)

10 loops, best of 3: 35.6 ms per loop

In [25]:

%timeit F.sum(axis=1)

10 loops, best of 3: 26.4 ms per loop

Standard deviations second.

In [26]:

%timeit F.std(axis=0)

10 loops, best of 3: 117 ms per loop

In [27]:

%timeit F.std(axis=1)

10 loops, best of 3: 70.8 ms per loop

Speeding Up I/O Operations¶

SQL Databases¶

Currently, and in the foreseeable future as well, most corporate data is stored in SQL databases. We take this as a starting point and generate a SQL table with dummy data.

In [29]:

filename = 'numbers'
import sqlite3 as sq3  # imports the SQLite library
# Remove files if they exist
import os
try:
    os.remove(filename + '.db')
    os.remove(filename + '.h5')
    os.remove(filename + '.hs')
    os.remove(filename + '.csv')
except:
    pass

We create a table and populate the table with the dummy data.

In [30]:

query = 'CREATE TABLE numbs (no1 real, no2 real, no3 real, no4 real, no5 real)'
con = sq3.connect(filename + '.db')
con.execute(query)
con.commit()

In [31]:

# Writing Random Numbers to SQLite3 Database
rows = 5000000  # 5,000,000 rows for the SQL table (ca. 250 MB in size)
nr = np.random.standard_normal((rows, 5))
%time con.executemany('INSERT INTO numbs VALUES(?, ?, ?, ?, ?)', nr)
con.commit()

CPU times: user 1min 2s, sys: 962 ms, total: 1min 3s
Wall time: 1min 3s

We can now read the data into a NumPy array.

In [32]:

# Reading a couple of rows from SQLite3 database table
array(con.execute('SELECT * FROM numbs').fetchmany(3)).round(3)

Out[32]:

array([[ 0.149, -0.874,  0.736, -0.551, -1.399],
       [ 1.299, -0.281, -1.416,  0.995,  0.337],
       [ 0.87 , -1.057,  0.544, -1.416, -2.613]])

In [33]:

# Reading the whole table into an NumPy array
%time result = array(con.execute('SELECT * FROM numbs').fetchall())

CPU times: user 39.3 s, sys: 1.83 s, total: 41.1 s
Wall time: 41.6 s

PyTables, which is based on the HDF5 file format, allows much faster, disk-based I/O than a typical SQL database. For example, writing the same data is much faster compared to SQLite3.

In [34]:

import tables as tb

In [35]:

%%time
h5 = tb.open_file('numbers.h5', 'w')
h5.create_array('/', 'numbs', obj=result, title='numbs')
h5.close()

CPU times: user 3.89 ms, sys: 209 ms, total: 213 ms
Wall time: 370 ms

Reading speed is equally impressive.

In [37]:

%%time
h5 = tb.open_file('numbers.h5', 'r')
arr = h5.root.numbs.read()
h5.close()

CPU times: user 2.6 ms, sys: 167 ms, total: 170 ms
Wall time: 158 ms

In [38]:

arr.nbytes

Out[38]:

200000000

Reading and Writing Data with pandas¶

pandas is optimized both for convenience and performance when it comes to analytics and I/O operations.

In [40]:

import pandas as pd
import pandas.io.sql as pds

In [41]:

%time data = pds.read_frame('SELECT * FROM numbs', con)
con.close()

CPU times: user 18.3 s, sys: 1.51 s, total: 19.8 s
Wall time: 20.2 s

In [42]:

data.head()

Out[42]:

	no1	no2	no3	no4	no5
0	0.149151	-0.873864	0.736056	-0.550805	-1.399441
1	1.298855	-0.281183	-1.415695	0.995101	0.337464
2	0.870333	-1.056957	0.543588	-1.416220	-2.613365
3	0.509911	0.139642	0.152018	0.197079	0.292907
4	-1.048647	-1.113482	-0.037894	1.216061	-0.263849

The plotting of larger data sets often is a problem. This can easily be accomplished here by only plotting every 500th data row, for instance.

In [43]:

%time data[::500].cumsum().plot(grid=True, legend=True)  # every 500th row is plotted

CPU times: user 106 ms, sys: 5.53 ms, total: 111 ms
Wall time: 119 ms

Out[43]:

<matplotlib.axes.AxesSubplot at 0x107cae990>

There are some reasons why you may want to write the DataFrame to disk:

Database Performance: you do not want to query the SQL database everytime anew
IO Speed: you want to increase data/information retrievel
Data Structure: maybe you want to store additional data with the original data set or you have generated a DataFrame out of multiple sources

pandas is tightly integrated with PyTables. It makes data storage and I/O operations with pandas not only convenient but quite fast as well.

In [44]:

%%time
h5 = pd.HDFStore(filename + '.hs', 'w')
h5['data'] = data  # writes the DataFrame to HDF5 database
h5.close()

CPU times: user 263 ms, sys: 374 ms, total: 638 ms
Wall time: 793 ms

Now, we can read the data from the HDF5 file into another DataFrame object. Performance of such an operation generally is quite high.

In [45]:

%%time
h5 = pd.HDFStore(filename + '.hs', 'r')
data_new = h5['data']
h5.close()

CPU times: user 30 ms, sys: 276 ms, total: 306 ms
Wall time: 305 ms

In [46]:

%time data.sum() - data_new.sum()   # checking if the DataFrames are the same

CPU times: user 750 ms, sys: 231 ms, total: 981 ms
Wall time: 976 ms

Out[46]:

no1    0
no2    0
no3    0
no4    0
no5    0
dtype: float64

Some final checks.

In [48]:

ll numb*

-rw-r--r--  1 yhilpisch  staff  272386048  8 Nov 14:26 numbers.db
-rw-rw-rw-  1 yhilpisch  staff  200002304  8 Nov 14:26 numbers.h5
-rw-rw-rw-  1 yhilpisch  staff  240007368  8 Nov 14:27 numbers.hs

Speeding Up Code Execution through Distributed Computing¶

The algorithm we (arbitrarily) choose is the Monte Carlo estimator for a European option value in the Black-Scholes-Merton set-up. In this set-up the underlying follows the stochastic differential equation:

\[dS_t = r S_t dt + \sigma S_t dZ_t\]

The Monte Carlo estimator for a call option value looks like follows:

\(C_0 = e^{-rT} \frac{1}{I} \sum_{i=1}^{I} \max[S_T(i) - K, 0]\)

where \(S_T(i)\) is the \(i\)-th simulated value of the underlying at maturity \(T\).

A dynamic implementation of the BSM Monte Carlo valuation could look like follows.

In [50]:

def BSM_MCS(S0):
    ''' Dynamic Black-Scholes-Merton MCS Estimator for European Calls. '''
    import numpy as np
    K = 100.; T = 1.0; r = 0.05; vola = 0.2
    M = 50; I = 20000
    dt = T / M
    rand = np.random.standard_normal((M + 1, I))
    S = np.zeros((M + 1, I)); S[0] = S0
    for t in range(1, M + 1):
        S[t] = S[t-1] * np.exp((r - 0.5 * vola ** 2) * dt + vola * np.sqrt(dt) * rand[t])
    BSM_MCS_Value = np.sum(np.maximum(S[-1] - K, 0)) / I
    return BSM_MCS_Value

The Sequential Calculation¶

The benchmark case of sequential valuation of \(n\) options.

In [51]:

def seqValue(n):
    optionValues = []
    for S in linspace(80, 100, n):
        optionValues.append(BSM_MCS(S))
    return optionValues

The Parallel Calculation¶

First, start a (local) cluster.

In [52]:

# in the shell ...
####
# ipcluster start -n 4
####
# or maybe more cores

Second, enable parallel computing capabilities.

In [53]:

from IPython.parallel import Client
c = Client(profile="default")
view = c.load_balanced_view()

The function for the parallel execution of a number \(n\) of different options valuations.

In [54]:

def parValue(n):
    optionValues = []
    for S in linspace(80, 100, n):
        value = view.apply_async(BSM_MCS, S)
        # asynchronously calculate the option values
        optionValues.append(value)
    c.wait(optionValues)
    return optionValues

Let us compare performance of the two different approaches.

In [55]:

n = 20  # number of option valuations
func_list = ['seqValue', 'parValue']
data_list = 2 * ['n']

In [56]:

perf_comp_data(func_list, data_list, rep=3)

function: parValue, av. time sec:   0.75665, relative:    1.0
function: seqValue, av. time sec:   1.22566, relative:    1.6

Performance scales linearly with the number of CPU cores available.

Speeding Up Code through Dynamic Compiling¶

Numba is an open-source, NumPy-aware optimizing compiler for Python sponsored by The Python Quants GmbH It uses the remarkable LLVM compiler infrastructure to compile Python byte-code to machine code especially for use in the NumPy run-time and SciPy modules.

Introductory Examples¶

Let us start with a problem that typically lead to performance issues in Python: alogrithms with a nested loops. A sandbox variant can illustrate the problem.

In [57]:

from math import cos, log
def f_py(I, J):
    res = 0
    for i in range(I):
        for j in range (J):
            res += int(cos(log(1)))
    return res

In [58]:

I, J = 1000, 1000
%time f_py(I, J)

CPU times: user 491 ms, sys: 6.89 ms, total: 498 ms
Wall time: 494 ms

Out[58]:

In principle, this can be vectorized with NumPy arrays.

In [59]:

def f_np(I, J):
    a = np.ones((I, J), dtype=np.float64)
    return int(np.sum(np.cos(np.log(a)))), a

In [60]:

%time res, a = f_np(I, J)

CPU times: user 20.3 ms, sys: 6.44 ms, total: 26.7 ms
Wall time: 25.8 ms

This is much faster, but not really memory efficient ...

In [61]:

a.nbytes

Out[61]:

Numba can provide help.

In [62]:

import numba as nb

In [63]:

%time f_nb = nb.autojit(f_py)

CPU times: user 2.47 ms, sys: 657 µs, total: 3.13 ms
Wall time: 2.73 ms

It preserves memory efficiency while speeding up execution considerably in this case.

In [64]:

%time f_nb(I, J)


DEBUG -- translate:361:translate
; ModuleID = 'tmp.module.__main__.f_py.107dfe578'

@PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**)

define i32 @__numba_specialized_0___main___2E_f_py(i32 %I, i32 %J) {
entry:
  %nsteps2 = alloca i64
  %target_temp1 = alloca i64
  %nsteps = alloca i64
  %target_temp = alloca i64
  %return_value = alloca i32
  store i64 0, i64* %target_temp, !tbaa !2
  %0 = sext i32 %I to i64
  store i64 %0, i64* %nsteps, !tbaa !3
  br label %"for_condition_4:13"

cleanup_label:                                    ; preds = %"exit_for_4:4", %error_label
  %1 = load i32* %return_value
  ret i32 %1

error_label:                                      ; No predecessors!
  store i32 123456789, i32* %return_value
  br label %cleanup_label

"for_condition_4:13":                             ; preds = %entry, %"exit_for_5:8"
  %res_2 = phi i32 [ 0, %entry ], [ %res_3, %"exit_for_5:8" ]
  %j_1 = phi i64 [ 123456789, %entry ], [ %j_2, %"exit_for_5:8" ]
  %2 = load i64* %target_temp, !tbaa !2
  %3 = load i64* %nsteps, !tbaa !3
  %4 = icmp slt i64 %2, %3
  %5 = icmp ne i1 %4, false
  br i1 %5, label %"loop_body_5:8", label %"exit_for_4:4"

"exit_for_4:4":                                   ; preds = %"for_condition_4:13"
  store i32 %res_2, i32* %return_value
  br label %cleanup_label

"loop_body_5:8":                                  ; preds = %"for_condition_4:13"
  %6 = load i64* %target_temp, !tbaa !2
  %7 = load i64* %target_temp, !tbaa !2
  %8 = add i64 %7, 1
  store i64 %8, i64* %target_temp, !tbaa !2
  store i64 0, i64* %target_temp1, !tbaa !4
  %9 = sext i32 %J to i64
  store i64 %9, i64* %nsteps2, !tbaa !5
  br label %"for_condition_5:17"

"for_condition_5:17":                             ; preds = %"loop_body_5:8", %"loop_body_6:19"
  %res_3 = phi i32 [ %20, %"loop_body_6:19" ], [ %res_2, %"loop_body_5:8" ]
  %j_2 = phi i64 [ %14, %"loop_body_6:19" ], [ %j_1, %"loop_body_5:8" ]
  %10 = load i64* %target_temp1, !tbaa !4
  %11 = load i64* %nsteps2, !tbaa !5
  %12 = icmp slt i64 %10, %11
  %13 = icmp ne i1 %12, false
  br i1 %13, label %"loop_body_6:19", label %"exit_for_5:8"

"exit_for_5:8":                                   ; preds = %"for_condition_5:17"
  br label %"for_condition_4:13"

"loop_body_6:19":                                 ; preds = %"for_condition_5:17"
  %14 = load i64* %target_temp1, !tbaa !4
  %15 = load i64* %target_temp1, !tbaa !4
  %16 = add i64 %15, 1
  store i64 %16, i64* %target_temp1, !tbaa !4
  %17 = call double @"numba.math.['double'].log"(double 1.000000e+00)
  %18 = call double @"numba.math.['double'].cos"(double %17)
  %19 = fptosi double %18 to i32
  %20 = add i32 %res_3, %19
  br label %"for_condition_5:17"
}

declare { i64, i8* }* @Py_BuildValue(i8*, ...)

declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...)

declare void @PyErr_Clear()

declare double @"numba.math.['double'].cos"(double)

declare double @"numba.math.['double'].log"(double)

!tbaa = !{!0, !1, !2, !3, !4, !5}

!0 = metadata !{metadata !"root"}
!1 = metadata !{metadata !"char *", metadata !0}
!2 = metadata !{metadata !"unique0", metadata !1}
!3 = metadata !{metadata !"unique1", metadata !1}
!4 = metadata !{metadata !"unique2", metadata !1}
!5 = metadata !{metadata !"unique3", metadata !1}


DEBUG -- translate:361:translate
; ModuleID = 'numba_executable_module'

@PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**)

define void @Py_INCREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = bitcast { i64, i8* }* %obj to i64*
  %1 = load i64* %0
  %2 = add i64 %1, 1
  store i64 %2, i64* %0
  ret void
}

define void @Py_DECREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = bitcast { i64, i8* }* %obj to i64*
  %1 = load i64* %0
  %2 = icmp sgt i64 %1, 1
  br i1 %2, label %if.then, label %if.else

if.then:                                          ; preds = %decl
  %3 = add i64 %1, -1
  store i64 %3, i64* %0
  br label %if.end

if.else:                                          ; preds = %decl
  call void @Py_DecRef({ i64, i8* }* %obj)
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  ret void
}

declare void @Py_DecRef({ i64, i8* }*)

define void @Py_XINCREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = ptrtoint { i64, i8* }* %obj to i64
  %1 = icmp ne i64 %0, 0
  br i1 %1, label %if.then, label %if.end

if.then:                                          ; preds = %decl
  %2 = bitcast { i64, i8* }* %obj to i64*
  %3 = load i64* %2
  %4 = add i64 %3, 1
  store i64 %4, i64* %2
  br label %if.end

if.end:                                           ; preds = %if.then, %decl
  ret void
}

define void @Py_XDECREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = ptrtoint { i64, i8* }* %obj to i64
  %1 = icmp ne i64 %0, 0
  br i1 %1, label %if.then, label %if.end

if.then:                                          ; preds = %decl
  call void @Py_DECREF({ i64, i8* }* %obj)
  br label %if.end

if.end:                                           ; preds = %if.then, %decl
  ret void
}

define i8* @IndexAxis(i8* %data, i64* %in_shape, i64* %in_strides, i64 %src_dim, i64 %index) {
decl:
  %data1 = alloca i8*
  %in_shape2 = alloca i64*
  %in_strides3 = alloca i64*
  %src_dim4 = alloca i64
  %index5 = alloca i64
  %result = alloca i8*
  store i8* %data, i8** %data1
  store i64* %in_shape, i64** %in_shape2
  store i64* %in_strides, i64** %in_strides3
  store i64 %src_dim, i64* %src_dim4
  store i64 %index, i64* %index5
  %0 = load i64** %in_strides3
  %1 = load i64* %src_dim4
  %2 = getelementptr inbounds i64* %0, i64 %1
  %3 = load i64* %2
  %4 = mul i64 %3, %index
  %5 = load i8** %data1
  %6 = getelementptr inbounds i8* %5, i64 %4
  store i8* %6, i8** %result
  ret i8* %6
}

define void @NewAxis(i64* %out_shape, i64* %out_strides, i32 %dst_dim) {
decl:
  %out_shape1 = alloca i64*
  %out_strides2 = alloca i64*
  %dst_dim3 = alloca i32
  store i64* %out_shape, i64** %out_shape1
  store i64* %out_strides, i64** %out_strides2
  store i32 %dst_dim, i32* %dst_dim3
  %0 = load i64** %out_shape1
  %1 = getelementptr inbounds i64* %0, i32 %dst_dim
  store i64 1, i64* %1
  %2 = load i64** %out_strides2
  %3 = load i32* %dst_dim3
  %4 = getelementptr inbounds i64* %2, i32 %3
  store i64 0, i64* %4
  ret void
}

define i32 @Broadcast(i64* %dst_shape, i64* %src_shape, i64* %src_strides, i32 %max_ndim, i32 %ndim) {
decl:
  %dst_shape1 = alloca i64*
  %src_shape2 = alloca i64*
  %src_strides3 = alloca i64*
  %max_ndim4 = alloca i32
  %ndim5 = alloca i32
  %0 = alloca i32
  store i64* %dst_shape, i64** %dst_shape1
  store i64* %src_shape, i64** %src_shape2
  store i64* %src_strides, i64** %src_strides3
  store i32 %max_ndim, i32* %max_ndim4
  store i32 %ndim, i32* %ndim5
  %1 = load i32* %max_ndim4
  %2 = sub i32 %1, %ndim
  store i32 0, i32* %0
  br label %loop.cond

loop.cond:                                        ; preds = %if.end11, %decl
  %3 = load i32* %0
  %4 = load i32* %ndim5
  %5 = icmp slt i32 %3, %4
  br i1 %5, label %loop.body, label %loop.end

loop.body:                                        ; preds = %loop.cond
  %6 = load i64** %src_shape2
  %7 = getelementptr inbounds i64* %6, i32 %3
  %8 = add i32 %3, %2
  %9 = load i64** %dst_shape1
  %10 = getelementptr inbounds i64* %9, i32 %8
  %11 = load i64* %7
  %12 = icmp eq i64 %11, 1
  br i1 %12, label %if.then, label %if.else

loop.end:                                         ; preds = %if.else7, %loop.cond
  %merge = phi i32 [ 1, %loop.cond ], [ 0, %if.else7 ]
  ret i32 %merge

if.then:                                          ; preds = %loop.body
  %13 = load i64** %src_strides3
  %14 = getelementptr inbounds i64* %13, i32 %3
  store i64 0, i64* %14
  br label %if.end11

if.else:                                          ; preds = %loop.body
  %15 = load i64* %10
  %16 = icmp eq i64 %15, 1
  br i1 %16, label %if.then6, label %if.else7

if.then6:                                         ; preds = %if.else
  store i64 %11, i64* %10
  br label %if.end11

if.else7:                                         ; preds = %if.else
  %17 = icmp ne i64 %11, %15
  br i1 %17, label %loop.end, label %if.end11

if.end11:                                         ; preds = %if.else7, %if.then6, %if.then
  %18 = load i32* %0
  %19 = add i32 %18, 1
  store i32 %19, i32* %0
  br label %loop.cond
}

define i32 @__numba_specialized_0___main___2E_f_py(i32 %I, i32 %J) {
entry:
  %0 = icmp sgt i32 %I, 0
  %1 = icmp sgt i32 %J, 0
  %or.cond = and i1 %0, %1
  br i1 %or.cond, label %"loop_body_5:8.lr.ph.split.us", label %"exit_for_4:4"

"loop_body_5:8.lr.ph.split.us":                   ; preds = %entry
  %2 = sext i32 %I to i64
  %3 = sext i32 %J to i64
  br label %"loop_body_6:19.lr.ph.us"

"loop_body_6:19.us":                              ; preds = %"loop_body_6:19.us", %"loop_body_6:19.lr.ph.us"
  %lsr.iv = phi i64 [ %lsr.iv.next, %"loop_body_6:19.us" ], [ %3, %"loop_body_6:19.lr.ph.us" ]
  %res_35.us = phi i32 [ %res_26.us, %"loop_body_6:19.lr.ph.us" ], [ %7, %"loop_body_6:19.us" ]
  %4 = tail call double @"numba.math.['double'].log"(double 1.000000e+00)
  %5 = tail call double @"numba.math.['double'].cos"(double %4)
  %6 = fptosi double %5 to i32
  %7 = add i32 %6, %res_35.us
  %lsr.iv.next = add i64 %lsr.iv, -1
  %exitcond = icmp eq i64 %lsr.iv.next, 0
  br i1 %exitcond, label %"for_condition_4:13.loopexit.us", label %"loop_body_6:19.us"

"for_condition_4:13.loopexit.us":                 ; preds = %"loop_body_6:19.us"
  %exitcond8 = icmp eq i64 %9, %2
  br i1 %exitcond8, label %"exit_for_4:4", label %"loop_body_6:19.lr.ph.us"

"loop_body_6:19.lr.ph.us":                        ; preds = %"loop_body_5:8.lr.ph.split.us", %"for_condition_4:13.loopexit.us"
  %res_26.us = phi i32 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %7, %"for_condition_4:13.loopexit.us" ]
  %8 = phi i64 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %9, %"for_condition_4:13.loopexit.us" ]
  %9 = add i64 %8, 1
  br label %"loop_body_6:19.us"

"exit_for_4:4":                                   ; preds = %"for_condition_4:13.loopexit.us", %entry
  %res_2.lcssa = phi i32 [ 0, %entry ], [ %7, %"for_condition_4:13.loopexit.us" ]
  ret i32 %res_2.lcssa
}

declare double @"numba.math.['double'].cos"(double)

declare double @"numba.math.['double'].log"(double)

define { i64, i8* }* @__numba_specialized_1_numba_2E_codegen_2E_llvmwrapper_2E___numba_wrapper_f_py(i8* %self, { i64, i8* }* %args) {
entry:
  %objtemp = alloca { i64, i8* }*
  store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6
  %0 = alloca { i64, i8* }*
  %1 = alloca { i64, i8* }*
  %return_value = alloca { i64, i8* }*
  %2 = call i32 ({ i64, i8* }*, i8*, ...)* @PyArg_ParseTuple({ i64, i8* }* %args, i8* getelementptr inbounds ([3 x i8]* @__STR_0, i32 0, i32 0), { i64, i8* }** %1, { i64, i8* }** %0)
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %cleanup.if.true, label %cleanup.if.end

cleanup_label:                                    ; preds = %no_error, %error_label
  %4 = load { i64, i8* }** %objtemp, !tbaa !6
  call void @Py_XDECREF({ i64, i8* }* %4)
  %5 = load { i64, i8* }** %return_value
  ret { i64, i8* }* %5

error_label:                                      ; preds = %empty7, %empty8, %empty5, %empty2, %cleanup.if.true
  store { i64, i8* }* null, { i64, i8* }** %return_value
  %6 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %6)
  br label %cleanup_label

cleanup.if.true:                                  ; preds = %entry
  br label %error_label

cleanup.if.end:                                   ; preds = %entry
  %7 = load { i64, i8* }** %1
  %8 = load { i64, i8* }** %0
  %9 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %7)
  br label %empty

empty:                                            ; preds = %cleanup.if.end
  %10 = call i8* @PyErr_Occurred()
  %11 = ptrtoint i8* %10 to i64
  %12 = icmp ne i64 %11, 0
  br i1 %12, label %empty2, label %empty1

empty1:                                           ; preds = %empty
  %13 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %8)
  br label %empty3

empty2:                                           ; preds = %empty
  br label %error_label

empty3:                                           ; preds = %empty1
  %14 = call i8* @PyErr_Occurred()
  %15 = ptrtoint i8* %14 to i64
  %16 = icmp ne i64 %15, 0
  br i1 %16, label %empty5, label %empty4

empty4:                                           ; preds = %empty3
  %17 = call i32 @__numba_specialized_0___main___2E_f_py(i32 %9, i32 %13)
  br label %empty6

empty5:                                           ; preds = %empty3
  br label %error_label

empty6:                                           ; preds = %empty4
  %18 = call i8* @PyErr_Occurred()
  %19 = ptrtoint i8* %18 to i64
  %20 = icmp ne i64 %19, 0
  br i1 %20, label %empty8, label %empty7

empty7:                                           ; preds = %empty6
  %21 = sext i32 %17 to i64
  %22 = call { i64, i8* }* inttoptr (i64 4433235252 to { i64, i8* }* (i64)*)(i64 %21)
  store { i64, i8* }* %22, { i64, i8* }** %objtemp, !tbaa !6
  %23 = ptrtoint { i64, i8* }* %22 to i64
  %24 = icmp eq i64 %23, 0
  br i1 %24, label %error_label, label %no_error

empty8:                                           ; preds = %empty6
  br label %error_label

no_error:                                         ; preds = %empty7
  %25 = load { i64, i8* }** %objtemp, !tbaa !6
  store { i64, i8* }* %25, { i64, i8* }** %return_value
  %26 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %26)
  br label %cleanup_label
}

declare i8* @PyErr_Occurred()

declare { i64, i8* }* @Py_BuildValue(i8*, ...)

declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...)

declare void @PyErr_Clear()

!tbaa = !{!0, !1, !2, !3, !4, !5, !0, !1, !6}

!0 = metadata !{metadata !"root"}
!1 = metadata !{metadata !"char *", metadata !0}
!2 = metadata !{metadata !"unique0", metadata !1}
!3 = metadata !{metadata !"unique1", metadata !1}
!4 = metadata !{metadata !"unique2", metadata !1}
!5 = metadata !{metadata !"unique3", metadata !1}
!6 = metadata !{metadata !"object", metadata !1}

CPU times: user 248 ms, sys: 27.5 ms, total: 275 ms
Wall time: 337 ms

Out[64]:

Let us compare the performance of the different alternatives.

In [65]:

func_list = ['f_py', 'f_np', 'f_nb']
data_list = 3 * ['I, J']

In [66]:

perf_comp_data(func_list, data_list, rep=3)

function: f_nb, av. time sec:   0.01615, relative:    1.0
function: f_np, av. time sec:   0.03077, relative:    1.9
function: f_py, av. time sec:   0.51423, relative:   31.8

N-Body Simulation¶

The simulation of the movement of N bodies in relation to each other, where \(N\) is a large number and a body can be a star/planet or a small particle, is an important problem in Physics.

The basic simulation algorithm involves a number of iterations ("loops") since one has to determine the force that every single body exerts on all the other bodies in the system to be simulated (e.g. a galaxy or a system of atomic particles).

The algorithm lends iteself pretty well for parallization. However, the example code that follows does not use any parallelization techniques. It illustrates rather what speed-ups are achievable by

using vectorization approaches with NumPy and
just-in-time compiling capabilities of Numba

A description and discussion of the algorithm is found in the recent article Vladimirov, Andrey and Vadim Karpusenko (2013): "Test-Driving Intel Xeon-Phi Coprocessors with a Basic N-Body Simulation." White Paper, Colfax International.

Pure Python¶

We first use pure Python to implement the algorithm. We start by generating random vectors of particle locations and a zero vector for the speed of each particle.

In [67]:

import random

In [68]:

nParticles = 150
nSteps = 15

In [69]:

p_py = []
for i in range(nParticles):
    p_py.append([random.gauss(0.0, 1.0) for j in range(3)])

In [70]:

pv_py = [[0, 0, 0] for x in p_py]

Next, we define the simulation function for the movements of each particle given their gravitational interaction.

In [71]:

def nbody_py(particle, particlev):  # lists as input
    dt = 0.01
    for step in range(1, nSteps + 1, 1):
        for i in range(nParticles):
            Fx = 0.0; Fy = 0.0; Fz = 0.0
            for j in range(nParticles):
                if j != i:
                    dx = particle[j][0] - particle[i][0]
                    dy = particle[j][1] - particle[i][1]
                    dz = particle[j][2] - particle[i][2]
                    drSquared = dx * dx + dy * dy + dz * dz
                    drPowerN32 = 1.0 / (drSquared + sqrt(drSquared))
                    Fx += dx * drPowerN32
                    Fy += dy * drPowerN32
                    Fz += dz * drPowerN32
                particlev[i][0] += dt * Fx
                particlev[i][1] += dt * Fy
                particlev[i][2] += dt * Fz
        for i in range(nParticles):
            particle[i][0] += particlev[i][0] * dt
            particle[i][1] += particlev[i][1] * dt
            particle[i][2] += particlev[i][2] * dt
    return particle, particlev

We then execute the function to measure execution speed.

In [72]:

%time p_py, pv_py = nbody_py(p_py, pv_py)

CPU times: user 593 ms, sys: 7.4 ms, total: 601 ms
Wall time: 596 ms

NumPy Solution¶

The basic algorithm allows for a number of vectorizations by using NumPy.

In [73]:

import numpy as np

In [74]:

p_np = np.random.standard_normal((nParticles, 3))
pv_np = np.zeros_like(p_np)

Using NumPy arrays and matplotlib's 3d plotting capabilities, we can easily visualize the intial positions of the particles in the space.

In [75]:

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [76]:

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(p_np[:, 0], p_np[:, 1], p_np[:, 2],
           c='r', marker='.')

Out[76]:

<mpl_toolkits.mplot3d.art3d.Patch3DCollection at 0x10b3617d0>

The respective NumPy simulation function is considerably more compact.

In [77]:

def nbody_np(particle, particlev):  # NumPy arrays as input
    nSteps = 15; dt = 0.01
    for step in range(1, nSteps + 1, 1):
        Fp = np.zeros((nParticles, 3))
        for i in range(nParticles):
            dp = particle - particle[i]
            drSquared = np.sum(dp ** 2, axis=1)
            h = drSquared + np.sqrt(drSquared)
            drPowerN32 = 1. / np.maximum(h, 1E-10)
            Fp += -(dp.T * drPowerN32).T
            particlev += dt * Fp
        particle += particlev * dt
    return particle, particlev

It is also considerably faster.

In [78]:

%time p_np, pv_np = nbody_np(p_np, pv_np)

CPU times: user 106 ms, sys: 1.25 ms, total: 107 ms
Wall time: 106 ms

Let's check how the particles moved. The following is a snapshot of the final positions of all particles.

In [79]:

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(p_np[:, 0], p_np[:, 1], p_np[:, 2],
           c='r', marker='.')

Out[79]:

<mpl_toolkits.mplot3d.art3d.Patch3DCollection at 0x10bac3490>

Dynamically Compiled Version¶

For this case, we first implement a function which is somehow a hybrid function of the pure Python (loop-heavy) and the NumPy array class-based version.

In [80]:

def nbody_hy(particle, particlev):  # NumPy arrays as input
    dt = 0.01
    for step in range(1, nSteps + 1, 1):
        for i in range(nParticles):
            Fx = 0.0; Fy = 0.0; Fz = 0.0
            for j in range(nParticles):
                if j != i:
                    dx = particle[j,0] - particle[i,0]
                    dy = particle[j,1] - particle[i,1]
                    dz = particle[j,2] - particle[i,2]
                    drSquared = dx * dx + dy * dy + dz * dz
                    drPowerN32 = 1.0 / (drSquared + sqrt(drSquared))
                    Fx += dx * drPowerN32
                    Fy += dy * drPowerN32
                    Fz += dz * drPowerN32
                particlev[i, 0] += dt * Fx
                particlev[i, 1] += dt * Fy
                particlev[i, 2] += dt * Fz
        for i in range(nParticles):
            particle[i,0] += particlev[i,0] * dt
            particle[i,1] += particlev[i,1] * dt
            particle[i,2] += particlev[i,2] * dt
    return particle, particlev

This function is then compiled with Numba.

In [81]:

import numba as nb

In [82]:

nbody_nb = nb.autojit(nbody_hy)

This new, compiled version of the hybrid function is yet faster.

In [83]:

p_nb = np.random.standard_normal((nParticles, 3))
pv_nb = np.zeros_like(p_nb)

In [84]:

%time p_nb, pv_nb = nbody_nb(p_nb, pv_nb)


DEBUG -- translate:361:translate
; ModuleID = 'tmp.module.__main__.nbody_hy.10baa19b0'

@PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**)

define { i64, i8* }* @__numba_specialized_2___main___2E_nbody_hy({ i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev) {
entry:
  %objtemp = alloca { i64, i8* }*
  store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6
  %nsteps8 = alloca i64
  %target_temp7 = alloca i64
  %nsteps6 = alloca i64
  %target_temp5 = alloca i64
  %nsteps4 = alloca i64
  %target_temp3 = alloca i64
  %nsteps = alloca i64
  %target_temp = alloca i64
  %particlev2 = alloca { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }*
  %particle1 = alloca { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }*
  store { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particle1
  %0 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }*
  call void @Py_INCREF({ i64, i8* }* %0)
  %1 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i32 0, i32 2
  %2 = load i8** %1, !tbaa !2
  %3 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i32 0, i32 4
  %4 = load i64** %3, !tbaa !4
  %5 = getelementptr i64* %4, i32 0
  %6 = load i64* %5
  %7 = getelementptr i64* %4, i32 1
  %8 = load i64* %7
  %9 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i32 0, i32 5
  %10 = load i64** %9, !tbaa !5
  %11 = getelementptr i64* %10, i32 0
  %12 = load i64* %11
  %13 = getelementptr i64* %10, i32 1
  %14 = load i64* %13
  store { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particlev2
  %15 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }*
  call void @Py_INCREF({ i64, i8* }* %15)
  %16 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i32 0, i32 2
  %17 = load i8** %16, !tbaa !2
  %18 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i32 0, i32 4
  %19 = load i64** %18, !tbaa !4
  %20 = getelementptr i64* %19, i32 0
  %21 = load i64* %20
  %22 = getelementptr i64* %19, i32 1
  %23 = load i64* %22
  %24 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i32 0, i32 5
  %25 = load i64** %24, !tbaa !5
  %26 = getelementptr i64* %25, i32 0
  %27 = load i64* %26
  %28 = getelementptr i64* %25, i32 1
  %29 = load i64* %28
  %return_value = alloca { i64, i8* }*
  store i64 0, i64* %target_temp, !tbaa !7
  br i1 true, label %ifexp.then, label %ifexp.else

cleanup_label:                                    ; preds = %"no_error_23:21", %error_label
  %30 = load { i64, i8* }** %objtemp, !tbaa !6
  call void @Py_XDECREF({ i64, i8* }* %30)
  %31 = load { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particlev2, !tbaa !6
  %32 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %31 to { i64, i8* }*
  call void @Py_XDECREF({ i64, i8* }* %32)
  %33 = load { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }** %particle1, !tbaa !6
  %34 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %33 to { i64, i8* }*
  call void @Py_XDECREF({ i64, i8* }* %34)
  %35 = load { i64, i8* }** %return_value
  ret { i64, i8* }* %35

error_label:                                      ; preds = %"exit_for_3:4"
  store { i64, i8* }* null, { i64, i8* }** %return_value
  %36 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %36)
  br label %cleanup_label

ifexp.then:                                       ; preds = %entry
  br label %ifexp.merge

ifexp.else:                                       ; preds = %entry
  br label %ifexp.merge

ifexp.merge:                                      ; preds = %ifexp.else, %ifexp.then
  %37 = phi i32 [ 1, %ifexp.then ], [ -1, %ifexp.else ]
  %38 = sext i32 %37 to i64
  %39 = sub i64 16, %38
  %40 = sdiv i64 %39, 1
  store i64 %40, i64* %nsteps, !tbaa !8
  br label %"for_condition_3:16"

"for_condition_3:16":                             ; preds = %ifexp.merge, %"exit_for_19:8"
  %i_1 = phi i64 [ 123456789, %ifexp.merge ], [ %i_4, %"exit_for_19:8" ]
  %Fx_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %Fx_2, %"exit_for_19:8" ]
  %drSquared_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %drSquared_2, %"exit_for_19:8" ]
  %Fy_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %Fy_2, %"exit_for_19:8" ]
  %Fz_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %Fz_2, %"exit_for_19:8" ]
  %dz_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %dz_2, %"exit_for_19:8" ]
  %j_1 = phi i64 [ 123456789, %ifexp.merge ], [ %j_2, %"exit_for_19:8" ]
  %dx_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %dx_2, %"exit_for_19:8" ]
  %dy_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %dy_2, %"exit_for_19:8" ]
  %drPowerN32_1 = phi double [ 0x7FF8000000000000, %ifexp.merge ], [ %drPowerN32_2, %"exit_for_19:8" ]
  %41 = load i64* %target_temp, !tbaa !7
  %42 = load i64* %nsteps, !tbaa !8
  %43 = icmp slt i64 %41, %42
  %44 = icmp ne i1 %43, false
  br i1 %44, label %"loop_body_4:8", label %"exit_for_3:4"

"exit_for_3:4":                                   ; preds = %"for_condition_3:16"
  %45 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }*
  %46 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }*
  %47 = call { i64, i8* }* (i64, ...)* @PyTuple_Pack(i64 2, { i64, i8* }* %45, { i64, i8* }* %46)
  store { i64, i8* }* %47, { i64, i8* }** %objtemp, !tbaa !6
  %48 = ptrtoint { i64, i8* }* %47 to i64
  %49 = icmp eq i64 %48, 0
  br i1 %49, label %error_label, label %"no_error_23:21"

"loop_body_4:8":                                  ; preds = %"for_condition_3:16"
  %50 = load i64* %target_temp, !tbaa !7
  %51 = mul i64 %50, 1
  %52 = add i64 1, %51
  %53 = load i64* %target_temp, !tbaa !7
  %54 = add i64 %53, 1
  store i64 %54, i64* %target_temp, !tbaa !7
  store i64 0, i64* %target_temp3, !tbaa !9
  store i64 150, i64* %nsteps4, !tbaa !10
  br label %"for_condition_4:17"

"for_condition_4:17":                             ; preds = %"loop_body_4:8", %"exit_for_6:12"
  %i_2 = phi i64 [ %i_1, %"loop_body_4:8" ], [ %59, %"exit_for_6:12" ]
  %Fx_2 = phi double [ %Fx_1, %"loop_body_4:8" ], [ %Fx_4, %"exit_for_6:12" ]
  %drSquared_2 = phi double [ %drSquared_1, %"loop_body_4:8" ], [ %drSquared_3, %"exit_for_6:12" ]
  %Fy_2 = phi double [ %Fy_1, %"loop_body_4:8" ], [ %Fy_4, %"exit_for_6:12" ]
  %Fz_2 = phi double [ %Fz_1, %"loop_body_4:8" ], [ %Fz_4, %"exit_for_6:12" ]
  %j_2 = phi i64 [ %j_1, %"loop_body_4:8" ], [ %j_3, %"exit_for_6:12" ]
  %dx_2 = phi double [ %dx_1, %"loop_body_4:8" ], [ %dx_3, %"exit_for_6:12" ]
  %dy_2 = phi double [ %dy_1, %"loop_body_4:8" ], [ %dy_3, %"exit_for_6:12" ]
  %dz_2 = phi double [ %dz_1, %"loop_body_4:8" ], [ %dz_3, %"exit_for_6:12" ]
  %drPowerN32_2 = phi double [ %drPowerN32_1, %"loop_body_4:8" ], [ %drPowerN32_3, %"exit_for_6:12" ]
  %55 = load i64* %target_temp3, !tbaa !9
  %56 = load i64* %nsteps4, !tbaa !10
  %57 = icmp slt i64 %55, %56
  %58 = icmp ne i1 %57, false
  br i1 %58, label %"loop_body_5:12", label %"exit_for_4:8"

"exit_for_4:8":                                   ; preds = %"for_condition_4:17"
  store i64 0, i64* %target_temp7, !tbaa !13
  store i64 150, i64* %nsteps8, !tbaa !14
  br label %"for_condition_19:17"

"loop_body_5:12":                                 ; preds = %"for_condition_4:17"
  %59 = load i64* %target_temp3, !tbaa !9
  %60 = load i64* %target_temp3, !tbaa !9
  %61 = add i64 %60, 1
  store i64 %61, i64* %target_temp3, !tbaa !9
  store i64 0, i64* %target_temp5, !tbaa !11
  store i64 150, i64* %nsteps6, !tbaa !12
  br label %"for_condition_6:21"

"for_condition_6:21":                             ; preds = %"loop_body_5:12", %"exit_if_7:16"
  %drSquared_3 = phi double [ %drSquared_2, %"loop_body_5:12" ], [ %drSquared_5, %"exit_if_7:16" ]
  %Fx_4 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fx_6, %"exit_if_7:16" ]
  %Fy_4 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fy_6, %"exit_if_7:16" ]
  %Fz_4 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fz_6, %"exit_if_7:16" ]
  %j_3 = phi i64 [ %j_2, %"loop_body_5:12" ], [ %66, %"exit_if_7:16" ]
  %dx_3 = phi double [ %dx_2, %"loop_body_5:12" ], [ %dx_5, %"exit_if_7:16" ]
  %dy_3 = phi double [ %dy_2, %"loop_body_5:12" ], [ %dy_5, %"exit_if_7:16" ]
  %dz_3 = phi double [ %dz_2, %"loop_body_5:12" ], [ %dz_5, %"exit_if_7:16" ]
  %drPowerN32_3 = phi double [ %drPowerN32_2, %"loop_body_5:12" ], [ %drPowerN32_5, %"exit_if_7:16" ]
  %62 = load i64* %target_temp5, !tbaa !11
  %63 = load i64* %nsteps6, !tbaa !12
  %64 = icmp slt i64 %62, %63
  %65 = icmp ne i1 %64, false
  br i1 %65, label %"loop_body_7:16", label %"exit_for_6:12"

"exit_for_6:12":                                  ; preds = %"for_condition_6:21"
  br label %"for_condition_4:17"

"loop_body_7:16":                                 ; preds = %"for_condition_6:21"
  %66 = load i64* %target_temp5, !tbaa !11
  %67 = load i64* %target_temp5, !tbaa !11
  %68 = add i64 %67, 1
  store i64 %68, i64* %target_temp5, !tbaa !11
  br label %"if_cond_7:19"

"if_cond_7:19":                                   ; preds = %"loop_body_7:16"
  %69 = icmp ne i64 %66, %59
  %70 = icmp ne i1 %69, false
  br i1 %70, label %"if_body_8:20", label %"exit_if_7:16"

"exit_if_7:16":                                   ; preds = %"if_cond_7:19", %"if_body_8:20"
  %drSquared_5 = phi double [ %165, %"if_body_8:20" ], [ %drSquared_3, %"if_cond_7:19" ]
  %Fx_6 = phi double [ %170, %"if_body_8:20" ], [ %Fx_4, %"if_cond_7:19" ]
  %Fy_6 = phi double [ %172, %"if_body_8:20" ], [ %Fy_4, %"if_cond_7:19" ]
  %Fz_6 = phi double [ %174, %"if_body_8:20" ], [ %Fz_4, %"if_cond_7:19" ]
  %dx_5 = phi double [ %130, %"if_body_8:20" ], [ %dx_3, %"if_cond_7:19" ]
  %dy_5 = phi double [ %145, %"if_body_8:20" ], [ %dy_3, %"if_cond_7:19" ]
  %dz_5 = phi double [ %160, %"if_body_8:20" ], [ %dz_3, %"if_cond_7:19" ]
  %drPowerN32_5 = phi double [ %168, %"if_body_8:20" ], [ %drPowerN32_3, %"if_cond_7:19" ]
  %71 = mul i64 %59, %27
  %72 = add i64 0, %71
  %73 = mul i64 0, %29
  %74 = add i64 %72, %73
  %75 = getelementptr i8* %17, i64 %74
  %76 = bitcast i8* %75 to double*
  %77 = load double* %76, !tbaa !2
  %78 = fmul double 1.000000e-02, %Fx_6
  %79 = fadd double %77, %78
  %80 = mul i64 %59, %27
  %81 = add i64 0, %80
  %82 = mul i64 0, %29
  %83 = add i64 %81, %82
  %84 = getelementptr i8* %17, i64 %83
  %85 = bitcast i8* %84 to double*
  store double %79, double* %85, !tbaa !2
  %86 = mul i64 %59, %27
  %87 = add i64 0, %86
  %88 = mul i64 1, %29
  %89 = add i64 %87, %88
  %90 = getelementptr i8* %17, i64 %89
  %91 = bitcast i8* %90 to double*
  %92 = load double* %91, !tbaa !2
  %93 = fmul double 1.000000e-02, %Fy_6
  %94 = fadd double %92, %93
  %95 = mul i64 %59, %27
  %96 = add i64 0, %95
  %97 = mul i64 1, %29
  %98 = add i64 %96, %97
  %99 = getelementptr i8* %17, i64 %98
  %100 = bitcast i8* %99 to double*
  store double %94, double* %100, !tbaa !2
  %101 = mul i64 %59, %27
  %102 = add i64 0, %101
  %103 = mul i64 2, %29
  %104 = add i64 %102, %103
  %105 = getelementptr i8* %17, i64 %104
  %106 = bitcast i8* %105 to double*
  %107 = load double* %106, !tbaa !2
  %108 = fmul double 1.000000e-02, %Fz_6
  %109 = fadd double %107, %108
  %110 = mul i64 %59, %27
  %111 = add i64 0, %110
  %112 = mul i64 2, %29
  %113 = add i64 %111, %112
  %114 = getelementptr i8* %17, i64 %113
  %115 = bitcast i8* %114 to double*
  store double %109, double* %115, !tbaa !2
  br label %"for_condition_6:21"

"if_body_8:20":                                   ; preds = %"if_cond_7:19"
  %116 = mul i64 %66, %12
  %117 = add i64 0, %116
  %118 = mul i64 0, %14
  %119 = add i64 %117, %118
  %120 = getelementptr i8* %2, i64 %119
  %121 = bitcast i8* %120 to double*
  %122 = load double* %121, !tbaa !2
  %123 = mul i64 %59, %12
  %124 = add i64 0, %123
  %125 = mul i64 0, %14
  %126 = add i64 %124, %125
  %127 = getelementptr i8* %2, i64 %126
  %128 = bitcast i8* %127 to double*
  %129 = load double* %128, !tbaa !2
  %130 = fsub double %122, %129
  %131 = mul i64 %66, %12
  %132 = add i64 0, %131
  %133 = mul i64 1, %14
  %134 = add i64 %132, %133
  %135 = getelementptr i8* %2, i64 %134
  %136 = bitcast i8* %135 to double*
  %137 = load double* %136, !tbaa !2
  %138 = mul i64 %59, %12
  %139 = add i64 0, %138
  %140 = mul i64 1, %14
  %141 = add i64 %139, %140
  %142 = getelementptr i8* %2, i64 %141
  %143 = bitcast i8* %142 to double*
  %144 = load double* %143, !tbaa !2
  %145 = fsub double %137, %144
  %146 = mul i64 %66, %12
  %147 = add i64 0, %146
  %148 = mul i64 2, %14
  %149 = add i64 %147, %148
  %150 = getelementptr i8* %2, i64 %149
  %151 = bitcast i8* %150 to double*
  %152 = load double* %151, !tbaa !2
  %153 = mul i64 %59, %12
  %154 = add i64 0, %153
  %155 = mul i64 2, %14
  %156 = add i64 %154, %155
  %157 = getelementptr i8* %2, i64 %156
  %158 = bitcast i8* %157 to double*
  %159 = load double* %158, !tbaa !2
  %160 = fsub double %152, %159
  %161 = fmul double %130, %130
  %162 = fmul double %145, %145
  %163 = fadd double %161, %162
  %164 = fmul double %160, %160
  %165 = fadd double %163, %164
  %166 = call double @"numba.math.['double'].sqrt"(double %165)
  %167 = fadd double %165, %166
  %168 = fdiv double 1.000000e+00, %167
  %169 = fmul double %130, %168
  %170 = fadd double %Fx_4, %169
  %171 = fmul double %145, %168
  %172 = fadd double %Fy_4, %171
  %173 = fmul double %160, %168
  %174 = fadd double %Fz_4, %173
  br label %"exit_if_7:16"

"for_condition_19:17":                            ; preds = %"exit_for_4:8", %"loop_body_20:29"
  %i_4 = phi i64 [ %i_2, %"exit_for_4:8" ], [ %179, %"loop_body_20:29" ]
  %175 = load i64* %target_temp7, !tbaa !13
  %176 = load i64* %nsteps8, !tbaa !14
  %177 = icmp slt i64 %175, %176
  %178 = icmp ne i1 %177, false
  br i1 %178, label %"loop_body_20:29", label %"exit_for_19:8"

"exit_for_19:8":                                  ; preds = %"for_condition_19:17"
  br label %"for_condition_3:16"

"loop_body_20:29":                                ; preds = %"for_condition_19:17"
  %179 = load i64* %target_temp7, !tbaa !13
  %180 = load i64* %target_temp7, !tbaa !13
  %181 = add i64 %180, 1
  store i64 %181, i64* %target_temp7, !tbaa !13
  %182 = mul i64 %179, %12
  %183 = add i64 0, %182
  %184 = mul i64 0, %14
  %185 = add i64 %183, %184
  %186 = getelementptr i8* %2, i64 %185
  %187 = bitcast i8* %186 to double*
  %188 = load double* %187, !tbaa !2
  %189 = mul i64 %179, %27
  %190 = add i64 0, %189
  %191 = mul i64 0, %29
  %192 = add i64 %190, %191
  %193 = getelementptr i8* %17, i64 %192
  %194 = bitcast i8* %193 to double*
  %195 = load double* %194, !tbaa !2
  %196 = fmul double %195, 1.000000e-02
  %197 = fadd double %188, %196
  %198 = mul i64 %179, %12
  %199 = add i64 0, %198
  %200 = mul i64 0, %14
  %201 = add i64 %199, %200
  %202 = getelementptr i8* %2, i64 %201
  %203 = bitcast i8* %202 to double*
  store double %197, double* %203, !tbaa !2
  %204 = mul i64 %179, %12
  %205 = add i64 0, %204
  %206 = mul i64 1, %14
  %207 = add i64 %205, %206
  %208 = getelementptr i8* %2, i64 %207
  %209 = bitcast i8* %208 to double*
  %210 = load double* %209, !tbaa !2
  %211 = mul i64 %179, %27
  %212 = add i64 0, %211
  %213 = mul i64 1, %29
  %214 = add i64 %212, %213
  %215 = getelementptr i8* %17, i64 %214
  %216 = bitcast i8* %215 to double*
  %217 = load double* %216, !tbaa !2
  %218 = fmul double %217, 1.000000e-02
  %219 = fadd double %210, %218
  %220 = mul i64 %179, %12
  %221 = add i64 0, %220
  %222 = mul i64 1, %14
  %223 = add i64 %221, %222
  %224 = getelementptr i8* %2, i64 %223
  %225 = bitcast i8* %224 to double*
  store double %219, double* %225, !tbaa !2
  %226 = mul i64 %179, %12
  %227 = add i64 0, %226
  %228 = mul i64 2, %14
  %229 = add i64 %227, %228
  %230 = getelementptr i8* %2, i64 %229
  %231 = bitcast i8* %230 to double*
  %232 = load double* %231, !tbaa !2
  %233 = mul i64 %179, %27
  %234 = add i64 0, %233
  %235 = mul i64 2, %29
  %236 = add i64 %234, %235
  %237 = getelementptr i8* %17, i64 %236
  %238 = bitcast i8* %237 to double*
  %239 = load double* %238, !tbaa !2
  %240 = fmul double %239, 1.000000e-02
  %241 = fadd double %232, %240
  %242 = mul i64 %179, %12
  %243 = add i64 0, %242
  %244 = mul i64 2, %14
  %245 = add i64 %243, %244
  %246 = getelementptr i8* %2, i64 %245
  %247 = bitcast i8* %246 to double*
  store double %241, double* %247, !tbaa !2
  br label %"for_condition_19:17"

"no_error_23:21":                                 ; preds = %"exit_for_3:4"
  %248 = load { i64, i8* }** %objtemp, !tbaa !6
  store { i64, i8* }* %248, { i64, i8* }** %return_value
  %249 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %249)
  br label %cleanup_label
}

declare { i64, i8* }* @PyTuple_Pack(i64, ...)

declare { i64, i8* }* @Py_BuildValue(i8*, ...)

declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...)

declare void @PyErr_Clear()

declare void @Py_INCREF({ i64, i8* }*)

declare void @Py_XINCREF({ i64, i8* }*)

declare double @"numba.math.['double'].sqrt"(double)

declare void @Py_XDECREF({ i64, i8* }*)

!tbaa = !{!0, !1, !2, !3, !4, !5, !6, !7, !8, !9, !10, !11, !12, !13, !14}

!0 = metadata !{metadata !"root"}
!1 = metadata !{metadata !"char *", metadata !0}
!2 = metadata !{metadata !"float3264 *", metadata !1}
!3 = metadata !{metadata !"npy_intp **", metadata !1}
!4 = metadata !{metadata !"tbaa_type(numpy shape, npy_intp *) *", metadata !3}
!5 = metadata !{metadata !"tbaa_type(numpy strides, npy_intp *) *", metadata !3}
!6 = metadata !{metadata !"object", metadata !1}
!7 = metadata !{metadata !"unique0", metadata !1}
!8 = metadata !{metadata !"unique1", metadata !1}
!9 = metadata !{metadata !"unique2", metadata !1}
!10 = metadata !{metadata !"unique3", metadata !1}
!11 = metadata !{metadata !"unique4", metadata !1}
!12 = metadata !{metadata !"unique5", metadata !1}
!13 = metadata !{metadata !"unique6", metadata !1}
!14 = metadata !{metadata !"unique7", metadata !1}


DEBUG -- translate:361:translate
; ModuleID = 'numba_executable_module'

@PyArray_API = linkonce_odr global i8** inttoptr (i64 4349348544 to i8**)

define void @Py_INCREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = bitcast { i64, i8* }* %obj to i64*
  %1 = load i64* %0
  %2 = add i64 %1, 1
  store i64 %2, i64* %0
  ret void
}

define void @Py_DECREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = bitcast { i64, i8* }* %obj to i64*
  %1 = load i64* %0
  %2 = icmp sgt i64 %1, 1
  br i1 %2, label %if.then, label %if.else

if.then:                                          ; preds = %decl
  %3 = bitcast { i64, i8* }* %obj to i64*
  %4 = add i64 %1, -1
  store i64 %4, i64* %3
  br label %if.end

if.else:                                          ; preds = %decl
  call void @Py_DecRef({ i64, i8* }* %obj)
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  ret void
}

declare void @Py_DecRef({ i64, i8* }*)

define void @Py_XINCREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = ptrtoint { i64, i8* }* %obj to i64
  %1 = icmp ne i64 %0, 0
  br i1 %1, label %if.then, label %if.end

if.then:                                          ; preds = %decl
  %2 = bitcast { i64, i8* }* %obj to i64*
  %3 = load i64* %2
  %4 = add i64 %3, 1
  store i64 %4, i64* %2
  br label %if.end

if.end:                                           ; preds = %if.then, %decl
  ret void
}

define void @Py_XDECREF({ i64, i8* }* %obj) {
decl:
  %obj1 = alloca { i64, i8* }*
  store { i64, i8* }* %obj, { i64, i8* }** %obj1
  %0 = ptrtoint { i64, i8* }* %obj to i64
  %1 = icmp ne i64 %0, 0
  br i1 %1, label %if.then, label %if.end

if.then:                                          ; preds = %decl
  call void @Py_DECREF({ i64, i8* }* %obj)
  br label %if.end

if.end:                                           ; preds = %if.then, %decl
  ret void
}

define i8* @IndexAxis(i8* %data, i64* %in_shape, i64* %in_strides, i64 %src_dim, i64 %index) {
decl:
  %data1 = alloca i8*
  %in_shape2 = alloca i64*
  %in_strides3 = alloca i64*
  %src_dim4 = alloca i64
  %index5 = alloca i64
  %result = alloca i8*
  store i8* %data, i8** %data1
  store i64* %in_shape, i64** %in_shape2
  store i64* %in_strides, i64** %in_strides3
  store i64 %src_dim, i64* %src_dim4
  store i64 %index, i64* %index5
  %0 = load i64** %in_strides3
  %1 = load i64* %src_dim4
  %2 = getelementptr inbounds i64* %0, i64 %1
  %3 = load i64* %2
  %4 = mul i64 %3, %index
  %5 = load i8** %data1
  %6 = getelementptr inbounds i8* %5, i64 %4
  store i8* %6, i8** %result
  ret i8* %6
}

define void @NewAxis(i64* %out_shape, i64* %out_strides, i32 %dst_dim) {
decl:
  %out_shape1 = alloca i64*
  %out_strides2 = alloca i64*
  %dst_dim3 = alloca i32
  store i64* %out_shape, i64** %out_shape1
  store i64* %out_strides, i64** %out_strides2
  store i32 %dst_dim, i32* %dst_dim3
  %0 = load i64** %out_shape1
  %1 = getelementptr inbounds i64* %0, i32 %dst_dim
  store i64 1, i64* %1
  %2 = load i64** %out_strides2
  %3 = load i32* %dst_dim3
  %4 = getelementptr inbounds i64* %2, i32 %3
  store i64 0, i64* %4
  ret void
}

define i32 @Broadcast(i64* %dst_shape, i64* %src_shape, i64* %src_strides, i32 %max_ndim, i32 %ndim) {
decl:
  %dst_shape1 = alloca i64*
  %src_shape2 = alloca i64*
  %src_strides3 = alloca i64*
  %max_ndim4 = alloca i32
  %ndim5 = alloca i32
  %0 = alloca i32
  store i64* %dst_shape, i64** %dst_shape1
  store i64* %src_shape, i64** %src_shape2
  store i64* %src_strides, i64** %src_strides3
  store i32 %max_ndim, i32* %max_ndim4
  store i32 %ndim, i32* %ndim5
  %1 = load i32* %max_ndim4
  %2 = sub i32 %1, %ndim
  store i32 0, i32* %0
  br label %loop.cond

loop.cond:                                        ; preds = %if.end11, %decl
  %3 = load i32* %0
  %4 = load i32* %ndim5
  %5 = icmp slt i32 %3, %4
  br i1 %5, label %loop.body, label %loop.end

loop.body:                                        ; preds = %loop.cond
  %6 = load i64** %src_shape2
  %7 = getelementptr inbounds i64* %6, i32 %3
  %8 = add i32 %3, %2
  %9 = load i64** %dst_shape1
  %10 = getelementptr inbounds i64* %9, i32 %8
  %11 = load i64* %7
  %12 = icmp eq i64 %11, 1
  br i1 %12, label %if.then, label %if.else

loop.end:                                         ; preds = %if.else7, %loop.cond
  %merge = phi i32 [ 1, %loop.cond ], [ 0, %if.else7 ]
  ret i32 %merge

if.then:                                          ; preds = %loop.body
  %13 = load i64** %src_strides3
  %14 = getelementptr inbounds i64* %13, i32 %3
  store i64 0, i64* %14
  br label %if.end11

if.else:                                          ; preds = %loop.body
  %15 = load i64* %10
  %16 = icmp eq i64 %15, 1
  br i1 %16, label %if.then6, label %if.else7

if.then6:                                         ; preds = %if.else
  store i64 %11, i64* %10
  br label %if.end11

if.else7:                                         ; preds = %if.else
  %17 = icmp ne i64 %11, %15
  br i1 %17, label %loop.end, label %if.end11

if.end11:                                         ; preds = %if.else7, %if.then6, %if.then
  %18 = load i32* %0
  %19 = add i32 %18, 1
  store i32 %19, i32* %0
  br label %loop.cond
}

define i32 @__numba_specialized_0___main___2E_f_py(i32 %I, i32 %J) {
entry:
  %0 = icmp sgt i32 %I, 0
  %1 = icmp sgt i32 %J, 0
  %or.cond = and i1 %0, %1
  br i1 %or.cond, label %"loop_body_5:8.lr.ph.split.us", label %"exit_for_4:4"

"loop_body_5:8.lr.ph.split.us":                   ; preds = %entry
  %2 = sext i32 %I to i64
  %3 = sext i32 %J to i64
  br label %"loop_body_6:19.lr.ph.us"

"loop_body_6:19.us":                              ; preds = %"loop_body_6:19.us", %"loop_body_6:19.lr.ph.us"
  %lsr.iv = phi i64 [ %lsr.iv.next, %"loop_body_6:19.us" ], [ %3, %"loop_body_6:19.lr.ph.us" ]
  %res_35.us = phi i32 [ %res_26.us, %"loop_body_6:19.lr.ph.us" ], [ %7, %"loop_body_6:19.us" ]
  %4 = tail call double @"numba.math.['double'].log"(double 1.000000e+00)
  %5 = tail call double @"numba.math.['double'].cos"(double %4)
  %6 = fptosi double %5 to i32
  %7 = add i32 %6, %res_35.us
  %lsr.iv.next = add i64 %lsr.iv, -1
  %exitcond = icmp eq i64 %lsr.iv.next, 0
  br i1 %exitcond, label %"for_condition_4:13.loopexit.us", label %"loop_body_6:19.us"

"for_condition_4:13.loopexit.us":                 ; preds = %"loop_body_6:19.us"
  %exitcond8 = icmp eq i64 %9, %2
  br i1 %exitcond8, label %"exit_for_4:4", label %"loop_body_6:19.lr.ph.us"

"loop_body_6:19.lr.ph.us":                        ; preds = %"loop_body_5:8.lr.ph.split.us", %"for_condition_4:13.loopexit.us"
  %res_26.us = phi i32 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %7, %"for_condition_4:13.loopexit.us" ]
  %8 = phi i64 [ 0, %"loop_body_5:8.lr.ph.split.us" ], [ %9, %"for_condition_4:13.loopexit.us" ]
  %9 = add i64 %8, 1
  br label %"loop_body_6:19.us"

"exit_for_4:4":                                   ; preds = %"for_condition_4:13.loopexit.us", %entry
  %res_2.lcssa = phi i32 [ 0, %entry ], [ %7, %"for_condition_4:13.loopexit.us" ]
  ret i32 %res_2.lcssa
}

declare double @"numba.math.['double'].cos"(double)

declare double @"numba.math.['double'].log"(double)

define { i64, i8* }* @__numba_specialized_1_numba_2E_codegen_2E_llvmwrapper_2E___numba_wrapper_f_py(i8* %self, { i64, i8* }* %args) {
entry:
  %objtemp = alloca { i64, i8* }*
  store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6
  %0 = alloca { i64, i8* }*
  %1 = alloca { i64, i8* }*
  %return_value = alloca { i64, i8* }*
  %2 = call i32 ({ i64, i8* }*, i8*, ...)* @PyArg_ParseTuple({ i64, i8* }* %args, i8* getelementptr inbounds ([3 x i8]* @__STR_0, i32 0, i32 0), { i64, i8* }** %1, { i64, i8* }** %0)
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %error_label, label %empty

cleanup_label:                                    ; preds = %no_error, %error_label
  %4 = load { i64, i8* }** %objtemp, !tbaa !6
  call void @Py_XDECREF({ i64, i8* }* %4)
  %5 = load { i64, i8* }** %return_value
  ret { i64, i8* }* %5

error_label:                                      ; preds = %empty6, %empty3, %empty, %entry, %empty7
  store { i64, i8* }* null, { i64, i8* }** %return_value
  %6 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %6)
  br label %cleanup_label

empty:                                            ; preds = %entry
  %7 = load { i64, i8* }** %1
  %8 = load { i64, i8* }** %0
  %9 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %7)
  %10 = call i8* @PyErr_Occurred()
  %11 = ptrtoint i8* %10 to i64
  %12 = icmp ne i64 %11, 0
  br i1 %12, label %error_label, label %empty3

empty3:                                           ; preds = %empty
  %13 = call i32 inttoptr (i64 4433236114 to i32 ({ i64, i8* }*)*)({ i64, i8* }* %8)
  %14 = call i8* @PyErr_Occurred()
  %15 = ptrtoint i8* %14 to i64
  %16 = icmp ne i64 %15, 0
  br i1 %16, label %error_label, label %empty6

empty6:                                           ; preds = %empty3
  %17 = call i32 @__numba_specialized_0___main___2E_f_py(i32 %9, i32 %13)
  %18 = call i8* @PyErr_Occurred()
  %19 = ptrtoint i8* %18 to i64
  %20 = icmp ne i64 %19, 0
  br i1 %20, label %error_label, label %empty7

empty7:                                           ; preds = %empty6
  %21 = sext i32 %17 to i64
  %22 = call { i64, i8* }* inttoptr (i64 4433235252 to { i64, i8* }* (i64)*)(i64 %21)
  store { i64, i8* }* %22, { i64, i8* }** %objtemp, !tbaa !6
  %23 = ptrtoint { i64, i8* }* %22 to i64
  %24 = icmp eq i64 %23, 0
  br i1 %24, label %error_label, label %no_error

no_error:                                         ; preds = %empty7
  %25 = load { i64, i8* }** %objtemp, !tbaa !6
  store { i64, i8* }* %25, { i64, i8* }** %return_value
  %26 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %26)
  br label %cleanup_label
}

declare i8* @PyErr_Occurred()

declare { i64, i8* }* @Py_BuildValue(i8*, ...)

declare i32 @PyArg_ParseTuple({ i64, i8* }*, i8*, ...)

declare void @PyErr_Clear()

define { i64, i8* }* @__numba_specialized_2___main___2E_nbody_hy({ i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev) {
entry:
  %0 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }*
  tail call void @Py_INCREF({ i64, i8* }* %0)
  %1 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i64 0, i32 2
  %2 = load i8** %1, align 8, !tbaa !7
  %3 = ptrtoint i8* %2 to i64
  %4 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, i64 0, i32 5
  %5 = load i64** %4, align 8, !tbaa !10
  %6 = load i64* %5, align 8
  %7 = getelementptr i64* %5, i64 1
  %8 = load i64* %7, align 8
  %9 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }*
  tail call void @Py_INCREF({ i64, i8* }* %9)
  %10 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i64 0, i32 2
  %11 = load i8** %10, align 8, !tbaa !7
  %12 = ptrtoint i8* %11 to i64
  %13 = getelementptr { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev, i64 0, i32 5
  %14 = load i64** %13, align 8, !tbaa !10
  %15 = load i64* %14, align 8
  %16 = getelementptr i64* %14, i64 1
  %17 = load i64* %16, align 8
  %18 = shl i64 %17, 1
  %19 = shl i64 %8, 1
  %20 = sub i64 0, %3
  %21 = sub i64 0, %12
  br label %"loop_body_4:8"

cleanup_label:                                    ; preds = %"no_error_23:21", %error_label
  %return_value.0.load14 = phi { i64, i8* }* [ %25, %"no_error_23:21" ], [ null, %error_label ]
  %22 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev to { i64, i8* }*
  %23 = bitcast { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle to { i64, i8* }*
  tail call void @Py_XDECREF({ i64, i8* }* %25)
  tail call void @Py_XDECREF({ i64, i8* }* %22)
  tail call void @Py_XDECREF({ i64, i8* }* %23)
  ret { i64, i8* }* %return_value.0.load14

error_label:                                      ; preds = %"exit_for_3:4"
  tail call void @Py_XINCREF({ i64, i8* }* null)
  br label %cleanup_label

"for_condition_3:16.loopexit":                    ; preds = %"loop_body_20:29"
  %24 = add i64 %27, 1
  %exitcond35 = icmp eq i64 %24, 15
  br i1 %exitcond35, label %"exit_for_3:4", label %"loop_body_4:8"

"exit_for_3:4":                                   ; preds = %"for_condition_3:16.loopexit"
  %25 = tail call { i64, i8* }* (i64, ...)* @PyTuple_Pack(i64 2, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particle, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %particlev)
  %26 = icmp eq { i64, i8* }* %25, null
  br i1 %26, label %error_label, label %"no_error_23:21"

"loop_body_4:8":                                  ; preds = %"for_condition_3:16.loopexit", %entry
  %27 = phi i64 [ 0, %entry ], [ %24, %"for_condition_3:16.loopexit" ]
  br label %"loop_body_5:12"

"for_condition_4:17.loopexit":                    ; preds = %"exit_if_7:16"
  %28 = add i64 %29, 1
  %exitcond33 = icmp eq i64 %28, 150
  br i1 %exitcond33, label %"loop_body_20:29", label %"loop_body_5:12"

"loop_body_5:12":                                 ; preds = %"for_condition_4:17.loopexit", %"loop_body_4:8"
  %29 = phi i64 [ 0, %"loop_body_4:8" ], [ %28, %"for_condition_4:17.loopexit" ]
  %30 = mul i64 %29, %15
  %31 = add i64 %30, %17
  %32 = add i64 %30, %18
  %33 = mul i64 %29, %6
  %34 = add i64 %33, %8
  %35 = add i64 %33, %19
  br label %"loop_body_7:16"

"loop_body_7:16":                                 ; preds = %"exit_if_7:16", %"loop_body_5:12"
  %lsr.iv = phi i64 [ %lsr.iv.next, %"exit_if_7:16" ], [ %20, %"loop_body_5:12" ]
  %Fz_418 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fz_6, %"exit_if_7:16" ]
  %Fy_417 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fy_6, %"exit_if_7:16" ]
  %Fx_416 = phi double [ 0.000000e+00, %"loop_body_5:12" ], [ %Fx_6, %"exit_if_7:16" ]
  %36 = phi i64 [ 0, %"loop_body_5:12" ], [ %37, %"exit_if_7:16" ]
  %37 = add i64 %36, 1
  %38 = icmp eq i64 %29, %36
  br i1 %38, label %"exit_if_7:16", label %"if_body_8:20"

"exit_if_7:16":                                   ; preds = %"loop_body_7:16", %"if_body_8:20"
  %Fx_6 = phi double [ %70, %"if_body_8:20" ], [ %Fx_416, %"loop_body_7:16" ]
  %Fy_6 = phi double [ %72, %"if_body_8:20" ], [ %Fy_417, %"loop_body_7:16" ]
  %Fz_6 = phi double [ %74, %"if_body_8:20" ], [ %Fz_418, %"loop_body_7:16" ]
  %sunkaddr = ptrtoint i8* %11 to i64
  %sunkaddr23 = add i64 %sunkaddr, %30
  %sunkaddr24 = inttoptr i64 %sunkaddr23 to double*
  %39 = load double* %sunkaddr24, align 8, !tbaa !7
  %40 = fmul double %Fx_6, 1.000000e-02
  %41 = fadd double %40, %39
  store double %41, double* %sunkaddr24, align 8, !tbaa !7
  %sunkaddr25 = ptrtoint i8* %11 to i64
  %sunkaddr26 = add i64 %sunkaddr25, %31
  %sunkaddr27 = inttoptr i64 %sunkaddr26 to double*
  %42 = load double* %sunkaddr27, align 8, !tbaa !7
  %43 = fmul double %Fy_6, 1.000000e-02
  %44 = fadd double %43, %42
  store double %44, double* %sunkaddr27, align 8, !tbaa !7
  %sunkaddr28 = ptrtoint i8* %11 to i64
  %sunkaddr29 = add i64 %sunkaddr28, %32
  %sunkaddr30 = inttoptr i64 %sunkaddr29 to double*
  %45 = load double* %sunkaddr30, align 8, !tbaa !7
  %46 = fmul double %Fz_6, 1.000000e-02
  %47 = fadd double %46, %45
  store double %47, double* %sunkaddr30, align 8, !tbaa !7
  %lsr.iv.next = sub i64 %lsr.iv, %6
  %exitcond = icmp eq i64 150, %37
  br i1 %exitcond, label %"for_condition_4:17.loopexit", label %"loop_body_7:16"

"if_body_8:20":                                   ; preds = %"loop_body_7:16"
  %48 = mul i64 %lsr.iv, -1
  %49 = inttoptr i64 %48 to double*
  %50 = load double* %49, align 8, !tbaa !7
  %sunkaddr31 = ptrtoint i8* %2 to i64
  %sunkaddr32 = add i64 %sunkaddr31, %33
  %sunkaddr33 = inttoptr i64 %sunkaddr32 to double*
  %51 = load double* %sunkaddr33, align 8, !tbaa !7
  %52 = fsub double %50, %51
  %53 = mul i64 %lsr.iv, -1
  %sunkaddr34 = add i64 %8, %53
  %sunkaddr35 = inttoptr i64 %sunkaddr34 to double*
  %54 = load double* %sunkaddr35, align 8, !tbaa !7
  %sunkaddr36 = ptrtoint i8* %2 to i64
  %sunkaddr37 = add i64 %sunkaddr36, %34
  %sunkaddr38 = inttoptr i64 %sunkaddr37 to double*
  %55 = load double* %sunkaddr38, align 8, !tbaa !7
  %56 = fsub double %54, %55
  %57 = mul i64 %lsr.iv, -1
  %sunkaddr39 = mul i64 %8, 2
  %sunkaddr40 = add i64 %57, %sunkaddr39
  %sunkaddr41 = inttoptr i64 %sunkaddr40 to double*
  %58 = load double* %sunkaddr41, align 8, !tbaa !7
  %sunkaddr42 = ptrtoint i8* %2 to i64
  %sunkaddr43 = add i64 %sunkaddr42, %35
  %sunkaddr44 = inttoptr i64 %sunkaddr43 to double*
  %59 = load double* %sunkaddr44, align 8, !tbaa !7
  %60 = fsub double %58, %59
  %61 = fmul double %52, %52
  %62 = fmul double %56, %56
  %63 = fadd double %61, %62
  %64 = fmul double %60, %60
  %65 = fadd double %63, %64
  %66 = tail call double @"numba.math.['double'].sqrt"(double %65)
  %67 = fadd double %66, %65
  %68 = fdiv double 1.000000e+00, %67
  %69 = fmul double %52, %68
  %70 = fadd double %Fx_416, %69
  %71 = fmul double %56, %68
  %72 = fadd double %Fy_417, %71
  %73 = fmul double %60, %68
  %74 = fadd double %Fz_418, %73
  br label %"exit_if_7:16"

"loop_body_20:29":                                ; preds = %"for_condition_4:17.loopexit", %"loop_body_20:29"
  %lsr.iv21 = phi i64 [ %lsr.iv.next22, %"loop_body_20:29" ], [ 150, %"for_condition_4:17.loopexit" ]
  %lsr.iv15 = phi i64 [ %lsr.iv.next16, %"loop_body_20:29" ], [ %21, %"for_condition_4:17.loopexit" ]
  %lsr.iv4 = phi i8* [ %91, %"loop_body_20:29" ], [ %2, %"for_condition_4:17.loopexit" ]
  %lsr.iv414 = bitcast i8* %lsr.iv4 to double*
  %lsr.iv45 = bitcast i8* %lsr.iv4 to i1*
  %75 = load double* %lsr.iv414, align 8, !tbaa !7
  %76 = mul i64 %lsr.iv15, -1
  %77 = inttoptr i64 %76 to double*
  %78 = load double* %77, align 8, !tbaa !7
  %79 = fmul double %78, 1.000000e-02
  %80 = fadd double %75, %79
  store double %80, double* %lsr.iv414, align 8, !tbaa !7
  %scevgep12 = getelementptr i8* %lsr.iv4, i64 %8
  %scevgep1213 = bitcast i8* %scevgep12 to double*
  %81 = load double* %scevgep1213, align 8, !tbaa !7
  %82 = mul i64 %lsr.iv15, -1
  %sunkaddr45 = add i64 %17, %82
  %sunkaddr46 = inttoptr i64 %sunkaddr45 to double*
  %83 = load double* %sunkaddr46, align 8, !tbaa !7
  %84 = fmul double %83, 1.000000e-02
  %85 = fadd double %81, %84
  %scevgep10 = getelementptr i8* %lsr.iv4, i64 %8
  %scevgep1011 = bitcast i8* %scevgep10 to double*
  store double %85, double* %scevgep1011, align 8, !tbaa !7
  %sunkaddr47 = ptrtoint i8* %lsr.iv4 to i64
  %sunkaddr48 = mul i64 %8, 2
  %sunkaddr49 = add i64 %sunkaddr47, %sunkaddr48
  %sunkaddr50 = inttoptr i64 %sunkaddr49 to double*
  %86 = load double* %sunkaddr50, align 8, !tbaa !7
  %87 = mul i64 %lsr.iv15, -1
  %sunkaddr51 = mul i64 %17, 2
  %sunkaddr52 = add i64 %87, %sunkaddr51
  %sunkaddr53 = inttoptr i64 %sunkaddr52 to double*
  %88 = load double* %sunkaddr53, align 8, !tbaa !7
  %89 = fmul double %88, 1.000000e-02
  %90 = fadd double %86, %89
  %sunkaddr54 = ptrtoint i8* %lsr.iv4 to i64
  %sunkaddr55 = mul i64 %8, 2
  %sunkaddr56 = add i64 %sunkaddr54, %sunkaddr55
  %sunkaddr57 = inttoptr i64 %sunkaddr56 to double*
  store double %90, double* %sunkaddr57, align 8, !tbaa !7
  %scevgep = getelementptr i1* %lsr.iv45, i64 %6
  %91 = bitcast i1* %scevgep to i8*
  %lsr.iv.next16 = sub i64 %lsr.iv15, %15
  %lsr.iv.next22 = add i64 %lsr.iv21, -1
  %exitcond34 = icmp eq i64 %lsr.iv.next22, 0
  br i1 %exitcond34, label %"for_condition_3:16.loopexit", label %"loop_body_20:29"

"no_error_23:21":                                 ; preds = %"exit_for_3:4"
  tail call void @Py_XINCREF({ i64, i8* }* %25)
  br label %cleanup_label
}

declare { i64, i8* }* @PyTuple_Pack(i64, ...)

declare double @"numba.math.['double'].sqrt"(double)

define { i64, i8* }* @__numba_specialized_3_numba_2E_codegen_2E_llvmwrapper_2E___numba_wrapper_nbody_hy(i8* %self, { i64, i8* }* %args) {
entry:
  %objtemp = alloca { i64, i8* }*
  store { i64, i8* }* null, { i64, i8* }** %objtemp, !tbaa !6
  %0 = alloca { i64, i8* }*
  %1 = alloca { i64, i8* }*
  %return_value = alloca { i64, i8* }*
  %2 = call i32 ({ i64, i8* }*, i8*, ...)* @PyArg_ParseTuple({ i64, i8* }* %args, i8* getelementptr inbounds ([3 x i8]* @__STR_0, i32 0, i32 0), { i64, i8* }** %1, { i64, i8* }** %0)
  %3 = icmp eq i32 %2, 0
  br i1 %3, label %cleanup.if.true, label %cleanup.if.end

cleanup_label:                                    ; preds = %no_error, %error_label
  %4 = load { i64, i8* }** %objtemp, !tbaa !6
  call void @Py_XDECREF({ i64, i8* }* %4)
  %5 = load { i64, i8* }** %return_value
  ret { i64, i8* }* %5

error_label:                                      ; preds = %cleanup.if.end, %cleanup.if.true
  store { i64, i8* }* null, { i64, i8* }** %return_value
  %6 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %6)
  br label %cleanup_label

cleanup.if.true:                                  ; preds = %entry
  br label %error_label

cleanup.if.end:                                   ; preds = %entry
  %7 = load { i64, i8* }** %1
  %8 = load { i64, i8* }** %0
  %9 = bitcast { i64, i8* }* %7 to { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }*
  %10 = bitcast { i64, i8* }* %8 to { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }*
  %11 = call { i64, i8* }* @__numba_specialized_2___main___2E_nbody_hy({ i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %9, { i64, i8*, i8*, i32, i64*, i64*, i8*, i8*, i32, i8* }* %10)
  store { i64, i8* }* %11, { i64, i8* }** %objtemp, !tbaa !6
  %12 = ptrtoint { i64, i8* }* %11 to i64
  %13 = icmp eq i64 %12, 0
  br i1 %13, label %error_label, label %no_error

no_error:                                         ; preds = %cleanup.if.end
  %14 = load { i64, i8* }** %objtemp, !tbaa !6
  store { i64, i8* }* %14, { i64, i8* }** %return_value
  %15 = load { i64, i8* }** %return_value, !tbaa !6
  call void @Py_XINCREF({ i64, i8* }* %15)
  br label %cleanup_label
}

!tbaa = !{!0, !1, !2, !3, !4, !5, !0, !1, !6, !0, !1, !7, !8, !9, !10, !6, !2, !3, !4, !5, !11, !12, !13, !14, !0, !1, !6}

!0 = metadata !{metadata !"root"}
!1 = metadata !{metadata !"char *", metadata !0}
!2 = metadata !{metadata !"unique0", metadata !1}
!3 = metadata !{metadata !"unique1", metadata !1}
!4 = metadata !{metadata !"unique2", metadata !1}
!5 = metadata !{metadata !"unique3", metadata !1}
!6 = metadata !{metadata !"object", metadata !1}
!7 = metadata !{metadata !"float3264 *", metadata !1}
!8 = metadata !{metadata !"npy_intp **", metadata !1}
!9 = metadata !{metadata !"tbaa_type(numpy shape, npy_intp *) *", metadata !8}
!10 = metadata !{metadata !"tbaa_type(numpy strides, npy_intp *) *", metadata !8}
!11 = metadata !{metadata !"unique4", metadata !1}
!12 = metadata !{metadata !"unique5", metadata !1}
!13 = metadata !{metadata !"unique6", metadata !1}
!14 = metadata !{metadata !"unique7", metadata !1}

CPU times: user 424 ms, sys: 35.5 ms, total: 460 ms
Wall time: 435 ms

Speed Comparisons¶

The dynamically compiled version is by far the fastest.

In [85]:

func_list = ['nbody_py', 'nbody_np', 'nbody_nb', 'nbody_hy']
data_list = ['p_py, pv_py', 'p_np, pv_np', 'p_nb, pv_nb', 'p_nb, pv_nb']

In [86]:

perf_comp_data(func_list, data_list, rep=3)

function: nbody_nb, av. time sec:   0.00607, relative:    1.0
function: nbody_np, av. time sec:   0.09028, relative:   14.9
function: nbody_py, av. time sec:   0.54923, relative:   90.5
function: nbody_hy, av. time sec:   5.44442, relative:  896.8

Performance Python Approaches¶

Nowadays, the Python ecosystem provides a number of ways to improve the performance of code.

paradigms: some Python paradigms might be more performant than others given a specific problem
libraries: there is a wealth of libraries available for different types of problems which often lead to much higher performance given a problem that fits into the scope of the library
compiling: a number of powerful compiling solutions are available, static and dynamic ones
parallelization: some Python libraries have built-in parallellization capabilities, others allow to harness the full power of multiple cores CPUs, whole clusters or GPUs

A major benefit of the Python ecosystem is, that all these approaches generally are easily implementable, meaning that the additional effort included is generally quite low (even for non-experts). In other words, performance improvements often are low hanging fruits.

The Python Quants

The Python Quants GmbH ¶

The Python Quants – the company Web site

www.pythonquants.com

Dr. Yves J. Hilpisch – my personal Web site

www.hilpisch.com

Derivatives Analytics with Python – my new book

Read About and Order the Book

Contact Us

yves@pythonquants.com

@dyjh