The Python Quants

New Python Data Analytics

Dr. Yves J. Hilpisch

The Python Quants GmbH

analytics@pythonquants.com

www.pythonquants.com

TsTables – High Performance Times Series Management

TsTables is a Python library by Andy Fiedler built on top of the popular PyTables HDF5 database library. It is meant to handle large amounts of high frequency time series data in append once, retrieve many times scenarios (cf. Gihub page). The focus lies on retrieving chunks of data from large data sets as quickly as possible.

In [1]:
import numpy as np
import pandas as pd
import tables as tb
import tstables as tstb
import random
from time import time
from datetime import datetime 
%matplotlib inline

Sample Time Series Data

Let us generate a decent amount of sample data points.

In [2]:
no = 30000000
co = 3
dt = 1. / (12 * 30 * 24 * 60)
vol = 0.2

We generate one second intervals of data.

In [3]:
dr = pd.date_range('2015-1-1', periods=no, freq='1s')
In [4]:
dr
Out[4]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2015-01-01 00:00:00, ..., 2015-12-14 05:19:59]
Length: 30000000, Freq: S, Timezone: None

In memory generation is quite quick.

In [5]:
%%time
da = 100 * np.exp(np.cumsum(-0.5 * vol ** 2 * dt +
        vol * np.sqrt(dt) * np.random.standard_normal((no, co)), axis=0))
da[0] = 100
CPU times: user 5.86 s, sys: 652 ms, total: 6.51 s
Wall time: 6.51 s

In [6]:
df = pd.DataFrame(da, index=dr, columns=['ts1', 'ts2', 'ts3'])
In [7]:
df.count()
Out[7]:
ts1    30000000
ts2    30000000
ts3    30000000
dtype: int64

The starting values of the three time series.

In [8]:
df.head()
Out[8]:
ts1 ts2 ts3
2015-01-01 00:00:00 100.000000 100.000000 100.000000
2015-01-01 00:00:01 100.010104 100.080024 99.941624
2015-01-01 00:00:02 99.995179 100.093401 99.921710
2015-01-01 00:00:03 99.934018 100.112409 99.912793
2015-01-01 00:00:04 99.969965 100.115609 99.954547

And a plot of the time series data (every 100000th point).

In [9]:
df[::100000].plot()
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9806e0fc90>

Storage and Retrieval with TsTables

To store the time series data in a PyTables table we first define the table structure.

In [10]:
class TS(tb.IsDescription):
    timestamp = tb.Int64Col(pos=0)
    ts1 = tb.Float64Col(pos=1)
    ts2 = tb.Float64Col(pos=1)
    ts3 = tb.Float64Col(pos=1)

Second, open a database file and create the table object.

In [11]:
h5 = tb.open_file('ts.h5','w')

TsTables adds a new function create_ts to PyTables.

In [12]:
ts = h5.create_ts('/','TS', TS)

Third, we append the time series data to the table object.

In [13]:
%time ts.append(df)
CPU times: user 2.18 s, sys: 1.26 s, total: 3.44 s
Wall time: 5.62 s

In [14]:
ls -n *.h5
-rw-r--r-- 1 1141 8 931233072 Mar  8 19:50 ts.h5

The approach of TsTables is to apply a highly structured storage hierarchy.

In [15]:
a = str(h5)
print a[:508]
ts.h5 (File) u''
Last modif.: 'Sun Mar  8 19:50:57 2015'
Object Tree: 
/ (RootGroup) u''
/TS (Group/Timeseries) u''
/TS/y2015 (Group) u''
/TS/y2015/m01 (Group) ''
/TS/y2015/m02 (Group) ''
/TS/y2015/m03 (Group) ''
/TS/y2015/m04 (Group) ''
/TS/y2015/m05 (Group) ''
/TS/y2015/m06 (Group) ''
/TS/y2015/m07 (Group) ''
/TS/y2015/m08 (Group) ''
/TS/y2015/m09 (Group) ''
/TS/y2015/m10 (Group) ''
/TS/y2015/m11 (Group) ''
/TS/y2015/m12 (Group) ''
/TS/y2015/m12/d01 (Group) ''
/TS/y2015/m12/d01/ts_data (Table(86400,))

The strength of TsTables lies in retrieving chunks of time series data defined by a start date and an end date (which obviously is a typical case in finance, e.g. in backtesting strategies or risk management).

In [16]:
read_start_dt = datetime(2015, 2, 1, 0, 0)
read_end_dt = datetime(2015, 2, 2, 0, 0)

TsTables tries to make such an operation as fast as possible.

In [17]:
%time rows = ts.read_range(read_start_dt, read_end_dt)
CPU times: user 14 ms, sys: 1e+03 µs, total: 15 ms
Wall time: 14.7 ms

Let us try it with random intervals.

In [18]:
t0 = time()
its = 100
for _ in xrange(its):
    day = random.randint(1, 27)
    read_start_dt = datetime(2015, 2, day, 0, 0)
    read_end_dt = datetime(2015, 2, day + 1, 0, 0)
    rows = ts.read_range(read_start_dt, read_end_dt)
t1 = time()

The results are convincing.

In [19]:
print "time for %d random accesses    %5.3f seconds" %(its, t1 - t0)
print "average time for random access  %5.3f seconds" %((t1 - t0) / its)
time for 100 random accesses    0.764 seconds
average time for random access  0.008 seconds

Conveniently, the returned object is a pandas DataFrame.

In [20]:
rows.count()
Out[20]:
ts1    86401
ts2    86401
ts3    86401
dtype: int64
In [21]:
rows.head()
Out[21]:
ts1 ts2 ts3
2015-02-18 00:00:00 46.976217 158.482381 114.877444
2015-02-18 00:00:01 46.996315 158.501924 114.901572
2015-02-18 00:00:02 46.989175 158.560159 114.896549
2015-02-18 00:00:03 46.960973 158.557845 114.833150
2015-02-18 00:00:04 46.970345 158.505452 114.894082

A look at a data sub-set.

In [22]:
rows[::500].plot()
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f97f87f5c10>
In [23]:
h5.close()
In [24]:
!rm ts.h5

bcolz – High Performance Columnar Data Store

bcolz is a columnar data store for fast data storage and retrieval with built-in high performance compression. It supports both in-memory and out-of-memory storage and operations. Cf. http://bcolz.blosc.org/.

In [25]:
import bcolz

ctable Example

The first example is based on the ctable class for data in table format. The example data set is 1 GB in size.

In [26]:
N = 100000 * 1000
print N
100000000

In-Memory Storage

We generate first an in-memory object using high compression. Since we work with integers, good compression ratios are to be expected. It takes about 24 sec to generate the ctable object from a generator via the fromiter method.

In [27]:
%%time
ct = bcolz.fromiter(((i, i ** 2) for i in xrange(N)),
                    dtype="i4, i8",
                    count=N,
                    cparams=bcolz.cparams(clevel=9))
CPU times: user 25.7 s, sys: 182 ms, total: 25.9 s
Wall time: 24.4 s

The in-memory size is about 150 MB only, which translates in to a compression ratio of 7+.

In [28]:
ct
Out[28]:
ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')])
  nbytes: 1.12 GB; cbytes: 153.51 MB; ratio: 7.45
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009)
 (99999998, 9999999600000004) (99999999, 9999999800000001)]

You can now implement fast numerical operations on this data object (note that the output is a carray object).

In [29]:
%time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 4.62 s, sys: 794 ms, total: 5.42 s
Wall time: 968 ms

Out[29]:
carray((100000000,), float64)
  nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[  0.00000000e+00   2.00000000e+00   6.00000000e+00 ...,   1.37491943e+09
   1.57491943e+09   1.77491942e+09]

Disk-Based Storage

The same tasks can be implemented with disk-based storage. To this end, only specify the rootdir parameter. With about 30 sec the generation takes a bit longer on disk. everything else (especially the object handling) remaining the same however.

In [30]:
%%time
ct = bcolz.fromiter(((i, i ** 2) for i in xrange(N)),
                    dtype="i4, i8",
                    count=N, rootdir='ct',
                    cparams=bcolz.cparams(clevel=9))
CPU times: user 25.4 s, sys: 310 ms, total: 25.7 s
Wall time: 31.1 s

Everything else (especially the object handling) remains almost the same however.

In [31]:
ct
Out[31]:
ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')])
  nbytes: 1.12 GB; cbytes: 153.51 MB; ratio: 7.45
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
  rootdir := 'ct'
[(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009)
 (99999998, 9999999600000004) (99999999, 9999999800000001)]

The numerical operations work in the same fashion and hardly take longer due to native multi threading and optimized caching.

In [32]:
%time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 4.92 s, sys: 814 ms, total: 5.73 s
Wall time: 1.08 s

Out[32]:
carray((100000000,), float64)
  nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[  0.00000000e+00   2.00000000e+00   6.00000000e+00 ...,   1.37491943e+09
   1.57491943e+09   1.77491942e+09]

Let us finally verify system disk usage.

In [33]:
!ls ct
__attrs__  __rootdirs__  f0  f1

In [34]:
!du -h ct
# system disk usage
12K	ct/f1/meta
153M	ct/f1/data
153M	ct/f1
12K	ct/f0/meta
4.5M	ct/f0/data
4.6M	ct/f0
158M	ct

In [35]:
!rm -r ct

carray Example

This example is about mid data which does not fit (in general) into memory (without compression).

In [36]:
import numpy as np

We generte as basis a NumPy ndarray object of size 32 MB.

In [37]:
n = 2000
a = np.arange(n * n).reshape(n, n) 
a.nbytes
Out[37]:
32000000

In-Memory Storage

Let us first again work in-memory. Our carray object contains 4,000 versions of the ndarray object. The in-memory generation of the object takes about 25 sec.

In [38]:
%%time
it = 4000
ca = bcolz.carray(a, cparams=bcolz.cparams(clevel=9))
for i in range(it):
    ca.append(a)
CPU times: user 26.2 s, sys: 213 ms, total: 26.4 s
Wall time: 26.4 s

The carray object stores 120 GB worth of data in less than 1 GB of memory, for a compression ratio of more than 130.

In [39]:
ca
Out[39]:
carray((8002000, 2000), int64)
  nbytes: 119.24 GB; cbytes: 912.17 MB; ratio: 133.86
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[[      0       1       2 ...,    1997    1998    1999]
 [   2000    2001    2002 ...,    3997    3998    3999]
 [   4000    4001    4002 ...,    5997    5998    5999]
 ..., 
 [3994000 3994001 3994002 ..., 3995997 3995998 3995999]
 [3996000 3996001 3996002 ..., 3997997 3997998 3997999]
 [3998000 3998001 3998002 ..., 3999997 3999998 3999999]]

Let us implement the evaluation of a numerical expression on this data set. The syntax and handling are the same as with NumPy ndarray objects.

In [40]:
%time ca[:5000] ** 2 + np.sqrt(ca[10000:15000])
CPU times: user 96 ms, sys: 25 ms, total: 121 ms
Wall time: 122 ms

Out[40]:
array([[  0.00000000e+00,   2.00000000e+00,   5.41421356e+00, ...,
          3.98805369e+06,   3.99204870e+06,   3.99604571e+06],
       [  4.00004472e+06,   4.00404573e+06,   4.00804874e+06, ...,
          1.59760722e+07,   1.59840672e+07,   1.59920642e+07],
       [  1.60000632e+07,   1.60080643e+07,   1.60160673e+07, ...,
          3.59640864e+07,   3.59760814e+07,   3.59880785e+07],
       ..., 
       [  3.97603600e+12,   3.97603999e+12,   3.97604398e+12, ...,
          3.98400403e+12,   3.98400802e+12,   3.98401201e+12],
       [  3.98401600e+12,   3.98401999e+12,   3.98402399e+12, ...,
          3.99199201e+12,   3.99199601e+12,   3.99200001e+12],
       [  3.99200400e+12,   3.99200800e+12,   3.99201199e+12, ...,
          3.99998800e+12,   3.99999200e+12,   3.99999600e+12]])

Another approach is to use the eval function of bcolz.

In [41]:
x = ca[:10000]  # 10,000 rows as sub-set
In [42]:
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(clevel=9))
  # output carray object compressed
CPU times: user 659 ms, sys: 38 ms, total: 697 ms
Wall time: 188 ms

Out[42]:
carray((10000, 2000), float64)
  nbytes: 152.59 MB; cbytes: 39.08 MB; ratio: 3.90
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[[  0.00000000e+00   2.00000000e+00   5.41421356e+00 ...,   3.98805369e+06
    3.99204870e+06   3.99604571e+06]
 [  4.00004472e+06   4.00404573e+06   4.00804874e+06 ...,   1.59760722e+07
    1.59840672e+07   1.59920642e+07]
 [  1.60000632e+07   1.60080643e+07   1.60160673e+07 ...,   3.59640864e+07
    3.59760814e+07   3.59880785e+07]
 ..., 
 [  1.59520360e+13   1.59520440e+13   1.59520520e+13 ...,   1.59679920e+13
    1.59680000e+13   1.59680080e+13]
 [  1.59680160e+13   1.59680240e+13   1.59680320e+13 ...,   1.59839800e+13
    1.59839880e+13   1.59839960e+13]
 [  1.59840040e+13   1.59840120e+13   1.59840200e+13 ...,   1.59999760e+13
    1.59999840e+13   1.59999920e+13]]

Disk-Based Storage

Disk-based storage of multiple versions of the array object. We write the object 4,000 times to disk in a single carray object. It takes only about 1 min to compress and store 120 GB worth of data on disk.

In [43]:
%%time
it = 4000
ca = bcolz.carray(a, rootdir='ca',
                 cparams=bcolz.cparams(clevel=9))
for i in range(it):
    ca.append(a)
CPU times: user 30.5 s, sys: 4.9 s, total: 35.4 s
Wall time: 1min 15s

The compression ratio in this case is again 130+.

In [44]:
ca
Out[44]:
carray((8002000, 2000), int64)
  nbytes: 119.24 GB; cbytes: 912.17 MB; ratio: 133.86
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
  rootdir := 'ca'
[[      0       1       2 ...,    1997    1998    1999]
 [   2000    2001    2002 ...,    3997    3998    3999]
 [   4000    4001    4002 ...,    5997    5998    5999]
 ..., 
 [3994000 3994001 3994002 ..., 3995997 3995998 3995999]
 [3996000 3996001 3996002 ..., 3997997 3997998 3997999]
 [3998000 3998001 3998002 ..., 3999997 3999998 3999999]]

Simple numerical operations are easy to implement.

In [45]:
%time np.sum(ca[:1000] + ca[4000:5000])
CPU times: user 31 ms, sys: 0 ns, total: 31 ms
Wall time: 31.2 ms

Out[45]:
3999998000000

Let us try the previous, mathematically more demanding operation – again with a sub-set of the data.

In [46]:
x = ca[:10000]  # 10,000 rows as sub-set

First, with an in-memory carray results object.

In [47]:
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(9))
CPU times: user 647 ms, sys: 10 ms, total: 657 ms
Wall time: 178 ms

Out[47]:
carray((10000, 2000), float64)
  nbytes: 152.59 MB; cbytes: 39.08 MB; ratio: 3.90
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[[  0.00000000e+00   2.00000000e+00   5.41421356e+00 ...,   3.98805369e+06
    3.99204870e+06   3.99604571e+06]
 [  4.00004472e+06   4.00404573e+06   4.00804874e+06 ...,   1.59760722e+07
    1.59840672e+07   1.59920642e+07]
 [  1.60000632e+07   1.60080643e+07   1.60160673e+07 ...,   3.59640864e+07
    3.59760814e+07   3.59880785e+07]
 ..., 
 [  1.59520360e+13   1.59520440e+13   1.59520520e+13 ...,   1.59679920e+13
    1.59680000e+13   1.59680080e+13]
 [  1.59680160e+13   1.59680240e+13   1.59680320e+13 ...,   1.59839800e+13
    1.59839880e+13   1.59839960e+13]
 [  1.59840040e+13   1.59840120e+13   1.59840200e+13 ...,   1.59999760e+13
    1.59999840e+13   1.59999920e+13]]

Second, with an on-disk results object. The time difference is not that huge.

In [48]:
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(9), rootdir='out')
CPU times: user 693 ms, sys: 53 ms, total: 746 ms
Wall time: 281 ms

Out[48]:
carray((10000, 2000), float64)
  nbytes: 152.59 MB; cbytes: 39.08 MB; ratio: 3.90
  cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
  rootdir := 'out'
[[  0.00000000e+00   2.00000000e+00   5.41421356e+00 ...,   3.98805369e+06
    3.99204870e+06   3.99604571e+06]
 [  4.00004472e+06   4.00404573e+06   4.00804874e+06 ...,   1.59760722e+07
    1.59840672e+07   1.59920642e+07]
 [  1.60000632e+07   1.60080643e+07   1.60160673e+07 ...,   3.59640864e+07
    3.59760814e+07   3.59880785e+07]
 ..., 
 [  1.59520360e+13   1.59520440e+13   1.59520520e+13 ...,   1.59679920e+13
    1.59680000e+13   1.59680080e+13]
 [  1.59680160e+13   1.59680240e+13   1.59680320e+13 ...,   1.59839800e+13
    1.59839880e+13   1.59839960e+13]
 [  1.59840040e+13   1.59840120e+13   1.59840200e+13 ...,   1.59999760e+13
    1.59999840e+13   1.59999920e+13]]

Finally, we verify system disk usage.

In [49]:
!du -hs ca
# system disk usage
985M	ca

In [50]:
!du -hs out
40M	out

In [51]:
!rm -r ca
!rm -r out

blaze – Data Blending and Analysis

blaze allows Python users a familiar interface to query data living in diverse data storage systems. Cf. http://blaze.pydata.org/.

In [52]:
import blaze as bz

Simple Example

The first example constructs a blaze.Data object from native Python objects.

In [53]:
t = bz.Data([('Henry', 'boy', 8),
              ('Lilli', 'girl', 14)],
            fields=['name', 'gender', 'age'])
In [54]:
t
Out[54]:
name gender age
0 Henry boy 8
1 Lilli girl 14
In [55]:
t[t.age > 10]
Out[55]:
name gender age
0 Lilli girl 14

Data from NumPy Array

Let us read data from an in-memory NumPy ndarray object.

In [56]:
import numpy as np
In [57]:
a = np.random.standard_normal((1000000, 5))
  # 1mn data rows, 5 columns
In [58]:
df = bz.DataFrame(a, columns=['f0', 'f1', 'f2', 'f3', 'f4'])
  # blaze DataFrame constructor

A look at the data structure.

In [59]:
df.head()
Out[59]:
f0 f1 f2 f3 f4
0 0.319853 0.245459 -0.797091 -0.064819 -0.688033
1 -0.417906 -1.432667 -0.906708 -0.229992 0.218824
2 -0.722823 -0.233840 -0.782198 -2.052541 -0.808443
3 1.124005 1.254103 -0.063634 -0.368991 -0.024721
4 -0.520790 -0.828998 1.081217 -0.033651 0.520602

Data itself is stored as NumPy ndarray object.

In [60]:
df.values
Out[60]:
array([[ 0.31985319,  0.24545877, -0.79709138, -0.0648189 , -0.68803281],
       [-0.41790586, -1.43266674, -0.90670814, -0.22999152,  0.21882423],
       [-0.72282341, -0.23384003, -0.78219815, -2.05254069, -0.80844349],
       ..., 
       [ 0.85018339, -1.94928256, -0.53374708, -0.22892796,  0.21056478],
       [ 2.24111957, -2.35307037,  0.78848578, -1.33289509,  0.94705762],
       [ 0.74294773,  1.21010213,  0.04568963, -0.33003822, -1.06262355]])

Data from CSV File

We generate first a CSV file using the random data from before.

In [61]:
%time df.to_csv('data.csv', index=False)
CPU times: user 9.05 s, sys: 187 ms, total: 9.24 s
Wall time: 9.28 s

Let us read the data with blaze. Actually, we only generate a view.

In [62]:
%time csv = bz.CSV('data.csv')
CPU times: user 34 µs, sys: 10 µs, total: 44 µs
Wall time: 49.1 µs

In [63]:
%time t1 = bz.Data(csv)
CPU times: user 6.26 ms, sys: 2.35 ms, total: 8.61 ms
Wall time: 7.84 ms

Now, we can work with the data. Note, however, that iterating, slicing, etc. are not (yet) implemented.

In [64]:
%time t1.count()
CPU times: user 800 µs, sys: 129 µs, total: 929 µs
Wall time: 858 µs

Out[64]:
1000000

The backend is a CSV object. And a look at the first 10 rows.

In [65]:
t1.data
Out[65]:
<into.backends.csv.CSV at 0x10ba76d90>
In [66]:
t1
Out[66]:
f0 f1 f2 f3 f4
0 0.319853 0.245459 -0.797091 -0.064819 -0.688033
1 -0.417906 -1.432667 -0.906708 -0.229992 0.218824
2 -0.722823 -0.233840 -0.782198 -2.052541 -0.808443
3 1.124005 1.254103 -0.063634 -0.368991 -0.024721
4 -0.520790 -0.828998 1.081217 -0.033651 0.520602
5 -0.252100 -1.281830 0.333219 -0.108898 -1.654572
6 0.966351 0.189840 -0.121226 -0.465816 -0.427109
7 -0.823019 -0.675048 -0.212049 1.446792 0.432087
8 0.717941 0.334547 1.110829 0.687371 -2.843008
9 0.672862 1.175542 -1.012422 0.040519 0.999462
10 1.820175 0.557146 0.440231 0.815029 -0.239220

Data from SQL

We now generate a SQLite3 table with the dummy data from before.

In [67]:
import sqlite3 as sq3
In [68]:
con = sq3.connect('data.sql')
try:
    con.execute('DROP TABLE numbers')
    # delete in case it exists
except:
    pass

We write the data into an appropriate table.

In [69]:
con.execute(
    'CREATE TABLE numbers (f0 real, f1 real, f2 real, f3 real, f4 real)'
    )
Out[69]:
<sqlite3.Cursor at 0x10ba48f80>
In [70]:
%time con.executemany('INSERT INTO numbers VALUES (?, ?, ?, ?, ?)', a)
CPU times: user 11.3 s, sys: 209 ms, total: 11.5 s
Wall time: 11.5 s

Out[70]:
<sqlite3.Cursor at 0x10ba48dc0>
In [71]:
con.commit()
In [72]:
con.close()

Now reading the data with blaze (i.e. just generating a view).

In [73]:
%time t2 = bz.Data('sqlite:///data.sql::numbers')
CPU times: user 13.3 ms, sys: 5.87 ms, total: 19.2 ms
Wall time: 16.8 ms

The SQL backend and first 10 rows again.

In [74]:
t2.data
Out[74]:
Table('numbers', MetaData(bind=Engine(sqlite:///data.sql)), Column('f0', REAL(), table=<numbers>), Column('f1', REAL(), table=<numbers>), Column('f2', REAL(), table=<numbers>), Column('f3', REAL(), table=<numbers>), Column('f4', REAL(), table=<numbers>), schema=None)
In [75]:
t2
Out[75]:
f0 f1 f2 f3 f4
0 0.319853 0.245459 -0.797091 -0.064819 -0.688033
1 -0.417906 -1.432667 -0.906708 -0.229992 0.218824
2 -0.722823 -0.233840 -0.782198 -2.052541 -0.808443
3 1.124005 1.254103 -0.063634 -0.368991 -0.024721
4 -0.520790 -0.828998 1.081217 -0.033651 0.520602
5 -0.252100 -1.281830 0.333219 -0.108898 -1.654572
6 0.966351 0.189840 -0.121226 -0.465816 -0.427109
7 -0.823019 -0.675048 -0.212049 1.446792 0.432087
8 0.717941 0.334547 1.110829 0.687371 -2.843008
9 0.672862 1.175542 -1.012422 0.040519 0.999462
10 1.820175 0.557146 0.440231 0.815029 -0.239220

Working with the blaze Objects

blaze provides an abstraction logic for computations/queries.

In [76]:
ts = bz.TableSymbol('ts',
        '{f0: float64, f1: float64, f2: float64, f3: float64, f4: float64}')
  # generic table description -- independent of the target data structure
expr = ts[ts['f0'] + ts['f3'] > 2.5]['f1']
  # generic expression -- independent of the target data structure

The blaze compiler specializes the generic objects to different data structures.

In [77]:
%time np.array(bz.compute(expr, a))  # NumPy ndarray object
CPU times: user 22.2 ms, sys: 6.42 ms, total: 28.6 ms
Wall time: 25.4 ms

Out[77]:
array([ 0.55714636,  0.64643383,  0.02715145, ...,  0.01646136,
       -0.0801466 ,  1.12205609])
In [78]:
%time np.array(bz.compute(expr, df))  # DataFrame object
CPU times: user 38.3 ms, sys: 8.39 ms, total: 46.7 ms
Wall time: 40.4 ms

Out[78]:
array([ 0.55714636,  0.64643383,  0.02715145, ...,  0.01646136,
       -0.0801466 ,  1.12205609])
In [79]:
%time np.array(bz.compute(expr, csv))  # CSV file representation
CPU times: user 1.17 s, sys: 188 ms, total: 1.36 s
Wall time: 1.35 s

Out[79]:
array([ 0.55714636,  0.64643383,  0.02715145, ...,  0.01646136,
       -0.0801466 ,  1.12205609])

In similar fashion, blaze allows unified expression evaluations for different backends (I).

In [80]:
%time t1[t1['f0'] + t1['f3'] > 2.5]['f1'].head()
  # table representation 1
  # from CSV
CPU times: user 4.67 ms, sys: 844 µs, total: 5.51 ms
Wall time: 4.93 ms

Out[80]:
f1
10 0.557146
27 0.646434
35 0.027151
51 0.515017
54 1.021370
85 -0.049430
105 -2.217659
215 0.678503
248 -1.250740
305 0.316044

In similar fashion, blaze allows unified expression evaluations for different backends (II).

In [81]:
%time t2[t2['f0'] + t2['f3'] > 2.5]['f1'].head()
  # table representation 2
  # from SQL database
CPU times: user 4 ms, sys: 840 µs, total: 4.84 ms
Wall time: 4.12 ms

Out[81]:
f1
0 0.557146
1 0.646434
2 0.027151
3 0.515017
4 1.021370
5 -0.049430
6 -2.217659
7 0.678503
8 -1.250740
9 0.316044
10 -0.646563

Typical aggregational operations work as well.

In [82]:
%time t1.f0.sum()
CPU times: user 492 µs, sys: 118 µs, total: 610 µs
Wall time: 513 µs

Out[82]:
-362.30595000518264
In [83]:
%time t2.f3.max()
CPU times: user 1.08 ms, sys: 188 µs, total: 1.27 ms
Wall time: 1.1 ms

Out[83]:
4.600597186631231

Transforming Data Formats

If you work intensively with data sets, it might be beneficial to transform them once into highly performant binary data formats (eg bcolz, HDF5).

Using bcolz as Data Store

In [84]:
%time bz.into('data.bcolz', 'data.csv')
  # natively done by blaze
  # cparams=bcolz.cparams(9) could be added
  # no effect here due to random floats
CPU times: user 2.09 s, sys: 618 ms, total: 2.71 s
Wall time: 2.74 s

Out[84]:
ctable((1000000,), [('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])
  nbytes: 38.15 MB; cbytes: 38.52 MB; ratio: 0.99
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
  rootdir := 'data.bcolz'
[ (0.3198531850803145, 0.245458767960108, -0.7970913836666452, -0.06481889703519775, -0.6880328099126439)
 (-0.41790586195877294, -1.432666742562975, -0.9067081412270984, -0.2299915211962229, 0.21882423409883608)
 (-0.7228234108159903, -0.2338400315854928, -0.7821981542503783, -2.0525406913167457, -0.8084434946349215)
 ...,
 (0.8501833916901524, -1.9492825646439451, -0.5337470832345179, -0.22892795562871054, 0.2105647794568058)
 (2.2411195659753136, -2.3530703744335533, 0.7884857765136529, -1.332895091777322, 0.9470576202661248)
 (0.7429477273462958, 1.2101021279999242, 0.045689629263727084, -0.3300382163012745, -1.0626235518611051)]

We can now connect to the bcolz disk-based ctable object.

In [85]:
import bcolz as bc
In [86]:
b = bc.ctable(rootdir='data.bcolz') 

Now, the power of bcolz for numerical computations can be played out.

In [87]:
%time nex = b.eval('sqrt(abs(f0)) + log(abs(f1))')
CPU times: user 170 ms, sys: 54.9 ms, total: 225 ms
Wall time: 164 ms

In [88]:
nex
Out[88]:
carray((1000000,), float64)
  nbytes: 7.63 MB; cbytes: 7.45 MB; ratio: 1.02
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[-0.83907065  1.00599395 -0.6029278  ...,  1.58951529  2.35275795
  1.05264891]

Using HDF5

Similarly, we can use PyTables and HDF5 as an efficient binary store.

In [89]:
import pandas as pd
In [90]:
%%time
con = sq3.connect('data.sql')
pd.HDFStore('data.h5')['sql'] = pd.read_sql('SELECT * FROM numbers', con)
  # simultaneously reading whole SQL table and writing it to HDF5 store
con.close()
CPU times: user 5.83 s, sys: 573 ms, total: 6.4 s
Wall time: 6.48 s

Now, data can be efficiently retrieved.

In [91]:
%%time
%matplotlib inline
pd.HDFStore('data.h5')['sql'][::1000].cumsum().plot(figsize=(10, 5))
  # simultaneously reading data from HDF5 store and plotting it
CPU times: user 915 ms, sys: 806 ms, total: 1.72 s
Wall time: 2.37 s

Cleaning Up

In [92]:
!du -h dat*
9.5M	data.bcolz/f0/data
8.0K	data.bcolz/f0/meta
9.5M	data.bcolz/f0
9.5M	data.bcolz/f1/data
8.0K	data.bcolz/f1/meta
9.5M	data.bcolz/f1
9.5M	data.bcolz/f2/data
8.0K	data.bcolz/f2/meta
9.5M	data.bcolz/f2
9.5M	data.bcolz/f3/data
8.0K	data.bcolz/f3/meta
9.5M	data.bcolz/f3
9.5M	data.bcolz/f4/data
8.0K	data.bcolz/f4/meta
9.5M	data.bcolz/f4
 48M	data.bcolz
 94M	data.csv
 46M	data.h5
 52M	data.sql

In [93]:
!ls -n dat*.*
-rw-r--r--  1 501  20  98161382 Mar  8 19:35 data.csv
-rw-r--r--  1 501  20  48007208 Mar  8 19:36 data.h5
-rw-r--r--  1 501  20  54446080 Mar  8 19:36 data.sql

data.bcolz:
total 16
-rw-r--r--  1 501  20    3 Mar  8 19:36 __attrs__
-rw-r--r--  1 501  20   42 Mar  8 19:36 __rootdirs__
drwxr-xr-x  5 501  20  170 Mar  8 19:36 f0
drwxr-xr-x  5 501  20  170 Mar  8 19:36 f1
drwxr-xr-x  5 501  20  170 Mar  8 19:36 f2
drwxr-xr-x  5 501  20  170 Mar  8 19:36 f3
drwxr-xr-x  5 501  20  170 Mar  8 19:36 f4

In [94]:
# cleaning up
!rm -r dat*.*

Conclusion

High performance (hardware-bound) I/O operations and highly efficient data blending and analytics are among Python's key strengths.