TsTables is a Python library by Andy Fiedler built on top of the popular PyTables
HDF5 database library. It is meant to handle large amounts of high frequency time series data in append once, retrieve many times scenarios (cf. Gihub page). The focus lies on retrieving chunks of data from large data sets as quickly as possible.
import numpy as np
import pandas as pd
import tables as tb
import tstables as tstb
import random
from time import time
from datetime import datetime
%matplotlib inline
Let us generate a decent amount of sample data points.
no = 30000000
co = 3
dt = 1. / (12 * 30 * 24 * 60)
vol = 0.2
We generate one second intervals of data.
dr = pd.date_range('2015-1-1', periods=no, freq='1s')
dr
<class 'pandas.tseries.index.DatetimeIndex'> [2015-01-01 00:00:00, ..., 2015-12-14 05:19:59] Length: 30000000, Freq: S, Timezone: None
In memory generation is quite quick.
%%time
da = 100 * np.exp(np.cumsum(-0.5 * vol ** 2 * dt +
vol * np.sqrt(dt) * np.random.standard_normal((no, co)), axis=0))
da[0] = 100
CPU times: user 5.86 s, sys: 652 ms, total: 6.51 s Wall time: 6.51 s
df = pd.DataFrame(da, index=dr, columns=['ts1', 'ts2', 'ts3'])
df.count()
ts1 30000000 ts2 30000000 ts3 30000000 dtype: int64
The starting values of the three time series.
df.head()
ts1 | ts2 | ts3 | |
---|---|---|---|
2015-01-01 00:00:00 | 100.000000 | 100.000000 | 100.000000 |
2015-01-01 00:00:01 | 100.010104 | 100.080024 | 99.941624 |
2015-01-01 00:00:02 | 99.995179 | 100.093401 | 99.921710 |
2015-01-01 00:00:03 | 99.934018 | 100.112409 | 99.912793 |
2015-01-01 00:00:04 | 99.969965 | 100.115609 | 99.954547 |
And a plot of the time series data (every 100000th point).
df[::100000].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f9806e0fc90>
To store the time series data in a PyTables table we first define the table structure.
class TS(tb.IsDescription):
timestamp = tb.Int64Col(pos=0)
ts1 = tb.Float64Col(pos=1)
ts2 = tb.Float64Col(pos=1)
ts3 = tb.Float64Col(pos=1)
Second, open a database file and create the table object.
h5 = tb.open_file('ts.h5','w')
TsTables adds a new function create_ts
to PyTables.
ts = h5.create_ts('/','TS', TS)
Third, we append the time series data to the table object.
%time ts.append(df)
CPU times: user 2.18 s, sys: 1.26 s, total: 3.44 s Wall time: 5.62 s
ls -n *.h5
-rw-r--r-- 1 1141 8 931233072 Mar 8 19:50 ts.h5
The approach of TsTables
is to apply a highly structured storage hierarchy.
a = str(h5)
print a[:508]
ts.h5 (File) u'' Last modif.: 'Sun Mar 8 19:50:57 2015' Object Tree: / (RootGroup) u'' /TS (Group/Timeseries) u'' /TS/y2015 (Group) u'' /TS/y2015/m01 (Group) '' /TS/y2015/m02 (Group) '' /TS/y2015/m03 (Group) '' /TS/y2015/m04 (Group) '' /TS/y2015/m05 (Group) '' /TS/y2015/m06 (Group) '' /TS/y2015/m07 (Group) '' /TS/y2015/m08 (Group) '' /TS/y2015/m09 (Group) '' /TS/y2015/m10 (Group) '' /TS/y2015/m11 (Group) '' /TS/y2015/m12 (Group) '' /TS/y2015/m12/d01 (Group) '' /TS/y2015/m12/d01/ts_data (Table(86400,))
The strength of TsTables lies in retrieving chunks of time series data defined by a start date and an end date (which obviously is a typical case in finance, e.g. in backtesting strategies or risk management).
read_start_dt = datetime(2015, 2, 1, 0, 0)
read_end_dt = datetime(2015, 2, 2, 0, 0)
TsTables tries to make such an operation as fast as possible.
%time rows = ts.read_range(read_start_dt, read_end_dt)
CPU times: user 14 ms, sys: 1e+03 µs, total: 15 ms Wall time: 14.7 ms
Let us try it with random intervals.
t0 = time()
its = 100
for _ in xrange(its):
day = random.randint(1, 27)
read_start_dt = datetime(2015, 2, day, 0, 0)
read_end_dt = datetime(2015, 2, day + 1, 0, 0)
rows = ts.read_range(read_start_dt, read_end_dt)
t1 = time()
The results are convincing.
print "time for %d random accesses %5.3f seconds" %(its, t1 - t0)
print "average time for random access %5.3f seconds" %((t1 - t0) / its)
time for 100 random accesses 0.764 seconds average time for random access 0.008 seconds
Conveniently, the returned object is a pandas DataFrame.
rows.count()
ts1 86401 ts2 86401 ts3 86401 dtype: int64
rows.head()
ts1 | ts2 | ts3 | |
---|---|---|---|
2015-02-18 00:00:00 | 46.976217 | 158.482381 | 114.877444 |
2015-02-18 00:00:01 | 46.996315 | 158.501924 | 114.901572 |
2015-02-18 00:00:02 | 46.989175 | 158.560159 | 114.896549 |
2015-02-18 00:00:03 | 46.960973 | 158.557845 | 114.833150 |
2015-02-18 00:00:04 | 46.970345 | 158.505452 | 114.894082 |
A look at a data sub-set.
rows[::500].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f97f87f5c10>
h5.close()
!rm ts.h5
bcolz
is a columnar data store for fast data storage and retrieval with built-in high performance compression. It supports both in-memory and out-of-memory storage and operations. Cf. http://bcolz.blosc.org/.
import bcolz
The first example is based on the ctable
class for data in table format. The example data set is 1 GB in size.
N = 100000 * 1000
print N
100000000
We generate first an in-memory object using high compression. Since we work with integers, good compression ratios are to be expected. It takes about 24 sec to generate the ctable
object from a generator via the fromiter
method.
%%time
ct = bcolz.fromiter(((i, i ** 2) for i in xrange(N)),
dtype="i4, i8",
count=N,
cparams=bcolz.cparams(clevel=9))
CPU times: user 25.7 s, sys: 182 ms, total: 25.9 s Wall time: 24.4 s
The in-memory size is about 150 MB only, which translates in to a compression ratio of 7+.
ct
ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')]) nbytes: 1.12 GB; cbytes: 153.51 MB; ratio: 7.45 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') [(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009) (99999998, 9999999600000004) (99999999, 9999999800000001)]
You can now implement fast numerical operations on this data object (note that the output is a carray
object).
%time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 4.62 s, sys: 794 ms, total: 5.42 s Wall time: 968 ms
carray((100000000,), float64) nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20 cparams := cparams(clevel=5, shuffle=True, cname='blosclz') [ 0.00000000e+00 2.00000000e+00 6.00000000e+00 ..., 1.37491943e+09 1.57491943e+09 1.77491942e+09]
The same tasks can be implemented with disk-based storage. To this end, only specify the rootdir
parameter. With about 30 sec the generation takes a bit longer on disk. everything else (especially the object handling) remaining the same however.
%%time
ct = bcolz.fromiter(((i, i ** 2) for i in xrange(N)),
dtype="i4, i8",
count=N, rootdir='ct',
cparams=bcolz.cparams(clevel=9))
CPU times: user 25.4 s, sys: 310 ms, total: 25.7 s Wall time: 31.1 s
Everything else (especially the object handling) remains almost the same however.
ct
ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')]) nbytes: 1.12 GB; cbytes: 153.51 MB; ratio: 7.45 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') rootdir := 'ct' [(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009) (99999998, 9999999600000004) (99999999, 9999999800000001)]
The numerical operations work in the same fashion and hardly take longer due to native multi threading and optimized caching.
%time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 4.92 s, sys: 814 ms, total: 5.73 s Wall time: 1.08 s
carray((100000000,), float64) nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20 cparams := cparams(clevel=5, shuffle=True, cname='blosclz') [ 0.00000000e+00 2.00000000e+00 6.00000000e+00 ..., 1.37491943e+09 1.57491943e+09 1.77491942e+09]
Let us finally verify system disk usage.
!ls ct
__attrs__ __rootdirs__ f0 f1
!du -h ct
# system disk usage
12K ct/f1/meta 153M ct/f1/data 153M ct/f1 12K ct/f0/meta 4.5M ct/f0/data 4.6M ct/f0 158M ct
!rm -r ct
This example is about mid data which does not fit (in general) into memory (without compression).
import numpy as np
We generte as basis a NumPy ndarray
object of size 32 MB.
n = 2000
a = np.arange(n * n).reshape(n, n)
a.nbytes
32000000
Let us first again work in-memory. Our carray
object contains 4,000 versions of the ndarray
object. The in-memory generation of the object takes about 25 sec.
%%time
it = 4000
ca = bcolz.carray(a, cparams=bcolz.cparams(clevel=9))
for i in range(it):
ca.append(a)
CPU times: user 26.2 s, sys: 213 ms, total: 26.4 s Wall time: 26.4 s
The carray
object stores 120 GB worth of data in less than 1 GB of memory, for a compression ratio of more than 130.
ca
carray((8002000, 2000), int64) nbytes: 119.24 GB; cbytes: 912.17 MB; ratio: 133.86 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') [[ 0 1 2 ..., 1997 1998 1999] [ 2000 2001 2002 ..., 3997 3998 3999] [ 4000 4001 4002 ..., 5997 5998 5999] ..., [3994000 3994001 3994002 ..., 3995997 3995998 3995999] [3996000 3996001 3996002 ..., 3997997 3997998 3997999] [3998000 3998001 3998002 ..., 3999997 3999998 3999999]]
Let us implement the evaluation of a numerical expression on this data set. The syntax and handling are the same as with NumPy ndarray
objects.
%time ca[:5000] ** 2 + np.sqrt(ca[10000:15000])
CPU times: user 96 ms, sys: 25 ms, total: 121 ms Wall time: 122 ms
array([[ 0.00000000e+00, 2.00000000e+00, 5.41421356e+00, ..., 3.98805369e+06, 3.99204870e+06, 3.99604571e+06], [ 4.00004472e+06, 4.00404573e+06, 4.00804874e+06, ..., 1.59760722e+07, 1.59840672e+07, 1.59920642e+07], [ 1.60000632e+07, 1.60080643e+07, 1.60160673e+07, ..., 3.59640864e+07, 3.59760814e+07, 3.59880785e+07], ..., [ 3.97603600e+12, 3.97603999e+12, 3.97604398e+12, ..., 3.98400403e+12, 3.98400802e+12, 3.98401201e+12], [ 3.98401600e+12, 3.98401999e+12, 3.98402399e+12, ..., 3.99199201e+12, 3.99199601e+12, 3.99200001e+12], [ 3.99200400e+12, 3.99200800e+12, 3.99201199e+12, ..., 3.99998800e+12, 3.99999200e+12, 3.99999600e+12]])
Another approach is to use the eval
function of bcolz.
x = ca[:10000] # 10,000 rows as sub-set
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(clevel=9))
# output carray object compressed
CPU times: user 659 ms, sys: 38 ms, total: 697 ms Wall time: 188 ms
carray((10000, 2000), float64) nbytes: 152.59 MB; cbytes: 39.08 MB; ratio: 3.90 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') [[ 0.00000000e+00 2.00000000e+00 5.41421356e+00 ..., 3.98805369e+06 3.99204870e+06 3.99604571e+06] [ 4.00004472e+06 4.00404573e+06 4.00804874e+06 ..., 1.59760722e+07 1.59840672e+07 1.59920642e+07] [ 1.60000632e+07 1.60080643e+07 1.60160673e+07 ..., 3.59640864e+07 3.59760814e+07 3.59880785e+07] ..., [ 1.59520360e+13 1.59520440e+13 1.59520520e+13 ..., 1.59679920e+13 1.59680000e+13 1.59680080e+13] [ 1.59680160e+13 1.59680240e+13 1.59680320e+13 ..., 1.59839800e+13 1.59839880e+13 1.59839960e+13] [ 1.59840040e+13 1.59840120e+13 1.59840200e+13 ..., 1.59999760e+13 1.59999840e+13 1.59999920e+13]]
Disk-based storage of multiple versions of the array object. We write the object 4,000 times to disk in a single carray
object. It takes only about 1 min to compress and store 120 GB worth of data on disk.
%%time
it = 4000
ca = bcolz.carray(a, rootdir='ca',
cparams=bcolz.cparams(clevel=9))
for i in range(it):
ca.append(a)
CPU times: user 30.5 s, sys: 4.9 s, total: 35.4 s Wall time: 1min 15s
The compression ratio in this case is again 130+.
ca
carray((8002000, 2000), int64) nbytes: 119.24 GB; cbytes: 912.17 MB; ratio: 133.86 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') rootdir := 'ca' [[ 0 1 2 ..., 1997 1998 1999] [ 2000 2001 2002 ..., 3997 3998 3999] [ 4000 4001 4002 ..., 5997 5998 5999] ..., [3994000 3994001 3994002 ..., 3995997 3995998 3995999] [3996000 3996001 3996002 ..., 3997997 3997998 3997999] [3998000 3998001 3998002 ..., 3999997 3999998 3999999]]
Simple numerical operations are easy to implement.
%time np.sum(ca[:1000] + ca[4000:5000])
CPU times: user 31 ms, sys: 0 ns, total: 31 ms Wall time: 31.2 ms
3999998000000
Let us try the previous, mathematically more demanding operation – again with a sub-set of the data.
x = ca[:10000] # 10,000 rows as sub-set
First, with an in-memory carray
results object.
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(9))
CPU times: user 647 ms, sys: 10 ms, total: 657 ms Wall time: 178 ms
carray((10000, 2000), float64) nbytes: 152.59 MB; cbytes: 39.08 MB; ratio: 3.90 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') [[ 0.00000000e+00 2.00000000e+00 5.41421356e+00 ..., 3.98805369e+06 3.99204870e+06 3.99604571e+06] [ 4.00004472e+06 4.00404573e+06 4.00804874e+06 ..., 1.59760722e+07 1.59840672e+07 1.59920642e+07] [ 1.60000632e+07 1.60080643e+07 1.60160673e+07 ..., 3.59640864e+07 3.59760814e+07 3.59880785e+07] ..., [ 1.59520360e+13 1.59520440e+13 1.59520520e+13 ..., 1.59679920e+13 1.59680000e+13 1.59680080e+13] [ 1.59680160e+13 1.59680240e+13 1.59680320e+13 ..., 1.59839800e+13 1.59839880e+13 1.59839960e+13] [ 1.59840040e+13 1.59840120e+13 1.59840200e+13 ..., 1.59999760e+13 1.59999840e+13 1.59999920e+13]]
Second, with an on-disk results object. The time difference is not that huge.
%time bcolz.eval('x ** 2 + sqrt(x)', cparams=bcolz.cparams(9), rootdir='out')
CPU times: user 693 ms, sys: 53 ms, total: 746 ms Wall time: 281 ms
carray((10000, 2000), float64) nbytes: 152.59 MB; cbytes: 39.08 MB; ratio: 3.90 cparams := cparams(clevel=9, shuffle=True, cname='blosclz') rootdir := 'out' [[ 0.00000000e+00 2.00000000e+00 5.41421356e+00 ..., 3.98805369e+06 3.99204870e+06 3.99604571e+06] [ 4.00004472e+06 4.00404573e+06 4.00804874e+06 ..., 1.59760722e+07 1.59840672e+07 1.59920642e+07] [ 1.60000632e+07 1.60080643e+07 1.60160673e+07 ..., 3.59640864e+07 3.59760814e+07 3.59880785e+07] ..., [ 1.59520360e+13 1.59520440e+13 1.59520520e+13 ..., 1.59679920e+13 1.59680000e+13 1.59680080e+13] [ 1.59680160e+13 1.59680240e+13 1.59680320e+13 ..., 1.59839800e+13 1.59839880e+13 1.59839960e+13] [ 1.59840040e+13 1.59840120e+13 1.59840200e+13 ..., 1.59999760e+13 1.59999840e+13 1.59999920e+13]]
Finally, we verify system disk usage.
!du -hs ca
# system disk usage
985M ca
!du -hs out
40M out
!rm -r ca
!rm -r out
blaze
allows Python users a familiar interface to query data living in diverse data storage systems.
Cf. http://blaze.pydata.org/.
import blaze as bz
The first example constructs a blaze.Data
object from native Python objects.
t = bz.Data([('Henry', 'boy', 8),
('Lilli', 'girl', 14)],
fields=['name', 'gender', 'age'])
t
name | gender | age | |
---|---|---|---|
0 | Henry | boy | 8 |
1 | Lilli | girl | 14 |
t[t.age > 10]
name | gender | age | |
---|---|---|---|
0 | Lilli | girl | 14 |
Let us read data from an in-memory NumPy ndarray
object.
import numpy as np
a = np.random.standard_normal((1000000, 5))
# 1mn data rows, 5 columns
df = bz.DataFrame(a, columns=['f0', 'f1', 'f2', 'f3', 'f4'])
# blaze DataFrame constructor
A look at the data structure.
df.head()
f0 | f1 | f2 | f3 | f4 | |
---|---|---|---|---|---|
0 | 0.319853 | 0.245459 | -0.797091 | -0.064819 | -0.688033 |
1 | -0.417906 | -1.432667 | -0.906708 | -0.229992 | 0.218824 |
2 | -0.722823 | -0.233840 | -0.782198 | -2.052541 | -0.808443 |
3 | 1.124005 | 1.254103 | -0.063634 | -0.368991 | -0.024721 |
4 | -0.520790 | -0.828998 | 1.081217 | -0.033651 | 0.520602 |
Data itself is stored as NumPy ndarray
object.
df.values
array([[ 0.31985319, 0.24545877, -0.79709138, -0.0648189 , -0.68803281], [-0.41790586, -1.43266674, -0.90670814, -0.22999152, 0.21882423], [-0.72282341, -0.23384003, -0.78219815, -2.05254069, -0.80844349], ..., [ 0.85018339, -1.94928256, -0.53374708, -0.22892796, 0.21056478], [ 2.24111957, -2.35307037, 0.78848578, -1.33289509, 0.94705762], [ 0.74294773, 1.21010213, 0.04568963, -0.33003822, -1.06262355]])
We generate first a CSV file using the random data from before.
%time df.to_csv('data.csv', index=False)
CPU times: user 9.05 s, sys: 187 ms, total: 9.24 s Wall time: 9.28 s
Let us read the data with blaze
. Actually, we only generate a view.
%time csv = bz.CSV('data.csv')
CPU times: user 34 µs, sys: 10 µs, total: 44 µs Wall time: 49.1 µs
%time t1 = bz.Data(csv)
CPU times: user 6.26 ms, sys: 2.35 ms, total: 8.61 ms Wall time: 7.84 ms
Now, we can work with the data. Note, however, that iterating, slicing, etc. are not (yet) implemented.
%time t1.count()
CPU times: user 800 µs, sys: 129 µs, total: 929 µs Wall time: 858 µs
The backend is a CSV object. And a look at the first 10 rows.
t1.data
<into.backends.csv.CSV at 0x10ba76d90>
t1
f0 | f1 | f2 | f3 | f4 | |
---|---|---|---|---|---|
0 | 0.319853 | 0.245459 | -0.797091 | -0.064819 | -0.688033 |
1 | -0.417906 | -1.432667 | -0.906708 | -0.229992 | 0.218824 |
2 | -0.722823 | -0.233840 | -0.782198 | -2.052541 | -0.808443 |
3 | 1.124005 | 1.254103 | -0.063634 | -0.368991 | -0.024721 |
4 | -0.520790 | -0.828998 | 1.081217 | -0.033651 | 0.520602 |
5 | -0.252100 | -1.281830 | 0.333219 | -0.108898 | -1.654572 |
6 | 0.966351 | 0.189840 | -0.121226 | -0.465816 | -0.427109 |
7 | -0.823019 | -0.675048 | -0.212049 | 1.446792 | 0.432087 |
8 | 0.717941 | 0.334547 | 1.110829 | 0.687371 | -2.843008 |
9 | 0.672862 | 1.175542 | -1.012422 | 0.040519 | 0.999462 |
10 | 1.820175 | 0.557146 | 0.440231 | 0.815029 | -0.239220 |
We now generate a SQLite3 table with the dummy data from before.
import sqlite3 as sq3
con = sq3.connect('data.sql')
try:
con.execute('DROP TABLE numbers')
# delete in case it exists
except:
pass
We write the data into an appropriate table.
con.execute(
'CREATE TABLE numbers (f0 real, f1 real, f2 real, f3 real, f4 real)'
)
<sqlite3.Cursor at 0x10ba48f80>
%time con.executemany('INSERT INTO numbers VALUES (?, ?, ?, ?, ?)', a)
CPU times: user 11.3 s, sys: 209 ms, total: 11.5 s Wall time: 11.5 s
<sqlite3.Cursor at 0x10ba48dc0>
con.commit()
con.close()
Now reading the data with blaze
(i.e. just generating a view).
%time t2 = bz.Data('sqlite:///data.sql::numbers')
CPU times: user 13.3 ms, sys: 5.87 ms, total: 19.2 ms Wall time: 16.8 ms
The SQL backend and first 10 rows again.
t2.data
Table('numbers', MetaData(bind=Engine(sqlite:///data.sql)), Column('f0', REAL(), table=<numbers>), Column('f1', REAL(), table=<numbers>), Column('f2', REAL(), table=<numbers>), Column('f3', REAL(), table=<numbers>), Column('f4', REAL(), table=<numbers>), schema=None)
t2
f0 | f1 | f2 | f3 | f4 | |
---|---|---|---|---|---|
0 | 0.319853 | 0.245459 | -0.797091 | -0.064819 | -0.688033 |
1 | -0.417906 | -1.432667 | -0.906708 | -0.229992 | 0.218824 |
2 | -0.722823 | -0.233840 | -0.782198 | -2.052541 | -0.808443 |
3 | 1.124005 | 1.254103 | -0.063634 | -0.368991 | -0.024721 |
4 | -0.520790 | -0.828998 | 1.081217 | -0.033651 | 0.520602 |
5 | -0.252100 | -1.281830 | 0.333219 | -0.108898 | -1.654572 |
6 | 0.966351 | 0.189840 | -0.121226 | -0.465816 | -0.427109 |
7 | -0.823019 | -0.675048 | -0.212049 | 1.446792 | 0.432087 |
8 | 0.717941 | 0.334547 | 1.110829 | 0.687371 | -2.843008 |
9 | 0.672862 | 1.175542 | -1.012422 | 0.040519 | 0.999462 |
10 | 1.820175 | 0.557146 | 0.440231 | 0.815029 | -0.239220 |
blaze
provides an abstraction logic for computations/queries.
ts = bz.TableSymbol('ts',
'{f0: float64, f1: float64, f2: float64, f3: float64, f4: float64}')
# generic table description -- independent of the target data structure
expr = ts[ts['f0'] + ts['f3'] > 2.5]['f1']
# generic expression -- independent of the target data structure
The blaze
compiler specializes the generic objects to different data structures.
%time np.array(bz.compute(expr, a)) # NumPy ndarray object
CPU times: user 22.2 ms, sys: 6.42 ms, total: 28.6 ms Wall time: 25.4 ms
array([ 0.55714636, 0.64643383, 0.02715145, ..., 0.01646136, -0.0801466 , 1.12205609])
%time np.array(bz.compute(expr, df)) # DataFrame object
CPU times: user 38.3 ms, sys: 8.39 ms, total: 46.7 ms Wall time: 40.4 ms
array([ 0.55714636, 0.64643383, 0.02715145, ..., 0.01646136, -0.0801466 , 1.12205609])
%time np.array(bz.compute(expr, csv)) # CSV file representation
CPU times: user 1.17 s, sys: 188 ms, total: 1.36 s Wall time: 1.35 s
array([ 0.55714636, 0.64643383, 0.02715145, ..., 0.01646136, -0.0801466 , 1.12205609])
In similar fashion, blaze
allows unified expression evaluations for different backends (I).
%time t1[t1['f0'] + t1['f3'] > 2.5]['f1'].head()
# table representation 1
# from CSV
CPU times: user 4.67 ms, sys: 844 µs, total: 5.51 ms Wall time: 4.93 ms
f1 | |
---|---|
10 | 0.557146 |
27 | 0.646434 |
35 | 0.027151 |
51 | 0.515017 |
54 | 1.021370 |
85 | -0.049430 |
105 | -2.217659 |
215 | 0.678503 |
248 | -1.250740 |
305 | 0.316044 |
In similar fashion, blaze
allows unified expression evaluations for different backends (II).
%time t2[t2['f0'] + t2['f3'] > 2.5]['f1'].head()
# table representation 2
# from SQL database
CPU times: user 4 ms, sys: 840 µs, total: 4.84 ms Wall time: 4.12 ms
f1 | |
---|---|
0 | 0.557146 |
1 | 0.646434 |
2 | 0.027151 |
3 | 0.515017 |
4 | 1.021370 |
5 | -0.049430 |
6 | -2.217659 |
7 | 0.678503 |
8 | -1.250740 |
9 | 0.316044 |
10 | -0.646563 |
Typical aggregational operations work as well.
%time t1.f0.sum()
CPU times: user 492 µs, sys: 118 µs, total: 610 µs Wall time: 513 µs
%time t2.f3.max()
CPU times: user 1.08 ms, sys: 188 µs, total: 1.27 ms Wall time: 1.1 ms
If you work intensively with data sets, it might be beneficial to transform them once into highly performant binary data formats (eg bcolz, HDF5
).
%time bz.into('data.bcolz', 'data.csv')
# natively done by blaze
# cparams=bcolz.cparams(9) could be added
# no effect here due to random floats
CPU times: user 2.09 s, sys: 618 ms, total: 2.71 s Wall time: 2.74 s
ctable((1000000,), [('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')]) nbytes: 38.15 MB; cbytes: 38.52 MB; ratio: 0.99 cparams := cparams(clevel=5, shuffle=True, cname='blosclz') rootdir := 'data.bcolz' [ (0.3198531850803145, 0.245458767960108, -0.7970913836666452, -0.06481889703519775, -0.6880328099126439) (-0.41790586195877294, -1.432666742562975, -0.9067081412270984, -0.2299915211962229, 0.21882423409883608) (-0.7228234108159903, -0.2338400315854928, -0.7821981542503783, -2.0525406913167457, -0.8084434946349215) ..., (0.8501833916901524, -1.9492825646439451, -0.5337470832345179, -0.22892795562871054, 0.2105647794568058) (2.2411195659753136, -2.3530703744335533, 0.7884857765136529, -1.332895091777322, 0.9470576202661248) (0.7429477273462958, 1.2101021279999242, 0.045689629263727084, -0.3300382163012745, -1.0626235518611051)]
We can now connect to the bcolz
disk-based ctable
object.
import bcolz as bc
b = bc.ctable(rootdir='data.bcolz')
Now, the power of bcolz
for numerical computations can be played out.
%time nex = b.eval('sqrt(abs(f0)) + log(abs(f1))')
CPU times: user 170 ms, sys: 54.9 ms, total: 225 ms Wall time: 164 ms
nex
carray((1000000,), float64) nbytes: 7.63 MB; cbytes: 7.45 MB; ratio: 1.02 cparams := cparams(clevel=5, shuffle=True, cname='blosclz') [-0.83907065 1.00599395 -0.6029278 ..., 1.58951529 2.35275795 1.05264891]
Similarly, we can use PyTables
and HDF5
as an efficient binary store.
import pandas as pd
%%time
con = sq3.connect('data.sql')
pd.HDFStore('data.h5')['sql'] = pd.read_sql('SELECT * FROM numbers', con)
# simultaneously reading whole SQL table and writing it to HDF5 store
con.close()
CPU times: user 5.83 s, sys: 573 ms, total: 6.4 s Wall time: 6.48 s
Now, data can be efficiently retrieved.
%%time
%matplotlib inline
pd.HDFStore('data.h5')['sql'][::1000].cumsum().plot(figsize=(10, 5))
# simultaneously reading data from HDF5 store and plotting it
CPU times: user 915 ms, sys: 806 ms, total: 1.72 s Wall time: 2.37 s
!du -h dat*
9.5M data.bcolz/f0/data 8.0K data.bcolz/f0/meta 9.5M data.bcolz/f0 9.5M data.bcolz/f1/data 8.0K data.bcolz/f1/meta 9.5M data.bcolz/f1 9.5M data.bcolz/f2/data 8.0K data.bcolz/f2/meta 9.5M data.bcolz/f2 9.5M data.bcolz/f3/data 8.0K data.bcolz/f3/meta 9.5M data.bcolz/f3 9.5M data.bcolz/f4/data 8.0K data.bcolz/f4/meta 9.5M data.bcolz/f4 48M data.bcolz 94M data.csv 46M data.h5 52M data.sql
!ls -n dat*.*
-rw-r--r-- 1 501 20 98161382 Mar 8 19:35 data.csv -rw-r--r-- 1 501 20 48007208 Mar 8 19:36 data.h5 -rw-r--r-- 1 501 20 54446080 Mar 8 19:36 data.sql data.bcolz: total 16 -rw-r--r-- 1 501 20 3 Mar 8 19:36 __attrs__ -rw-r--r-- 1 501 20 42 Mar 8 19:36 __rootdirs__ drwxr-xr-x 5 501 20 170 Mar 8 19:36 f0 drwxr-xr-x 5 501 20 170 Mar 8 19:36 f1 drwxr-xr-x 5 501 20 170 Mar 8 19:36 f2 drwxr-xr-x 5 501 20 170 Mar 8 19:36 f3 drwxr-xr-x 5 501 20 170 Mar 8 19:36 f4
# cleaning up
!rm -r dat*.*
High performance (hardware-bound) I/O operations and highly efficient data blending and analytics are among Python's key strengths.
Python Quant Platform | http://quant-platform.com
Python for Finance | Python for Finance @ O'Reilly
Derivatives Analytics with Python | Derivatives Analytics @ Wiley Finance