Winning in Efficient Markets (II)

Discovering Statistical Inefficiencies via Deep Learning

Dr. Yves J. Hilpisch | The Python Quants GmbH

http://tpq.io | @dyjh | training@tpq.io

The Agenda

This tutorial shows

  • how to retrieve historical data across asset classes via the Eikon Data API,
  • how to work with such data using pandas, Plotly and Cufflinks and
  • how to discover stastical inefficiencies in financial time series with machine learning.

Importing Required Packages

In [1]:
import time
import eikon as ek  # the Eikon package
import numpy as np  # NumPy
import pandas as pd  # pandas
import cufflinks as cf  # Cufflinks
import configparser as cp

The following Python and package versions are used.

In [2]:
import sys
print(sys.version)
3.7.4 (default, Aug 13 2019, 15:17:50) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
In [3]:
ek.__version__
Out[3]:
'1.0.1'
In [4]:
np.__version__
Out[4]:
'1.17.3'
In [5]:
pd.__version__
Out[5]:
'0.25.1'
In [6]:
cf.__version__
Out[6]:
'0.17.0'

Connecting to Eikon Data API

This code sets the app_id to connect to the Eikon Data API Proxy which needs to be running locally.

In [7]:
cfg = cp.ConfigParser()
cfg.read('eikon.cfg')
Out[7]:
['eikon.cfg']
In [8]:
# ek.set_app_key(cfg['eikon']['app_id'])

Retrieving Cross-Asset Data

We first define a small universe of RICS for which to retrieve data.

In [9]:
rics = [
    'GE',  # General Electric stock
    'AAPL.O',  # Apple stock
    '.SPX',  # S&P 500 stock index
    '.VIX',  # VIX volatility index
    'EUR=',  # EUR/USD exchange rate
    'XAU=',  # Gold price
    'GLD',  # Gold ETF
    'BTC=',  # Bitcoin in USD
]

Second, end-of-day (EOD) data is retrieved.

In [11]:
fn = 'eikon_eod_data.csv'
try:
    data = pd.read_csv(fn, index_col=0, parse_dates=True)
except:
    first = True
    for ric in rics:
        print(ric)
        d = ek.get_timeseries(ric,  # the RIC
                             fields='CLOSE',  # the required fields
                             start_date='2015-01-01',  # start date
                             end_date='2019-09-30')  # end date
        if first:
            data = d
            data.columns = [ric]
            first = False
        else:
            data[ric] = d
        time.sleep(2)
    data.to_csv(fn)
In [12]:
data.head()  # first five rows
Out[12]:
GE AAPL.O .SPX .VIX EUR= XAU= GLD BTC=
Date
2015-01-02 24.088950 109.33 2058.20 17.79 1.2000 1188.68 114.08 315.41
2015-01-05 23.646775 106.25 2020.58 19.92 1.1931 1203.70 115.80 276.80
2015-01-06 23.137312 106.26 2002.61 21.12 1.1888 1218.45 117.12 276.82
2015-01-07 23.146924 107.75 2025.90 19.31 1.1837 1210.05 116.43 NaN
2015-01-08 23.425687 111.89 2062.14 17.01 1.1792 1208.63 115.94 276.82
In [13]:
data.tail()  # final five rows
Out[13]:
GE AAPL.O .SPX .VIX EUR= XAU= GLD BTC=
Date
2019-09-24 8.97 217.68 2966.60 17.05 1.1018 1531.80 144.51 8575.50
2019-09-25 9.16 221.03 2984.87 15.96 1.0941 1503.75 141.83 8500.98
2019-09-26 9.02 219.89 2977.62 16.07 1.0921 1505.60 141.79 8088.33
2019-09-27 9.04 218.82 2961.79 17.22 1.0938 1496.56 141.06 8055.49
2019-09-30 8.94 223.97 2976.74 16.24 1.0898 1472.00 138.87 8260.00
In [14]:
data.info()  # DataFrame meta information
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1194 entries, 2015-01-02 to 2019-09-30
Data columns (total 8 columns):
GE        1194 non-null float64
AAPL.O    1194 non-null float64
.SPX      1194 non-null float64
.VIX      1194 non-null float64
EUR=      1194 non-null float64
XAU=      1194 non-null float64
GLD       1194 non-null float64
BTC=      1193 non-null float64
dtypes: float64(8)
memory usage: 84.0 KB

Preparing Features

To discover statistical inefficiencies we work with different features extracted from the time series data.

In [15]:
def add_lags(data, ric, lags):
    cols = []
    df = pd.DataFrame(data[ric])
    df['r'] = np.log(df / df.shift())
    df['sma'] = df[ric].rolling(20).mean()
    df['min'] = df[ric].rolling(20).min()
    df['max'] = df[ric].rolling(20).max()
    df['mom'] = df['r'].rolling(20).mean()
    df['vol'] = df['r'].rolling(20).std()
    df.dropna(inplace=True)
    df['d'] = np.where(df['r'] > 0, 1, 0)
    features = [ric, 'r', 'd', 'sma', 'min', 'max', 'mom', 'vol']
    for f in features:
        for lag in range(1, lags + 1):
            col = f'{f}_lag_{lag}'  # defines the column name
            df[col] = df[f].shift(lag)  # creates the lagged data column
            cols.append(col)  # stores the column name
    df.dropna(inplace=True)  # gets rid of incomplete data rows
    return df, cols

Second, the iterations over all RICs, using the add_lags function and storing the resulting DataFrame objects in a dictionary.

In [16]:
lags = 7  # historical lags
In [17]:
dfs = {}
for ric in rics:
    print(ric)
    df, cols = add_lags(data, ric, lags)
    dfs[ric] = df.dropna(), cols
GE
AAPL.O
.SPX
.VIX
EUR=
XAU=
GLD
BTC=
In [18]:
dfs.keys()  # the keys of the dictonary
Out[18]:
dict_keys(['GE', 'AAPL.O', '.SPX', '.VIX', 'EUR=', 'XAU=', 'GLD', 'BTC='])
In [19]:
# dfs['AAPL.O'].head(7)

Machine Learning

The matrix consisting of the lagged data columns is used to "predict" the next day's value of the RIC via Machine Learning.

In [20]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
In [21]:
np.set_printoptions(precision=4, suppress=True)
In [22]:
def create_model(algo=3, n_estimators=75):
    base_estimators = [
        GaussianNB(),
        LogisticRegression(C=1),
        DecisionTreeClassifier(random_state=100, max_depth=3,
                              min_samples_leaf=12),
        MLPClassifier(hidden_layer_sizes=2 * [256,],
                      early_stopping=True, shuffle=False,
                      # random_state=100,
                      activation='relu',
                      max_iter=250),
        RandomForestClassifier(n_estimators=5, max_depth=3,
                               min_samples_leaf=8, random_state=100)
        ]
    model = BaggingClassifier(base_estimator=base_estimators[4],
                          n_estimators=n_estimators,
                          bootstrap=True,
                          oob_score=True,
                          n_jobs=4,
                          # random_state=100
                        )
    return model

First, training and prediction in-sample only.

In [23]:
for ric in rics:
    model = create_model(n_estimators=75)
    df, cols = dfs[ric]  # getting data for the RIC
    model.fit(df[cols], df['d'])  # the fitting step
    pred = model.predict(df[cols])  # the prediction step
    acc = accuracy_score(df['d'], pred)  # prediction accuracy
    print(f'IN-SAMPLE | {ric:7s} | acc={acc:.4f}')
IN-SAMPLE | GE      | acc=0.6401
IN-SAMPLE | AAPL.O  | acc=0.6452
IN-SAMPLE | .SPX    | acc=0.6230
IN-SAMPLE | .VIX    | acc=0.6435
IN-SAMPLE | EUR=    | acc=0.6598
IN-SAMPLE | XAU=    | acc=0.6889
IN-SAMPLE | GLD     | acc=0.6804
IN-SAMPLE | BTC=    | acc=0.6199

Second, training in-sample and prediction out-of-sample.

In [24]:
split = int(len(dfs[ric][0]) * 0.7)
In [25]:
for ric in rics:
    model = create_model()
    df, cols = dfs[ric]  # getting data for the RIC
    train = df.iloc[:split]  # training data set
    mu, std = train[cols].mean(), train[cols].std()
    train[cols] = (train[cols] - mu) / std  # normalization
    model.fit(train[cols], train['d'])  # the fitting step
    test = df.iloc[split:]  # test data set
    test[cols] = (test[cols] - mu) / std  # normalization
    pred = model.predict(test[cols])  # the prediction step
    acc = accuracy_score(test['d'], pred)  # prediction accuracy
    print(f'OUT-OF-SAMPLE | {ric:7s} | acc={acc:.4f}')
OUT-OF-SAMPLE | GE      | acc=0.4986
OUT-OF-SAMPLE | AAPL.O  | acc=0.4816
OUT-OF-SAMPLE | .SPX    | acc=0.4816
OUT-OF-SAMPLE | .VIX    | acc=0.5722
OUT-OF-SAMPLE | EUR=    | acc=0.5127
OUT-OF-SAMPLE | XAU=    | acc=0.4986
OUT-OF-SAMPLE | GLD     | acc=0.5071
OUT-OF-SAMPLE | BTC=    | acc=0.4900

Analyzing Intraday Data

Let us quickly check, whether the results are similar on an intraday basis.

In [26]:
fn = 'eikon_id_data.csv'
try:
    data = pd.read_csv(fn, index_col=0, parse_dates=True)
except:
    first = True
    for ric in rics:
        print(ric)
        d = ek.get_timeseries(ric,  # the RIC
                             fields='CLOSE',  # the required fields
                             start_date='2019-10-07',  # start date
                             end_date='2019-10-12',  # end date
                             interval='minute') 
        if first:
            data = d
            data.columns = [ric]
            first = False
        else:
            data[ric] = d
        time.sleep(2)
    data.to_csv(fn)
In [27]:
data.dropna(inplace=True)
In [28]:
data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1996 entries, 2019-10-07 13:32:00 to 2019-10-11 20:15:00
Data columns (total 8 columns):
GE        1996 non-null float64
AAPL.O    1996 non-null float64
.SPX      1996 non-null float64
.VIX      1996 non-null float64
EUR=      1996 non-null float64
XAU=      1996 non-null float64
GLD       1996 non-null float64
BTC=      1996 non-null float64
dtypes: float64(8)
memory usage: 140.3 KB
In [29]:
data.tail()
Out[29]:
GE AAPL.O .SPX .VIX EUR= XAU= GLD BTC=
Date
2019-10-11 20:10:00 8.78 236.30 2970.27 15.55 1.1039 1487.8700 140.25 8375.70
2019-10-11 20:11:00 8.77 236.33 2970.27 15.63 1.1038 1487.7975 140.23 8352.60
2019-10-11 20:12:00 8.78 236.40 2970.27 15.56 1.1039 1487.2671 140.20 8369.26
2019-10-11 20:14:00 8.77 236.39 2970.27 15.55 1.1035 1487.2476 140.15 8372.16
2019-10-11 20:15:00 8.78 236.40 2970.27 15.58 1.1037 1486.9347 140.14 8384.40
In [30]:
dfs = {}
for ric in rics:
    df, cols = add_lags(data, ric, lags)
    dfs[ric] = df.dropna(), cols
In [31]:
split = int(len(dfs[ric][0]) * 0.8)
In [32]:
for ric in rics:
    model = create_model(n_estimators=75)
    df, cols = dfs[ric]  # getting data for the RIC
    train = df.iloc[:split]  # training data set
    mu, std = train[cols].mean(), train[cols].std()
    train[cols] = (train[cols] - mu) / std  # normalization
    model.fit(train[cols], train['d'])  # the fitting step
    test = df.iloc[split:]  # test data set
    test[cols] = (test[cols] - mu) / std  # normalization
    pred = model.predict(test[cols])  # the prediction step
    acc = accuracy_score(test['d'], pred)  # prediction accuracy
    print(f'OUT-OF-SAMPLE | {ric:7s} | acc={acc:.4f}')
OUT-OF-SAMPLE | GE      | acc=0.6396
OUT-OF-SAMPLE | AAPL.O  | acc=0.4695
OUT-OF-SAMPLE | .SPX    | acc=0.4898
OUT-OF-SAMPLE | .VIX    | acc=0.5660
OUT-OF-SAMPLE | EUR=    | acc=0.6345
OUT-OF-SAMPLE | XAU=    | acc=0.5279
OUT-OF-SAMPLE | GLD     | acc=0.4670
OUT-OF-SAMPLE | BTC=    | acc=0.6142

Eikon Data API Developer Resources