Winning in Efficient Markets (II)¶

Discovering Statistical Inefficiencies via Deep Learning

Dr. Yves J. Hilpisch | The Python Quants GmbH

The Agenda¶

This tutorial shows

how to retrieve historical data across asset classes via the Eikon Data API,
how to work with such data using pandas, Plotly and Cufflinks and
how to discover stastical inefficiencies in financial time series with machine learning.

Importing Required Packages¶

import time
import eikon as ek  # the Eikon package
import numpy as np  # NumPy
import pandas as pd  # pandas
import cufflinks as cf  # Cufflinks
import configparser as cp

The following Python and package versions are used.

import sys
print(sys.version)

3.7.4 (default, Aug 13 2019, 15:17:50) 
[Clang 4.0.1 (tags/RELEASE_401/final)]

ek.__version__

'1.0.1'

np.__version__

'1.17.3'

pd.__version__

'0.25.1'

cf.__version__

'0.17.0'

Connecting to Eikon Data API¶

This code sets the app_id to connect to the Eikon Data API Proxy which needs to be running locally.

cfg = cp.ConfigParser()
cfg.read('eikon.cfg')

['eikon.cfg']

# ek.set_app_key(cfg['eikon']['app_id'])

Retrieving Cross-Asset Data¶

We first define a small universe of RICS for which to retrieve data.

rics = [
    'GE',  # General Electric stock
    'AAPL.O',  # Apple stock
    '.SPX',  # S&P 500 stock index
    '.VIX',  # VIX volatility index
    'EUR=',  # EUR/USD exchange rate
    'XAU=',  # Gold price
    'GLD',  # Gold ETF
    'BTC=',  # Bitcoin in USD
]

Second, end-of-day (EOD) data is retrieved.

fn = 'eikon_eod_data.csv'
try:
    data = pd.read_csv(fn, index_col=0, parse_dates=True)
except:
    first = True
    for ric in rics:
        print(ric)
        d = ek.get_timeseries(ric,  # the RIC
                             fields='CLOSE',  # the required fields
                             start_date='2015-01-01',  # start date
                             end_date='2019-09-30')  # end date
        if first:
            data = d
            data.columns = [ric]
            first = False
        else:
            data[ric] = d
        time.sleep(2)
    data.to_csv(fn)

data.head()  # first five rows

data.tail()  # final five rows

data.info()  # DataFrame meta information

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1194 entries, 2015-01-02 to 2019-09-30
Data columns (total 8 columns):
GE        1194 non-null float64
AAPL.O    1194 non-null float64
.SPX      1194 non-null float64
.VIX      1194 non-null float64
EUR=      1194 non-null float64
XAU=      1194 non-null float64
GLD       1194 non-null float64
BTC=      1193 non-null float64
dtypes: float64(8)
memory usage: 84.0 KB

Preparing Features¶

To discover statistical inefficiencies we work with different features extracted from the time series data.

def add_lags(data, ric, lags):
    cols = []
    df = pd.DataFrame(data[ric])
    df['r'] = np.log(df / df.shift())
    df['sma'] = df[ric].rolling(20).mean()
    df['min'] = df[ric].rolling(20).min()
    df['max'] = df[ric].rolling(20).max()
    df['mom'] = df['r'].rolling(20).mean()
    df['vol'] = df['r'].rolling(20).std()
    df.dropna(inplace=True)
    df['d'] = np.where(df['r'] > 0, 1, 0)
    features = [ric, 'r', 'd', 'sma', 'min', 'max', 'mom', 'vol']
    for f in features:
        for lag in range(1, lags + 1):
            col = f'{f}_lag_{lag}'  # defines the column name
            df[col] = df[f].shift(lag)  # creates the lagged data column
            cols.append(col)  # stores the column name
    df.dropna(inplace=True)  # gets rid of incomplete data rows
    return df, cols

Second, the iterations over all RICs, using the add_lags function and storing the resulting DataFrame objects in a dictionary.

lags = 7  # historical lags

dfs = {}
for ric in rics:
    print(ric)
    df, cols = add_lags(data, ric, lags)
    dfs[ric] = df.dropna(), cols

GE
AAPL.O
.SPX
.VIX
EUR=
XAU=
GLD
BTC=

dfs.keys()  # the keys of the dictonary

dict_keys(['GE', 'AAPL.O', '.SPX', '.VIX', 'EUR=', 'XAU=', 'GLD', 'BTC='])

# dfs['AAPL.O'].head(7)

Machine Learning¶

The matrix consisting of the lagged data columns is used to "predict" the next day's value of the RIC via Machine Learning.

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

np.set_printoptions(precision=4, suppress=True)

def create_model(algo=3, n_estimators=75):
    base_estimators = [
        GaussianNB(),
        LogisticRegression(C=1),
        DecisionTreeClassifier(random_state=100, max_depth=3,
                              min_samples_leaf=12),
        MLPClassifier(hidden_layer_sizes=2 * [256,],
                      early_stopping=True, shuffle=False,
                      # random_state=100,
                      activation='relu',
                      max_iter=250),
        RandomForestClassifier(n_estimators=5, max_depth=3,
                               min_samples_leaf=8, random_state=100)
        ]
    model = BaggingClassifier(base_estimator=base_estimators[4],
                          n_estimators=n_estimators,
                          bootstrap=True,
                          oob_score=True,
                          n_jobs=4,
                          # random_state=100
                        )
    return model

First, training and prediction in-sample only.

for ric in rics:
    model = create_model(n_estimators=75)
    df, cols = dfs[ric]  # getting data for the RIC
    model.fit(df[cols], df['d'])  # the fitting step
    pred = model.predict(df[cols])  # the prediction step
    acc = accuracy_score(df['d'], pred)  # prediction accuracy
    print(f'IN-SAMPLE | {ric:7s} | acc={acc:.4f}')

IN-SAMPLE | GE      | acc=0.6401
IN-SAMPLE | AAPL.O  | acc=0.6452
IN-SAMPLE | .SPX    | acc=0.6230
IN-SAMPLE | .VIX    | acc=0.6435
IN-SAMPLE | EUR=    | acc=0.6598
IN-SAMPLE | XAU=    | acc=0.6889
IN-SAMPLE | GLD     | acc=0.6804
IN-SAMPLE | BTC=    | acc=0.6199

Second, training in-sample and prediction out-of-sample.

split = int(len(dfs[ric][0]) * 0.7)

for ric in rics:
    model = create_model()
    df, cols = dfs[ric]  # getting data for the RIC
    train = df.iloc[:split]  # training data set
    mu, std = train[cols].mean(), train[cols].std()
    train[cols] = (train[cols] - mu) / std  # normalization
    model.fit(train[cols], train['d'])  # the fitting step
    test = df.iloc[split:]  # test data set
    test[cols] = (test[cols] - mu) / std  # normalization
    pred = model.predict(test[cols])  # the prediction step
    acc = accuracy_score(test['d'], pred)  # prediction accuracy
    print(f'OUT-OF-SAMPLE | {ric:7s} | acc={acc:.4f}')

OUT-OF-SAMPLE | GE      | acc=0.4986
OUT-OF-SAMPLE | AAPL.O  | acc=0.4816
OUT-OF-SAMPLE | .SPX    | acc=0.4816
OUT-OF-SAMPLE | .VIX    | acc=0.5722
OUT-OF-SAMPLE | EUR=    | acc=0.5127
OUT-OF-SAMPLE | XAU=    | acc=0.4986
OUT-OF-SAMPLE | GLD     | acc=0.5071
OUT-OF-SAMPLE | BTC=    | acc=0.4900

Analyzing Intraday Data¶

Let us quickly check, whether the results are similar on an intraday basis.

fn = 'eikon_id_data.csv'
try:
    data = pd.read_csv(fn, index_col=0, parse_dates=True)
except:
    first = True
    for ric in rics:
        print(ric)
        d = ek.get_timeseries(ric,  # the RIC
                             fields='CLOSE',  # the required fields
                             start_date='2019-10-07',  # start date
                             end_date='2019-10-12',  # end date
                             interval='minute') 
        if first:
            data = d
            data.columns = [ric]
            first = False
        else:
            data[ric] = d
        time.sleep(2)
    data.to_csv(fn)

data.dropna(inplace=True)

data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1996 entries, 2019-10-07 13:32:00 to 2019-10-11 20:15:00
Data columns (total 8 columns):
GE        1996 non-null float64
AAPL.O    1996 non-null float64
.SPX      1996 non-null float64
.VIX      1996 non-null float64
EUR=      1996 non-null float64
XAU=      1996 non-null float64
GLD       1996 non-null float64
BTC=      1996 non-null float64
dtypes: float64(8)
memory usage: 140.3 KB

data.tail()

dfs = {}
for ric in rics:
    df, cols = add_lags(data, ric, lags)
    dfs[ric] = df.dropna(), cols

split = int(len(dfs[ric][0]) * 0.8)

for ric in rics:
    model = create_model(n_estimators=75)
    df, cols = dfs[ric]  # getting data for the RIC
    train = df.iloc[:split]  # training data set
    mu, std = train[cols].mean(), train[cols].std()
    train[cols] = (train[cols] - mu) / std  # normalization
    model.fit(train[cols], train['d'])  # the fitting step
    test = df.iloc[split:]  # test data set
    test[cols] = (test[cols] - mu) / std  # normalization
    pred = model.predict(test[cols])  # the prediction step
    acc = accuracy_score(test['d'], pred)  # prediction accuracy
    print(f'OUT-OF-SAMPLE | {ric:7s} | acc={acc:.4f}')

OUT-OF-SAMPLE | GE      | acc=0.6396
OUT-OF-SAMPLE | AAPL.O  | acc=0.4695
OUT-OF-SAMPLE | .SPX    | acc=0.4898
OUT-OF-SAMPLE | .VIX    | acc=0.5660
OUT-OF-SAMPLE | EUR=    | acc=0.6345
OUT-OF-SAMPLE | XAU=    | acc=0.5279
OUT-OF-SAMPLE | GLD     | acc=0.4670
OUT-OF-SAMPLE | BTC=    | acc=0.6142

Eikon Data API Developer Resources¶

Overview
Quick Start
Documentation
Downloads
Tutorials
Q&A Forums
Data Item Browser Application: Type DIB into Eikon Search Bar.

	GE	AAPL.O	.SPX	.VIX	EUR=	XAU=	GLD	BTC=
Date
2015-01-02	24.088950	109.33	2058.20	17.79	1.2000	1188.68	114.08	315.41
2015-01-05	23.646775	106.25	2020.58	19.92	1.1931	1203.70	115.80	276.80
2015-01-06	23.137312	106.26	2002.61	21.12	1.1888	1218.45	117.12	276.82
2015-01-07	23.146924	107.75	2025.90	19.31	1.1837	1210.05	116.43	NaN
2015-01-08	23.425687	111.89	2062.14	17.01	1.1792	1208.63	115.94	276.82

	GE	AAPL.O	.SPX	.VIX	EUR=	XAU=	GLD	BTC=
Date
2019-09-24	8.97	217.68	2966.60	17.05	1.1018	1531.80	144.51	8575.50
2019-09-25	9.16	221.03	2984.87	15.96	1.0941	1503.75	141.83	8500.98
2019-09-26	9.02	219.89	2977.62	16.07	1.0921	1505.60	141.79	8088.33
2019-09-27	9.04	218.82	2961.79	17.22	1.0938	1496.56	141.06	8055.49
2019-09-30	8.94	223.97	2976.74	16.24	1.0898	1472.00	138.87	8260.00

	GE	AAPL.O	.SPX	.VIX	EUR=	XAU=	GLD	BTC=
Date
2019-10-11 20:10:00	8.78	236.30	2970.27	15.55	1.1039	1487.8700	140.25	8375.70
2019-10-11 20:11:00	8.77	236.33	2970.27	15.63	1.1038	1487.7975	140.23	8352.60
2019-10-11 20:12:00	8.78	236.40	2970.27	15.56	1.1039	1487.2671	140.20	8369.26
2019-10-11 20:14:00	8.77	236.39	2970.27	15.55	1.1035	1487.2476	140.15	8372.16
2019-10-11 20:15:00	8.78	236.40	2970.27	15.58	1.1037	1486.9347	140.14	8384.40