Eikon Data API

Cross-Asset Financial Analytics — Discovering Statistical Inefficiencies via Deep Learning

Dr. Yves J. Hilpisch | The Python Quants GmbH

http://tpq.io | @dyjh | training@tpq.io

The Agenda

This tutorial shows

  • how to retrieve historical data across asset classes via the Eikon Data API,
  • how to work with such data using pandas, Plotly and Cufflinks and
  • how to discover stastical inefficiencies in financial time series with machine learning.

Importing Required Packages

In [1]:
import time
import eikon as ek  # the Eikon package
import numpy as np  # NumPy
import pandas as pd  # pandas
import cufflinks as cf  # Cufflinks
import configparser as cp

The following Python and package versions are used.

In [2]:
import sys
print(sys.version)
3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 13:42:17) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
In [3]:
ek.__version__
Out[3]:
'1.0.1'
In [4]:
np.__version__
Out[4]:
'1.17.2'
In [5]:
pd.__version__
Out[5]:
'0.25.2'
In [6]:
cf.__version__
Out[6]:
'0.17.0'

Connecting to Eikon Data API

This code sets the app_id to connect to the Eikon Data API Proxy which needs to be running locally.

In [7]:
cfg = cp.ConfigParser()
cfg.read('eikon.cfg')
Out[7]:
['eikon.cfg']
In [8]:
# ek.set_app_key(cfg['eikon']['app_id'])

Retrieving Cross-Asset Data

We first define a small universe of RICS for which to retrieve data.

In [9]:
rics = [
    'GE',  # General Electric stock
    'AAPL.O',  # Apple stock
    '.SPX',  # S&P 500 stock index
    '.VIX',  # VIX volatility index
    'EUR=',  # EUR/USD exchange rate
    'XAU=',  # Gold price
    'GLD',  # Gold ETF
    'BTC=',  # Bitcoin in USD
]

Second, end-of-day (EOD) data is retrieved.

In [10]:
fn = 'eikon_eod_data.csv'
try:
    data = pd.read_csv(fn, index_col=0, parse_dates=True)
except:
    first = True
    for ric in rics:
        print(ric)
        d = ek.get_timeseries(ric,  # the RIC
                             fields='CLOSE',  # the required fields
                             start_date='2015-01-01',  # start date
                             end_date='2019-09-30')  # end date
        if first:
            data = d
            data.columns = [ric]
            first = False
        else:
            data[ric] = d
        time.sleep(2)
    data.to_csv(fn)
In [11]:
data.head()  # first five rows
Out[11]:
GE AAPL.O .SPX .VIX EUR= XAU= GLD BTC=
Date
2015-01-02 24.088950 109.33 2058.20 17.79 1.2000 1188.68 114.08 315.41
2015-01-05 23.646775 106.25 2020.58 19.92 1.1931 1203.70 115.80 276.80
2015-01-06 23.137312 106.26 2002.61 21.12 1.1888 1218.45 117.12 276.82
2015-01-07 23.146924 107.75 2025.90 19.31 1.1837 1210.05 116.43 NaN
2015-01-08 23.425687 111.89 2062.14 17.01 1.1792 1208.63 115.94 276.82
In [12]:
data.tail()  # final five rows
Out[12]:
GE AAPL.O .SPX .VIX EUR= XAU= GLD BTC=
Date
2019-09-24 8.97 217.68 2966.60 17.05 1.1018 1531.80 144.51 8575.50
2019-09-25 9.16 221.03 2984.87 15.96 1.0941 1503.75 141.83 8500.98
2019-09-26 9.02 219.89 2977.62 16.07 1.0921 1505.60 141.79 8088.33
2019-09-27 9.04 218.82 2961.79 17.22 1.0938 1496.56 141.06 8055.49
2019-09-30 8.94 223.97 2976.74 16.24 1.0898 1472.00 138.87 8260.00
In [13]:
data.info()  # DataFrame meta information
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1194 entries, 2015-01-02 to 2019-09-30
Data columns (total 8 columns):
GE        1194 non-null float64
AAPL.O    1194 non-null float64
.SPX      1194 non-null float64
.VIX      1194 non-null float64
EUR=      1194 non-null float64
XAU=      1194 non-null float64
GLD       1194 non-null float64
BTC=      1193 non-null float64
dtypes: float64(8)
memory usage: 84.0 KB

Preparing Features

To discover statistical inefficiencies we work with different features extracted from the time series data.

In [14]:
def add_lags(data, ric, lags):
    cols = []
    df = pd.DataFrame(data[ric])
    df['r'] = np.log(df / df.shift())
    df['sma'] = df[ric].rolling(20).mean()
    df['min'] = df[ric].rolling(20).min()
    df['max'] = df[ric].rolling(20).max()
    df['mom'] = df['r'].rolling(20).mean()
    df['vol'] = df['r'].rolling(20).std()
    df.dropna(inplace=True)
    df['d'] = np.where(df['r'] > 0, 1, 0)
    features = [ric, 'r', 'd', 'sma', 'min', 'max', 'mom', 'vol']
    for f in features:
        for lag in range(1, lags + 1):
            col = f'{f}_lag_{lag}'  # defines the column name
            df[col] = df[f].shift(lag)  # creates the lagged data column
            cols.append(col)  # stores the column name
    df.dropna(inplace=True)  # gets rid of incomplete data rows
    return df, cols

Second, the iterations over all RICs, using the add_lags function and storing the resulting DataFrame objects in a dictionary.

In [15]:
lags = 7  # historical lags
In [16]:
dfs = {}
for ric in rics:
    print(ric)
    df, cols = add_lags(data, ric, lags)
    dfs[ric] = df.dropna(), cols
GE
AAPL.O
.SPX
.VIX
EUR=
XAU=
GLD
BTC=
In [17]:
dfs.keys()  # the keys of the dictonary
Out[17]:
dict_keys(['GE', 'AAPL.O', '.SPX', '.VIX', 'EUR=', 'XAU=', 'GLD', 'BTC='])
In [18]:
# dfs['AAPL.O'].head(7)

Machine Learning

The matrix consisting of the lagged data columns is used to "predict" the next day's value of the RIC via Machine Learning.

In [19]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
In [20]:
np.set_printoptions(precision=4, suppress=True)
In [21]:
def create_model(algo=3, n_estimators=75):
    base_estimators = [
        GaussianNB(),
        LogisticRegression(C=1),
        DecisionTreeClassifier(random_state=100, max_depth=3,
                              min_samples_leaf=12),
        MLPClassifier(hidden_layer_sizes=2 * [128,],
                      early_stopping=True, shuffle=False,
                      random_state=100,
                      activation='relu',
                      max_iter=250),
        RandomForestClassifier(n_estimators=5, max_depth=3,
                               min_samples_leaf=8, random_state=100)
        ]
    model = BaggingClassifier(base_estimator=base_estimators[4],
                          n_estimators=n_estimators,
                          bootstrap=True,
                          max_features=0.75,
                          max_samples=0.75,
                          n_jobs=4,
                          random_state=100
                        )
    return model

First, training and prediction in-sample only.

In [22]:
%%time
for ric in rics:
    model = create_model(n_estimators=10)
    df, cols = dfs[ric]  # getting data for the RIC
    model.fit(df[cols], df['d'])  # the fitting step
    pred = model.predict(df[cols])  # the prediction step
    acc = accuracy_score(df['d'], pred)  # prediction accuracy
    print(f'IN-SAMPLE | {ric:7s} | acc={acc:.4f}')
IN-SAMPLE | GE      | acc=0.5835
IN-SAMPLE | AAPL.O  | acc=0.6195
IN-SAMPLE | .SPX    | acc=0.5938
IN-SAMPLE | .VIX    | acc=0.6127
IN-SAMPLE | EUR=    | acc=0.6427
IN-SAMPLE | XAU=    | acc=0.6401
IN-SAMPLE | GLD     | acc=0.6470
IN-SAMPLE | BTC=    | acc=0.6139
CPU times: user 1.25 s, sys: 753 ms, total: 2 s
Wall time: 1min 6s

Second, training in-sample and prediction out-of-sample.

In [23]:
split = int(len(dfs[ric][0]) * 0.7)
In [24]:
%%time
for ric in rics:
    model = create_model()
    df, cols = dfs[ric]  # getting data for the RIC
    train = df.iloc[:split]  # training data set
    mu, std = train[cols].mean(), train[cols].std()
    train[cols] = (train[cols] - mu) / std  # normalization
    model.fit(train[cols], train['d'])  # the fitting step
    test = df.iloc[split:]  # test data set
    test[cols] = (test[cols] - mu) / std  # normalization
    pred = model.predict(test[cols])  # the prediction step
    acc = accuracy_score(test['d'], pred)  # prediction accuracy
    print(f'OUT-OF-SAMPLE | {ric:7s} | acc={acc:.4f}')
OUT-OF-SAMPLE | GE      | acc=0.5156
OUT-OF-SAMPLE | AAPL.O  | acc=0.4958
OUT-OF-SAMPLE | .SPX    | acc=0.5212
OUT-OF-SAMPLE | .VIX    | acc=0.5666
OUT-OF-SAMPLE | EUR=    | acc=0.5156
OUT-OF-SAMPLE | XAU=    | acc=0.4986
OUT-OF-SAMPLE | GLD     | acc=0.5042
OUT-OF-SAMPLE | BTC=    | acc=0.4986
CPU times: user 5.29 s, sys: 855 ms, total: 6.15 s
Wall time: 1min 7s

Analyzing Intraday Data

Let us quickly check, whether the results are similar on an intraday basis.

In [25]:
fn = 'eikon_id_data.csv'
try:
    data = pd.read_csv(fn, index_col=0, parse_dates=True)
except:
    first = True
    for ric in rics:
        print(ric)
        d = ek.get_timeseries(ric,  # the RIC
                             fields='CLOSE',  # the required fields
                             start_date='2019-10-07',  # start date
                             end_date='2019-10-12',  # end date
                             interval='minute') 
        if first:
            data = d
            data.columns = [ric]
            first = False
        else:
            data[ric] = d
        time.sleep(2)
    data.to_csv(fn)
In [26]:
data.dropna(inplace=True)
In [27]:
data.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1996 entries, 2019-10-07 13:32:00 to 2019-10-11 20:15:00
Data columns (total 8 columns):
GE        1996 non-null float64
AAPL.O    1996 non-null float64
.SPX      1996 non-null float64
.VIX      1996 non-null float64
EUR=      1996 non-null float64
XAU=      1996 non-null float64
GLD       1996 non-null float64
BTC=      1996 non-null float64
dtypes: float64(8)
memory usage: 140.3 KB
In [28]:
data.tail()
Out[28]:
GE AAPL.O .SPX .VIX EUR= XAU= GLD BTC=
Date
2019-10-11 20:10:00 8.78 236.30 2970.27 15.55 1.1039 1487.8700 140.25 8375.70
2019-10-11 20:11:00 8.77 236.33 2970.27 15.63 1.1038 1487.7975 140.23 8352.60
2019-10-11 20:12:00 8.78 236.40 2970.27 15.56 1.1039 1487.2671 140.20 8369.26
2019-10-11 20:14:00 8.77 236.39 2970.27 15.55 1.1035 1487.2476 140.15 8372.16
2019-10-11 20:15:00 8.78 236.40 2970.27 15.58 1.1037 1486.9347 140.14 8384.40
In [29]:
dfs = {}
for ric in rics:
    df, cols = add_lags(data, ric, lags)
    dfs[ric] = df.dropna(), cols
In [30]:
split = int(len(dfs[ric][0]) * 0.8)
In [31]:
%%time
for ric in rics:
    model = create_model(n_estimators=75)
    df, cols = dfs[ric]  # getting data for the RIC
    train = df.iloc[:split]  # training data set
    mu, std = train[cols].mean(), train[cols].std()
    train[cols] = (train[cols] - mu) / std  # normalization
    model.fit(train[cols], train['d'])  # the fitting step
    test = df.iloc[split:]  # test data set
    test[cols] = (test[cols] - mu) / std  # normalization
    pred = model.predict(test[cols])  # the prediction step
    acc = accuracy_score(test['d'], pred)  # prediction accuracy
    print(f'OUT-OF-SAMPLE | {ric:7s} | acc={acc:.4f}')
OUT-OF-SAMPLE | GE      | acc=0.6396
OUT-OF-SAMPLE | AAPL.O  | acc=0.4619
OUT-OF-SAMPLE | .SPX    | acc=0.4898
OUT-OF-SAMPLE | .VIX    | acc=0.4442
OUT-OF-SAMPLE | EUR=    | acc=0.6421
OUT-OF-SAMPLE | XAU=    | acc=0.5127
OUT-OF-SAMPLE | GLD     | acc=0.4848
OUT-OF-SAMPLE | BTC=    | acc=0.6117
CPU times: user 5.04 s, sys: 785 ms, total: 5.83 s
Wall time: 1min 8s

Conclusions

Based on this tutorial, we can conclude that

  • it is easy to retrieve historical end-of-day and intraday data across asset classes via the Eikon Data API,
  • Plotly and Cufflinks make financial data visualization convenient and
  • there is some evidence for statistical inefficiencies based on a machine learning-based prediction approach.

Eikon Data API Developer Resources

Data Item Browser Application: Type DIB into Eikon Search Bar.