Predicting Financial Time Series Movements
Dr. Yves J. Hilpisch | The Python Quants GmbH
This tutorial shows
pandas
, Plotly
and Cufflinks
andimport eikon as ek # the Eikon Python wrapper package
import numpy as np # NumPy
import pandas as pd # pandas
import cufflinks as cf # Cufflinks
from sklearn.svm import SVC # sckikit-learn
from sklearn.model_selection import train_test_split
import configparser as cp
The following Python and package versions are used.
import sys
print(sys.version)
ek.__version__
np.__version__
pd.__version__
cf.__version__
This code sets the app_id
to connect to the Eikon Data API Proxy which needs to be running locally.
cfg = cp.ConfigParser()
cfg.read('eikon.cfg')
ek.set_app_id(cfg['eikon']['app_id'])
We first define a small universe of RICS
for which to retrieve data.
rics = [
'SPY', # S&P 500 ETF
'AAPL.O', # Apple stock
'AMZN.O' # Amazon stock
]
Second, end-of-day (EOD) data is retrieved.
data = ek.get_timeseries(rics, # the RICs
fields='CLOSE', # the required fields
start_date='2018-02-12', # start date
end_date='2018-02-28', # end date
interval='minute') # bar length
data.info()
data.head() # first five rows
data.tail() # final five rows
Only complete data rows are selected.
data.dropna(inplace=True) # deletes tows with NaN values
data.info() # DataFrame meta information
We next calculate the log returns in vectorized fashion.
rets = np.log(data / data.shift(1)).dropna() # log returns in vectorized fashion
rets.head()
Using Cufflinks
, we can plot the normalized financial time series as line plots for comparison.
cf.set_config_file(offline=True) # set the plotting mode to offline
data.normalize().iplot(kind='lines')
The frequeny distributions, i.e. the histograms, of the log returns per RIC
.
rets.iplot(kind='histogram', subplots=True)
To gain insights into whether the random walk hypothesis holds true, we work with five lags. The code that follows derives the lagged data for every single RIC
. First, a function that adds columns with lagged data to a DataFrame
object.
def add_lags(data, ric, lags):
cols = []
df = pd.DataFrame(rets[ric])
for lag in range(1, lags + 1):
col = 'lag_{}'.format(lag) # defines the column name
# creates the lagged data column
df[col] = np.digitize(df[ric].shift(lag), bins=[0])
cols.append(col) # stores the column name
df.dropna(inplace=True) # gets rid of incomplete data rows
return df, cols
Second, the iterations over all RICs
, using the add_lags
function and storing the resulting DataFrame
objects in a dictionary.
lags = 5 # five historical lags
dfs = {}
for ric in rics:
df, cols = add_lags(data, ric, lags)
dfs[ric] = df
cols # the column names for the lags
dfs.keys() # the keys of the dictonary
dfs['AAPL.O'].head(7)
The matrix consisting of the lagged data columns is used to "predict" the next day's direction of movement of the RIC
via support vector machine (SVM) algorithm.
for ric in rics:
model = SVC(C=100) # the ML model
df = dfs[ric].copy() # getting data for the RIC
model.fit(df[cols], np.sign(df[ric])) # model fitting
dfs[ric]['position'] = model.predict(df[cols]) # prediction
for ric in rics:
print('{:10} | {}'.format(ric, dfs[ric]['position'].values[:12]))
Let's backtest the performance of the ML-based trading strategies. First, the strategy returns.
for ric in rics:
dfs[ric]['strategy'] = dfs[ric]['position'] * dfs[ric][ric]
Second, the visualization of the cumulative performance.
for ric in rics:
dfs[ric][[ric, 'strategy']].cumsum().apply(np.exp).iplot()
Next, to get a more realistic picture of the real trading performance to be expected a random train test split to implement out-of-sample backtesting.
res = {}
for ric in rics:
model = SVC(C=100) # the ML model
df = dfs[ric].copy() # getting data for the RIC
mu = df[ric].mean()
v = df[ric].std()
bins = [mu - v, mu, mu + v]
# bins = [0]
train_x, test_x, train_y, test_y = train_test_split(
df[cols].apply(lambda x: np.digitize(x, bins=bins)),
np.sign(df[ric]), test_size=0.33, random_state=111)
train_x.sort_index(inplace=True)
train_y.sort_index(inplace=True)
test_x.sort_index(inplace=True)
test_y.sort_index(inplace=True)
model.fit(train_x, train_y) # model fitting
pred = model.predict(test_x) # prediction
strat = pred * df[ric][test_y.index]
res[ric] = pd.DataFrame({ric: df[ric][test_y.index],
'pred': pred,
'strategy': strat})
res['AAPL.O'].head()
for ric in rics:
res[ric][[ric, 'strategy']].cumsum().apply(np.exp).iplot()
Based on this tutorial, we can conclude that
Plotly
and Cufflinks
make financial data visualization convenient,Data Item Browser Application: Type DIB
into Eikon Search Bar.