.. _data_analytics:

Raspberry Pi for Data Analytics
---------------------------------

This example is about data analytics with Python (cf. http://python.org) and IPython (cf. http://ipython.org) on the RPi. Before we install it, first install the Python PIP installer by::

    sudo apt-get install python-pip python-dev build-essential 

Installing Data Analytics Libraries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Any serious data analytics effort with Python generally includes to some extent the **pandas** library (cf. http://pandas.pydata.org). To install it, **upgrade the NumPy** library first (cf. http://scipy.org)::

    sudo pip install numpy --upgrade

This might take quite a while (1h+) due to the library being pretty large and the RPi not being that quick in compiling it. Then **install pandas**::

    sudo pip install pandas

This also takes some time (again 1h+). Also install the **matplotlib** plotting library (with some updates/dependencies) as follows::

    sudo easy_install -U distribute
    sudo apt-get install libpng-dev libjpeg8-dev libfreetype6-dev
    sudo pip install matplotlib

And, oh wonder, this also takes quite a while to install and compile. We might want to install another useful library, namely **PyTables** (cf. http://pytables.org) for efficient I/O with Python::

    sudo pip install numexpr
    sudo pip install cython
    sudo apt-get install libhdf5-serial-dev
    sudo pip install tables 

All this taken together takes a few hours in total. However, your patience will pay off: your RPi will be equipped with **state-of-the-art Python-based data analytics libraries** that can be used then for a wide range of data collection, crunching and storage tasks. Finally, install the IPython interactive analytics environment::

    sudo pip install ipython

Interactive Data Analytics
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now **start IPython** on the shell via::

    ipython

You should then see something like::

    pi@rpi /home/ftp $ ipython
    Python 2.7.3 (default, Mar 18 2014, 05:13:23) 
    Type "copyright", "credits" or "license" for more information.

    IPython 2.3.1 -- An enhanced Interactive Python.
    ?         -> Introduction and overview of IPython's features.
    %quickref -> Quick reference.
    help      -> Python's own help system.
    object?   -> Details about 'object', use 'object??' for extra details.

    In [1]: 

Now let's retrieve some **stock quotes for the Apple stock**:

.. ipython::

    In [1]: import pandas.io.data as web

    In [2]: aapl = web.DataReader('AAPL', data_source='yahoo')

    In [3]: aapl.tail()
    Out[3]: 
                  Open    High     Low   Close    Volume  Adj Close
    Date                                                           
    2014-12-16  106.37  110.16  106.26  106.75  60790700     106.75
    2014-12-17  107.12  109.84  106.82  109.41  53411800     109.41
    2014-12-18  111.87  112.65  110.66  112.65  59006200     112.65
    2014-12-19  112.26  113.24  111.66  111.78  88429800     111.78
    2014-12-22  112.16  113.49  111.97  112.94  44976200     112.94


Next, let us caculate two different **moving averages** (42 days & 252 days):

.. ipython::

    In [4]: import pandas as pd

    In [5]: aapl['42d'] = pd.rolling_mean(aapl['Adj Close'], window=42)

    In [6]: aapl['252d'] = pd.rolling_mean(aapl['Adj Close'], window=252)

Finally, a plot of the index closing values and the moving averages:

.. ipython::

    In [7]: import matplotlib.pyplot as plt

    In [8]: aapl[['Adj Close', '42d', '252d']].plot(title='Apple Inc.'); plt.savefig('source/aapl.png')


The **saved png plot** might then look like below.

.. image:: aapl.png

Via the shell (either directly or via ``ssh`` access) such figures cannot be displayed. However, you could imagine to run a Web site on the RPi where the figure is included and displayed via html (see :ref:`web_apps`). You could also send such a graphical output/result to yourself or someone else e.g. by email or FTP transfer.

Fast I/O Operations
~~~~~~~~~~~~~~~~~~~~~~~

When using the RPi for data collection purposes, it might be beneficial to have efficient I/O capabilities available. This is where the PyTables library comes into play. The following is a Python script (:download:`download link<./data_collection.py>`) that collects stock data for a number of symbols and stores the data on disk in HDF5 format (cf. http://hdfgroup.org).

.. literalinclude:: data_collection.py

Running the script from the shell yields an output like this::

    pi@rpi ~ $ python data_collection.py 
    Time needed to collect data in sec.  1.61
    Time needed to store data in sec.    1.40


The data gathered and stored by this Python script is not that large. The following script (:download:`download link<./large_data_set.py>`) generates a set with pseudo-random sample data which is **80 MB in size** and writes it to disk.

.. literalinclude:: large_data_set.py

Running this script yields an output like follows::

    pi@rpi ~ $ python large_data_set.py 
    Size of data set in bytes 80000000
    Time needed to generate data in sec. 10.24
    Time needed to store data in sec.     9.39

It takes less than 10 seconds to write 80 MB of data to the SD card (times here might vary significantly depending on the card type used). You see that you can even process **larger data sets** (although not "big data") with the  RPi.