Deep Learning's Deep Problem

published: Feb. 2, 2018, 9:12 p.m.

In 2010 I started writing a book called Machine Learning in Action, in went to print in May 2012, and has sold tens of thousands of copies in many countries. It was a great experience for me as the author. The book really was based on stuff I had learned as well as what was the state of the art in 2010. One of the pieces of feedback I got was: "maybe you should consider adding deep learning." I ultimately decided to leave the topic out, because there wasn't much code out there at the time, and mostly because I couldn't explain why they worked.

The past summer Ali Rahimi received an award at the NIPS conference and used his time on stage to echo these sentiments. Ali Rahimi's talk at NIPS Ali stresses the point I confronted personally many years ago, most practitioners can't explain why deep learning works.

Over the past few years I have seen numerous attempts to explain why they work, but I feel there is so much hype around deep learning that few stop to ask this question, and it leaves us worse off. So many times when things are going right we stop to ask questions, it's only when things go wrong do most people stop to ask why. This is a common pattern with humans, and one of our biases that prevent us from making good decisions.


How to Use Fundamental Data With Zipline

published: Jan. 25, 2018, 6:04 p.m.

Basic Usage with SF1 Dataset

I few months back I wrote some code to access the fundamental data in the Zipline Pipeline. What follows are instructions to get this data set up and running on your machine. The process may seem convoluted, this was necessary to make accesses to the data fast. By accesses I mean in the typical use case of a Zipline backtest.

Step 0. Make sure you can access Quandl, and you have a Quandl api key. I have set my Quandl api key as an environment variable.

>export QUANDL_API_KEY="thereoncewasamanfromnantuket"  
(that's not my real api key). If you are going to be using the SF1 data (paid) make sure you have registered for the data and can access it.

Step 1. Make sure your Zipline bundle is up to date.

>zipline ingest

Step 2. Clone or download the code from my alphacompiler repo, you also want to change the string called BASE inside alphacompiler/data/load_quandl_sf1.py to some folder on your machine. this line. Also you need to fix the path self.data_path in the file sf1_fundamentals.py. (Yes I do need to fix this step.) AND finally install the code using:

>python setup.py install 
from within the alphacompiler/ directory.

Step 3. Edit the script alphacompiler/data/load_quandl_sf1.py for to include the fundamental fields you are interested in using. For example, if you want to use Return on Equity enter ROE_ART. Here is a list of available fields pay attention to the suffix like _ART.

Step 4. Run the script alphacompiler/data/load_quandl_sf1.py. (If all goes well, this will take some time as it makes many API calls to Quandl and saves the data.)

>python load_quandl_sf1.py

Step 5. Now you are ready to use the fundamental data within your Zipline algorithm. This is the easy part. All you have to do is add the include statement:

from alphacompiler.data.sf1_fundamentals import Fundamentals

After that statement has been added you can access your fundamentals with the exact same names you used in step 3. Here is a working example Zipline script, reading that will give you an idea how to use this. This algorithm is simply a modified version of the Zipline base Pipeline demo, it is meant for demonstration purposes not for real trading.

You can use this code for other fundamental datasets

This was written for usage with the Quandl SF1 dataset, but it is by no means limited to this dataset. You could copy load_quandl_sf1.py to another file and change this line to get your data from another source. You would then only need to copy sf1_fundamentals.py to a new file and simply specifiy the location of your new .npy file.

How this all works

To understand why things are written this way you need to understand fundamental data. The data comes from SEC 10-Q, and 10-K filings. Every publicly traded company has to file a report every quarter with the SEC, the quarterly reports are called 10-Qs and the yearly reports are called 10-Ks. So a single ticker has data from these reports four times a year. That's great.

How do we access this data every day? One option could be to keep a big table with every fundamental value every day for every ticker. The problem with this approach is that the data only changes four times a year, so you your data will be repeated 60 times. Also there are a lot of fundamental values, -hundreds so this slows things down quite a bit. Our machines may have a lot of RAM to store these values, but lower levels of the memory hierarchy are much faster. If we can reduce the amount of data moved in and out of the lower levels we can speed things up quite a bit, perhaps 1000x for some loops.

Let's improve upon this idea, can't we just store the values when they change? Yes, but it is not simple. Q1 ends on March 31st, Q2 ends on June 30th, Q3 September 30th, and Q4 December 31st. Can't we just keep an array of four values for each security and based on the day of the year choose an index, and use this for all securities? We could write some Python for this like:

index = int(day_of_year/60)

The main problem with this approach is that each company is allowed to have a different definition of Q1, Q2, etc, and they can even change this definition. We can use the above idea, but instead of computing an index for all securities we will have to compute one for each security. This is the main idea behind the SparseDataFactor in the alphacomplier library. The ratchet update speeds things up further by only computing the index one time per backtest, and then checking (using only a comparison operator) if the index needs to be updated. (Time only goes forward.) This code can be used with any sparse data not just fundamentals.

Another area for speedup was ticker lookup. If you look at the way data is stored in Zipline, it is stored by an SID which is an integer for each security. For a given bundle this value is fixed. Now when you get some data from another provider you will probably key the new data by ticker. So how do you get this data in the middle of a backtest? You could look up each security's ticker, then use that ticker to look up the relevant data. This process is slow because at each time step you doing the exact same two step lookup you did in the previous step. A better way is do this lookup one time, and then store the external (fundamental) data in the exact same order that Zipline requests. I call this process aligning the data. What happens if we don't have a fundamental data for a ticker in our bundle? We can put a default value like NaN for these.

Let me know if you have any questions

Peter


Fast Cov and Corr

published: Jan. 4, 2018, 5:02 p.m.

Recently I came across Scott Sanderson's post about making the built in Beta factor in Zipline much faster. That is great, as anyone who has used it before knows that it was prohibitively slow. I was never a heavy user of Beta but I did use Correlation and Covariance a lot and they were always the slowest operators. It was on a long list of mine to speed those up. My previous solution was just horrendous, I don't know what I was thinking other than I wanted to get something working to prove my compiler was correct. Actually in my defense it was not super easy to find a vectorized correlation operator, and I think that led to the dirty solution I had.

The discussion on the form was great, with Burrito Dan proposing that in the sake of speed you could drop demeaning, because the mean was very small compared to the variance, and he showed in most equities that would result in 2% error. Good point.

Now if you look at the math of Beta it is very similar to that of (Pearson) correlation, in fact the two are often used interchangeably the common lexicon. corr(x,y)=cov(x,y)/std(x)/std(y), and Beta(y,x) = cov(x,y)/var(x). Also covariance is needed for both of those, so on the way to doing a faster correlation you need to write a faster covariance.

I have written posted those solutions here, they also are now part of the alphacompiler.util.zipline_data_tools module.


def fast_cov(m0, m1):
    """Improving the speed of cov()"""
    nan = np.nan
    isnan = np.isnan
    N, M = m0.shape
    #out = np.full(M, nan)
    allowed_missing_count = int(0.25 * N)

    independent = np.where(  # shape: (N, M)
        isnan(m0),
        nan,
        m1,
    )
    ind_residual = independent - nanmean(independent, axis=0)  # shape: (N, M)
    covariances = nanmean(ind_residual * m0, axis=0)           # shape: (M,)

    nanlocs = isnan(independent).sum(axis=0) > allowed_missing_count
    covariances[nanlocs] = nan
    return covariances

def fast_corr(m0, m1):
    """Improving the speed of correlation"""
    nan = np.nan
    isnan = np.isnan
    N, M = m0.shape
    out = np.full(M, nan)
    allowed_missing_count = int(0.25 * N)

    independent = np.where(  # shape: (N, M)
        isnan(m0),
        nan,
        m1,
    )
    ind_residual = independent - nanmean(independent, axis=0)  # shape: (N, M)
    covariances = nanmean(ind_residual * m0, axis=0)  # shape: (M,)

    # corr(x,y) = cov(x,y)/std(x)/std(y)
    std_v = nanstd(m0, axis=0)  # std(X)  could reuse ind_residual for possible speedup
    np.divide(covariances, std_v, out=out)
    std_v = nanstd(m1, axis=0)  # std(Y)
    np.divide(out, std_v, out=out)

    # handle NaNs
    nanlocs = isnan(independent).sum(axis=0) > allowed_missing_count
    out[nanlocs] = nan
    return out

These improvements as well as a few others really paid off resulting in a 20x speedup of my compiled code. Enjoy.


Sector codes on Zipline

published: Nov. 6, 2017, 4:08 p.m.

When Quantopian pulled the plug on the free lunch that was hosted trading some of us turned to Zipline. As I talk about in previous posts I had already been using Zipline to do things I couldn't do on Quantopian's platform. One thing that Quantopian did not open source with their fundamental code. I spent some time writing what I believe is the greatest implementation of fundamental data from sparse data. The code you have to write is less than the Quantopian code and I believe it is much faster, but since their code is not open source I can not measure speed. I will get into that in a subsequent post, but as an intro I would like to show code for getting Pipeline data for a single value. This is used to get sector codes for US equities. There is no time component. If you wanted to get sector codes on Quantopian you would need the following code:


from quantopian.pipeline.data import morningstar
.
.

# while setting up your Pipeline: 
grouping = Grouping()
pipe.add(grouping, "grouping")
.
.

# finally a class to tie those two together
class Grouping(CustomFactor):
    sectors_in = morningstar.asset_classification.morningstar_sector_code.latest
    sectors_in.window_safe = True
    inputs = [sectors_in]
    window_length = 1

    def compute(self, today, assets, out, sectors):
        out[:] = sectors[-1]

Now let me show you how it is done with the version I have written:

from alphacompiler.data.NASDAQ import NASDAQSectorCodes
.
.

# while setting up your Pipeline: 
grouping = NASDAQSectorCodes()
pipe.add(grouping, "grouping")

That's it. You can argue that I moved the class to another file, and that is fair, but take a look at that class:

class NASDAQSectorCodes(CustomFactor):
    """Returns a value for an SID stored in memory."""
    inputs = []
    window_length = 1

    def __init__(self, *args, **kwargs):
        self.data = np.load("/path/to/the/file")

    def compute(self, today, assets, out):
        out[:] = self.data[assets]

All it does is output the data. The reason is that the data is stored in an array that is organized the same way that Zipline is feeding the data to the algorithm. No HashMap lookups, no reading from files, no funny business. What goes on behind the scenes is that when the data file is built we have some knowledge of all the assets that will be used by Zipline, and those assets have SIDs. So we can pack the corresponding sector codes for each asset as an index in an array. That makes for outputting extra values quick and easy. This idea is further extended with a second dimension of sparse dates to get fundamental data. I will show that in a later post. These ideas could be used for any data not just fundamentals and sector codes.

I will probably abstract this further and make a generic Factor for any "aligned" data. What I'm calling aligned is data that has the same order in the array as the assets fed from Zipline.

You can find the code here.


Why I Trade

published: Oct. 24, 2017, 11:40 a.m.

Why do we trade? Why do spend thousands of hours looking at code and black and white papers with complicated math symbols? I've heard other quants talk about how beating the market is one of the hardest challenges, involving skills learned from the fields of computer science, finance, statistics, and psychology. It is a big challenge and doing hard things can be a reward in its self.

It is also fun to prove people wrong. A doctor last year told me I couldn't beat the market, then he misdiagnosed me. Luckily I didn't believe him on either account.

Trading securities on the open market as a business has a certain purity to it that you cannot get from other businesses. I compare it to an individual sport like running or swimming: it is just you and the clock. Sure there are other competitors, and environmental factors such as the weather for running. These factors play a minor role in your performance compared to other sports, team sports. Compare a business trading securities vs. a business selling a good/service. The latter has to choose the right market, they have to choose the good/service, then they need to execute and deliver the good/service. It is a lot less pure than a business trading, it is not just you and the market.

Solving hard problems, can be fun but once you solve them they get old. I played video games as a teenager, one day while visiting my parents' house with my family my son wanted to see those video games. It was fun teaching him to play, and quite rewarding to see how quickly he picked it up. The next morning the video game console was out, and I remembered how rewarding it was to beat a game. I thought: I will see if I can beat this game again, however as I started to play it became boring and monotonous to go through the game, and I saw myself wasting a few hours of my precious early morning time and put the game down.

All of the three reasons I listed are good but, I believe after a while they get old. Recently there were a number of hurricanes hitting the southern part of North America and destroying homes. My cousin has a tug boat in the Florida keys, he got his boat to safety but when the storm was over there were many people, perhaps in the hundreds who did not. Their boats were blown into marshes or on the beach and they needed help getting out. My cousin had pulled out over 20 people for free when I heard what he was doing. He had run into one snag: boats with a lot of torque require a lot of gasoline which costs money. I didn't hesitate to contact him and see if I could help out, I had a good year trading.

It felt good to be a position to direct capital to a cause which I felt was worthy, but ultimately what felt good was helping other people out. I believe this is a lasting reason to trade.


Welcome to Alpha Complier

published: Aug. 12, 2017, 2:17 p.m.

TLDR; Code, strategies, and analysis for Quantitative Investing

Hi I am Peter Harrington,

In the fall of 2016 I wrote a compiler to take arbitrary mathematical expressions and compile them into code that can be run on Quantopian's platform. A number of people were interested in using this compiler so I threw together a primitive website to share the tool. In the process of testing the compiler I needed more control of the code than was available on Quantopain. So I started to use the open source Zipline, and wrote the tools I wanted. Recently I have seen others interested or ask how to solve similar problems, however I felt that there wasn't a place to post the material, so I threw together this blog to share code, strategies and some peripheral material for quantitative investing.

Update: October 2017

Quantopian has shut down live trading support. Myself and a few other developers spent time getting Zipline-live ready, and now you can trade with that software.

Thanks for reading


Zipline CustomFilter

published: Aug. 11, 2017, 3:45 p.m.

In this post I present a simple CustomFilter I made to check if an asset is in a given list on a given day. Why would someone want such a filter? Zipline comes with the StaticAssets filter, but this only checks if an asset is in a given list. I want to change that list by day. Why would I want to do this? Well I moved my code off of Quantopian and into Zipline for performance and flexibility reasons. However one of the things that Quantopian offers is free data, and some code. "With freedom comes responsibility." So I wanted to evaluate the performance of the Quantopian data against other data sets. This may be straight forward with a single ticker but a pipeline algo changes tickers with each update. Here is the code for the CustomFilter:


class StaticAssetsByDate(CustomFilter):
    """This is mostly for debug, it simply checks if an asset is in a dict for a given day."""
    window_length = 1
    inputs = []

    def add_dict(self, asset_d):
        self.asset_d = asset_d

    def compute(self, today, assets, out):
        if today not in self.asset_d:
            raise Exception("holy shit, date not found in asset_d")

        out[:] = assets.isin(self.asset_d[today])



Blog Search

Welcome

Hi this is Peter Harrington's spot for discussing all things related to quantitative finance. Mostly focusing on how to build your own system and strategy. I focus on Long/Short equity and futures, but am open to learning about other assets and strategies.