Review of the SEP dataset

published: March 9, 2018, 10:09 p.m.

Recently got access to the Sharadar Equity Prices dataset via Quandl. It is abbreviated as SEP for short. I paid for it, that's how I got access. After spending the previous two days dealing with the nastiness of the CRSP dataset, I thought integrating it was going to be easy: only nine columns in the dataset, and splits are already built in. Of course I am talking about integrating this dataset to Zipline and writing a Zipline bundle loader.

One thing that is a problem is that many of the tickers have missing interstitial values. The third ticker AAAGY, had 432 rows over 659 trading days. The Zipline loader will thrown an error so we have to handle this problem. How can we handle it? One way is to simply toss the ticker - fuck it. As I am interested only in large cap stocks, this probably is the best idea, how many would we toss? A lot, 1531 out of 8599 were missing some interstitial values. It should be noted that I also ingested the top 1000 most liquid securities with only the last year's worth of data and only two tickers were missing values.

The second way is to fill in the values of future dates with previous missing dates. This is known a forward fill and will not introduce any look-ahead bias. To do this you could have some empty reference DataFrame, join the two dataframes and use DataFrame.fillna(method=‘ffill’)

A third method would be to fill in the data with data from another dataset, but if you have another dataset why not just use that?

I use these datasets for multi-factor models trading hundreds of securities. One ticker missing will not break what I am doing. The real test of the dataset lies in how multiple securities perform together. Later I will publish some performance with different datasets using multi-factor models.

WRDS Access

published: March 7, 2018, 3:37 p.m.

Recently I was granted access to WRDS (Wharton Research Data Services). This is run by the Wharton school at the University of Pennsylvania, and a collection of many interesting economic datasets. I was getting access for the CRSP dataset which is equity prices from the University of Chicago. Once I logged into WRDS it was like a five year old kid being given $100 and sent into a candy store with the directive of buy anything you want. I do have some self control, so I gave my self 30min to look around before I got back to the task at hand.

The CRSP dataset claims to the most complete equity dataset in existence. After messing with four equity datasets I am starting to believe the claims made that CRSP is the most complete. They have data going back to 1925, additionally they keep a unique ID of every equity and even if a stock changes name the unique ID stays the same. I have had multiple problems with other datasets because they used ticker as the primary key and tickers change. Oh also CRSP has ETFs, REITs, and ADRs in the dataset. To me ETFs are a big deal, because quants want to see the ETF data for benchmarking and hedging, but often equity datasets leave out ETFs because they are not true equities.

I have written a Zipline bundle loader for the CRSP data, which I have made public, it is available in my Alphacompiler Github repo.

Another useful dataset available on WRDS is the Compustat fundamental data. The Compustat data contains the usual balance sheet, income statement, and cash flow data that one would want in fundamental data. However they provide three data sets, one with restated data, one with preliminary data and one final dataset called point-in-time. The point-in-time data provides all data points for restatements and a date next to each value of when that data was known. For backtesting that is a huge point. There is even a merged CRSP Compustat-CRSP dataset.

RavenPack's sentiment data is also available on WRDS. If you are not familiar with RavenPack, it is a company that provides news analytics on publicly traded companies. They have updates every millisecond and provide millisecond data going back to 2001. I tried a sample of the data and it looked huge. I sent a query for all the events for INTC in January. Even on the uneventful days of early January when everyone is recovering from New Year's hangovers and holiday merriment, RavenPack detected 250 events for boring old INTC. It would be rough processing all these events for all equities.

RavenPack's venture backer claims that 70% of world's leading hedge funds use the dataset. That's interesting, because if everyone is using it isn't it then wouldn't trades get crowded? I could go on at length about this subject, but I will save that for another post. Word on the street is that RavenPack's data costs $100k/year and is worth it.

There are a ton of other datasets I don't have time to get into.

Deep Learning's Deep Problem

published: Feb. 2, 2018, 9:12 p.m.

In 2010 I started writing a book called Machine Learning in Action, in went to print in May 2012, and has sold tens of thousands of copies in many countries. It was a great experience for me as the author. The book really was based on stuff I had learned as well as what was the state of the art in 2010. One of the pieces of feedback I got was: "maybe you should consider adding deep learning." I ultimately decided to leave the topic out, because there wasn't much code out there at the time, and mostly because I couldn't explain why they worked.

The past summer Ali Rahimi received an award at the NIPS conference and used his time on stage to echo these sentiments. Ali Rahimi's talk at NIPS Ali stresses the point I confronted personally many years ago, most practitioners can't explain why deep learning works.

Over the past few years I have seen numerous attempts to explain why they work, but I feel there is so much hype around deep learning that few stop to ask this question, and it leaves us worse off. So many times when things are going right we stop to ask questions, it's only when things go wrong do most people stop to ask why. This is a common pattern with humans, and one of our biases that prevent us from making good decisions.

How to Use Fundamental Data With Zipline

published: Jan. 25, 2018, 6:04 p.m.

Basic Usage with SF1 Dataset

I few months back I wrote some code to access the fundamental data in the Zipline Pipeline. What follows are instructions to get this data set up and running on your machine. The process may seem convoluted, this was necessary to make accesses to the data fast. By accesses I mean in the typical use case of a Zipline backtest.

Step 0. Make sure you can access Quandl, and you have a Quandl api key. I have set my Quandl api key as an environment variable.

>export QUANDL_API_KEY="thereoncewasamanfromnantuket"  
(that's not my real api key). If you are going to be using the SF1 data (paid) make sure you have registered for the data and can access it.

Step 1. Make sure your Zipline bundle is up to date.

>zipline ingest

Step 2. Clone or download the code from my alphacompiler repo, you also want to change the string called BASE inside alphacompiler/data/ to some folder on your machine. this line. Also you need to fix the path self.data_path in the file (Yes I do need to fix this step.) AND finally install the code using:

>python install 
from within the alphacompiler/ directory.

Step 3. Edit the script alphacompiler/data/ for to include the fundamental fields you are interested in using. For example, if you want to use Return on Equity enter ROE_ART. Here is a list of available fields pay attention to the suffix like _ART.

Step 4. Run the script alphacompiler/data/ (If all goes well, this will take some time as it makes many API calls to Quandl and saves the data.)


Step 5. Now you are ready to use the fundamental data within your Zipline algorithm. This is the easy part. All you have to do is add the include statement:

from import Fundamentals

After that statement has been added you can access your fundamentals with the exact same names you used in step 3. Here is a working example Zipline script, reading that will give you an idea how to use this. This algorithm is simply a modified version of the Zipline base Pipeline demo, it is meant for demonstration purposes not for real trading.

You can use this code for other fundamental datasets

This was written for usage with the Quandl SF1 dataset, but it is by no means limited to this dataset. You could copy to another file and change this line to get your data from another source. You would then only need to copy to a new file and simply specifiy the location of your new .npy file.

How this all works

To understand why things are written this way you need to understand fundamental data. The data comes from SEC 10-Q, and 10-K filings. Every publicly traded company has to file a report every quarter with the SEC, the quarterly reports are called 10-Qs and the yearly reports are called 10-Ks. So a single ticker has data from these reports four times a year. That's great.

How do we access this data every day? One option could be to keep a big table with every fundamental value every day for every ticker. The problem with this approach is that the data only changes four times a year, so you your data will be repeated 60 times. Also there are a lot of fundamental values, -hundreds so this slows things down quite a bit. Our machines may have a lot of RAM to store these values, but lower levels of the memory hierarchy are much faster. If we can reduce the amount of data moved in and out of the lower levels we can speed things up quite a bit, perhaps 1000x for some loops.

Let's improve upon this idea, can't we just store the values when they change? Yes, but it is not simple. Q1 ends on March 31st, Q2 ends on June 30th, Q3 September 30th, and Q4 December 31st. Can't we just keep an array of four values for each security and based on the day of the year choose an index, and use this for all securities? We could write some Python for this like:

index = int(day_of_year/60)

The main problem with this approach is that each company is allowed to have a different definition of Q1, Q2, etc, and they can even change this definition. We can use the above idea, but instead of computing an index for all securities we will have to compute one for each security. This is the main idea behind the SparseDataFactor in the alphacomplier library. The ratchet update speeds things up further by only computing the index one time per backtest, and then checking (using only a comparison operator) if the index needs to be updated. (Time only goes forward.) This code can be used with any sparse data not just fundamentals.

Another area for speedup was ticker lookup. If you look at the way data is stored in Zipline, it is stored by an SID which is an integer for each security. For a given bundle this value is fixed. Now when you get some data from another provider you will probably key the new data by ticker. So how do you get this data in the middle of a backtest? You could look up each security's ticker, then use that ticker to look up the relevant data. This process is slow because at each time step you doing the exact same two step lookup you did in the previous step. A better way is do this lookup one time, and then store the external (fundamental) data in the exact same order that Zipline requests. I call this process aligning the data. What happens if we don't have a fundamental data for a ticker in our bundle? We can put a default value like NaN for these.

Let me know if you have any questions


Fast Cov and Corr

published: Jan. 4, 2018, 5:02 p.m.

Recently I came across Scott Sanderson's post about making the built in Beta factor in Zipline much faster. That is great, as anyone who has used it before knows that it was prohibitively slow. I was never a heavy user of Beta but I did use Correlation and Covariance a lot and they were always the slowest operators. It was on a long list of mine to speed those up. My previous solution was just horrendous, I don't know what I was thinking other than I wanted to get something working to prove my compiler was correct. Actually in my defense it was not super easy to find a vectorized correlation operator, and I think that led to the dirty solution I had.

The discussion on the form was great, with Burrito Dan proposing that in the sake of speed you could drop demeaning, because the mean was very small compared to the variance, and he showed in most equities that would result in 2% error. Good point.

Now if you look at the math of Beta it is very similar to that of (Pearson) correlation, in fact the two are often used interchangeably the common lexicon. corr(x,y)=cov(x,y)/std(x)/std(y), and Beta(y,x) = cov(x,y)/var(x). Also covariance is needed for both of those, so on the way to doing a faster correlation you need to write a faster covariance.

I have written posted those solutions here, they also are now part of the alphacompiler.util.zipline_data_tools module.

def fast_cov(m0, m1):
    """Improving the speed of cov()"""
    nan = np.nan
    isnan = np.isnan
    N, M = m0.shape
    #out = np.full(M, nan)
    allowed_missing_count = int(0.25 * N)

    independent = np.where(  # shape: (N, M)
    ind_residual = independent - nanmean(independent, axis=0)  # shape: (N, M)
    covariances = nanmean(ind_residual * m0, axis=0)           # shape: (M,)

    nanlocs = isnan(independent).sum(axis=0) > allowed_missing_count
    covariances[nanlocs] = nan
    return covariances

def fast_corr(m0, m1):
    """Improving the speed of correlation"""
    nan = np.nan
    isnan = np.isnan
    N, M = m0.shape
    out = np.full(M, nan)
    allowed_missing_count = int(0.25 * N)

    independent = np.where(  # shape: (N, M)
    ind_residual = independent - nanmean(independent, axis=0)  # shape: (N, M)
    covariances = nanmean(ind_residual * m0, axis=0)  # shape: (M,)

    # corr(x,y) = cov(x,y)/std(x)/std(y)
    std_v = nanstd(m0, axis=0)  # std(X)  could reuse ind_residual for possible speedup
    np.divide(covariances, std_v, out=out)
    std_v = nanstd(m1, axis=0)  # std(Y)
    np.divide(out, std_v, out=out)

    # handle NaNs
    nanlocs = isnan(independent).sum(axis=0) > allowed_missing_count
    out[nanlocs] = nan
    return out

These improvements as well as a few others really paid off resulting in a 20x speedup of my compiled code. Enjoy.

Sector codes on Zipline

published: Nov. 6, 2017, 4:08 p.m.

When Quantopian pulled the plug on the free lunch that was hosted trading some of us turned to Zipline. As I talk about in previous posts I had already been using Zipline to do things I couldn't do on Quantopian's platform. One thing that Quantopian did not open source with their fundamental code. I spent some time writing what I believe is the greatest implementation of fundamental data from sparse data. The code you have to write is less than the Quantopian code and I believe it is much faster, but since their code is not open source I can not measure speed. I will get into that in a subsequent post, but as an intro I would like to show code for getting Pipeline data for a single value. This is used to get sector codes for US equities. There is no time component. If you wanted to get sector codes on Quantopian you would need the following code:

from import morningstar

# while setting up your Pipeline: 
grouping = Grouping()
pipe.add(grouping, "grouping")

# finally a class to tie those two together
class Grouping(CustomFactor):
    sectors_in = morningstar.asset_classification.morningstar_sector_code.latest
    sectors_in.window_safe = True
    inputs = [sectors_in]
    window_length = 1

    def compute(self, today, assets, out, sectors):
        out[:] = sectors[-1]

Now let me show you how it is done with the version I have written:

from import NASDAQSectorCodes

# while setting up your Pipeline: 
grouping = NASDAQSectorCodes()
pipe.add(grouping, "grouping")

That's it. You can argue that I moved the class to another file, and that is fair, but take a look at that class:

class NASDAQSectorCodes(CustomFactor):
    """Returns a value for an SID stored in memory."""
    inputs = []
    window_length = 1

    def __init__(self, *args, **kwargs): = np.load("/path/to/the/file")

    def compute(self, today, assets, out):
        out[:] =[assets]

All it does is output the data. The reason is that the data is stored in an array that is organized the same way that Zipline is feeding the data to the algorithm. No HashMap lookups, no reading from files, no funny business. What goes on behind the scenes is that when the data file is built we have some knowledge of all the assets that will be used by Zipline, and those assets have SIDs. So we can pack the corresponding sector codes for each asset as an index in an array. That makes for outputting extra values quick and easy. This idea is further extended with a second dimension of sparse dates to get fundamental data. I will show that in a later post. These ideas could be used for any data not just fundamentals and sector codes.

I will probably abstract this further and make a generic Factor for any "aligned" data. What I'm calling aligned is data that has the same order in the array as the assets fed from Zipline.

You can find the code here.

Why I Trade

published: Oct. 24, 2017, 11:40 a.m.

Why do we trade? Why do spend thousands of hours looking at code and black and white papers with complicated math symbols? I've heard other quants talk about how beating the market is one of the hardest challenges, involving skills learned from the fields of computer science, finance, statistics, and psychology. It is a big challenge and doing hard things can be a reward in its self.

It is also fun to prove people wrong. A doctor last year told me I couldn't beat the market, then he misdiagnosed me. Luckily I didn't believe him on either account.

Trading securities on the open market as a business has a certain purity to it that you cannot get from other businesses. I compare it to an individual sport like running or swimming: it is just you and the clock. Sure there are other competitors, and environmental factors such as the weather for running. These factors play a minor role in your performance compared to other sports, team sports. Compare a business trading securities vs. a business selling a good/service. The latter has to choose the right market, they have to choose the good/service, then they need to execute and deliver the good/service. It is a lot less pure than a business trading, it is not just you and the market.

Solving hard problems, can be fun but once you solve them they get old. I played video games as a teenager, one day while visiting my parents' house with my family my son wanted to see those video games. It was fun teaching him to play, and quite rewarding to see how quickly he picked it up. The next morning the video game console was out, and I remembered how rewarding it was to beat a game. I thought: I will see if I can beat this game again, however as I started to play it became boring and monotonous to go through the game, and I saw myself wasting a few hours of my precious early morning time and put the game down.

All of the three reasons I listed are good but, I believe after a while they get old. Recently there were a number of hurricanes hitting the southern part of North America and destroying homes. My cousin has a tug boat in the Florida keys, he got his boat to safety but when the storm was over there were many people, perhaps in the hundreds who did not. Their boats were blown into marshes or on the beach and they needed help getting out. My cousin had pulled out over 20 people for free when I heard what he was doing. He had run into one snag: boats with a lot of torque require a lot of gasoline which costs money. I didn't hesitate to contact him and see if I could help out, I had a good year trading.

It felt good to be a position to direct capital to a cause which I felt was worthy, but ultimately what felt good was helping other people out. I believe this is a lasting reason to trade.

Welcome to Alpha Complier

published: Aug. 12, 2017, 2:17 p.m.

TLDR; Code, strategies, and analysis for Quantitative Investing

Hi I am Peter Harrington,

In the fall of 2016 I wrote a compiler to take arbitrary mathematical expressions and compile them into code that can be run on Quantopian's platform. A number of people were interested in using this compiler so I threw together a primitive website to share the tool. In the process of testing the compiler I needed more control of the code than was available on Quantopain. So I started to use the open source Zipline, and wrote the tools I wanted. Recently I have seen others interested or ask how to solve similar problems, however I felt that there wasn't a place to post the material, so I threw together this blog to share code, strategies and some peripheral material for quantitative investing.

Update: October 2017

Quantopian has shut down live trading support. Myself and a few other developers spent time getting Zipline-live ready, and now you can trade with that software.

Thanks for reading

Blog Search


Hi this is Peter Harrington's spot for discussing all things related to quantitative finance. Mostly focusing on how to build your own system and strategy. I focus on Long/Short equity and futures, but am open to learning about other assets and strategies.