Review of the SEP dataset
Posted on March 9, 2018, 10:09 p.m.
Recently got access to the Sharadar Equity Prices dataset via Quandl. It is abbreviated as SEP for short. I paid for it, that's how I got access. After spending the previous two days dealing with the nastiness of the CRSP dataset, I thought integrating it was going to be easy: only nine columns in the dataset, and splits are already built in. Of course I am talking about integrating this dataset to Zipline and writing a Zipline bundle loader.
One thing that is a problem is that many of the tickers have missing interstitial values. The third ticker AAAGY, had 432 rows over 659 trading days. The Zipline loader will thrown an error so we have to handle this problem. How can we handle it? One way is to simply toss the ticker - fuck it. As I am interested only in large cap stocks, this probably is the best idea, how many would we toss? A lot, 1531 out of 8599 were missing some interstitial values. It should be noted that I also ingested the top 1000 most liquid securities with only the last year's worth of data and only two tickers were missing values.
The second way is to fill in the values of future dates with previous missing dates. This is known a forward fill and will not introduce any look-ahead bias. To do this you could have some empty reference DataFrame, join the two dataframes and use DataFrame.fillna(method=‘ffill’)
A third method would be to fill in the data with data from another dataset, but if you have another dataset why not just use that?
I use these datasets for multi-factor models trading hundreds of securities. One ticker missing will not break what I am doing. The real test of the dataset lies in how multiple securities perform together. Later I will publish some performance with different datasets using multi-factor models.