Posted on March 7, 2018, 3:37 p.m.
Recently I was granted access to WRDS (Wharton Research Data Services). This is run by the Wharton school at the University of Pennsylvania, and a collection of many interesting economic datasets. I was getting access for the CRSP dataset which is equity prices from the University of Chicago. Once I logged into WRDS it was like a five year old kid being given $100 and sent into a candy store with the directive of buy anything you want. I do have some self control, so I gave my self 30min to look around before I got back to the task at hand.
The CRSP dataset claims to the most complete equity dataset in existence. After messing with four equity datasets I am starting to believe the claims made that CRSP is the most complete. They have data going back to 1925, additionally they keep a unique ID of every equity and even if a stock changes name the unique ID stays the same. I have had multiple problems with other datasets because they used ticker as the primary key and tickers change. Oh also CRSP has ETFs, REITs, and ADRs in the dataset. To me ETFs are a big deal, because quants want to see the ETF data for benchmarking and hedging, but often equity datasets leave out ETFs because they are not true equities.
I have written a Zipline bundle loader for the CRSP data, which I have made public, it is available in my Alphacompiler Github repo.
Another useful dataset available on WRDS is the Compustat fundamental data. The Compustat data contains the usual balance sheet, income statement, and cash flow data that one would want in fundamental data. However they provide three data sets, one with restated data, one with preliminary data and one final dataset called point-in-time. The point-in-time data provides all data points for restatements and a date next to each value of when that data was known. For backtesting that is a huge point. There is even a merged CRSP Compustat-CRSP dataset.
RavenPack's sentiment data is also available on WRDS. If you are not familiar with RavenPack, it is a company that provides news analytics on publicly traded companies. They have updates every millisecond and provide millisecond data going back to 2001. I tried a sample of the data and it looked huge. I sent a query for all the events for INTC in January. Even on the uneventful days of early January when everyone is recovering from New Year's hangovers and holiday merriment, RavenPack detected 250 events for boring old INTC. It would be rough processing all these events for all equities.
RavenPack's venture backer claims that 70% of world's leading hedge funds use the dataset. That's interesting, because if everyone is using it isn't it then wouldn't trades get crowded? I could go on at length about this subject, but I will save that for another post. Word on the street is that RavenPack's data costs $100k/year and is worth it.
There are a ton of other datasets I don't have time to get into.