How to Use Fundamental Data With Zipline
Posted on Jan. 25, 2018, 6:04 p.m.
Basic Usage with SF1 Dataset
I few months back I wrote some code to access the fundamental data in the Zipline Pipeline. What follows are instructions to get this data set up and running on your machine. The process may seem convoluted, this was necessary to make accesses to the data fast. By accesses I mean in the typical use case of a Zipline backtest.
Step 0. Make sure you can access Quandl, and you have a Quandl api key. I have set my Quandl api key as an environment variable.
(that's not my real api key). If you are going to be using the SF1 data (paid) make sure you have registered for the data and can access it.
Step 1. Make sure your Zipline bundle is up to date.
Step 2. Clone or download the code from my alphacompiler repo, you also want to change the string called BASE inside alphacompiler/data/load_quandl_sf1.py to some folder on your machine. this line. Also you need to fix the path self.data_path in the file sf1_fundamentals.py. (Yes I do need to fix this step.) AND finally install the code using:
from within the alphacompiler/ directory.
>python setup.py install
Step 3. Edit the script alphacompiler/data/load_quandl_sf1.py for to include the fundamental fields you are interested in using. For example, if you want to use Return on Equity enter ROE_ART. Here is a list of available fields pay attention to the suffix like _ART.
Step 4. Run the script alphacompiler/data/load_quandl_sf1.py. (If all goes well, this will take some time as it makes many API calls to Quandl and saves the data.)
Step 5. Now you are ready to use the fundamental data within your Zipline algorithm. This is the easy part. All you have to do is add the include statement:
from alphacompiler.data.sf1_fundamentals import Fundamentals
After that statement has been added you can access your fundamentals with the exact same names you used in step 3. Here is a working example Zipline script, reading that will give you an idea how to use this. This algorithm is simply a modified version of the Zipline base Pipeline demo, it is meant for demonstration purposes not for real trading.
You can use this code for other fundamental datasetsThis was written for usage with the Quandl SF1 dataset, but it is by no means limited to this dataset. You could copy load_quandl_sf1.py to another file and change this line to get your data from another source. You would then only need to copy sf1_fundamentals.py to a new file and simply specifiy the location of your new .npy file.
How this all works
To understand why things are written this way you need to understand fundamental data. The data comes from SEC 10-Q, and 10-K filings. Every publicly traded company has to file a report every quarter with the SEC, the quarterly reports are called 10-Qs and the yearly reports are called 10-Ks. So a single ticker has data from these reports four times a year. That's great.
How do we access this data every day? One option could be to keep a big table with every fundamental value every day for every ticker. The problem with this approach is that the data only changes four times a year, so you your data will be repeated 60 times. Also there are a lot of fundamental values, -hundreds so this slows things down quite a bit. Our machines may have a lot of RAM to store these values, but lower levels of the memory hierarchy are much faster. If we can reduce the amount of data moved in and out of the lower levels we can speed things up quite a bit, perhaps 1000x for some loops.
Let's improve upon this idea, can't we just store the values when they change? Yes, but it is not simple. Q1 ends on March 31st, Q2 ends on June 30th, Q3 September 30th, and Q4 December 31st. Can't we just keep an array of four values for each security and based on the day of the year choose an index, and use this for all securities? We could write some Python for this like:
index = int(day_of_year/60)
The main problem with this approach is that each company is allowed to have a different definition of Q1, Q2, etc, and they can even change this definition. We can use the above idea, but instead of computing an index for all securities we will have to compute one for each security. This is the main idea behind the SparseDataFactor in the alphacomplier library. The ratchet update speeds things up further by only computing the index one time per backtest, and then checking (using only a comparison operator) if the index needs to be updated. (Time only goes forward.) This code can be used with any sparse data not just fundamentals.
Another area for speedup was ticker lookup. If you look at the way data is stored in Zipline, it is stored by an SID which is an integer for each security. For a given bundle this value is fixed. Now when you get some data from another provider you will probably key the new data by ticker. So how do you get this data in the middle of a backtest? You could look up each security's ticker, then use that ticker to look up the relevant data. This process is slow because at each time step you doing the exact same two step lookup you did in the previous step. A better way is do this lookup one time, and then store the external (fundamental) data in the exact same order that Zipline requests. I call this process aligning the data. What happens if we don't have a fundamental data for a ticker in our bundle? We can put a default value like NaN for these.
Let me know if you have any questions