Review of AFML
published: May 17, 2018, 6:48 p.m.
Here is the review I have written for "Advances in Financial Machine Learning" by Marcos Lopez de Prado.
TLDR: the book is awesome, it really is on another level, and you will be stuck in the past if you don't ingest this book.
Title of review: "the book that's on every quant's desk right now"
If you are not in the target audience I think you will find this book hard to digest. Also I have read some chapters twice and worked through the code samples, so I believe I offer a perspective that other readers may be lacking. Marcos has given a number of lectures titled “The 7 Reasons Most Machine Learning Funds Fail”, you can find the lecture slides online. The seven core ideas in that lecture are covered in chapters 2-8, with other chapters offering supporting details, or going further in depth. If you have limited time to process the book, I think you would be better served by taking a deep dive on chapters 2-8, rather than skimming the whole thing.
The ideas in this book work, and you would be doing yourself a disservice by not reading this book. Some of the ideas range from the common sense (backtesting is not a research tool, feature importance is) to the heretical ("for decades most financial research has been based on over-differentiated (memory-less) series, leading to spurious forecasts and overfitting.") [That quote was in his 7 Reasons presentation from Quantcon 2018, not the book.] He offers compelling arguments and solutions backed by peer reviewed publications for all his points.
The book would be a highly valuable reference even without the code snippets, but he provides functional code and even tools to make it work on large datasets. Once again this code is not for the faint of heart, his use of Pandas will leave even a seasoned financial developer to RTFM.
There are some flaws which I can overlook. Strict software engineers will be irked at the code violating PEP8. It is hard to put code samples into a book so things like multiple statements per line can greatly compact the code and make it readable. In chapter 20 he uses threads and processes interchangeably although they are two distinct tools. Chapter 22 felt a little out of place but it seems compulsory for financial authors to include a "just for fun" final chapter. There was a quick discussion at the end of Chapter 14 on performance attribution, which felt rushed and I feel it would be hard for the non-financial portion of the target audience to follow. These are minor items. I found at least three errors in the code which I hear have been corrected in the second printing.
It is arguable that the ideas in this book could be extended to any asset class. If I had to guess, I would say this was often applied to trading futures, although bonds, equities, and equity options are briefly mentioned.
Getting Started With Risk
published: April 6, 2018, 2:56 p.m.
The term risk can be confusing. If you sit and think about the word risk, and what it means you will come up with some answer along the lines of assigning a probability that a given event will happen. There are risks all around us and consciously or not, we evaluate the risk of an event and take steps to mitigate that risk or decide that the risk (probability) is too small and not worth mitigating. There are risks we could die traveling to work or school, yet we decide that going to work or school is worth the risk. So the same concept exists in money management, and these concepts apply rather you are managing $10K of your own money, or $10B of the money for a pension fund. How do you not loose all your money? Diversify. ("Don't put all your eggs in one basket"). Yet when things are going well in one investment you will wish you had put all your money in that. Andrew Carnegie famous quote: "The way to become rich is to put all your eggs in one basket and then watch that basket." So there exists a tradeoff and balancing that tradeoff is the subject of much work.
Risk Models I think this term is confusing to people new to finance, and new to institutional size money management. Here is an actual conversation I had with a friend (let's call him Mr. E) who worked as a quant at a few large hedge funds:
Peter "FundX had a shitty year and they fired the CIO."
Mr. E "why did they have a bad year?"
Peter "I heard that most of their algorithms were making money off of Short Term Reversion, and they weren't tracking it. Then Short Term Reversion stopped working and they had a drawdown."
Mr. E "I find that hard to believe, when that CIO and I worked at FundY, all the PMs had to report their risk exposures to Gordon Gekko (the founder of the fund with an enormous ego)."
So why did I bring up this conversation? To illustrate what happens in larger funds: there are PMs managing a chunk of money, and there are risk managers who's job it is to make sure the fund does not have all of it's eggs in one basket. They way they can do this is have all the PMs report in a standardized format their "risk exposures". These PMs are making money but if an event happens or a way of making money turns around what is it going to mean for the whole firm?
Risk management may seem like a jumping through hoops for some worry wart ass hole, however usually it's a good idea. The first long/short multifactor strategy I ever traded with my own money was a bit of a wild ride, (I will save that story for another time) however after I applied some risk control it cut the drawdown in half, and only had marginal impact on the returns. That is the tradeoff you should strive for.
Traditional Risk Models I think this term is rather confusing, at least to the person not familiar with portfolio management. When people are talking about the risk models they are talking about using Markowitz minimum variance to adjust return for risk. If you have n assets you want to hold in your portfolio then you will want to maximize μ'w - γw'Σw, or some similar version perhaps with added penalties and constraints. Σ is the asset covariance matrix, which determines how the assets move together. The question is: "what are the risks?", so people spend a lot of time on this. I will go into detail on this in another post with some Python code showing you how to build you own. In Susan Alexander's book "Market Models" she discusses how a firm should have a library of covariance matrices to stress test portfolio's on different events. The key insight here is that co-movement is not stationary: when markets are calm things remain uncorrelated, however when shit goes south everything becomes correlated. Her comment should give us another insight as to why everyone uses a model introduced in 1952: it is a common framework, and has become part of the language. It would be hard for people to understand what we were doing if we used some other model.
Non-Traditional Risk Models As I mentioned in the last paragraph everyone uses the Markowitz mean variance with an off the shelf covariance matrix or their own covariance matrix. They spend time building the matrix, then plug it into a convex solver. Could you maximize risk adjusted return, while meeting constraints another way? Absolutely. In the book "Inside the Black Box" Rishi Narang discusses portfolio construction techniques: "Some quants use machine learning techniques such as supervised learning, or genetic algorithms to help with the problem of optimization. The argument in favor of machine learning techniques in portfolio construction is that mean variance optimization is a form of data mining in that it involves searching many possible portfolios and attempting to find the ones that exhibited the best characteristics, as specified by the objective function of the optimizer. But the field of machine learning aims to do much the same thing, and it is a field that has received more rigorous scientific attention in a wide variety of disciplines than portfolio optimization, which is almost exclusively a financial topic. As such, there may be good arguments for considering machine learning approaches to finding the optimal portfolio, especially due to the quality of those algorithms relative to the mean variance optimization technique"
Hedge funds are the startups of the financial world. One of the reasons startups can innovate is that they are small companies and do not have the constraints of larger companies. Perhaps there is much to be gained from using non-traditional risk models?
Thanks for reading
published: March 21, 2018, 2:56 p.m.
I feel GLORIOUS. After reading books about quantitative trading for nine years I feel like I finally found the right books. Perhaps they were there all along and I learned through other methods.
Edit 2018/4/6 updated praise for QEPM.
Do a search for books on Amazon with Quantitative Finance and you will get many fine books. However, most of them discuss pricing derivatives. I first read Paul Wilmot's "Paul Wilmott Introduces Quantitative Finance." I read the whole book cover to cover on a two week trip overseas. At 728 pages I think I would have been better off reading the first few chapters and looking for another book, not because there is anything wrong but the focus is entirely on pricing derivates. I feel that the reason for this was prior to 2008 "quant" (mostly) meant someone who built models to price derivatives. In the preface to "The Handbook of Equity Market Anomalies: Translating Market Inefficiencies into Effective Investment Strategies" Len Zacks explains: "In the aftermath of the global financial meltdown of 2008, the accuracy of the quant models of Collateralized Debt Obligations (CDOs) was called into question and many of the quants who created these models and worked for the major banks were downsized. At the same time, another type of quant model, the multifactor equity model, and it's creators were thriving.." Perhaps there are tons of people using exciting models to prices derivatives, it's just not my cup of tea. I am interested in multifactor models, and currently focusing on equities.
Here is the start of my list, as with all posts here is a work in progress. Some are more targeted at begginers and some are more academic. A final warning: I love books more than most people and have a garage full of them, however no book is a replacement for trying things on your own. The author of these books (this applies to academic research as well) have come to the conclusions under certain circumstances. Your circumstances may not be the same, so please find out for yourself what works and what doesn't. Books, blog posts, academic papers, etc. are good sources for ideas and starting points but shouldn't be taken as fact especially in a field as dynamic as trading.
- "Inside the Black Box: A Simple Guide to Quantitative and High Frequency Trading" by Rishi K. Narang I wish I would have read this book a long time ago, it would have saved me from reinventing the wheel. Things like risk model and execution model I built on my own without knowing what to call them. I certainly got the concepts for the risk model from elsewhere, however plugging them into a live trading strategy I had to do on my own, and it would have been easier having read his book. Chapters 3, 5 and 6 are worth the price of this book alone. I would recommend this to everyone interested in quantitative finance.
- "Quantitative Trading: How to Build Your Own Algorithmic Trading Business" by Ernie Chan Somewhere between an academic and a practical guide, I really like this book. Dr. Ernie has a PhD and his style is very academic, so this is a bit of an academic introduces how to build your own system at home. I would also recommend his other two books, they are somewhat sequels to this book with new strategies and benefits to working on your own. The code is all in Matlab, but is not hard to port to other languages.
- "Active Portfolio Management: A Quantitative Approach for Producing Superior Returns and Controlling Risk" by Grinold and Kahn. This is a very academic book, laying foundations for CAPM. This book is considered a classic, and will be on your shelf next to Shakespeare.
- "Building Winning Algorithmic Trading Systems" by Kevin J. Davey This book is 100% for beginners wanting to do systematic trading. On the homebrew to academic spectrum this book lands very close to the pure homebrew end. Kevin walks you through his journey from trading his own money and looking everywhere for help, sometimes from charlatans. He also has a good introduction to Monte Carlo methods. I think some of his strategies appear to be placing all of one's capital in a single instrument. All of his examples are in Excel and can be downloaded. I would read this book for a good introduction and then move on to more elaborate systems.
- "Following the Trend: Diversified Managed Futures Trading" by Andreas Clenow I love this book, it is a logical extension to Kevin Davey's book, where Clenow is showing you how to do trend following in a diversified systematic way. He also shows that if you adjust for risk, the historical returns of most managed futures funds are nearly identical. There is as much valuable information in this book on managed futures as other topics. He does not hold anything back and even spells out his motives for writing a book like this (advertisement). Clenow makes an argument that long equities are a horrible investment if you look at the risk/reward characteristics.
- "Hedge Fund Market Wizards" by Jack Schwager This book ranks pretty low on the heavy math scale, in fact I'm not sure if there is even one equation in it. However it is very useful to hear interviews with many successful hedge fund managers. Of course I thought the quantitate managers were the most interesting, and you can get some good ideas from discretionary traders as well. The risk/reward discussion in the Appendix is also very helpful and will make you question the use of Sharpe's ratio over other ratios (Sortino).
- "Quantitative Equity Portfolio Management" by Qian, Hua, and Sorensen This is the most academic book on this list, ironically the authors all seem to work in industry rather than academia at the time of publishing. However I think any technical person involved with long/short equity should get this book. As I stated at the start of this post there are tons of books on derivative pricing but few on long/short equity. This is the goto reference for quantitative equity strategies. When I started moving to larger amounts of capital I ran into a lot of problems and didn't have a lot of places to turn, this book was there. Chapters 2, 3, 8 were particularly helpful with risk and transaction costs. After I started reading this book I noticed terminology from this book all around me. It is a little dated at 10 years, some of the chapter such as the ones on alpha models felt dated, but I read there is a 2nd edition in the works. There is a lot of math in this book, it actually it just linear algebra a a little bit of expectations. The linear algebra keeps things organized when you are dealing with 500 securities at once.
Review of the SEP dataset
published: March 9, 2018, 10:09 p.m.
Recently got access to the Sharadar Equity Prices dataset via Quandl. It is abbreviated as SEP for short. I paid for it, that's how I got access. After spending the previous two days dealing with the nastiness of the CRSP dataset, I thought integrating it was going to be easy: only nine columns in the dataset, and splits are already built in. Of course I am talking about integrating this dataset to Zipline and writing a Zipline bundle loader.
One thing that is a problem is that many of the tickers have missing interstitial values. The third ticker AAAGY, had 432 rows over 659 trading days. The Zipline loader will thrown an error so we have to handle this problem. How can we handle it? One way is to simply toss the ticker - fuck it. As I am interested only in large cap stocks, this probably is the best idea, how many would we toss? A lot, 1531 out of 8599 were missing some interstitial values. It should be noted that I also ingested the top 1000 most liquid securities with only the last year's worth of data and only two tickers were missing values.
The second way is to fill in the values of future dates with previous missing dates. This is known a forward fill and will not introduce any look-ahead bias. To do this you could have some empty reference DataFrame, join the two dataframes and use DataFrame.fillna(method=‘ffill’)
A third method would be to fill in the data with data from another dataset, but if you have another dataset why not just use that?
I use these datasets for multi-factor models trading hundreds of securities. One ticker missing will not break what I am doing. The real test of the dataset lies in how multiple securities perform together. Later I will publish some performance with different datasets using multi-factor models.
published: March 7, 2018, 3:37 p.m.
Recently I was granted access to WRDS (Wharton Research Data Services). This is run by the Wharton school at the University of Pennsylvania, and a collection of many interesting economic datasets. I was getting access for the CRSP dataset which is equity prices from the University of Chicago. Once I logged into WRDS it was like a five year old kid being given $100 and sent into a candy store with the directive of buy anything you want. I do have some self control, so I gave my self 30min to look around before I got back to the task at hand.
The CRSP dataset claims to the most complete equity dataset in existence. After messing with four equity datasets I am starting to believe the claims made that CRSP is the most complete. They have data going back to 1925, additionally they keep a unique ID of every equity and even if a stock changes name the unique ID stays the same. I have had multiple problems with other datasets because they used ticker as the primary key and tickers change. Oh also CRSP has ETFs, REITs, and ADRs in the dataset. To me ETFs are a big deal, because quants want to see the ETF data for benchmarking and hedging, but often equity datasets leave out ETFs because they are not true equities.
I have written a Zipline bundle loader for the CRSP data, which I have made public, it is available in my Alphacompiler Github repo.
Another useful dataset available on WRDS is the Compustat fundamental data. The Compustat data contains the usual balance sheet, income statement, and cash flow data that one would want in fundamental data. However they provide three data sets, one with restated data, one with preliminary data and one final dataset called point-in-time. The point-in-time data provides all data points for restatements and a date next to each value of when that data was known. For backtesting that is a huge point. There is even a merged CRSP Compustat-CRSP dataset.
RavenPack's sentiment data is also available on WRDS. If you are not familiar with RavenPack, it is a company that provides news analytics on publicly traded companies. They have updates every millisecond and provide millisecond data going back to 2001. I tried a sample of the data and it looked huge. I sent a query for all the events for INTC in January. Even on the uneventful days of early January when everyone is recovering from New Year's hangovers and holiday merriment, RavenPack detected 250 events for boring old INTC. It would be rough processing all these events for all equities.
RavenPack's venture backer claims that 70% of world's leading hedge funds use the dataset. That's interesting, because if everyone is using it isn't it then wouldn't trades get crowded? I could go on at length about this subject, but I will save that for another post. Word on the street is that RavenPack's data costs $100k/year and is worth it.
There are a ton of other datasets I don't have time to get into.
Deep Learning's Deep Problem
published: Feb. 2, 2018, 9:12 p.m.
In 2010 I started writing a book called Machine Learning in Action, in went to print in May 2012, and has sold tens of thousands of copies in many countries. It was a great experience for me as the author. The book really was based on stuff I had learned as well as what was the state of the art in 2010. One of the pieces of feedback I got was: "maybe you should consider adding deep learning." I ultimately decided to leave the topic out, because there wasn't much code out there at the time, and mostly because I couldn't explain why they worked.
The past summer Ali Rahimi received an award at the NIPS conference and used his time on stage to echo these sentiments. Ali Rahimi's talk at NIPS Ali stresses the point I confronted personally many years ago, most practitioners can't explain why deep learning works.
Over the past few years I have seen numerous attempts to explain why they work, but I feel there is so much hype around deep learning that few stop to ask this question, and it leaves us worse off. So many times when things are going right we stop to ask questions, it's only when things go wrong do most people stop to ask why. This is a common pattern with humans, and one of our biases that prevent us from making good decisions.
How to Use Fundamental Data With Zipline
published: Jan. 25, 2018, 6:04 p.m.
Basic Usage with SF1 Dataset
I few months back I wrote some code to access the fundamental data in the Zipline Pipeline. What follows are instructions to get this data set up and running on your machine. The process may seem convoluted, this was necessary to make accesses to the data fast. By accesses I mean in the typical use case of a Zipline backtest.
Step 0. Make sure you can access Quandl, and you have a Quandl api key. I have set my Quandl api key as an environment variable.
(that's not my real api key). If you are going to be using the SF1 data (paid) make sure you have registered for the data and can access it.
Step 1. Make sure your Zipline bundle is up to date.
Step 2. Clone or download the code from my alphacompiler repo, you also want to change the string called BASE inside alphacompiler/data/load_quandl_sf1.py to some folder on your machine. this line. Also you need to fix the path self.data_path in the file sf1_fundamentals.py. (Yes I do need to fix this step.) AND finally install the code using:
from within the alphacompiler/ directory.
>python setup.py install
Step 3. Edit the script alphacompiler/data/load_quandl_sf1.py for to include the fundamental fields you are interested in using. For example, if you want to use Return on Equity enter ROE_ART. Here is a list of available fields pay attention to the suffix like _ART.
Step 4. Run the script alphacompiler/data/load_quandl_sf1.py. (If all goes well, this will take some time as it makes many API calls to Quandl and saves the data.)
Step 5. Now you are ready to use the fundamental data within your Zipline algorithm. This is the easy part. All you have to do is add the include statement:
from alphacompiler.data.sf1_fundamentals import Fundamentals
After that statement has been added you can access your fundamentals with the exact same names you used in step 3. Here is a working example Zipline script, reading that will give you an idea how to use this. This algorithm is simply a modified version of the Zipline base Pipeline demo, it is meant for demonstration purposes not for real trading.
You can use this code for other fundamental datasetsThis was written for usage with the Quandl SF1 dataset, but it is by no means limited to this dataset. You could copy load_quandl_sf1.py to another file and change this line to get your data from another source. You would then only need to copy sf1_fundamentals.py to a new file and simply specifiy the location of your new .npy file.
How this all works
To understand why things are written this way you need to understand fundamental data. The data comes from SEC 10-Q, and 10-K filings. Every publicly traded company has to file a report every quarter with the SEC, the quarterly reports are called 10-Qs and the yearly reports are called 10-Ks. So a single ticker has data from these reports four times a year. That's great.
How do we access this data every day? One option could be to keep a big table with every fundamental value every day for every ticker. The problem with this approach is that the data only changes four times a year, so you your data will be repeated 60 times. Also there are a lot of fundamental values, -hundreds so this slows things down quite a bit. Our machines may have a lot of RAM to store these values, but lower levels of the memory hierarchy are much faster. If we can reduce the amount of data moved in and out of the lower levels we can speed things up quite a bit, perhaps 1000x for some loops.
Let's improve upon this idea, can't we just store the values when they change? Yes, but it is not simple. Q1 ends on March 31st, Q2 ends on June 30th, Q3 September 30th, and Q4 December 31st. Can't we just keep an array of four values for each security and based on the day of the year choose an index, and use this for all securities? We could write some Python for this like:
index = int(day_of_year/60)
The main problem with this approach is that each company is allowed to have a different definition of Q1, Q2, etc, and they can even change this definition. We can use the above idea, but instead of computing an index for all securities we will have to compute one for each security. This is the main idea behind the SparseDataFactor in the alphacomplier library. The ratchet update speeds things up further by only computing the index one time per backtest, and then checking (using only a comparison operator) if the index needs to be updated. (Time only goes forward.) This code can be used with any sparse data not just fundamentals.
Another area for speedup was ticker lookup. If you look at the way data is stored in Zipline, it is stored by an SID which is an integer for each security. For a given bundle this value is fixed. Now when you get some data from another provider you will probably key the new data by ticker. So how do you get this data in the middle of a backtest? You could look up each security's ticker, then use that ticker to look up the relevant data. This process is slow because at each time step you doing the exact same two step lookup you did in the previous step. A better way is do this lookup one time, and then store the external (fundamental) data in the exact same order that Zipline requests. I call this process aligning the data. What happens if we don't have a fundamental data for a ticker in our bundle? We can put a default value like NaN for these.
Let me know if you have any questions
Fast Cov and Corr
published: Jan. 4, 2018, 5:02 p.m.
Recently I came across Scott Sanderson's post about making the built in Beta factor in Zipline much faster. That is great, as anyone who has used it before knows that it was prohibitively slow. I was never a heavy user of Beta but I did use Correlation and Covariance a lot and they were always the slowest operators. It was on a long list of mine to speed those up. My previous solution was just horrendous, I don't know what I was thinking other than I wanted to get something working to prove my compiler was correct. Actually in my defense it was not super easy to find a vectorized correlation operator, and I think that led to the dirty solution I had.
The discussion on the form was great, with Burrito Dan proposing that in the sake of speed you could drop demeaning, because the mean was very small compared to the variance, and he showed in most equities that would result in 2% error. Good point.
Now if you look at the math of Beta it is very similar to that of (Pearson) correlation, in fact the two are often used interchangeably the common lexicon. corr(x,y)=cov(x,y)/std(x)/std(y), and Beta(y,x) = cov(x,y)/var(x). Also covariance is needed for both of those, so on the way to doing a faster correlation you need to write a faster covariance.
I have written posted those solutions here, they also are now part of the alphacompiler.util.zipline_data_tools module.
These improvements as well as a few others really paid off resulting in a 20x speedup of my compiled code. Enjoy.
def fast_cov(m0, m1): """Improving the speed of cov()""" nan = np.nan isnan = np.isnan N, M = m0.shape #out = np.full(M, nan) allowed_missing_count = int(0.25 * N) independent = np.where( # shape: (N, M) isnan(m0), nan, m1, ) ind_residual = independent - nanmean(independent, axis=0) # shape: (N, M) covariances = nanmean(ind_residual * m0, axis=0) # shape: (M,) nanlocs = isnan(independent).sum(axis=0) > allowed_missing_count covariances[nanlocs] = nan return covariances def fast_corr(m0, m1): """Improving the speed of correlation""" nan = np.nan isnan = np.isnan N, M = m0.shape out = np.full(M, nan) allowed_missing_count = int(0.25 * N) independent = np.where( # shape: (N, M) isnan(m0), nan, m1, ) ind_residual = independent - nanmean(independent, axis=0) # shape: (N, M) covariances = nanmean(ind_residual * m0, axis=0) # shape: (M,) # corr(x,y) = cov(x,y)/std(x)/std(y) std_v = nanstd(m0, axis=0) # std(X) could reuse ind_residual for possible speedup np.divide(covariances, std_v, out=out) std_v = nanstd(m1, axis=0) # std(Y) np.divide(out, std_v, out=out) # handle NaNs nanlocs = isnan(independent).sum(axis=0) > allowed_missing_count out[nanlocs] = nan return out