Fast Cov and Corr
Posted on Jan. 4, 2018, 5:02 p.m.
Recently I came across Scott Sanderson's post about making the built in Beta factor in Zipline much faster. That is great, as anyone who has used it before knows that it was prohibitively slow. I was never a heavy user of Beta but I did use Correlation and Covariance a lot and they were always the slowest operators. It was on a long list of mine to speed those up. My previous solution was just horrendous, I don't know what I was thinking other than I wanted to get something working to prove my compiler was correct. Actually in my defense it was not super easy to find a vectorized correlation operator, and I think that led to the dirty solution I had.
The discussion on the form was great, with Burrito Dan proposing that in the sake of speed you could drop demeaning, because the mean was very small compared to the variance, and he showed in most equities that would result in 2% error. Good point.
Now if you look at the math of Beta it is very similar to that of (Pearson) correlation, in fact the two are often used interchangeably the common lexicon. corr(x,y)=cov(x,y)/std(x)/std(y), and Beta(y,x) = cov(x,y)/var(x). Also covariance is needed for both of those, so on the way to doing a faster correlation you need to write a faster covariance.
I have written posted those solutions here, they also are now part of the alphacompiler.util.zipline_data_tools module.
def fast_cov(m0, m1):
"""Improving the speed of cov()"""
nan = np.nan
isnan = np.isnan
N, M = m0.shape
#out = np.full(M, nan)
allowed_missing_count = int(0.25 * N)
independent = np.where( # shape: (N, M)
isnan(m0),
nan,
m1,
)
ind_residual = independent - nanmean(independent, axis=0) # shape: (N, M)
covariances = nanmean(ind_residual * m0, axis=0) # shape: (M,)
nanlocs = isnan(independent).sum(axis=0) > allowed_missing_count
covariances[nanlocs] = nan
return covariances
def fast_corr(m0, m1):
"""Improving the speed of correlation"""
nan = np.nan
isnan = np.isnan
N, M = m0.shape
out = np.full(M, nan)
allowed_missing_count = int(0.25 * N)
independent = np.where( # shape: (N, M)
isnan(m0),
nan,
m1,
)
ind_residual = independent - nanmean(independent, axis=0) # shape: (N, M)
covariances = nanmean(ind_residual * m0, axis=0) # shape: (M,)
# corr(x,y) = cov(x,y)/std(x)/std(y)
std_v = nanstd(m0, axis=0) # std(X) could reuse ind_residual for possible speedup
np.divide(covariances, std_v, out=out)
std_v = nanstd(m1, axis=0) # std(Y)
np.divide(out, std_v, out=out)
# handle NaNs
nanlocs = isnan(independent).sum(axis=0) > allowed_missing_count
out[nanlocs] = nan
return out
These improvements as well as a few others really paid off resulting in a 20x speedup of my compiled code.
Enjoy.