Fast Cov and Corr

by Peter Harrington


Posted on Jan. 4, 2018, 5:02 p.m.


Recently I came across Scott Sanderson's post about making the built in Beta factor in Zipline much faster. That is great, as anyone who has used it before knows that it was prohibitively slow. I was never a heavy user of Beta but I did use Correlation and Covariance a lot and they were always the slowest operators. It was on a long list of mine to speed those up. My previous solution was just horrendous, I don't know what I was thinking other than I wanted to get something working to prove my compiler was correct. Actually in my defense it was not super easy to find a vectorized correlation operator, and I think that led to the dirty solution I had.

The discussion on the form was great, with Burrito Dan proposing that in the sake of speed you could drop demeaning, because the mean was very small compared to the variance, and he showed in most equities that would result in 2% error. Good point.

Now if you look at the math of Beta it is very similar to that of (Pearson) correlation, in fact the two are often used interchangeably the common lexicon. corr(x,y)=cov(x,y)/std(x)/std(y), and Beta(y,x) = cov(x,y)/var(x). Also covariance is needed for both of those, so on the way to doing a faster correlation you need to write a faster covariance.

I have written posted those solutions here, they also are now part of the alphacompiler.util.zipline_data_tools module.


def fast_cov(m0, m1):
    """Improving the speed of cov()"""
    nan = np.nan
    isnan = np.isnan
    N, M = m0.shape
    #out = np.full(M, nan)
    allowed_missing_count = int(0.25 * N)

    independent = np.where(  # shape: (N, M)
        isnan(m0),
        nan,
        m1,
    )
    ind_residual = independent - nanmean(independent, axis=0)  # shape: (N, M)
    covariances = nanmean(ind_residual * m0, axis=0)           # shape: (M,)

    nanlocs = isnan(independent).sum(axis=0) > allowed_missing_count
    covariances[nanlocs] = nan
    return covariances

def fast_corr(m0, m1):
    """Improving the speed of correlation"""
    nan = np.nan
    isnan = np.isnan
    N, M = m0.shape
    out = np.full(M, nan)
    allowed_missing_count = int(0.25 * N)

    independent = np.where(  # shape: (N, M)
        isnan(m0),
        nan,
        m1,
    )
    ind_residual = independent - nanmean(independent, axis=0)  # shape: (N, M)
    covariances = nanmean(ind_residual * m0, axis=0)  # shape: (M,)

    # corr(x,y) = cov(x,y)/std(x)/std(y)
    std_v = nanstd(m0, axis=0)  # std(X)  could reuse ind_residual for possible speedup
    np.divide(covariances, std_v, out=out)
    std_v = nanstd(m1, axis=0)  # std(Y)
    np.divide(out, std_v, out=out)

    # handle NaNs
    nanlocs = isnan(independent).sum(axis=0) > allowed_missing_count
    out[nanlocs] = nan
    return out

These improvements as well as a few others really paid off resulting in a 20x speedup of my compiled code. Enjoy.

Blog Search

Welcome

Hi this is Peter Harrington's spot for discussing all things related to quantitative finance. Mostly focusing on how to build your own system and strategy. I focus on Long/Short equity and futures, but am open to learning about other assets and strategies.