Another Footnote to Plato: Thoughts On Backtesting

So I have a few methodological thoughts on backtesting strategies. The lack of sound methods in studies is problematic in my opinion. If anyone has some insight into this I'd appreciate it.

I'm going to lay out what I consider three common problems in empirical research into backtesting investing strategies. Not all studies suffer from all three problems but I think a lot do. And I think attempting to address these is a good start.

1) The Problem of Induction

One key problem is related to the problem of induction (which I briefly touched in that link there.) Just because something has occurred, say, in the past, doesn't mean it will continue to do so in the future. Or to put it more generally: just because something applies to one data set does not mean it will apply to another.

Empirical studies often only look at one data set. In fact, the strategy being tested, in many cases, was optimized using the very same data set that is being used to test.

But the relevant questions is not whether or not it did well on that data set. The relevant question is whether or not it will apply to other data sets. For example, will it apply to future data sets? Because, isn't that the goal? Aren't we really interested in whether or not this strategy will "work" in the future?

So what's the solution? In general, there's no perfect fix for the problem of induction; it isn't going away. But there are things we can do to improve our situation. For example, we can test the strategy on multiple data sets. This can be achieved by dividing up historical data into sections. We can also test strategies on different markets. Some of these techniques are already used.

2) Risk-adjusted Returns

One thing that is sometimes looked at and sometimes not is "risk-adjusted returns". The idea is that there is a relationship between risk and returns.

The idea is nice: you ought to earn more returns by taking on more risk. But there are problems with actually applying it. How do you measure risks (or different types of risk) and how does that translate into actual returns?

One initial approach was CAPM. CAPM just stands for capital asset pricing model which is actually a pretty generic description but what most people mean by CAPM is a very specific capital asset pricing model that involves linear regression between asset returns and benchmark returns (typically the "market" or the S&P 500). The slope of the line - Beta - was treated as measure of risk while the intercept of the line - Alpha - is the "risk-adjusted" returns.

Since at least Fama and French (1992), CAPM has sort of been abandoned (the finance profession is slow I think). But it's not entirely clear what, if anything, should replace it.

A common approach (and one taken by Fama and French) is to make the following assumptions (these are necessary for their technique to be valid):

Assume markets get the risk-return relationship correct. (People who deviate from the market's position are simply mistaken.)
Assume that markets are informationally efficient.
Assume that the market correctly assesses the risk of all assets.
Assume that the market correctly predicts future returns.

If you make those assumptions then you can look at, say, two classes of assets and see that one earns, say, 5% and the other 10% and then conclude that the 10% asset must be "riskier" than the 5% asset.

If all of the above assumptions are not true . . . well, then the above conclusion wouldn't be warranted.

And that's the sad state of "risk-adjusted" returns.

My beef with all of this (which I began spelling out here), is that I'm not entirely convinced that there exists a correct relationship between risk and return. People have differing preferences on how they see different assets. No doubt some may incorrectly assess the situation but even when the situation is correctly assessed there may be differences of opinion on how "risk" should translate into "returns".

I believe the idea that there is a risk-return tradeoff is predominantly a normative claim. And I don't believe there is any human-independent standard that establishes a standard. As a result, we can just sit here with our preferences.

Nonetheless, I think the idea that the additional risk of a strategy should be factored in somehow. But I don't see how that will be resolved (I'm not convinced there is an "answer" to discover.)

3) Statistical Significance

Maybe someone can point to me in the right direction but I have yet to see a study which attempted to see if the difference between the returns of two strategies were statistically significant. Who is to say that this strategy that allegedly outperforms didn't do so just by chance?

Now there are some problems with doing so. Typically, when you compare means (in this would be looking at arithmetic mean, not geometric mean which is more important for investment returns), you use a student t-test. And that requires a normal distribution assumption. But there aren't too many things that are normally distributed in finance. Stock market returns are definitely not.

And maybe that's why it's never been used (I doubt it). But I still think it's important question to ask.

In addition, there is the Behrens-Fisher Problem. Given two populations that are normally distributed with unequal variances, what's the test statistic and what's the corresponding distribution?

So to take some initiative, I applied a student t-test comparing annual S&P 500 returns to 10Y treasury returns (using Damodaran's data). I used Excel's calculation of t-test with unequal variances which I believe uses the Welch's t test.

The result was a p-value of 0.013 (1.3%). That's not too shabby.

But that was also for 85 years of data.! What if we only looked at 20 years of data and compared two stock strategies (S&P 500 versus some alternative). Would we get statistical significance there?

Now I don't want to dwell on this too much because the non-normality of financial assets is a problem and I'm not sure if there is a solution. We'd have to justify using the t-test or find some alternative test that would be more appropriate. I'm not sure what solution to offer here.

But it is possible that the outperformance of a particular strategies is merely due to chance and something should be done to assess whether or not that's the case.

Concluding Remarks

I think these are some genuine problems with respect to backtesting of historical financial returns. I'm not sure if they can be entirely overcome but I do think they warrant further investigation and due consideration. Ultimately we want to know three things:

1) Will the strategy work in the future? (the problem of induction).

2) Am I really getting extra returns or am I just taking on more risk? (risk adjusted returns).

3) Are the historical results valid or are they just due to chance? (statistical significance).

All three are important considerations for future investing decisions.