
Financial Backtesting: A Cautionary Tale - kawera
http://www.philosophicaleconomics.com/2015/12/backtesting/
======
chollida1
The biggest problem with back-testing in finance that I've come across is that
markets change.

Nowadays markets change every 5-10 years,this is my own personal belief here,
which means you can't do too many meaningful comparisons over longer time
periods. Couple that with the fact that most bull markets last for 5-10 years
and this means you can't really tell how your active portfolio will perform.

Or put another way, strategies that work on a macro level tend not to work
very well. Anything of the flavor of "if the market does X for past 5 days,
then buy Y and hold until Z happens tend not to be reproducible or testable in
a meaningful way."

What does work, again from personal experience, is arbitrage. Arbitrage is
everywhere in the financial markets.

\- 5 year bonds that are 2 years old look just like 3 year bonds that were
just issued, assuming all terms are the same.

\- The A and B shares of a single company tend to follow a constant pattern

\- 2 different ETF's that follow the same index(say SPY) tend to move
together,

\- exchanges that tend to cross at midpoint( most dark pools) and price via
the SIP tend to offer a half penny latency arbitrage opportunities

\- when two companies merge the price of the target will converge on the deal
price at the closing.

\- low latency arb where stock prices follow some commodities price, banks vs
currency rates, oil produces vs WTI, etc

These are all examples of things that you can meaningfully back-test and
model.

Jim Simmons of Renaissance Capital once remarked that there is no signal in
the market data at a macro level, I tend to agree with him and the author.

~~~
p4wnc6
The way I understand this is that a lot of it comes down to articulating
useful prior distributions over some of the outcomes to be predicted.

The kind of backtesting that this post explores, which is basically the bread
and butter of "quantamental" asset managers (lolz CRSP Fama/French data), is
almost always performed in a frequentist set up where you plan to regress
outcomes (portfolio returns) on a simple set of "easy to understand"
predictors, like price momentum over some trailing time period.

You might have to do a few cross-sectional or time-series adjustments, like
the Fama/MacBeth regression and some poor man's time series smoothing of
coefficients, but it's basically just a really large, really crappy OLS model,
where the plan is to just torture the vendor data sets until you get out
whatever kind of statistical significance you need for a marketing whitepaper
and/or pitchbook entry to shop around to unsophisticated board members in
client organizations.

This is the whole "which 1000 tests did I do before I showed you the results"
thing from this article, and it just amounts to simple p-hacking (or t-stat
hacking) which is a well-known Bad Thing To Do.

Clients are already very sensitive to things like transaction costs, so the
clients probably would have demanded that the analysis be carried out net of
transaction costs in the first place. But transaction costs are easy for a
paying client to get their head around... for non-quantitative client board
members, getting their head around nuances of p-values and regression models
is actually quite hard, even if these are statistics 101 for the finance
professionals working with them.

But the amazing thing is that while everyone's busy worrying about p-hacking
(and _never_ Bonferroni-correcting reported results, if they even know how
many researcher degrees of freedom [1] are involved in the first place), no
one is talking about the fact that all of this is in the service of a
frequentist-style analysis of _Prob(Data | market hypothesis)_ , rather than a
Bayesian-style _policy_ question that seeks to analyze _Prob(market hypothesis
| Data)_ and necessitates some use of expert-derived priors about the
probability of some market hypothesis in general.

In the case of a momentum strategy like the one studied, the market hypothesis
would need to be about modern market conditions, the causal underpinnings of
price momentum as a predictive factor, and whether or not its efficacy will
improve or decrease when you condition on the _modern_ market situation.

And crucially, _this_ is the part that active managers are (like everyone
else) very bad at. They can't articulate accurate or informative priors about
the current status of something like the causal underpinnings of price
momentum any better than anyone else.

If you are seeking to pay someone for a _policy_ analysis, e.g. what should I
think about the posterior probability _Prob(market hypothesis | Data)_ , then
you shouldn't be paying them for the _easy_ part of that analysis -- the part
where you write down some silly likelihood function due to heroic OLS
assumptions and optimize it for parameters and update a portfolio position
based on those parameters. That part is borderline _trivial_ if you have
decent software engineers to make sure the code is not excessively buggy (yet
most quantamental shops don't, and they're still letting portfolio managers do
this shit directly with SLOPE in Excel and then making Ivy MBAs translate it
into poor Python or MATLAB code for "production") -- the _hard_ part is
actually articulating informative priors.

I think overall, it higlights a need for client organizations to become a bit
more sophisticated about the sorts of forecasting models that quantamental
managers use. Then they can begin applying market pressure by firing managers
that do bang-simple things like OLS in a crappy dev environment, and hopefully
create competitive pressure to make quant shops shift to more rigorous
Bayesian methods, or _at least_ make the rampant researcher degrees of freedom
and p-hacking problems transparent so they can be mitigated.

Of course, a lot of this is moot because a very large amount of the business
done with quantamental asset managers is based on nepotism and cover-your-ass
blame insurance. For example, suppose you are a non-technical board member of
the entity charged with managing the pension plans for some state's retired
firefighters. You are paid money not _just_ based on whether the pension
plan's assets improve, but also on whether you did due diligence when hiring
an active manager (read: you hired a fancy consulting firm who told you what
you wanted to hear), and whether you are in a position to fire firms if they
seem to do a bad job, or create pockets of plausible deniability by crafting
"data journalism" accounts about a market downturn that "no one could have
avoided."

Because these board members are the ones with most of the interaction with
quantamental shops, most quantamental shops end up turning into shoddy data
journalism mills -- need a story to explain yesterday's bad returns to the
rest of your board? We're on it. Whadya want? Something about oil & natural
gas? Something data mined about some random earnings reports? Just let us
know. We're your one stop shop to rationalizing away yesterday's returns. Just
don't ask us what will happen tomorrow.

In this sense, clients rarely punish quantamental shops. In fact, many of them
are _handsomely rewarded_ for doing p-hacking and shoddy data journalism as
long as it covers the ass of some client board member who doesn't want to lose
his bonus because of a downturn in the S&P 500 and needs to sell a story to
his bosses fast.

If that kind of stuff doesn't go away, there will be no actual prediction-
based incentive for the asset managers themselves to ever improve. Why work
hard to write good code and do proper statistics when you can make bank by
selling shoddy data journalism?

[1] [http://andrewgelman.com/2012/11/01/researcher-degrees-of-
fre...](http://andrewgelman.com/2012/11/01/researcher-degrees-of-freedom/)

~~~
xixi77
You make it sound like Bayesian-type analysis is somehow fundamentally
impervious to p-hacking and is therefore clearly superior to frequentist for
this purpose; do you have any actual references in support of this -- because
I am personally having serious doubts about that, in particular given how bad
people are at coming up with priors (and when they do, wouldn't these priors
be usually backward-looking anyway?)

~~~
p4wnc6
No statistical method is impervious to malicious attempts to subvert it. That
said, reasoning between two different models on the basis of _magnitude of
t-stat_ is _inherently_ flawed. That is, such an inference process doesn't
possess the properties necessary to lead to the conclusions people try to draw
from it, _even if no malicious intent is present_. When you add this to a
situation where the temptation is high for things like p-hacking, it _actively
encourages_ poor statistical hygiene.

Whereas, with Bayesian statistics, you'd have to _go out of your way and do
something that is glaringly fishy_ to do the same thing. Hiding it from people
is generally a lot harder than hiding what you're doing with generic
p-hacking.

One of the main reasons for this is that the model's reliance upon assumptions
is usually far more transparent in a Bayesian model. You are forced to
publicly declare your dependence on assumptions encoded in the prior. In the
frequentist case, many people _don 't even consider_ that there are prior
assumptions that may not hold about the data, and indeed they often report
results which are incorrect because they did not account for basic things,
like how their data is encoded, or non-ignorable experimental designs.

One of my favorite papers about this sort of thing is "Let's Put the Garbage-
Can Regression and Garbage-Can Probits Where They Belong" by Christopher Achen
[1].

Of course, the same thing can happen in a Bayesian setting. Someone can use an
off-the-shelf Bayesian model fitting tool without understanding what it is
doing, and they could use standard prior distributions for analytical
convenience or ease of coding, and then later try to package their result as
some "fancy Bayesian analysis" when it really suffers from low stats quality
just like the dumb t-stat comparison would have.

But it's _harder_ to do that in a Bayesian setting without it drawing a red
flag (I guess, unless you just tell bald-faced lies about what code you wrote,
but that's a different story).

The bigger point is that you don't trust just one thing. You don't look at a
single description of model fit or comparison, like two t-stats, and draw a
conclusion. You should plot the residuals, make calibration plots, ROC curves,
histograms of where your observations fall based on predicted outcome
variables, etc. There is _never_ any sort of cookie cutter approach that will
tell you whether some backtest should be believed.

As for the usefulness of managing this stuff in the form of articulating
priors, one quote I always liked was in response to a critcism of Bayesian
methods' reliance on a prior, where some academic said something like "So and
so is standing on the front porch with a shotgun ready to shoot down any
assumption that comes over the hill -- meanwhile, the back door is wide open
for any assumption to run in."

If we set aside the fact that _any_ statistics procedure can be misused, or
can be reported upon without proper skepticism and due diligence, then I still
_do_ further believe that frequentism, in general as an entire epistemological
half of statistics is simply a failed project, especially for any type of
problem that is a _policy_ problem, which inherently requires access to a
posterior distribution conditioned on some information.

But, I don't want to derail this thread and turn it into yet another
philosophical debate about whether frequentism can be used as a basis for
probability theory that corresponds with reality or not.

[1]
[http://www.columbia.edu/~gjw10/achen04.pdf](http://www.columbia.edu/~gjw10/achen04.pdf)

------
Nomentatus
Gotta agree with Jerf - what happened seems obvious. Someone else got there
first - and we know pretty much when - and by now their algorithm includes
sophisticated filters and refinements. They're pulling in only the best low
hanging fruit now and competing for it, too - by buys near the end of the day,
say. Margins are much lower now and the stated crude algorithm can no longer
compete.

My first paper-and-pencil investment strategy in stocks was precisely this
momentum one, when I was about 14, around 1970. So, this wasn't an opportunity
that was going to persist forever. (What made it non-viable in the past was
the high cost of broker fees.)

------
lordnacho
Well written article. I kinda fear for some people I used to work with.
They've got a "model" that turns over the portfolio 1.5 times a year,
backtested since around 2000. Sometimes it's hard to get people to understand
that what they're doing is totally crazy.

What the article is hinting at is a very specific part of the scientific
method: mechanism.

When you hypothesise about something, you are also guessing at how exactly the
effect comes to pass. So for instance you surmise that incident radiation
induces a current in conductors, lessening the amount of radiation beyond the
point of incidence. This mechanism would have some interesting experimental
results, namely that non conductors would not shield anything, whereas
conductors can be used to shield say a phone from EM radiation. It also means
if your shield was conducting but very thin, it might not work.

In the financial sphere, you rarely have the luxury of being able to test
mechanisms. There's a huge amount of data, and it's quite easy to find
structure, but the apparatus itself is rarely exposed.

This leaves you with two areas of research: arbitrage, as in what Chollida1
says, and presumed mechanisms.

Since the first area has been explored by him, I'll look at the second.

So let's say you used to work in a large pension fund, and that pension fund
always did its trades starting at 3pm, and filled whatever wasn't done in the
auction. You meet some colleagues, who do the same. Since you now have a
glimpse into a causal mechanism, it's pretty reasonable for you to look at a
model where things that have traded a lot since 3pm continue in the way they
were going.

There could be lots of similar mechanisms in play, coming on and off as the
market evolved.

------
mcguire
Does anyone have current performance data for the _Dogs of the Dow_ or Motley
Fool Foolish Four (?) strategies?

------
canttestthis
If you were an investor in 1985 and you did this analysis and ran out-of-
sample tests, you would have 15 years of profitability before the algorithm
starts generating losses. After a few months of losses you would discard the
algorithm and search for a new one. What am I missing?

~~~
lintiness
why would you imagine a few months or losses indicated the strategy didn't
work anymore?

~~~
p4wnc6
Exactly. This is a form of counterfactual bias. It's easy to imagine how you
were "almost right" and give yourself credit for that, but it's much harder to
imagine when you were "almost wrong" and correctly penalize your thinking in
proportion to the almost-wrong-ness.

It's easy to imagine you would have turned off the strategy at a convenient
time, but it's hard to imagine that you would maybe rationalize that it's just
a short slump or something and keep the strategy going.

And for any hard rule you make like a priori committing to turning off some
strategy if there are X consecutive months of bad returns (for reasonably
small X), you can construct some counter-example strategy where, if you had
just had the foresight to hold on to month X+1, you would have made a ton of
money and there was some after-the-fact obvious reason why months 1 through X
had the poor return, and "smart" investors would have understood this and...

------
1812Overture
"Markets can remain irrational longer than you can remain solvent." \- John
Maynard Keynes

