
Machine learning for financial prediction - Matetricks
http://robotwealth.com/machine-learning-financial-prediction-david-aronson/
======
mathgenius
It's just so ridiculously easy to overfit these models, and so so many ways to
shoot yourself in the foot as a result.

For example, "I split the data set into 5 random segments and then trained a
model on 4 of the 5 segments and then tested it on 5th." Such data is serially
correlated (it's not good old iid) so already it looks like you have poisoned
the test set with information from the training set.

The hard part is not "feature engineering" or "ensemble methods", the hard
part is controlling the entropy that you feed these things because they are
voracious monsters and will absolutely eat all of it.

~~~
lpage
> _Such data is serially correlated (it 's not good old iid) so already it
> looks like you have poisoned the test set with information from the training
> set._

Kind of. If it was that simple making money off of an autoregressive model
would be trivial -> everyone would do it -> serial correlation would
disappear.

I agree with your observation that figuring out what to feed the beast is one
of the bigger challenges though. Case and point: train a mean reversion model
on the last seven years of S&P data to buy dips and train a momentum model to
buy higher highs. That equity curve would look very encouraging. Do it on a
fifteen year basis, and not so much. Now the question becomes: how long of a
lookback do you use when training your models? Chopping up data at random will
mux out useful correlations. Subsetting into periods leads to poorly
generalized models. Not fun.

------
joegreen
If anyone else is getting errors when loading the page, here's the google
cached version
[http://webcache.googleusercontent.com/search?q=cache:-ciyXfS...](http://webcache.googleusercontent.com/search?q=cache:-ciyXfSG2XoJ:robotwealth.com/machine-
learning-financial-prediction-david-aronson/+&cd=1&hl=en&ct=clnk&gl=us)

~~~
meeper16
More tools for ML financial prediction

[http://52.11.211.67/recommend/app/hidden_connections?query=h...](http://52.11.211.67/recommend/app/hidden_connections?query=helium&db=all)

[http://52.11.211.67/recommend/historical-trends/index-
contra...](http://52.11.211.67/recommend/historical-trends/index-
contracts.html)

------
dpweb
There are a few problems with turning your laptop into a money machine using
data analysis.

Remember the maxim, past performance is not a guarantee of future results. You
can develop strategies based on past data that will beat the market, but, the
nature of markets is to adapt to kill your edge. Markets adapt constantly and
your edge stops working at an unknown point in time. It's unknowable when that
WILL happen because past data can't show that.

The other reason is transaction costs. In gambling called vig. Let's say I'm
betting NFL games. NFL home teams win 51% of games. Even flipping a coin I've
read come up heads 50.1% of the time. These are profitable systems. But you're
paying the bookie 10% on each loss. You could find someone to bet you on coin
tosses and bet heads each time. You have a positive expected return, although
you need a huge number of flips to make money!

In trading of course costs is commissions. Why do you think there was a rise
in HFT? The strategies are consistently profitable. (Besides the
flashing/manipulation tactics) It is ONLY profitable because of extremely low
commission costs that are not available to the retail (or even semi-
professional) trader.

Systems that can pull $0.0001 out of every share traded overall on high volume
can be (pretty easily) created, but you can't trade them profitably. In fact,
you will find commissions (semi-pros who pay about $3 per 1000 shares) priced
right at the point of an edge you could be expected to develop.

~~~
melling
"nature of markets is to adapt to kill your edge"

If you are a low volume, small time trader, the market isn't going to move as
quickly to adapt to you. If you have $100,000, for example, and return 30% a
year, you aren't on anyone's radar.

~~~
jzwinck
30% on 100k is 30k. You'd be better off getting a regular job unless you can
sustain that for more than 10 years. Which you can't predict.

~~~
branchless
Exactly. Plus let's remember you are adding nothing to life here. All you are
doing is collecting 30k off other people with a slightly less optimal
"strategy" than you.

And no I don't believe these people are "adding liquidity and assisting price
discovery".

to the reply as I can't post as HN censors detractors of big finance:

I don't believe the benefits of liquidity added by HFT are worth the enormous
costs firms sink into it.

~~~
dsacco
_> > And no I don't believe these people are "adding liquidity and assisting
price discovery"._

The nice thing about reality is that it remains even when your belief persists
against it.

In large-cap stocks during rising markets, high frequency trading does improve
liquidity[1]. While the effect may not be as prevalent during a downturn and
it may not impact smaller stocks as much, I'd like to see your evidence that
it actively harms the market or that the practice is vapid and produces
nothing of value.

Or is that just a statement you made because it nicely aligns with your
political conception of Wall St?

EDIT: It helps to point out that "algorithmic trading" and "high frequency
trading" are not at all the same thing, especially as these terms are usually
conflated on HN. An algorithmic trading system does not necessarily need to
trade at high frequency. Some algorithmic trading systems make trades in
intervals of days or weeks, not seconds or milliseconds. The paper cited here
describes the market-making activities of what is traditionally called high
frequency trading and the benefits it has over human brokers of the past, but
it uses the umbrella term "algorithmic trading."

EDIT 2: The parent comment responded to this one by editing his original one,
because _" HN censors detractors of big finance."_ You also claimed you don't
believe that the liquidity provided by HFT is worth the capital that large
firms dump into it.

In 2013 the entire HFT industry made about $1B, down from $5B in 2009[2]. HFT
is not a large industry. It is eating much of Wall Street's traditional
market-making inefficiencies, which is why it is widely disliked, but it is
not "big finance." Big Finance is generally opposed to HFT.

You still haven't provided evidence or numbers to prove or even quantify what
you're claiming. Are you saying HFT is not worth the investment to firms, or
are you saying it isn't providing some vague "value to society" relative to
alternative uses of investor capital?

The first case is obviously nonsensical, as many firms generate profit using
high frequency trading strategies. The second case is like saying we shouldn't
do anything if it doesn't save impoverished children in Africa. The added
liquidity has a material and beneficial impact on trading outcomes for buy-
and-hold retail investors, which is shown in my first citation here. You have
yet to satisfactorily refute this.

[1]:
[http://faculty.haas.berkeley.edu/hender/Algo.pdf](http://faculty.haas.berkeley.edu/hender/Algo.pdf)

[2]: [http://www.bloomberg.com/news/articles/2013-06-06/how-the-
ro...](http://www.bloomberg.com/news/articles/2013-06-06/how-the-robots-lost-
high-frequency-tradings-rise-and-fall)

~~~
branchless
Generating a profit does not equate to generating wealth.

------
mcbrown
Former professional investment manager here...

The biggest problem with things like this, which almost nobody talks about in
the context of investing, is publication bias.

100 people try to develop a profitable trading algorithm. 1 comes up with one
that looks great on back-tests at a 1% confidence (in other words, exactly
what you'd expect from random chance alone over 100 trials).

That person writes an article/pitch/business plan based on their algorithm.
You never see results from the 99 who failed.

Going forward, the successful algorithm is no more likely to work than the
failed 99, but from the perspective of the general public it sure looks like a
winner!

~~~
mathgenius
> 100 people try to develop a profitable trading algorithm....

It's much worse than this with machine learning approaches. Imagine a million
people trying to find a profitable algo, all on your laptop, and you are
choosing the best one out of all of those.

If you are used to pen-and-paper trading strategies, or even excel
spreadsheets, machine learning is just a completely different level to this.
And probably how it works will be unintelligible to anyone. I don't even see
how someone can write a business plan based on this.

~~~
selectron
The type of approach used has limited effect on survivorship bias, what
matters is the number of people employing different approaches and the size of
the effect. So if machine learning approaches can produce real results, the
data will show this. Survivorship bias is real, but it is not the full story.

------
hendzen
If you can actually reliably generate alpha from a model like this there is no
point of running the strategy yourself. There are any number of hedge funds
that will sign you on, let you keep all of the IP you develop, and give you
10-12% of any returns you generate. That sounds small, but it's mitigated by
the fact that you will have access to potentially billions of dollars in
capital to trade if your strategy has the capacity for it. So you get 10% of a
much bigger pie, with way less downside risk. Plus you get access to all their
internal trading systems, execution services, data feeds, etc, which are
usually orders of magnitude better than what an individual has access to.

~~~
malux85
Who do I contact? I have a deep learning startup that is trading forex right
now, I would like to make some contacts and see if I can integrate

~~~
wocram
Why FX? More direct access to exchanges?

~~~
vegabook
it's the most liquid market on earth, it's 24 hour for the majors, its depth
is enormous, and it is less prone to event risk than individual equities (as
long as you stick to unpegged currencies) so you can get tons of leverage on
it.

------
ChuckMcM
I think financial prediction via machine learning will be a useful cruicible
for defining AI from non-AI. So far, so many companies that have applied
machine learning to prediction have ended up on the wrong side of the order
book at the wrong time. I don't know if this is because other algorithms
figure out what they are doing and rapidly develop a counter algorithm to
fleece them, or if its just savvy traders intuition about what the algorithm
is keying on and manipulating it. Sort of like good RTS game players that
figure out how the opponent AI is playing and start playing against its
programming rather than some strategy from first principles.

------
xivzgrev
Anyone know where he got all the raw data to feed his algo? Clearly he used a
lot of data and the two main sources of free info i know of are google finance
and yahoo finance. At least with google finance i run into issues with their
api if you execute too many calls simultaneously, a bunch end up not returning
any data

~~~
TDL
Not sure where he got his data, but you might want to try
[https://www.quandl.com/](https://www.quandl.com/)

They have a free, community, curated data set of ~3200 stocks.

~~~
xivzgrev
Wow i have not heard of that site before - thanks!

------
lordnacho
Interesting article. I do something related, and here's my take:

Data mining is useful because it gives you things that are predictive that you
might not have considered at first, but make sense after. This is mainly due
to combinatorial explosion in the potential number of formulas.

You generally have a vague idea of what might be predictive, eg cheapness vs
earnings and cash flow, but there's a huge number of ways that might show up
in the data, and there's a huge number of ways it might hide in the data.

So for instance an old school analyst might do a ranking of price/earnings as
well as cash flow, or whatever bespoke formula desired.

A data mining approach could take all the fundamentals and generate formulas
mixing the variables, yielding a number that seem to be effective. Out of
those, you'd look at them and decide that they capture some thesis (low P/E,
upward trend in earnings). Then you'd look at whether the formula is sensitive
to small tweaks. For instance, if you regressed the last 6 earnings and it had
phenomenal performance, but with 5 or 7 it wasn't, you probably conclude it's
some sort of random result.

There's funds that take the mass approach to an extreme. They have huge
databases, with a genetic algorithm that generates expression trees, and a
battery of stats (incl backtests) to decide what works. They end up with many
thousands of strategies that are a great deal more effective than your
standard one-trick pony fund.

~~~
Wonnk13
very interesting. Can you recommend any resources for someone with a fairly
strong stats / programming background but no real substantive finance
experience?

~~~
lordnacho
Igor Tulchinsky has a fund that does this. He also writes books and papers
about how he does it, with everything you need to do it yourself.

------
dreamdu5t
There's a hedge fund built by anonymous data scientists -
[https://numer.ai](https://numer.ai)

You can use ML to make money on encrypted stock data for free. Think Kaggle
but the winning models are used to trade.

~~~
chillacy
It looks like the feature set is fixed on numer.ai? If so everyone's probably
developing mega ensembles (this is what netflix's competition ended up as,
with teams merging because their models did better together). Compared to
quantopia, where you're responsible for feature engineering too (though
numer.ai is probably easier to get started, since model selection is imo the
fun part).

------
aj7
Really successful traders spend their obtaining insider information, not
massaging public data. It stands to reason that an ensemble of technical
trading methods would regress towards the mean.

~~~
egwor
Using insider information is illegal. If you mean insider information then
this is total nonsense. Sure, there are a few that do illegal things (and
inevitably get caught since there is so much monitoring going on).

~~~
falsestprophet
Exactly, when I heard drugs usage was a problem in some communities I too knew
that was impossible because selling and using drugs is illegal.

Sure, there are a few that do illegal things (and inevitably get caught since
there is so much monitoring going on).

~~~
dsacco
Your middlebrow dismissal doesn't work here because the parent didn't say it's
not a problem or that it doesn't happen. The parent is refuting a point that
successful traders exclusively become that way by trading on illegally
obtained information.

~~~
zepto
The point being refuted is a straw man - nobody said anything about
'exclusively'.

~~~
dsacco
No it isn't. The original comment might not have said "all successful traders
spend their time...insider information" but the implication is there as it was
stated. Given what the grandparent comment replied with, it appears I wasn't
the only one who inferred that message.

~~~
zepto
Just because more than one person infers something doesn't mean it's really
there.

------
sovande
I'll invoke the Black Swan
([https://en.wikipedia.org/wiki/The_Black_Swan_%28Taleb_book%2...](https://en.wikipedia.org/wiki/The_Black_Swan_%28Taleb_book%29))
since it hasn't been done yet in this thread.

------
aj7
...spend their time and resources...

------
robotwealth
Hello

I'm Kris, the guy who wrote the article that started this thread. Thanks to
all who have read my article and taken the time to comment. In the context of
my motivation for starting my blog, it means a lot. I'm an engineer who became
interested in quantitative finance and machine learning a few years ago. I
learned how to code and apply my maths and stats knowledge to finance
independently - no formal training whatsoever. This meant that for a long time
I was conducting research and developing trading systems in a vacuum; I had no
one to bounce ideas off or learn from. So I started writing about what I was
doing in the hopes of getting some feedback. So thank you all for providing
some. The insights were immensely valuable and I learned a lot.

I thought it would be useful to respond to some of the comments.

mathgenius brought up the extremely valid point that regular k-fold cross
validation in a time series context doesn't make sense since the data is
autocorrelated, not iid. I no longer use this approach for time series data,
instead favoring Rob Hyndman's time series cross validation approach, also
known as forward chaining. I believe this approach is the best representation
of a real trading environment. The issue becomes deciding how large the
rolling window of training data should be - older data may be obsolete, but
excluding too much history can lead to not enough training instances.

dpweb raises a good point too, namely that just because your model performed
well on past data, even if that data was out of sample, there is no guarantee
that the future will be sufficiently like the past, meaning that your model
may well become useless at some point in time (possibly very quickly). This is
a valid point, but no reason to abandon the markets. It does however require
that any algorithm's live performance be objectively monitored such that the
level of deviation from expected performance can be statistically quantified.
Once a pre-determined confidence level in the model's obsolescence is reached,
it should be removed from the portfolio.

mcbrown's comment about publication bias is a good one too. Even worse, I've
personally developed hundreds of trading systems that I haven't published.
Other bloggers and publishers have most likely also done the same. This form
of selection bias is very likely rampant, and is especially applicable to
models 'discovered' using machine learning techniques that may not be rooted
in traditional economic or financial principles. The moral: absent some form
of robust accounting for selection bias, view all of these types of systems
with a healthy dose of skepticism, and the published performance as a
theoretical upper limit to what could be achieved in practice.

hendzen's point about partnering with a fund or proprietary trading company
rather than running your reliable, alpha generating strategy yourself is also
a valid one. I have happily found this out for myself recently.

Also, lordnacho is spot on regarding his take on the utility of data mining in
finance.

Thanks again for all the comments!

------
nxzero
Never understood why anyone would spend time creating any trading method given
even if it did work (possible, but unlikely) the SEC would audit you and then
leak how you were making the outperforming returns.

Welcome any thoughts, in part because legally beating the market is possible,
just don't get the SEC & OPSEC aspect.

~~~
lordnacho
Why would the SEC audit you? Just by random chance there are many people
outperforming the market. They can't audit you just for outperforming.

If they do audit you, how will they discover how you are generating your
trading decisions? Their remit is to make sure you aren't doing something
illegal. There's no reason they would understand what you were doing in
anything other than a superficial way.

Also, something can be profitable, and obviously so, without being easily
reproducible. For instance there are firms that do simple footrace arbitrage
on the same security between different exchanges. Not hard to understand, but
you still can't do it. There's a whole spectrum of strategies that are on a
frontier on the map of easy-to-understand vs easy-to-implement.

Besides all that, I think even if you were to learn about a way to beat the
market, the way you found out might lead you to be very skeptical of whatever
was proposed. If a guy is selling it on a website, you will probably not
believe him, right? And if he showed you backtests that worked, you would
suspect they were generated from a random generator of some sort. And if he
then shows you the math, you would almost certainly find fault with it. Why
did he do this or that transformation on the data? Must be random...

~~~
nxzero
Few years back, SEC started being very agressive about finding entities making
above average returns; my understanding is that if over a set amount of
transactions you're making over 30% that you will get "knocked" and the
auditors have zero reason not to leak the information. Best example I know is
the Walmart parking lot satellite imagery analysis; happy to dig up a link.

~~~
Bromskloss
I'd love a link as well as additional examples, if you can think of any.

~~~
partisan
Googling "walmart parking lot analysis" yielded the following as the first
result.

[http://www.cnbc.com/id/38722872](http://www.cnbc.com/id/38722872)

~~~
Bromskloss
Thanks. It has no information about the technique being leaked by the SEC,
though.

~~~
eggy
Not sure about SEC leaking anything, but satellite data is a tool for
investors to use satellite imagery, and image processing to see things like
how many cars are in several big branches, make assumptions and correlations
to spending, and then act on the information before Walmart releases a
quarterly statement, sort of.

Number of container ships docked or leaving port around China can forecast
trends in China's exports, again before any official numbers are made
available. I don't see this as insider trading. You pay for the satellite
time, you gamble on your data analysis, and you either win or lose. If it were
certain, others do the same, and the edge is lost quickly by market
adjustments.

