Hacker News new | past | comments | ask | show | jobs | submit login
Fitting to Noise or Nothing at All: Machine Learning in Markets (zacharydavid.com)
295 points by bilifuduo on Aug 6, 2017 | hide | past | favorite | 69 comments

Most academic CS literature is complete BS. The vast majority of papers fit into a simple formula of "We apply method X to problem Y, and outperform other approaches using approaches similar to method X".

Meanwhile, no one uses method X or any of its cousins, because in the real world the problem is solved very differently with a combination of both principled algorithms and heuristics derived from real-world datasets.

The paper also fails to give any theoretical reason or mathematical insight as to why their version of X is better.

Thus, it doesn't actually solve a real world problem OR advance scientific understanding.

I think this is a reasonable point, but I would just add that a lot of people in CS academia are well aware of this. The problem is that we all serve multiple masters and one of the things we have to do is publish frequently. I think you'll find that many CS academics try to strike a balance between publishing for the sake of publishing and actually working towards a larger scientific goal. Personally speaking I'm definitely guilty of publishing work that looks good on my CV, but does not advance my deeper scientific agenda. That said the science is always on the front of my mind even if it only makes up 10-20% of my actual publications.

In regards to solutions it would be great if we focused less on the frequency at which we published and editors were more willing to publish work that had novel ideas even if it did not have state of the art performance (yet). Although like any job there will always be parts that are tedious, involve politics and yes parts that are even counter productive. At some point as an individual you just have to play the game while still thinking about and trying to advance the bigger picture.

>Personally speaking I'm definitely guilty of publishing work that looks good on my CV, but does not advance my deeper scientific agenda.

This sounds analogous to resume driven development. And I completely understand it. I have pragmatically chosen the best tech for the job (factoring in the learning cure for new tech and what I already know) and haven't learned much new tech for the last couple of years. Now I get the feeling my CV is looking a bit dated. Does it make me a worse at developing software? I would say not. I have mastery in a few areas rather than shallow knowledge in a lot. Will it make it harder for me to get a job? Quite possibly if I keep going this way.

This sort of comment does very little to advance the conversation. I made several attempts to write a response that talks about heuristics, algorithms, what industry does and what academia does, but in the end I found your portrayal of academia and industry to be so hyperbolic and unrealistically one sided that I felt it not even worth getting into the discussion.

Let's try to have respect for each other on both sides of the "academia/industry" fence and not resort to unrealistic black and white views of the world.

If you look at an article talking about finances and machine learning, the success depends if you earn real money or not. It is cynical but it is the way algorithms are valued in that space. Real money returns are the real test.

> Most academic CS literature is complete BS

Thats complete BS. Certain journals are complete BS, and certain subfields of CS are superfluous and often times lack substance in publications. But CS literature, from top tier journals, has vast merits.

English is not my native language so I don't get offended at all by the coarse language (coming from "saltier" spanish, for me it's actually quaint that you just hint at the word with 'BS'), but please. I'm sure that with a little more effort you both can provide an adjective that provides more information. Just tagging things as BS seems juvenile, and in HN we try to keep conversations more rational and less emotional.

Depends. If the the method it proposes do have an advantage over other methods, and generalizes well, it is a good paper nevertheless.

I think the assertion is that they never do. This is widely regarded in the ml classes as the advantage of data being much more important than the algos.

I'd be curious to know how accurate the assertion is.

Most of them are. But my problem with this claim is, good papers could still be classified into this formula as well. If talking about field like image recognition, almost everything after ResNet is incremental improvement to it, most of them are BS, but there are still some interesting new ideas even their gain is just marginal.

I think the difference is that many of the new algos aren't even showing reproducible incremental improvements. Rather, they just benefit from the massive amounts of data available.

Though, I fully grant that the problem is as much that they are optimizing different things. Sometimes, the goal was just faster training, not better accuracy. (Though, I'm curious how fungible those are... Something that trains faster would ostensibly train for better accuracy at the same times, no?)

Could an enterprising person find such papers, cleverly select a dataset such that X is way worse, and finally refute the original paper?

This seems like something an amateur could do. Wouldn't that tear down a bunch of this junk?

Sure. But for approximately the same level of effort, you could generate a new paper in the same template. It would likely be easier to get that "novel result" published than the refutation.

And that's generally the truth in every scientific field.

Right now you have a "reproduction renaissance" in psychology and cancer, where you can get published, and get acknowledged, if you tear down a high level paper. But in general, unless you aim at refuting the highest level, it is hard to get published and advances your career nowhere, so people have no incentive to do it.

A non-replication does not "tear down" a paper... It is not any kind of "attack", your attitude is one of the sources of the problem.


When I did medical research I wished people would try to replicate my findings so I could see how consistent the results were and how well my model fit future data... In fact it was really annoying that it was unlikely anyone would ever try to do a direct replication.

My attitude? What attitude is that?

A single non-replication does not. Consistent failure to replicate, after communicating with the original author, does.

See e.g.



Or just google "reproducibility crisis", the two most common subjects that come up are cancer research and psychology research.

That is the problem: Why wouldn't try to do a replication?

Because the incentives are to provide "new research", not to confirm or deny a known result.

It seem to me that most probably no-one has gotten tenure by reproducing, disproving results.

I can't up vote this enough..

I've spend an unreasonable amount of time reading research related to finance for my web app Piglet:


Long story short, all "research" is pretty much B.S.

I used to assume this is because people want to make money, so they keep the good analysis secret. However, after working in the industry a few years; it's mostly because they just don't know how to apply the algorithms or if it's even possible.

I think my favorite example is the seminal paper on using twitter sentiment to predict stock movement[1]. They don't use a large enough data set, and more importantly they use granger causality to identify "casualty"[2] between sentiment and stock value. They then claim they found a specific range which has a p-value indicating they are correlated... Of course you'll find a correlation when you look at two normalized signals and try to match them up.

Now, if they had not use the DJIA (Dow Jones Industrial Average) and instead used 500 individual stocks, and found the sentiment on twitter correlated with stock value(s) 90% of the time between 5 and 10 days. I'd argue they probably have something.

However, because their method is literally a BFS on only two signals in an attempt to find a correlation, they must correct for the p-value. i.e. "look and you shall find"[3]

This is just one of the hundred issues I've found, but really sheds light on how bad that industry is.

[1] https://scholar.google.com/scholar?hl=en&q=twitter+stock+sen...

[2] Granger Causality offsets two signals in an attempt to find a correlation between them at an offset between two times. AKA find "causality" by finding correlation at an offset in time

[3] https://stats.stackexchange.com/questions/5750/look-and-you-...

> Long story short, all "research" is pretty much B.S.

Easy there! When I start feeling cynical about things like almost every sentiment analysis paper, I go back and read some of my favorite papers, especially ones from the 70s and 80s (eg, [1])

Another useful heuristic is to read the best paper award winners from conferences in a particular area. It's not a perfect metric, but the signal/noise ratio is better.

[1] the flajolet-martin algorithm: http://algo.inria.fr/flajolet/Publications/FlMa85.pdf

> Long story short, all "research" is pretty much B.S.

I'm curious about this as it relates to Piglet. Are you monetizing on the idea that most people won't realize this and will pay for Piglet with the hopes that they signals they get there are somehow positively correlated to what's happening in the real world for a particular stock/idea/company/politics when in reality you don't think it really does?

For Piglet, I'm actually going back and conducting research to find valuable signals and releasing everything in as clear a way as possible.

Importantly, I'm providing access to signals first with explanation(s) of what each signal is / how it's useful today.

I also have an investment model currently accomplishing on average 50%+ YoY (for the back testing on 8 years, live for 4 years).

From there, I'm building further machine learning models on top of the signals. However, (as mentioned prior from the sentiment analysis paper) I'm doing much more robust research. For example, comparing all stocks as opposed to DJIA.

What are your thoughts on Renaissance Technologies?

I got into quant finance 12 years ago with the mistaken idea that I was going to successfully use all these cool machine learning techniques (genetic programming! SVMs! neural networks!) to run great statistical arbitrage books.

Most machine learning techniques focus on problems where the signal is very strong, but the structure is very complex. For instance, take the problem of recognizing whether a picture is a picture of a bird. A human will do well on this task, which shows that there is very little intrinsic noise. However, the correlation of any given pixel with the class of the image is essentially 0. The "noise" is in discovering the unknown relationship between pixels and class, not in the actual output.

Noise dominates everything you will find in statistical arbitrage. R^2 of 1% are something to write home about. With this amount of noise, it's generally hard to do much better than a linear regression. Any model complexity has to come from integrating over latent parameters or manual feature engineering, the rest will overfit.

I think Geoffrey Hinton said that statistics and machine learning are really the same thing, but since we have two different names for it, we might as well call machine learning everything that focuses on dealing with problems with a complex structure and low noise, and statistics everything that focuses on dealing with problems with a large amount of noise. I like this distinction, and I did end up picking up a lot of statistics working in this field.

I'll regularly get emails from friends who tried some machine learning technique on some dataset and found promising results. As the article points out, these generally don't hold up. Accounting for every source of bias in a backtest is an art. The most common mistake is to assume that you can observe the relative price of two stocks at the close, and trade at that price. Many pairs trading strategies appear to work if you make this assumption (which tends to be the case if all you have are daily bars), but they really do not. Others include: assuming transaction costs will be the same on average (they won't, your strategy likely detects opportunities at time were the spread is very large and prices are bad), assuming index memberships don't change (they do and that creates selection bias), assuming you can short anything (stocks can be hard to short or have high borrowing costs), etc.

In general, statistical arbitrage isn't machine learning bound(1), and it is not a data mining endeavor. Understanding the latent market dynamics you are trying to capitalize on, finding new data feeds that provide valuable information, carefully building out a model to test your hypothesis, deriving a sound trading strategy from that model is how it works.

(1: this isn't always true. For instance, analyzing news with NLP, or using computer vision to estimate crop outputs from satellite imagery can make use of machine learning techniques to yield useful, tradeable signals. My comment mostly focuses on machine learning applied to price information. )

A few years ago I graduated with a PhD in statistics with lots of ML inspiration. Since then I have always dreamed of applying my knowledge and skill in this domain. However, despite the belief I was 'probably' in a decent position to do so, I consistently read about how impossible it was. I have a boring 'normal' persons job, but, posts like this are somewhat reassuring that I made a reasonable decision to abandon a life of fruitless datamining and overfitting.

I am keen to second here. With a PhD in probability & loads of experience in data analytics, my experience has told me that we are too ignorant and sometime too ambitious to try to predict a the outcome of a stochastic process (e.g. financial time series) without knowing that the amount of information required to make a sound prediction is far beyond those we have. Unless there's a very clear dominating signal among thousands of information sources, very often we are trading on noise.

Although I don't necessarily agree with all the points in this article, it just reminds me what Poincaré said:

`You ask me to predict for you the phenomena about to happen. If, unluckily, I knew the laws of these phenomena I could make the prediction only by inextricable calculations and would have to renounce attempting to answer you; but as I have the good fortune not to know them, I will answer you at once. And what is most surprising, my answer will be right.' -- Poincaré, H. (1913) The Foundations of Science. New York, The Science Press. p. 396.

I don't think the message here is "don't do it," but "have domain knowledge." The crux of the paper was scientists applying ML to a bunch of data without really understanding trading.

You can actually have scientists find signals in data they have no domain experience in. In a typical hedge fund the quantitative researchers will be a different group from the quantitative developers and traders. There are fuzzy lines between those depending on culture, but those three groups are broadly the front office. You really need domain experience for execution and risk management, but pure insights can be derived without necessarily needing any domain experience.

That said, quant researchers typically understand how the market works. They are just able to quickly excel without a background in it.

It's easy to forget that this is a highly competitive field.

You're used to see the techniques you work with capture signal because there isn't an army of PhDs in math, physics, and computer science working around the clock to trade any signal out of that data.

In the end, it doesn't even matter if you're the best statistician in the world: whatever signal you detect may simply not be worth the effort you put into detecting it.

> Most machine learning techniques focus on problems where the signal is very strong, but the structure is very complex. For instance, take the problem of recognizing whether a picture is a picture of a bird. A human will do well on this task, which shows that there is very little intrinsic noise. However, the correlation of any given pixel with the class of the image is essentially 0. The "noise" is in discovering the unknown relationship between pixels and class, not in the actual output.

Could it be that by looking only at the prices timeseries we are not looking at the actual information but only at the output of a irreversible function and that for effectively predicting the prices we need a model that captures what actually happens in the real world?

It's hard for me to follow exactly what you're getting at here, but (to make an analogy to cryptography) it seems like you're saying it's hard to find a signal because we only have the apparently random output of a function, not the seed itself.

It's fairly true that (at least these days) you're not going to identify a signal just by looking at a timeseries of prices, no matter how granular your dataset is (up to and including tick data). There are pockets of repeating patterns but those are vanishingly small and fleeting; the prices themselves may as well be stochastic.

Essentially all funds are empowered with significant amount of data, and the prices themselves are just used for backtesting and sanity checking. It's the source of truth, but it's not the way in which new insights are identified. The signal comes from other types of data that is far more reversible.

Thanks, it makes sense.

The article cited in the OP says:

""" Our historical dataset contains 5 minute mid-prices for 43 CME listed commodity and FX futures from March 31st 1991 to September 30th, 2014. We use the most recent fifteen years of data because the previous period is less liquid for some of the symbols, resulting in long sections of 5 minute candles with no price movement. Each feature is normalized by subtracting the mean and dividing by the standard deviation. The training set consists of 25,000 consecutive observations and the test set consists of the next 12,500 observations. """

(excerpt from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2756331)

That sounds to me as if the only data they trained the model are a bunch of prices.

I'm I reading it correctly?

Could you elaborate on news & NLP in regards to stocks?

We tried sentiment analysis in uni a few years ago and had no good results:

The idea was essentially: news says: 'stock A is great' -> it goes up shortly thereafter

We tested our algorithms against classifying Amazon reviews & Tweets by sentiment. Those are filled with sentiment and easy to detect if it is a 5 star review or a 1 star review. The news articles we parsed all had near neutral sentiment. We ended up building a classifier that could detect the news category of an article quite easily instead.

My initial idea was sparked by the Gulf spill and the subsequent dip in BP, I wanted to detect and capitalise big events like that, but the news sources we parsed always seemed to significantly lag behind the stock movement, too.

I just did a project on this, but for Bitcoin instead of stocks where I examined news, Reddit, forum (Bitcointalk.org) and IRC sentiment using some simple ML algos. The goal was to determine whether this data has any predictive causality.

I scraped the above sources over a full year (2015) and then had the data annotated on positive, negative and neutral sentiment.

The problem with labeling sentiment data is that there might not be a single 'true' label due to varying interpretations and ambiguity. So at best you'll get to 80-85% accuracy there. The less formal (News > Reddit/Forum > IRC), the lower your accuracy due to lack of context.

I Then matched the annotated sentiment to market data and did some causality analysis. What I found is that interestingly, you can't just say positive news = price/volume goes up. It is way more fine grained than that. For example negative Reddit sentiment leads price movements, but price movements lead positive sentiment. For news its the reverse.

All in all I didn't incorporate this into any trading strategies, but found it interesting to see the differences between online sentiment channels.

NPR's Planet Money is doing that by following Trump's tweets: http://www.npr.org/sections/money/2017/04/07/522897876/meet-...

I recall reading a paper (sorry, can't find the link now) where the volume of news mentioning the company was what was relevant.

Need to measure this in terms of average volume of news for all tracked market participants, and relative to the normal volume of news for that stock.

That way, if the return has recently gone up and there is substantial volume increase its probably because of positive news.

That distinction from Hinton is quite interesting, I often work with "Machine Learning" models, but for the most part the models are regression models and/or tree classification models that are easier to understand conceptually. I have yet to actually implement any complex machine learning techniques because I fear even if they do or do not work, I won't be able to interpret them in terms of their statistics.

What's your view on cases like Renaissance Technologies? There's an interview online on James Simons (https://www.youtube.com/watch?v=QNznD9hMEh0) - and he explicitly talked about using math models to detect anomalies, e.g. trends. It's also known that they've used Hidden Markov Models, at least in the early days.

Rentec uses machine learning, but more importantly the firm curates massive amounts of high-signal data. The most significant part of their work lies in process automation and the rapid testing of hypotheses, which empowers the optimal use of mind-numbing amounts of data that most other firms simply can't take advantage of. Very early on their success was due (in part) to the willingness of Simons et al to use correlations in disparate datasets which could be proven but which didn't really make sense, and which wouldn't be explained by anything intuitive.

In other words, Rentec is not just pointing machine learning models at data, they're investing in a very robust data processing pipeline. Everything before the analysis is just as important as the analysis at funds like theirs.

Just want to second this comment, their data processes are the key strength of Medallion. Grandparent comment by murbard2 also talks about the importance of this component to quant work (in the last paragraph: "finding new data feeds that provide valuable information")

While Jim Simons is a mathematician and Rentec clearly has hired many brilliant people with PhDs, it's maybe worth mentioning the actual mathematics being used in their work isn't super high level difficult, impossible or secretive. Many of the PhD's working there do not have a PhD in math, but rather something like Physics, so I would say if you are familiar with graduate level math courses you can understand the math needed for this type of work. Math isn't where their edge comes from. Also Medallion is 30 years old, their early work in the mid 80s was done on computers with less processing power than your phone, "Machine Learning" as the term is being used lately or access to supercomputing hardware no one else knows about is also not where their edge came from.

Well said. Where most funds have the same problem 'chollida1 describes here[1], Rentec (and other similar firms) moved past that by establishing the right culture and investing in the right technology from the outset.

They need smart people, but hiring the smartest people and having the most sophisticated models won't do you any good if you can't acquire high signal data, can't clean that data properly and can't rapidly backtest. And if you can't do any of that, adding more data is just going to add more noise.


1. https://news.ycombinator.com/item?id=13139638#13140352

Know all about it, was a data lackey B* at a HF (not Rentec) when I first graduated from school.

By the way for anyone reading this, Rentec is hiring, need to know C/C++ and a "basic knowledge of statistics" is preferred but not required.

Given your background, I'd be interested in picking your brain a bit for a few projects I'm working on. If you're looking to remain anonymous would you mind sending me an email (in my profile), or throwing an email up in yours?

but i feel like they're looking for PhDs and Math Olympiads.

Is there really that much high-signal data that is not already being used? Is their advantage in finding signals in data that other people overlook, or finding new data sources?

> Is there really that much high-signal data that is not already being used?


> Is their advantage in finding signals in data that other people overlook, or finding new data sources?


I won't go into any particular detail, but there is a lot of signal in the market for those who are imaginative. The obvious and low hanging fruit is long gone, but there are still many places that offer an edge.

I wish there was some sort of full intro tutorial on finding strategies; ie: an example of a former signal (now traded away), the thought process, the data sourcing, statistical analysis, trading/signal strategy, etc..

The thing is that no one is really motivated to make a complete tutorial on finding strategies because it's economically irrational. You're either giving away specific sources of alpha or you're empowering potential competitors. This is why it's virtually guaranteed that anyone selling courses that teach trading or related skills is a fraud - they have essentially no incentive to just ramp up their own trading capital instead.

The industry is also extremely secretive (necessarily so). You'll hardly ever find a good treatise on finding novel signals, but there are tutorials on algo trading in general with examples of production strategies that used to work which have been, as you say, traded away. For that purpose I'd recommend you start here: http://www.decal.org/file/2945

also, is this high-signal data publicly available? or only for purchases from vendors?

It's public (technically it has to be in order to be strictly legal for use). But for the most part it is unintuitive, unclean (needs to be heavily normalized) and not easily accessible. There are a variety of vendors that source it, clean it and analyze it to make it salable to firms. Quantitative firms also have teams devoted to doing all of that internally.

how does one go about getting a quant position? (PhD, bachelors, masters?) I feel like there's a bias towards PhDs...

I like the responses to this already. But I'll add that there's a difference between what I loving call throwing poop at the wall, and using machine learning to estimate non-linear functions of structural models or combining signals that already have alpha.

ML can be very useful if you have some signal or if you have a model.

Special thanks to Nickolas Younker (at LiquidWeb) for saving my behind and getting this all set up.

I was happy to help. Migrations are very complicated especially when in the face of massive traffic to move from one server to another and keep it live. I apologize that the dns transition wasnt seemless, but I managed a full migration in under 30 minutes. I'm glad I was able to help.

Doesn't load for me, mate. Also, if you put your email in your profile, I can send you an email instead of posting a comment here.

Same here. Unfortunate...

Sorry you two are still having issues. I had to do a new incognito window to get to it

Sorry guys. Traffic killed the site. Booting up a new server

it's up, propagation should take a minute

Do you have a cached copy?

En route

Does anyone know of any paper that describes a reproducible method of generating above normal returns in the mature western financial markets? Nope. Me neither.

Two obvious ones:

[1] Fama and French, 1992. The Cross-Section of Expected Stock Returns. http://faculty.som.yale.edu/zhiwuchen/Investments/Fama-92.pd...

[2] Jegadeesh and Titman, 1993. Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency . https://www.jstor.org/stable/2328882

These don't use machine learning but they do answer your question.

Do you consider cryptocurrency mature?

Nope. But even for those types of market I doubt that there exists an academic paper describing a reproducible trading strategy/method with above normal returns.

https://pdfs.semanticscholar.org/db15/1836543d8a70db1dabef3d... "Bayesian regression and Bitcoin"

> The strategy is able to nearly double the investment in less than 60 day period when run against real data trace.

http://cs229.stanford.edu/proj2015/029_report.pdf "Algorithmic Trading of Cryptocurrency Based on Twitter Sentiment Analysis"

http://journals.plos.org/plosone/article?id=10.1371/journal.... "When Bitcoin encounters information in an online forum: Using text mining to analyse user opinions and predict value fluctuation"

https://pdfs.semanticscholar.org/e065/3631b4a476abf5276a264f... "Automated Bitcoin Trading via Machine Learning Algorithms"

"Based on this price prediction method, we devise a simple strategy for trading Bitcoin. The strategy is able to nearly double the investment in less than 60 day period when run against real data trace."

From the paper "Bayesian regression and Bitcoin".

I think are more relevant question is: This is clearly not the case in the real world - so why does it appear to be like that?

But thanks for the links; I will enjoy reading through them.

Do you mean by "the real world" that Bitcoin price prediction does not work in the real world? Or that stock market prediction/strategies do not work in the real world?

Would be nice to see a standard academic platform for backtesting. Then, the paper could say "we submitted our implementation of this strategy to Backtest (which includes transaction costs and slippage)."

I think the pymc3 people have made something available

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact