Meanwhile, no one uses method X or any of its cousins, because in the real world the problem is solved very differently with a combination of both principled algorithms and heuristics derived from real-world datasets.
The paper also fails to give any theoretical reason or mathematical insight as to why their version of X is better.
Thus, it doesn't actually solve a real world problem OR advance scientific understanding.
In regards to solutions it would be great if we focused less on the frequency at which we published and editors were more willing to publish work that had novel ideas even if it did not have state of the art performance (yet). Although like any job there will always be parts that are tedious, involve politics and yes parts that are even counter productive. At some point as an individual you just have to play the game while still thinking about and trying to advance the bigger picture.
This sounds analogous to resume driven development. And I completely understand it. I have pragmatically chosen the best tech for the job (factoring in the learning cure for new tech and what I already know) and haven't learned much new tech for the last couple of years. Now I get the feeling my CV is looking a bit dated. Does it make me a worse at developing software? I would say not. I have mastery in a few areas rather than shallow knowledge in a lot. Will it make it harder for me to get a job? Quite possibly if I keep going this way.
Let's try to have respect for each other on both sides of the "academia/industry" fence and not resort to unrealistic black and white views of the world.
Thats complete BS. Certain journals are complete BS, and certain subfields of CS are superfluous and often times lack substance in publications. But CS literature, from top tier journals, has vast merits.
I'd be curious to know how accurate the assertion is.
Though, I fully grant that the problem is as much that they are optimizing different things. Sometimes, the goal was just faster training, not better accuracy. (Though, I'm curious how fungible those are... Something that trains faster would ostensibly train for better accuracy at the same times, no?)
This seems like something an amateur could do. Wouldn't that tear down a bunch of this junk?
Right now you have a "reproduction renaissance" in psychology and cancer, where you can get published, and get acknowledged, if you tear down a high level paper. But in general, unless you aim at refuting the highest level, it is hard to get published and advances your career nowhere, so people have no incentive to do it.
When I did medical research I wished people would try to replicate my findings so I could see how consistent the results were and how well my model fit future data... In fact it was really annoying that it was unlikely anyone would ever try to do a direct replication.
A single non-replication does not. Consistent failure to replicate, after communicating with the original author, does.
Or just google "reproducibility crisis", the two most common subjects that come up are cancer research and psychology research.
Because the incentives are to provide "new research", not to confirm or deny a known result.
It seem to me that most probably no-one has gotten tenure by reproducing, disproving results.
I've spend an unreasonable amount of time reading research related to finance for my web app Piglet:
Long story short, all "research" is pretty much B.S.
I used to assume this is because people want to make money, so they keep the good analysis secret. However, after working in the industry a few years; it's mostly because they just don't know how to apply the algorithms or if it's even possible.
I think my favorite example is the seminal paper on using twitter sentiment to predict stock movement. They don't use a large enough data set, and more importantly they use granger causality to identify "casualty" between sentiment and stock value. They then claim they found a specific range which has a p-value indicating they are correlated... Of course you'll find a correlation when you look at two normalized signals and try to match them up.
Now, if they had not use the DJIA (Dow Jones Industrial Average) and instead used 500 individual stocks, and found the sentiment on twitter correlated with stock value(s) 90% of the time between 5 and 10 days. I'd argue they probably have something.
However, because their method is literally a BFS on only two signals in an attempt to find a correlation, they must correct for the p-value. i.e. "look and you shall find"
This is just one of the hundred issues I've found, but really sheds light on how bad that industry is.
 Granger Causality offsets two signals in an attempt to find a correlation between them at an offset between two times. AKA find "causality" by finding correlation at an offset in time
Easy there! When I start feeling cynical about things like almost every sentiment analysis paper, I go back and read some of my favorite papers, especially ones from the 70s and 80s (eg, )
Another useful heuristic is to read the best paper award winners from conferences in a particular area. It's not a perfect metric, but the signal/noise ratio is better.
 the flajolet-martin algorithm: http://algo.inria.fr/flajolet/Publications/FlMa85.pdf
I'm curious about this as it relates to Piglet. Are you monetizing on the idea that most people won't realize this and will pay for Piglet with the hopes that they signals they get there are somehow positively correlated to what's happening in the real world for a particular stock/idea/company/politics when in reality you don't think it really does?
Importantly, I'm providing access to signals first with explanation(s) of what each signal is / how it's useful today.
I also have an investment model currently accomplishing on average 50%+ YoY (for the back testing on 8 years, live for 4 years).
From there, I'm building further machine learning models on top of the signals. However, (as mentioned prior from the sentiment analysis paper) I'm doing much more robust research. For example, comparing all stocks as opposed to DJIA.
Most machine learning techniques focus on problems where the signal is very strong, but the structure is very complex. For instance, take the problem of recognizing whether a picture is a picture of a bird. A human will do well on this task, which shows that there is very little intrinsic noise. However, the correlation of any given pixel with the class of the image is essentially 0. The "noise" is in discovering the unknown relationship between pixels and class, not in the actual output.
Noise dominates everything you will find in statistical arbitrage. R^2 of 1% are something to write home about. With this amount of noise, it's generally hard to do much better than a linear regression. Any model complexity has to come from integrating over latent parameters or manual feature engineering, the rest will overfit.
I think Geoffrey Hinton said that statistics and machine learning are really the same thing, but since we have two different names for it, we might as well call machine learning everything that focuses on dealing with problems with a complex structure and low noise, and statistics everything that focuses on dealing with problems with a large amount of noise. I like this distinction, and I did end up picking up a lot of statistics working in this field.
I'll regularly get emails from friends who tried some machine learning technique on some dataset and found promising results. As the article points out, these generally don't hold up. Accounting for every source of bias in a backtest is an art. The most common mistake is to assume that you can observe the relative price of two stocks at the close, and trade at that price. Many pairs trading strategies appear to work if you make this assumption (which tends to be the case if all you have are daily bars), but they really do not. Others include: assuming transaction costs will be the same on average (they won't, your strategy likely detects opportunities at time were the spread is very large and prices are bad), assuming index memberships don't change (they do and that creates selection bias), assuming you can short anything (stocks can be hard to short or have high borrowing costs), etc.
In general, statistical arbitrage isn't machine learning bound(1), and it is not a data mining endeavor. Understanding the latent market dynamics you are trying to capitalize on, finding new data feeds that provide valuable information, carefully building out a model to test your hypothesis, deriving a sound trading strategy from that model is how it works.
(1: this isn't always true. For instance, analyzing news with NLP, or using computer vision to estimate crop outputs from satellite imagery can make use of machine learning techniques to yield useful, tradeable signals. My comment mostly focuses on machine learning applied to price information. )
Although I don't necessarily agree with all the points in this article, it just reminds me what Poincaré said:
`You ask me to predict for you the phenomena about to happen. If, unluckily, I knew the laws of these phenomena I could make the prediction only by inextricable calculations and would have to renounce attempting to answer you; but as I have the good fortune not to know them, I will answer you at once. And what is most surprising, my answer will be right.'
-- Poincaré, H. (1913) The Foundations of Science. New York, The Science Press. p. 396.
That said, quant researchers typically understand how the market works. They are just able to quickly excel without a background in it.
You're used to see the techniques you work with capture signal because there isn't an army of PhDs in math, physics, and computer science working around the clock to trade any signal out of that data.
In the end, it doesn't even matter if you're the best statistician in the world: whatever signal you detect may simply not be worth the effort you put into detecting it.
Could it be that by looking only at the prices timeseries we are not looking at the actual information but only at the output of a irreversible function and that for effectively predicting the prices we need a model that captures what actually happens in the real world?
It's fairly true that (at least these days) you're not going to identify a signal just by looking at a timeseries of prices, no matter how granular your dataset is (up to and including tick data). There are pockets of repeating patterns but those are vanishingly small and fleeting; the prices themselves may as well be stochastic.
Essentially all funds are empowered with significant amount of data, and the prices themselves are just used for backtesting and sanity checking. It's the source of truth, but it's not the way in which new insights are identified. The signal comes from other types of data that is far more reversible.
The article cited in the OP says:
Our historical dataset contains 5 minute mid-prices for 43 CME listed commodity
and FX futures from March 31st 1991 to September 30th, 2014. We use the
most recent fifteen years of data because the previous period is less liquid for
some of the symbols, resulting in long sections of 5 minute candles with no price
movement. Each feature is normalized by subtracting the mean and dividing by
the standard deviation. The training set consists of 25,000 consecutive observations
and the test set consists of the next 12,500 observations.
(excerpt from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2756331)
That sounds to me as if the only data they trained the model are a bunch of prices.
I'm I reading it correctly?
We tried sentiment analysis in uni a few years ago and had no good results:
The idea was essentially: news says: 'stock A is great' -> it goes up shortly thereafter
We tested our algorithms against classifying Amazon reviews & Tweets by sentiment. Those are filled with sentiment and easy to detect if it is a 5 star review or a 1 star review. The news articles we parsed all had near neutral sentiment. We ended up building a classifier that could detect the news category of an article quite easily instead.
My initial idea was sparked by the Gulf spill and the subsequent dip in BP, I wanted to detect and capitalise big events like that, but the news sources we parsed always seemed to significantly lag behind the stock movement, too.
I scraped the above sources over a full year (2015) and then had the data annotated on positive, negative and neutral sentiment.
The problem with labeling sentiment data is that there might not be a single 'true' label due to varying interpretations and ambiguity. So at best you'll get to 80-85% accuracy there. The less formal (News > Reddit/Forum > IRC), the lower your accuracy due to lack of context.
I Then matched the annotated sentiment to market data and did some causality analysis. What I found is that interestingly, you can't just say positive news = price/volume goes up. It is way more fine grained than that. For example negative Reddit sentiment leads price movements, but price movements lead positive sentiment. For news its the reverse.
All in all I didn't incorporate this into any trading strategies, but found it interesting to see the differences between online sentiment channels.
Need to measure this in terms of average volume of news for all tracked market participants, and relative to the normal volume of news for that stock.
That way, if the return has recently gone up and there is substantial volume increase its probably because of positive news.
In other words, Rentec is not just pointing machine learning models at data, they're investing in a very robust data processing pipeline. Everything before the analysis is just as important as the analysis at funds like theirs.
While Jim Simons is a mathematician and Rentec clearly has hired many brilliant people with PhDs, it's maybe worth mentioning the actual mathematics being used in their work isn't super high level difficult, impossible or secretive. Many of the PhD's working there do not have a PhD in math, but rather something like Physics, so I would say if you are familiar with graduate level math courses you can understand the math needed for this type of work. Math isn't where their edge comes from. Also Medallion is 30 years old, their early work in the mid 80s was done on computers with less processing power than your phone, "Machine Learning" as the term is being used lately or access to supercomputing hardware no one else knows about is also not where their edge came from.
They need smart people, but hiring the smartest people and having the most sophisticated models won't do you any good if you can't acquire high signal data, can't clean that data properly and can't rapidly backtest. And if you can't do any of that, adding more data is just going to add more noise.
By the way for anyone reading this, Rentec is hiring, need to know C/C++ and a "basic knowledge of statistics" is preferred but not required.
> Is their advantage in finding signals in data that other people overlook, or finding new data sources?
I won't go into any particular detail, but there is a lot of signal in the market for those who are imaginative. The obvious and low hanging fruit is long gone, but there are still many places that offer an edge.
The industry is also extremely secretive (necessarily so). You'll hardly ever find a good treatise on finding novel signals, but there are tutorials on algo trading in general with examples of production strategies that used to work which have been, as you say, traded away. For that purpose I'd recommend you start here: http://www.decal.org/file/2945
ML can be very useful if you have some signal or if you have a model.
 Fama and French, 1992. The Cross-Section of Expected Stock Returns. http://faculty.som.yale.edu/zhiwuchen/Investments/Fama-92.pd...
 Jegadeesh and Titman, 1993. Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency . https://www.jstor.org/stable/2328882
These don't use machine learning but they do answer your question.
> The strategy is able to nearly double the investment in less than 60 day period when run against real data trace.
http://cs229.stanford.edu/proj2015/029_report.pdf "Algorithmic Trading of Cryptocurrency Based on Twitter Sentiment Analysis"
http://journals.plos.org/plosone/article?id=10.1371/journal.... "When Bitcoin encounters information in an online forum: Using text mining to analyse user opinions and predict value fluctuation"
https://pdfs.semanticscholar.org/e065/3631b4a476abf5276a264f... "Automated Bitcoin Trading via Machine Learning Algorithms"
From the paper "Bayesian regression and Bitcoin".
I think are more relevant question is: This is clearly not the case in the real world - so why does it appear to be like that?
But thanks for the links; I will enjoy reading through them.