I find it interesting that Computer Scientists are basically rediscovering statistics.
Now when predicting time series, an issue is that most model (like ARIMA, GARCH etc.) are short-memory processes. When you look at the full-series prediction of LSTMs, you observe the same thing.
So in terms of Time Series, Machine Learning is currently in the mid to late 80's compared to Financial Econometrics.
So if you are a CS, you should now probably take a look at fractional GARCH models and incorporate this into the LSTM logic. If the statistic issues are the same, then this may give you that hot new paper.
It's been amazing to watch CS (really the Python community, save statsmodels and patsy) discover statistics. For a while I thought perhaps it was me and statistics that was "behind." Over time I realized that it was mostly re-invention of old ideas: one-hot encoding = dummy variables, neural networks approximating polynomial regression, etc. I decided to double-down on statistics and it's really paid off. NN / random forests and the stats-founded but CS-led approaches are very general models. That leaves statisticians a big opening because a more specific model can be chosen to obtain more accurate predictions. These days I'm positioning myself to clean-up the messes / save broken ML models. Turns out [stats] theory is very practical. :-)
Because saying "relevant username" is frowned upon I'll just point out that R A Fisher is "a genius who almost single-handedly created the foundations for modern statistical science"[0]
It’s funny to me, as a professional statistician, because most methods popularized by Fischer et al in the early 1900s are wildly inappropriate for practical problems, especially policy decision science or causal inference.
All the theory behind t-testing, Wald testing, using the detivatives of the log likelihood near to the MLE point estimate in order to also estimate standard errors when no analytical solution exists, ANOVA, instrumental variables, etc.
It is in no sense exaggerative or incendiary to say that whole collection of stuff is truly garbage statistics that is insanely rife with counter-intuitive results, common situations when minor violations of the assumptions can easily lead to statistically significant results of the wrong sign, and common practical needs (like model selection without doing a bunch of pairwise or subset selection calculations, or correcting for multicollinearity in large regressions where calculating something like variance inflation factors is totally intractable) are difficult or impossible.
Modern Bayesian approaches fully and entirely subsume these techniques, and not just for large data (in fact, using Bayesian methods is more critical for small data), and also not because of modern computing frameworks, but because, from very first principle of null-hypothesis significance testing, that whole field of stats/econometrics is fundamentally incapable of giving evidence or estimations that could address the very questions that the whole field is based on.
NHST basically solves a type of inference problem that nobody can ever actually have in reality, and which is almost always not even approximately close enough to actually be non-misleading.
NHST is like the stats analogue of Javascript: a horrible historical accident that gained market traction despite being utterly and unequivocally a bad choice for the very problem domain it’s intended to be used for. The historical accident of adoption and momentum in Javascript sets back professional computer science by decades until it’s eventually wholesale replaced with something whose first principles are actually appropriate.
That same reckoning is in flux in many fields of statistics, as the fundamental unreliability of NHST estimation is more understood and drop-in Bayesian replacements are more available.
I don't disagree with anything you've written. The only thing I'd take issue with is placing NHST at the feet of statisticians. Scientists deserve a fair share as well. :-p
... and someone who would be very difficult to "out asshole" or to out do in male chauvinism.
Those are criticism on personality, on the technical side it took the community a long time to undo the damage of promoting non robust parametric statistics. But this much is certain he pulled statistics into the realms of math -- no mean feat.
I know a handful of Econ phds working in data science; and Google, FB etc. have hired top economists as well.
The Phineas Gage of applied quantitative Econ is demand estimation. You typically want to know the elasticity of quantities sold to price so to inform pricing policies. But the problem is that causality is cloudy -- low prices cause a decrease in supply -- so you never know what you're looking at.
People with a decent training in econometrics know how to treat this problem.
I'm pretty sure orgs like Amazon were trying to do naive demand estimation, fell flat on their noses and copped to having to hire people who have thought about the underlying conceptual issues before.
I'm curious what resources you found useful to learn stats modelling and what sorts of approaches have been useful.
On one hand, it's almost a tautoloy that specific models should be better than general models, but I worked on some 2d time series classification with a statistician and afterwards, for kicks, I replaced the entire thing with a CNN+LSTM and it worked just as well as the whole complicated model he had come up with.
On the other hand, the "more ignorant CS approach" has produced impressive achievements in language tasks (e.g., translation), visual tasks (e.g., image generation), game playing tasks (e.g., Go), agent-in-virtual-world tasks (e.g., DOTA), and robot-in-real-world tasks (e.g., self-driving cars).
Academic statistics departments often seem to be "20 years behind" on all those fronts...
I don't think it's entirely fair to say "Computer Scientists are basically rediscovering statistics". LSTMs are used beyond just time series prediction. It is also quite common in language modelling tasks, which is also a sequence modelling task, and where it works quite well. I'm not familiar at all with using GARCH/ARIMA for something like this.
Also, with neural networks it's very easy and natural to build complex models where different "layers" perform different tasks. So an LSTM can very easily be extended to work bi-directionally (taking data from the beginning of the sequence, and the end of the sequence), adding things like attention, using word-vectors before the recurrent network or just using a character model.
What are the statistical equivalents for this? Because most of the papers on this topic seem to come from Computer Science. Take a look at the epilogue of [1] for a thorough discussion on where statistical theory needs to catch up.
[1] Computer Age Statistical Inference - Efron, Hastie.
No, it wouldn't. Firstly, nonparametrics in general can be a little misleading. The most common instantiations place function ("process") priors on modeling decisions that are otherwise found through trial and error. Those process priors do have their own parameters though. But more importantly, LSTMs and neural networks are very much parametric - their success come from the advances in computing and optimization that have enabled estimating these parameters in very complicated model structures.
Your CS term for parametric is not quite 1 to 1 with statistic usage for parametric.
Also what you're describing is very similar to Bayesian statistic.
> But more importantly, LSTMs and neural networks are very much parametric - their success come from the advances in computing and optimization that have enabled estimating these parameters in very complicated model structures.
Which for statistician is basically blackbox and nonparametric since you have no idea what the distribution is dude and there is no assumption of a distribution. Hence nonparametric statistic which is the answer to your question you've asked for.
Why are you talking about priors ? Nonparametric vs parametric is an axis completely orthogonal to Bayesian vs Frequentist.
We weren't talking about the "success" though, I was responding to the question "where in the body of stats literature would a neural net model lie".
I argue that would be non-parametric stats. In parametric stats the limit (#params/#data) goes to 0. For models where this is not the case, statisticians and probabilists call them non-parametric (and in certain cases semi-parametric models). Neural net, especially the deep kind (and certainly not the single layer kind) have the property that #params/#data is finite and large.
I agree with your sentiments, but there is a contribution that the CS departments made that the statistics, math, Econ (as in econometrics departments) seemed to have overlooked. I remember going to each of these departments in 2002 and asking them why don’t we split the data sets to train and update the coefficients and automate the process. The answer was always the same “that’s trivial and adds nothing to the field”.
> why don’t we split the data sets to train and update the coefficients and automate the process.
What you just stated is just a pipeline. You can just split the data and train it and automate with tree ensemble that aren't boosting that is if you're talking about doing in parallel.
If you're just saying split and do as batch process in different time interval you can do that with nonparametric bayesian.
CS contribution in creating Deep learning and having it be the best accurate algo for certain data domain is pretty nice. But again Stat care a lot more than prediction.
I think that ML is very useful, but remember that forecasting is really not the main objective of econometric models.
Basically, forecasting implies you have a good handle on all properties of the relevant distributions, which in my opinion is a lost cause in social sciences (think external validity).
Instead, econometrics is nowadays mainly concerned with the identification of causal effect using non-parametric or semi-parametric approaches. Basically, you can believably estimate the directionality of some mechanism, but you probably never have the data or model to make a good out of sample prediction. You can, but it's basically implied that approaches that consistently estimate some marginal of a conditional expectation will NOT be that useful to predict a whole stochastic process.
Also, using training and test sets kind of predicates that your process is very stable. Otherwise the "test" set is not really a good test, is it? Again, in social sciences these things are hard to argue.
You usually wanna generalize some mechanism from this industry to that industry, not find a good predictor in the same industry.
Test datasets still run on the same data!
ML is successful because in practice we DO care about prediction. This allows us to do all the cool things. Because econometrics/stats is so conservative and comes from a causal standpoint, people are just really shy to develop a model for prediction (not everywhere true, but that's the gist).
For ML, the primary question is basically how good the thing predicts. When I first tried scikit learn way back, I was so confused it didn't offer standard errors or some other statistical measure. But then I saw how ingrained the in-sample, out-sample process is and I thought well - that's really useful.
tl;dr: Stats and ML have different objectives, but there is a lot to learn in stats for ML
You shouldn't listen to N. Taleb on technical matters. He's been a classic mold crank for the last decade or so when it comes to anything serious, relegated instead to writing fluffy books on whatever he thinks is important.
GARCH, like I said, is a short memory process and is inherently inadequate for (longer) out of sample predictions. Doing this is possible, but not really correct. Taleb is basically right, of course what he says is probably inflammatory and half wrong, as usual.
Don't forget that most econometrics models are also concerned with identification and causality, less with prediction.
Why does everyone naively try to predict price? No ‘traders’ are interested in predicting it - what traders do is identify good locations to enter or exit the market.
I.e. places with defined risk where you will know if you’re wrong if it goes against you by x% while you expect a y% gain if you’re right AND y>x is worth more than the number of times you’re wrong.
The types of Algos that work well for this are edge identification ones - I know this because I am (not as well as I’d like) successfully doing it.
LSTMs haven’t performed so well for me in this task but non-NN algos have. CNNs however were promising but didn’t match what I’d come up with - still searching for the holy grail that’ll make me rich!
.. because you buy at the price, and sell at the price (spread and fees ignored for now).
Which means, regardless of your philosophy, you are predicting a price change - a long signal is a prediction for positive price change; a short signal is a prediction for a negative price change. If that wasn’t true, your system would not be able to profit.
Predicting price change and predicting price are semantically equivalent, although a specific algorithm might be better at one than the other.
Predicting price means you’re predicting one variable with no idea of hot likely you are to be wrong and how wrong you’re likely to be and says nothing of where your expectations are for price to go after.
It is semantically different to say: if price goes to Y then you have odds that it will then go to Target 1 and then slightly lower odds it goes to Target 2.
Do you long now, or short now, or neither? If you go long, you’ve predicted price goes higher. Regardless of what you think about the entire path. Mathematically nothing else makes sense.
Replying to myself instead of all three replies so far:
People, prediction is a general term. Many predictors come with accuracy estimates (and outside of finance, often prediction bounds). But even if it was only one number - if you have good prediction of the expected price change, that could be sufficient to trade as it encompasses, by definition, the sun of probability of different outcomes times their magnitude.
Either E[price] or E[log price] is a single predicted value you can successfully trade with as long as you are far from your margins, and depending of course on your utility functions.
But as I mentioned, in most fields, when you talk of a “predictor”, that’s not a single number but also accuracy estimates or even a full fledged probability distribution of future events.
Right, but that's not what this article (and several others I've seen) are doing - they are measuring the performance of some AI by putting on some unnecessary (and quite likely impossible) constraints on what it is supposed to be outputting.
All I'm pointing out is that measuring any stock trading algo by treating it as a regression problem for the exact next time step is a naive approach - that's not the same thing as what human traders are doing.
Anyway, if you are trading then I wish you lots of success.
> Why does everyone naively try to predict price? No ‘traders’ are interested in predicting it - what traders do is identify good locations to enter or exit the market.
I agree.
I've built many systems in this area, but it wasn't until I started working in the Indian market (>10 yrs ago) that it became abundantly clear that trying to calculate the long/shorts signals using historical (/time series) data was a waste of time. (And yet my primary role was to provide tools that did exactly that).
Back then, in the indian market, you could see that most of the stocks, although skyrocketing upwards, all followed the slow vs fast moving averages to buy and sell! Back then, they weren't looking at RSI, stochastics, support lines, etc, etc. It was crazily predictable...but over time it was really interesting to see it become more haphazard and like western stocks. That is, the fundamentals came into play and as you say, the traders began to use other metrics to buy and sell.
> I've built many systems in this area, but it wasn't until I started working in the Indian market (>10 yrs ago) that it became abundantly clear that trying to calculate the long/shorts signals using historical (/time series) data was a waste of time. (And yet my primary role was to provide tools that did exactly that).
I'm currently working on building similar tools in my area of work for the Indian market and would really appreciate if you could shed some more light into the things you learned from your experience in working in this domain.
Better, why do people think they can predict long term dynamics of what is basically a chaotic system? I think finding some small local dynamics is fine, but applying a neural network to try say something about global long term dynamics complete garbage - ie long term weather simulation, stock markets etc.
Chaotic systems can be deterministic, just that you will never be able to accurately measure all the variables to make a long term prediction accurately.
In the weather example, people know the equations that approximate how it works. Why value does a neural network bring? Knowing the equations is better understanding.
There was a Quanta Magazine article talking about predicting the evolution of a flame-front using ML. The ML algorithm remained accurate for eight Lyapunov intervals; eight times longer than the previous SOTA.
The only real argument I think you can make is that it might be more efficient to have a neural network quickly spit out an approximate solution instead of solving the actual equations.
But if you have the time having the actual equations is more valuable?
For anyone considering this, LSTM only starts to pay off if you have many many time series. For a single time series like this one you’re better off using classical time series approaches like ARIMA or other Gaussian state space models.
I've built quite a few of these kinds of models. The real trick is to compare it against other methods AND to properly split A LOT of data. In many cases, (depending on the input data) a random walk does roughly as well as "predicting". This is because signal data (such as stock data) often just follow a random (or seemingly random) trend.
Yep, it is dangerous. If you're not quantifying uncertainty, you can't make safe predictions. I think this is reason for the obsession with "data cleaning" in the ML community, "outliers" aka rare observations sink general models.
I see here that original poster (OP) of the post tried to use many-to-one LSTMs instead of many-to-many LSTMs. I tell that first by looking at the charts. Then I saw the method named "predict_point_by_point" with the comment "Predict each timestep given the last sequence of true data, in effect only predicting 1 step ahead each time" in his code here: https://github.com/jaungiers/LSTM-Neural-Network-for-Time-Se...
Well, glad to see that some similar work as mine can get this much traction on HN. I would have loved to get this much traction when I did my post, too. Anyway, I would suggest OP to take a look at seq2seq, as it objectively performs better (and without the "laggy drift" visual effect observed as in OP's figure named "S&P500 multi-sequence prediction").
In other words, using many-to-one neural architectures creates some kind of feedback which doesn't happen with seq2seq which doesn't build on its own accumulated error. It has a decoder with different weights than the encoder, and can be deep (stacked).
The aim of this post is to explain why sequence to sequence models appear to perform better than "many to one" RNNs on signal prediction problems. It also describes an implementation of a sequence 2 sequence model using the Keras API.
I'm currently learning machine learning at the most basic level, this is the sort of stuff I want to work towards though
I deal with time series data a lot at work, I work in broadcasting/media and 99% of the time the data is fairly "predictable" and follows a regular daily pattern, peppered with the odd spikes during big, unpredicatble news events.
A year ago, the original blog post [1] (it was just recently updated, which is now the one linked here on HN) helped me on a semester thesis, where I quite successfully used LSTM for short-term electricity load forecasting, which also has very strong daily, weekly and seasonal patterns. I used multiple features/variables such as calendar and weather data and found the LSTM models to easily beat ARIMA/TBATS forecasts.
You can find the code repo on my Github link [2], but please bear with the code quality. I only have an economics background, so my coding experience is fairly limited :)
Well, I don't want to be pedantic, but don't you rather mean "Most TSA MODELS require data to be stationary"? My experience has been, that often practical TSA actually involves how to deal (testing, differencing, smoothing...) with non-stationarity, which is often not a trivial task...
exactly, Judea Pearl's The Book of Why opened my eyes to the fact that most of what happens in machine learning is really just curve fitting
It connected with what i've heard Chomsky say about trying to develop laws of physics by filming what's happening outside the window. We need to do experiments and interventions to learn the dynamics of a system
"What do you think the role is, if any, of other uses of so-called big data? [...]
NOAM CHOMSKY: It’s more complicated than that. Let’s go back to the early days of modern physics: Galileo, Newton, and so on. They did not organize data. If they had, they could never have reached the laws of nature. You couldn’t establish the law of falling bodies, what we all learn in high school, by simply accumulating data from videotapes of what’s happening outside the window. What they did was study highly idealized situations, such as balls rolling down frictionless planes. Much of what they did were actually thought experiments.
Now let’s go to linguistics. Among the interesting questions that we ask are, for example, what’s the nature of ECP violations? You can look at 10 billion articles from the Wall Street Journal, and you won’t find any examples of ECP violations. It’s an interesting theory-determined question that tells you something about the nature of language, just as rolling a ball down an inclined plane is something that tells you about the laws of nature. Scientists use data, of course. But theory-driven experimental investigation has been the nature of the sciences for the last 500 years.
In linguistics we all know that the kind of phenomena that we inquire about are often exotic. They are phenomena that almost never occur. In fact, those are the most interesting phenomena, because they lead you directly to fundamental principles. You could look at data forever, and you’d never figure out the laws, the rules, that are structure dependent. Let alone figure out why. And somehow that’s missed by the Silicon Valley approach of just studying masses of data and hoping something will come out. It doesn’t work in the sciences, and it doesn’t work here."
It is actually a really interesting subject, marketing people doing a/b tests for ads/features seem at least a little closer to the experimental ideal, not just fitting curves to data
For further reading, I'd recommend the epilogue of Casuality (Pearl 2000), it's from a 1996 lecture at UCLA:
There are a lot of subtle points to make here. There is a tendency to throw data at models that don't capture parts of a distribution, and it is definitely true that many of the tail events in a challenging domain will not occur again no matter how long we observe the domain. Successful machine learning systems are able to predict these outcomes without having seen the data previously because they have captured the theory that creates them. Unfortunately it is very difficult to determine when a model is capturing the domain theory and when it is just modelling a distribution - often the only way is to "know" that it's a bit fishy. In many domains this difference doesn't matter, vision in animals seems to work in this way - it's all approximations and sameasis, and we and machines get tricked by optical illusions and so on. Other domains (many in physics) are modelled by observing data and inferring a higher level theory. Early days physics didn't work this way - Chomsky is right, but the method of Galileo is not the only method. Modern scientists do organise data and do look for exceptions and regularities which then drives the search for explanatory systems with predictive power.
I really violently oppose this characterization of ML as "just" curve fitting, as if curve fitting is some simple solved problem. It seems like there is a ignorance about issues relating to model selection, which is an essential part of curve fitting. What complexity of model does the data support? Can you keep a distribution over structures that allows uncertain parts of the model to be interrogated? These are the parts of the fitting equation that allow something like "experiments" to be automatically generated as part of the curve fitting.
Not the same kind of experiment. An experiment in the scientific sense tweaks the process that generates the data, not the interpretation of the data. There is an inspiration / hypothesis creation step between old data and new experiment.
Main differences: A hypothesis is sorta kinda like your model's coefficients, but more generally applicable. And you have no feedback loop between model coefficients and input data.
So yeah, you are doing very sophisticated curve fitting. It is useful alright, it's just not very much like science.
What Chomsky is saying is that the control variables don't exist until you create them because the most telling things don't happen until you have a specific hypothesis and make them happen to test the hypothesis.
I disagree. What he is saying is that there is a special rule for languages that he doesn't think you would get at without an enormous amount of data. So a passive learning algorithm wouldn't uncover this structure in a reasonable amount of time or data (I guess it is poor sample efficiency he is worried about). A learning algorithm that has a distribution over it's own internal model of language would be able to ask questions that minimize the uncertainty of the model.
But what you describe is still curve fitting. I say this in spite of some expertise in ML myself. There are some parts of ML that are not fall in the curve fitting family but they are still a small part, for example Markov logic network, some parts of reinforcement learning.
What you are saying is curve fitting with good predictive ability is not trivial, and that is indeed true.
Markov Logic Networks are still about finding coefficients for a probability distribution over some process. My opinion is that there is only curve fitting. There is data and a minimum complexity model that can reproduce the data with minimum error. So do you really believe that there are physical processes where this approach will fail?
Nobody is interested in having a machine discover the theory behind parabolic trajectories. That was solved science 400 years ago.
What is interesting, is having a machine that can estimate a parabolic trajectory, not deductively, but inductively, based only on visual observation, for a variety of different shaped and sized objects. The way a human does.
Galileo was a great scientist, and discovered many natural laws relating to motion, but that wouldn’t have made him a great dodgeball player.
Which leads me to another point: Many of these books cost $100+. If you don't have those kind of resources, try Library Genesis. It's been very helpful for getting started.
Now when predicting time series, an issue is that most model (like ARIMA, GARCH etc.) are short-memory processes. When you look at the full-series prediction of LSTMs, you observe the same thing.
So in terms of Time Series, Machine Learning is currently in the mid to late 80's compared to Financial Econometrics.
So if you are a CS, you should now probably take a look at fractional GARCH models and incorporate this into the LSTM logic. If the statistic issues are the same, then this may give you that hot new paper.