
Time Series Prediction Using LSTM Deep Neural Networks - shivinski
https://www.altumintelligence.com/articles/a/Time-Series-Prediction-Using-LSTM-Deep-Neural-Networks
======
zwaps
I find it interesting that Computer Scientists are basically rediscovering
statistics.

Now when predicting time series, an issue is that most model (like ARIMA,
GARCH etc.) are short-memory processes. When you look at the full-series
prediction of LSTMs, you observe the same thing.

So in terms of Time Series, Machine Learning is currently in the mid to late
80's compared to Financial Econometrics.

So if you are a CS, you should now probably take a look at fractional GARCH
models and incorporate this into the LSTM logic. If the statistic issues are
the same, then this may give you that hot new paper.

~~~
RA_Fisher
It's been amazing to watch CS (really the Python community, save statsmodels
and patsy) discover statistics. For a while I thought perhaps it was me and
statistics that was "behind." Over time I realized that it was mostly re-
invention of old ideas: one-hot encoding = dummy variables, neural networks
approximating polynomial regression, etc. I decided to double-down on
statistics and it's really paid off. NN / random forests and the stats-founded
but CS-led approaches are very general models. That leaves statisticians a big
opening because a more specific model can be chosen to obtain more accurate
predictions. These days I'm positioning myself to clean-up the messes / save
broken ML models. Turns out [stats] theory is very practical. :-)

~~~
curiousgal
Because saying "relevant username" is frowned upon I'll just point out that R
A Fisher is "a genius who almost single-handedly created the foundations for
modern statistical science"[0]

0.[https://en.m.wikipedia.org/wiki/Ronald_Fisher](https://en.m.wikipedia.org/wiki/Ronald_Fisher)

~~~
mlthoughts2018
It’s funny to me, as a professional statistician, because most methods
popularized by Fischer et al in the early 1900s are wildly inappropriate for
practical problems, especially policy decision science or causal inference.

All the theory behind t-testing, Wald testing, using the detivatives of the
log likelihood near to the MLE point estimate in order to also estimate
standard errors when no analytical solution exists, ANOVA, instrumental
variables, etc.

It is in no sense exaggerative or incendiary to say that whole collection of
stuff is truly garbage statistics that is insanely rife with counter-intuitive
results, common situations when minor violations of the assumptions can easily
lead to statistically significant results _of the wrong sign_ , and common
practical needs (like model selection _without_ doing a bunch of pairwise or
subset selection calculations, or correcting for multicollinearity in large
regressions where calculating something like variance inflation factors is
totally intractable) are difficult or impossible.

Modern Bayesian approaches fully and entirely subsume these techniques, and
not just for large data (in fact, using Bayesian methods is more critical for
small data), and also not because of modern computing frameworks, but because,
from very first principle of null-hypothesis significance testing, that whole
field of stats/econometrics is fundamentally incapable of giving evidence or
estimations that could address the very questions that the whole field is
based on.

NHST basically solves a type of inference problem that nobody can ever
actually have in reality, and which is almost always not even approximately
close enough to actually be non-misleading.

NHST is like the stats analogue of Javascript: a horrible historical accident
that gained market traction despite being utterly and unequivocally a bad
choice for the very problem domain it’s intended to be used for. The
historical accident of adoption and momentum in Javascript sets back
professional computer science by decades until it’s eventually wholesale
replaced with something whose first principles are actually appropriate.

That same reckoning is in flux in many fields of statistics, as the
fundamental unreliability of NHST estimation is more understood and drop-in
Bayesian replacements are more available.

~~~
RA_Fisher
I don't disagree with anything you've written. The only thing I'd take issue
with is placing NHST at the feet of statisticians. Scientists deserve a fair
share as well. :-p

------
jarym
Why does everyone naively try to predict price? No ‘traders’ are interested in
predicting it - what traders do is identify good locations to enter or exit
the market.

I.e. places with defined risk where you will know if you’re wrong if it goes
against you by x% while you expect a y% gain if you’re right AND y>x is worth
more than the number of times you’re wrong.

The types of Algos that work well for this are edge identification ones - I
know this because I am (not as well as I’d like) successfully doing it.

LSTMs haven’t performed so well for me in this task but non-NN algos have.
CNNs however were promising but didn’t match what I’d come up with - still
searching for the holy grail that’ll make me rich!

~~~
beagle3
.. because you buy at the price, and sell at the price (spread and fees
ignored for now).

Which means, regardless of your philosophy, you are predicting a price change
- a long signal is a prediction for positive price change; a short signal is a
prediction for a negative price change. If that wasn’t true, your system would
not be able to profit.

Predicting price change and predicting price are semantically equivalent,
although a specific algorithm might be better at one than the other.

~~~
qeternity
Given less than 100% certainty, traders don't want to predict price, they want
to predict future distribution of price over some time period.

Source: hedge fund trader

~~~
beagle3
True. That’s still considered a prediction of price among my trading
colleagues.

------
cshenton
For anyone considering this, LSTM only starts to pay off if you have many many
time series. For a single time series like this one you’re better off using
classical time series approaches like ARIMA or other Gaussian state space
models.

------
lettergram
I've built quite a few of these kinds of models. The real trick is to compare
it against other methods AND to properly split A LOT of data. In many cases,
(depending on the input data) a random walk does roughly as well as
"predicting". This is because signal data (such as stock data) often just
follow a random (or seemingly random) trend.

------
glial
Seems to me that this is almost dangerous unless the uncertainty (and
therefore confidence) of the prediction can be quantified.

~~~
RA_Fisher
Yep, it is dangerous. If you're not quantifying uncertainty, you can't make
safe predictions. I think this is reason for the obsession with "data
cleaning" in the ML community, "outliers" aka rare observations sink general
models.

------
GChevalier
I see here that original poster (OP) of the post tried to use many-to-one
LSTMs instead of many-to-many LSTMs. I tell that first by looking at the
charts. Then I saw the method named "predict_point_by_point" with the comment
"Predict each timestep given the last sequence of true data, in effect only
predicting 1 step ahead each time" in his code here:
[https://github.com/jaungiers/LSTM-Neural-Network-for-Time-
Se...](https://github.com/jaungiers/LSTM-Neural-Network-for-Time-Series-
Prediction/blob/6aa5c5124ebfa405bf38bb6674871ab59d458b5c/core/model.py#L89)

I strongly think the system would be better to perform many predictions at
once instead, using seq2seq neural networks. The problem is properly explained
here at the beginning of this other post: [https://github.com/LukeTonin/keras-
seq-2-seq-signal-predicti...](https://github.com/LukeTonin/keras-seq-2-seq-
signal-prediction) This other post is, in turn, derived from my original
project here doing seq2seq predictions with TensorFlow:
[https://github.com/guillaume-chevalier/seq2seq-signal-
predic...](https://github.com/guillaume-chevalier/seq2seq-signal-prediction)

OP also forgot to cite the image I made:
[https://en.wikipedia.org/wiki/Long_short-
term_memory#/media/...](https://en.wikipedia.org/wiki/Long_short-
term_memory#/media/File:The_LSTM_cell.png)

Well, glad to see that some similar work as mine can get this much traction on
HN. I would have loved to get this much traction when I did my post, too.
Anyway, I would suggest OP to take a look at seq2seq, as it objectively
performs better (and without the "laggy drift" visual effect observed as in
OP's figure named "S&P500 multi-sequence prediction").

In other words, using many-to-one neural architectures creates some kind of
feedback which doesn't happen with seq2seq which doesn't build on its own
accumulated error. It has a decoder with different weights than the encoder,
and can be deep (stacked).

~~~
luanton
[https://news.ycombinator.com/item?id=17902967](https://news.ycombinator.com/item?id=17902967)

The aim of this post is to explain why sequence to sequence models appear to
perform better than "many to one" RNNs on signal prediction problems. It also
describes an implementation of a sequence 2 sequence model using the Keras
API.

------
djhworld
I'm currently learning machine learning at the most basic level, this is the
sort of stuff I want to work towards though

I deal with time series data a lot at work, I work in broadcasting/media and
99% of the time the data is fairly "predictable" and follows a regular daily
pattern, peppered with the odd spikes during big, unpredicatble news events.

~~~
md2be
Time series analysis requires the data to be stationary.

~~~
dafrie
Well, I don't want to be pedantic, but don't you rather mean "Most TSA MODELS
require data to be stationary"? My experience has been, that often practical
TSA actually involves how to deal (testing, differencing, smoothing...) with
non-stationarity, which is often not a trivial task...

------
daviddumenil
Could this approach be applied to a metric monitoring framework to give
earlier/more accurate notifications if when a threshold would be crossed?

Typically these are triggered when e.g. 90% of a threshold has been crossed.

------
fooker
So, curve fitting?

~~~
f00_
exactly, Judea Pearl's The Book of Why opened my eyes to the fact that most of
what happens in machine learning is really just curve fitting

It connected with what i've heard Chomsky say about trying to develop laws of
physics by filming what's happening outside the window. We need to do
experiments and interventions to learn the dynamics of a system

"What do you think the role is, if any, of other uses of so-called big data?
[...]

NOAM CHOMSKY: It’s more complicated than that. Let’s go back to the early days
of modern physics: Galileo, Newton, and so on. They did not organize data. If
they had, they could never have reached the laws of nature. You couldn’t
establish the law of falling bodies, what we all learn in high school, by
simply accumulating data from videotapes of what’s happening outside the
window. What they did was study highly idealized situations, such as balls
rolling down frictionless planes. Much of what they did were actually thought
experiments.

Now let’s go to linguistics. Among the interesting questions that we ask are,
for example, what’s the nature of ECP violations? You can look at 10 billion
articles from the Wall Street Journal, and you won’t find any examples of ECP
violations. It’s an interesting theory-determined question that tells you
something about the nature of language, just as rolling a ball down an
inclined plane is something that tells you about the laws of nature.
Scientists use data, of course. But theory-driven experimental investigation
has been the nature of the sciences for the last 500 years.

In linguistics we all know that the kind of phenomena that we inquire about
are often exotic. They are phenomena that almost never occur. In fact, those
are the most interesting phenomena, because they lead you directly to
fundamental principles. You could look at data forever, and you’d never figure
out the laws, the rules, that are structure dependent. Let alone figure out
why. And somehow that’s missed by the Silicon Valley approach of just studying
masses of data and hoping something will come out. It doesn’t work in the
sciences, and it doesn’t work here."

\- [https://www.rochester.edu/newscenter/conversations-on-
lingui...](https://www.rochester.edu/newscenter/conversations-on-linguistics-
and-politics-with-noam-chomsky-152592/)

It is actually a really interesting subject, marketing people doing a/b tests
for ads/features seem at least a little closer to the experimental ideal, not
just fitting curves to data

For further reading, I'd recommend the epilogue of Casuality (Pearl 2000),
it's from a 1996 lecture at UCLA:

\-
[http://bayes.cs.ucla.edu/BOOK-2K/causality2-epilogue.pdf](http://bayes.cs.ucla.edu/BOOK-2K/causality2-epilogue.pdf)

~~~
salty_biscuits
I really violently oppose this characterization of ML as "just" curve fitting,
as if curve fitting is some simple solved problem. It seems like there is a
ignorance about issues relating to model selection, which is an essential part
of curve fitting. What complexity of model does the data support? Can you keep
a distribution over structures that allows uncertain parts of the model to be
interrogated? These are the parts of the fitting equation that allow something
like "experiments" to be automatically generated as part of the curve fitting.

~~~
ahartmetz
Not the same kind of experiment. An experiment in the scientific sense tweaks
the process that generates the data, not the interpretation of the data. There
is an inspiration / hypothesis creation step between old data and new
experiment.

Main differences: A hypothesis is sorta kinda like your model's coefficients,
but more generally applicable. And you have no feedback loop between model
coefficients and input data.

So yeah, you are doing very sophisticated curve fitting. It is useful alright,
it's just not very much like science.

~~~
salty_biscuits
No, it's the same. It is just about having access to control variables.

~~~
ahartmetz
What Chomsky is saying is that the control variables don't exist until you
create them because the most telling things don't happen until you have a
specific hypothesis and _make them happen_ to test the hypothesis.

~~~
salty_biscuits
I disagree. What he is saying is that there is a special rule for languages
that he doesn't think you would get at without an enormous amount of data. So
a passive learning algorithm wouldn't uncover this structure in a reasonable
amount of time or data (I guess it is poor sample efficiency he is worried
about). A learning algorithm that has a distribution over it's own internal
model of language would be able to ask questions that minimize the uncertainty
of the model.

------
shawn
If anyone is looking to get into machine learning, I've found "Introduction to
Data Mining" very useful:

[https://news.ycombinator.com/item?id=17808349](https://news.ycombinator.com/item?id=17808349)

First edition:
[http://www.uokufa.edu.iq/staff/ehsanali/Tan.pdf](http://www.uokufa.edu.iq/staff/ehsanali/Tan.pdf)

Also see "mining of massive datasets" usually available at this link, but it
seems to be down:
[http://infolab.stanford.edu/~ullman/mmds/book.pdf](http://infolab.stanford.edu/~ullman/mmds/book.pdf)

Which leads me to another point: Many of these books cost $100+. If you don't
have those kind of resources, try Library Genesis. It's been very helpful for
getting started.

