Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Exponential Smoothing: faster and more accurate than NeuralProphet (github.com/nixtla)
161 points by LevoMX on Aug 17, 2022 | hide | past | favorite | 42 comments
We benchmarked on more than 55K series and show that ETS improves MAPE and sMAPE forecast accuracy by 32% and 19%, respectively, with 104x less computational time over NeuralProphet.

We hope this exercise helps the forecast community avoid adopting yet another overpromising and unproven forecasting method.



This wouldn't pass peer-review of it were a paper. Major issues:

- No fair hyperparametrization for Neural Prophet. They mention multiple times they used default hyperparams or ad-hoc example hyperparams.

- 3/4 benchmark datasets (one they didn't finish training) where ETS outperforms is not strong evidence of all-round robustness. Benchmarks like SuperGlue for NLP combine 10 completely different tasks with more subtasks to assess language model performance. And even SuperGlue is not uncontroversial.


This comparison was not intended as an academic paper. However, we are confident that predicting and evaluating performance over more than 55k series from the M international forecasting competitions (standard benchmarks in the field) is a good start. After corresponding with the author, we included additional data sets from the electricity domain like Ercot and ETTM2. The dataset that did not finish training shows that Neuralprophet simply does not scale. (We let the training run for over 73 hours with 96 CPUs -as NeuralProphet implementation is restricted to CPU-, and we canceled it after investing 288 USD in it) Suggestions to strengthen our experiments (without spending hundreds of dollars) are highly welcomed.

Regarding the hyperparameter selection, we went beyond the original NeuralProphet paper, tunning and actively trying to help improve its performance. We even made a PR fixing NP bugs in the process.


This is ironic, because the Neural Prophet paper wasn’t reviewed either and didn’t benchmark it on anything other than Prophet.


While the results don't prove the superiority convincingly, it does seem that ETS is a good candidate as a first go-to in practical applications. "In practice, practice and theory are the same. In theory, they are not.”


In theory practice and theory are the same. In practice they are not.


HN pedantry ruins the fun of wordplay yet again.


In theory, his version is right. In practice, yours.


The difference between theory and practice is greater in theory than in practice.


In theory there is no difference between theory and practice. In practice there is.

- Yogi Berra


"Sometimes malpractice is just malpractice"

- Sigmund Freud


In the original Prophet paper (https://peerj.com/preprints/3190.pdf) they claim that Prophet outperforms ETS (see Figure 7, for example). And in the NeuralProphet paper, they claim that it outperforms Prophet (but do not, as far as I can see, compare directly to ETS). Here we see ETS outperforms NeuralProphet.

Presumably this apparent non-transitivity is because of differences in each evaluation. If we fix the evaluation to the method used here, is it still the case that NeuralProphet outperforms Prophet (and therefore the claim that Prophet outperforms ETS is not correct)? Or is it that NeuralProphet does not outperform Prophet, but Prophet does outperform ETS?


Figure 7 of the mentioned paper evaluates FB-Prophet in a extremely convenient environment of long horizon h in {30,60,90,120,150,180}. It is known that ETS and ARIMA models concatenate errors and degrade in performance with longer forecasting horizons. We have explored and offered solutions to these issues with the N-HiTS model specialized in long-horizon (https://arxiv.org/abs/2201.12886).

In recent years, FB-Prophet has gained a reputation for the poor quality of its predictions in many practical scenarios (short/medium term horizon) and its slow performance on bigger data sets; our ARIMA/ETS work and NeuralProphet confirmed those suspicions. The mentioned 92 percent improvements of the paper are restricted to h in {1,3,15,60} (https://arxiv.org/pdf/2111.15397.pdf).

This post's results are rather for short-horizon tasks, the same as NeuralProphet experiments. But we are confident that specialized tools like N-HiTS would outperform Prophet in long-horizon settings.


Prophets original use case was for forecasting extremely long time series rather automatically. If that's not your use case, it may not be the right fit.


I think the problem arises from the datasets used to evaluate the performance of the models. In the case of Prophet's paper, only one time series is used (The number of events created on Facebook). We can conclude from the results comparing AutoARIMA vs. Prophet (https://github.com/Nixtla/statsforecast/tree/main/experiment..., using the same datasets as in the ETS vs. NeuralProphet experiment) that ETS is also better than Prophet. Regarding NeuralProphet vs. Prophet, the results are not conclusive for these datasets.


As usual in ML, the appropriate solution depends on the problem and context.

ML (particularly DL) tends to outperform "classical" statistical time series forecasting when the data is (strongly) nonlinear, highly dimensional and large. The opposite holds as well.

It is also important to note that accuracy is not the only relevant metric in practical applications. Explainability is of particular interest in time series forecasting: it is good to know if your sales are going to increase/decrease, but it is even more valuable to know which input variables are likely to account for that change. Hence, a "simple" model with inferior forecasting accuracy might be preferred to a stronger estimator if it can give insights to not only the "what" will happen, but also the "why".


A larger problem is that time series modeling is particularly resistant towards black box approaches since a lot of information is encoded in the model itself.

Take even a simple moving average model on daily observations. Consider stock ticker data (where there are no weekends) and web traffic data (where there is an observation each day). The stock ticker data should be smoothed with a 5 day window and the web traffic with a 7 to help reduce the impact of weekly effects (which probably shouldn't exist in the stock market anyway).

It's possible in either of these cases you might find a moving average that performs better on some choose metric, say 4 or 8 days. However neither of these alternatives make any sense as a window if we're trying to remove day-of-week effect, and unless you can come up with a justifiable explanation, smoothing over arbitrary windows should be avoided.

If you let a black box optimize even a simple moving average you would be avoiding some very essential introspection into what your model is actually claiming.

Not to mention that we often can do more than just prediction with these intentional model tunings (for example day-of-week effect can be explicitly differenced from the data to measure exactly how much sales should increase on a Saturday)


Do you know of any good resources that explain why smoothing should be used?

My intuition is that in your given example (stock prices) smoothing would probably be doing yourself a disservice as it would hide the optimal hour or day of the week to make purchases/sales.

Is it mostly related to the timeframe of your analysis and needing to trade off near term precision for longer term precision?


Smoothing is used to negate seasonal effects.

Suppose you run a bar, and your busiest days in order are Friday, Saturday, Thursday, Sunday, Wednesday, Tuesday and Monday.

Now you are the owner and you want to look at your foot traffic everyday to monitor the health of your business. However, from the ordering I've presented, this will almost never work trivially. Monday traffic will always be less than Sunday, does this mean every Monday you should be concerned about business? Of course not.

However by averaging the last 7 days and looking at that each day, you are canceling out these seasonal effects because every single day of the week is accounted for in your measurement. If the 7 day moving average on Monday is less than Sunday you should be concerned because the average when calculated on Sunday included the pervious Monday.

For your example, you use the smoothed data and a history of the original data to come up with an exact explanation of which days are the best.

For example if you are a bar owner and you don't know which day is the best, you can take a 7 day moving average and subtract it from each day of actual observations. Then averages those differences grouped by day of week and you get an estimate for the day of week effect (you can also calculate standard deviation as well).


> ML (particularly DL) tends to outperform "classical" statistical time series forecasting when the data is (strongly) nonlinear, highly dimensional and large.

This claim about forecasting with DL comes up a lot, but I’ve seen little evidence to back it up.

Personally, I’ve never managed to have the same success others apparently have with DL time series forecasting.


It's true simply because large ANNs have a higher capacity, which is great for large, nonlinear data but less so for small datasets or simple functions.

In any case, Transformers are eating ML right now and I'm actually surprised there's no "GPT-3 for time series" yet. It's technically the same problem as language modeling (that is, multi-step prediction of numerics), however, there is only a comparably little amount of human-generated data for self-supervised learning of a time series forecasting model. Another reason might be that the expected applications and potentials of such a pre-trained model aren't as glamorous as generating language.


> It's technically the same problem as language modeling

You're thinking of modeling event sequences which is not strictly speaking the same as time series modeling.

Plenty of people do use LSTMs to model event sequences, using the hidden state of the model as a vector representation of processes current location walking a graph (i.e. a Users journey through a mobile app, or navigating following links on the web.)

Time series is different because the ticks of timed events are at consistent intervals and are also part of the problem being modeled. In general time series models have often been distinct from sequence models.

The reason there's no GPT-3 for any general sequence is the lack of data. Typically the vocabulary of events is much smaller than natural languages and the corpus of sequences much smaller.


There's a deeper issue. All language (and code and other things in the GPT/etc corpora) seem to have something in common - hierarchical, short- and long-range structure.

In contrast, there is nothing that all time series have in common. There's no way to learn generic time series knowledge that will reliably generalise to new unseen time series.


Like I said, still not seen any evidence.


Then look at some of the past time series related Kaggle challenges, plenty of evidence there in the winning solutions.


So true, causal inference is much more valuable than prediction which is a relatively cheap commodity.


Can someone explain this? I don't know what the context is for this Show HN.


NeuralProphet is the successor-extension of Prophet, and it aims to provide Prophet with neural networks and autoregressive terms. The paper can be found here (https://arxiv.org/abs/2111.15397?fbclid=IwAR2vCkHYiy5yuPPjWX...).

We noted that the paper only compares NeuralProphet against Prophet and does not include standard time series datasets (such as M-competitions). So we decided to test the model against simpler models (ETS in this case) using the StatsForecast library (https://github.com/Nixtla/statsforecast/).


They're time series prediction methods. E.g. they mention electricity usage forecasting - given historical data, what will the usage be in 1 hour?

Facebook's Prophet is quite popular in the space I understand. No idea about the other two.


A minor language error: "this model does not outperform classical statistical methods neither in accuracy nor speed." should say "either" and "or".


The "Not-Neither-Nor" sequence is typical, even with regards to American English, versus British English (the Queen's English.) In either case, both are technically-correct.


As part of a double negative?


English is nothing if not inconsistent.



Nothing there seems to contradict me. The problem in the linked page is that "neither ... nor" is used after "not", which makes it a double negative.


Hmm, wow. When I saw the headline, I assumed they used like one dataset or something similarly limiting.

I'd need to dig out the original paper, but I would be surprised if the original didn't compare to basic benchmark methods. But from memory, I never saw such a comparison (until now).


Be surprised. Here is the original Neuralprophet paper: https://arxiv.org/pdf/2111.15397.pdf It only compares itself to Prophet.


what's the consensus on machine learning vs more classical methods for time series forecasting? I know in 2018 a hybrid model won the M4 competition, obviously in this case classical still beats AI/ML

https://en.wikipedia.org/wiki/Makridakis_Competitions


I think depends massively in what you mean by "time series". If it is really an ARMA model you're looking at then ML can only bring noise to the problem. If it is a complex large system that happens to be indexed by time, ML can well be better.

AFAIK Prophet had more modest scope than "be all and end all of TS modelling", rather a decent model for everything. It might indeed be excellent at that...


In the M5 competition[1], most winning solutions used LightGBM. So ML beat classical.

Just a couple of the winning solutions used DL.

[1] https://www.sciencedirect.com/science/article/pii/S016920702...


I would like to see the results of this ETS on the M5 Competition dataset, and see how fast it is compared to the ETS that was used as a benchmark. It goes without saying that accuracy is important, but reducing the total execution time is also pretty valuable.


I wish the introduction was targeted to a more general audience. It was not very clear what the application was.


This makes me reconsider my opposition to a death penalty.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: