In general, the M-Competitions (https://forecasters.org/resources/time-series-data/), the olympics of timeseries forecasting, have proven frustrating for ML methods... linear models do shockingly well and the ML models that have won, generally seem to be variants of older tree-based methods (ie. LightGBM is a favorite).
Will be interesting to see whether the Transformer architecture ends up making real progress here.
They are comparing a non-ensembled transformer model with an ensemble of simple linear models. It's not surprising that the ensemble models of linear time series models will do well, since ensembles optimize for the bias-variance trade-off.
Transformer/ML models by themselves have a tendency to overfit past patterns. They pick up more signal in the patterns, but they also pick up spurious patterns. They're low bias but high variance.
It would be more interesting to compare an ensemble of transformer models with an ensemble of linear models to see which is more accurate.
(that said, it's pretty impressive that an ensemble of simple linear models can beat a large scale transformer model -- this tells me the domain being forecast has a high degree of variance, which transformer models by themselves don't do well on.)
Dropout is different from ensembles. It is a regularization method.
It might look like an ensemble because you’re selecting different subsets but ensembles combine different independent models rather than just subset models.
That said random forests are an internal ensemble, so I guess that could work.
In my mind an ensemble is like a committee. For it to be effective, each member should be independent (able to pick up different signals) and have a greater than random chance of being correct.
Are these models high risk because of their lack of interpratability? Specialized models like temporal fusion transformers attempt to solve this but in practice I'm seeing folks torn apart when defending transformers against model risk committees within organizations that are mature enough to have them.
Interpretability is just one pillar to satisfy in AI governance. You have build submodels to assist with interpreting black box main prediction models.
Is there a way to directly train transformer models to output embeddings that could help tree based models downstream? For tabular data tree based models seems to be the best but I feel like foundational models could help them in some way
As a practitioner the most impactful library for time series has been brms, which basically gives you syntactic sugar for creating statistical models in Stan. Checks all the boxes including probabilistic forecasts, multiple link functions for the likelihood including weiner, gamma, Gaussian, student t, binomial, zero-inflated and hurdle models. Also has auto-regressive and ordinal predictors and you actually learn something from your data.
I find a lot of these ML and DL libraries to be harder to troubleshoot beyond blind hyperparameter tuning whereas with stats I can tweak model, modify likelihood, etc. There’s also a lot of high value problems that have few data points these libraries tend to want at least daily data.
I guess I just mean I’m a data scientist—someone who uses models like these in practice as opposed to someone who develops them.
I’m not sure what to even make of a term like “foundational time series”. Does that just mean it’s widely used and known? You have to earn a role like that you can’t just declare yourself one.
Maybe I'm missing something obvious, but what is the idea behind quantizing and tokenizing time series? We tokenize text because text isn't numbers. In the case of time series, we're... turning numbers into less precise numbers? The benefit of scaling and centering is trivial and i guess all timeseries ML does it, but I don't see why we need a token after that.
I'm building upon insights from this paper (https://arxiv.org/pdf/2403.03950.pdf) and believe that classification can sometimes outperform regression, even when dealing with continuous output values. This is particularly true in scenarios where the output is noisy and may assume various values (multi modal). By treating the problem as classification over discrete bins, we can obtain an approximate distribution over these bins, rather than settling for a single, averaged value as regression would yield. This approach not only facilitates sampling but may also lead to more favorable loss landscapes. The linked paper in this comment provides more details of this idea.
Isn't it a given that classification would "outperform" regression, assuming n_classes < n_possible_continuous_labels?
Turning a regression problem into a classification problem bins the data, offers more examples per label, simplifying the problem, with a tradeoff in what granularity you can predict.
(It depends on what you mean by "outperform" since metrics for classification and regression aren't always comparable, but I think I'm following the meaning of your comment overall)
Tokenisation turns a continuous signal into a normalized discrete vocabulary: stock "went up a lot", "went up a little", "stayed flat". This smooths out noise and simplifies matching up similar but not identical signals.
> We tokenize text because text isn't numbers.
Text is actually numbers. People tried inputting UTF8 directly into transformers, but it doesn't work that well. Karpathy explains why:
Text can be represented by numbers but they aren't the same datatype. They don't support the same operations (addition, subtraction, multiplication, etc).
Interesting. Can you explain how this is superior and/or different from traditional DSP filters or other non-tokenization tricks in the signal processing field?
Traditional DSP filters still output a continuous signal. And it's a well-explored domain, hard to imagine any low-hanging fruit there.
My intuition is the following: transformers work really well for text, so we could try turning a time series into a "story" (limited vocabulary) and see what happens.
I think it could also have a connection with symbolic AI: The discrete tokens could be the symbols that many believe is useful or necessary for reasoning.
It is also useful for compression, reducing memory requirements by the quantization and small integer representations.
My primitive understanding is that we approximate a Markovian approach and indirectly model the transition probabilities just by working through tokens.
Chronos is probably overkill for what I am looking to do with time series data. I just did an Ask HN on time series[0] but unfortunately didn't get the replies I was hoping for. Maybe this thread can get the bump I need:
I inherited a large time series JSON dataset in 2024. I've been successful in using the Observable Framework[1] by writing a Rust (rust-script) data loader[2] to parse and plot simple line charts[3] to visually see the data. There are hundreds of graphs over years of data so I would like to identify what graphs I should be paying attention to. My initial thought is to calculate metrics on each graph such as:
- Variability: how "spread out" are the data points from one another?
- Trend: direction of data path, up or down?
- Slope: are the data points increasing or decreasing?
- Level: where are the data points on the vertical axis?
What libraries, AI, databases, etc... would you recommend that would allow me to calculate these values? I am no data scientist and don't need forecasting but overall, I just want a dashboard that shows the most "important" graphs.
I always worked in R for time series analysis. This cookbook has everything you would need for a plan to analyze a time series [0] and this book provides a strong base and understanding while being focus on forecasting. [1] Have fun !
When you ask what data should be paying attention to, that should be depends on your objective. Do you want to predict something? Identify anomalies? In the end, what matters is understanding the meaning and relations of these data, rather than throwing them in to some ML framework and hoping to get something out.
Prediction and anomalies are not objectives but of the 4 listed, I would say the primary objective is identifying a trend in the data to know whether the data is moving in a specific direction—increasing or decreasing in value.
I already added linear regression marks that draws linear regression lines with confidence bands[1] to my Observable plots but they do not give me a “value” so I need to manually look at the graphs and read the red line.
Because these approaches as likely derived from papers published 3-5 years ago. At this point neither TimesFM or Chronos is particularly novel. I've had similar models in production for complex time series for 18 months now.
Coming from finance, I always wonder how and if these large pre-trained models are usable on any financial time series. I see the appeal of pre-trained models in areas where there is clearly a stationary pattern, even if its very hidden (i.e industrial or biological metrics). But given the inherently high signal/noise ratio and how extremely non-stationary or chaotic the financial data processes tend to be, i struggle to see the use of pre-trained foundation models.
I played around with timeGPT beta against predicting the sp500 index performance for the next day (not multi variate time series as I couldn't figure out how to get it setup) and trying to use the confidence intervals it generated to buy options was useless at best
I can see chronos working a bit better, as it tries to convert trends, and pieces of time series into tokens, like gpt does for phrases.
Ie. Stock goes down terribly, then dead cat bounces. This is common.
Stock goes up, hits resistance due to existing sell orders, comes down
Stock is on stable upward trend, continues upward trend
If I can verbalize these usual actions, it's likely chronos can also pickup on them.
Once again quality of data trumps all for LLM's, so performance might vary. If you read the paper, they point out a few situations where the LLM is unable to learn a trend, ie. When the prompting time series isn't long enough.
Amazon's older time series forecasting system DeepAR, has supported using external regressors since 2018 [1]. From this new Chronos paper, I didn't find any mention of external regressors.
They do mention covariates in section 6.1 - specifically how this method doesn’t support them but ideas on how they could in the future such as via stacking:
> In this work, we have focused on univariate time series forecasting since it constitutes the most common of real-world time series use-cases. Nevertheless, practical forecasting tasks often involve additional information that must be taken into account. One example involves covariates, that can be either time-independent (e.g., color of the product) or time-varying (e.g., on which days the product is on sale). Another closely related problem is multivariate forecasting, where historic values of one time series (e.g., interest rates) can influence the forecast for another time series (e.g., housing prices). The number of covariates or multivariate dimensions can vary greatly across tasks, which makes it challenging to train a single model that can handle all possible combinations. A possible solution may involve training task-specific adaptors that inject the covariates into the pretrained forecasting model (Rahman et al., 2020). As another option, we can build stacking ensembles (Ting & Witten, 1997) of Chronos and other light-weight models that excel at handling covariates such as LightGBM (Ke et al., 2017).
Ah. Thank you. The same concept goes under different names, so one needs to search for all of "exogenous variables", "external regressors", "external factors" and "covariates".
It may not be known yet, and this project seems to be targeted at gaussian distributions, but wouldn't the simplicity bias reduce sensitivity? I mean attention in transformers works so well in part because OOD is typically close enough.
Probably just my own bias because it seems everything I deal with is at least MArP and anomalies are important to my use case.
I can see where this is useful for others, even Amazon suggests ARIMA or ETS if you don't have hundreds of related streams.
Is this more targeted at people who want more smoothing?
It's great to see research in this field, I know there is opportunity here, and I hope to someday benefit from progress. But I skimmed the paper, and it doesn't appear solve a problem that I have. From the practical standpoint, what I want from a time series tool includes: 1) a small set of simple levers that I can review and tune 2) short training time for any input sets of size O(10k) to O(100k) (this covers seconds/day, minutes/week, hours/year) 3) the process of train + forecast can run fine on CPUs -- not GPUs with low memory overhead 4) decent out of the box performance that basically passes the sniff test and 5) a simple way to include regressors. I've enough experience to have learned to be wary of fully automated tuning, benchmark performance metrics, elaborate models, etc.
You make money with if you have useful data others don't have, or you have better algorithms that others aren't using.
When these become publicly known and used, your system doesn't work any more because the prices now include whatever signal you had for yourself before.
It's a bit more subtle than that, because there are feedback loops in the system. When a signal or factor spreads, it does so at multiple time horizons.
e.g. If I have a good signal at predicting horizon 1 day, then it is in my interest to have many people trading it at horizon > 1 day, as they will push the price in my direction.
I doubt the differences in performance between all the „neural“ models are statistically significant. It strikes me as odd that a model like TFT can be the worst of the „neural“ models in one benchmark and at the same time be the best in another benchmark. Also what is the point of Benchmark I ? „It comprises 15 datasets that were also part of the training data of Chronos models“ . That is not forecasting. That is just remembering/overfitting these time series.
In general, the M-Competitions (https://forecasters.org/resources/time-series-data/), the olympics of timeseries forecasting, have proven frustrating for ML methods... linear models do shockingly well and the ML models that have won, generally seem to be variants of older tree-based methods (ie. LightGBM is a favorite).
Will be interesting to see whether the Transformer architecture ends up making real progress here.