Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
MIT researchers use large language models to flag problems in complex systems (news.mit.edu)
95 points by fluxify on Aug 15, 2024 | hide | past | favorite | 32 comments


Sadly the paper uses for benchmarks datasets that: - are known to be pretty useless - contain mistakes - can be misleading with a naive F1-score measure. (to be fair they write "we looked at the F1-Score, under which both partial and full anomaly detection are considered correct identification" so this may be mitigated, but it's not clear)

See https://kdd-milets.github.io/milets2021/slides/Irrational%20... So it's hard to take any benchmark from the paper seriously. The paper is also ignoring any recent work (like > 2018) on univariate timeseries anomaly detection in the matrix profile space (eg MADRID).

The "practicality of usage" and conclusion sections are pretty correct though: it's expensive, slow, and no-shot is worthless if some other methods can train and infer in orders of magnitude less time.

It would have been interesting to see how the DETECTOR method performs when the LLM forecasting is replaced with some standard forecasting. (eg some auto ETS, if possible robust to anomalies in the training data). It looks like the natural follow up of this article is to remove the LLM altogether.


> For the second approach, called Detector, they use the LLM as a forecaster to predict the next value from a time series. The researchers compare the predicted value to the actual value. A large discrepancy suggests that the real value is likely an anomaly.

Unless there's more to it in the actual paper, this is how just about every anomaly detection technique already works. You fit a model of the distribution of data under normal-enough circumstances, and an observation is an "anomaly" if it seems very improbable (based on your model), or is otherwise extreme if your model isn't explicitly probabilistic.

So yes, this technique would be great if you removed the LLM: it's already the industry standard framework.

There's nothing wrong conceptually with trying to plug in a transformer model here. The problem is the presumption that a "large" pre-trained transformer model can actually work effectively on arbitrary time series.


> It looks like the natural follow up of this article is to remove the LLM altogether.

Welcome to scientific papers in 2024…


Some people, when confronted with a complex system, think "I know, I'll use a large language model." Now they have two complex systems interacting.


The human body is orders more complex than LLMs and we can already use LLMs to improve anomaly detection rates for cancer.


> improve anomaly detection rates for cancer.

What do you mean by that?


https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11003452/

“We propose and demonstrate the use of these models to assist in the manual curation process which results in higher accuracy and F1 score with lesser time and cost, thus improving efforts of cancer research.”


How does this relate to anomaly detection?


Anomalys are markers to keep track of when diagnosing.


That paper doesn't even mention the word anomaly or detecting them.


Cancer is anomalous cell behavior.


> Now they have two complex systems interacting.

As opposed to sending crowds of people to interact with the system? We are even more complex and less benchmarked/tested than LLMs. At least the LLM is a known thing, we know its limitations, we can evaluate and compensate for their lacks.


Some people trivialise things in bad faith to score internet points on Hacker News, I guess. The status quo, per the article, was already using “deep learning” models to perform anomaly detection. We are already talking about complex, black-box systems.

The stated advantage was that they didn’t require deployment-specific training. You can just throw a pre-trained LLM at it. There is also a stated benefit: early-stage detection, without needing to pay for, or wait for, training a custom ML model. The article is quite open about the fact that the LLM approach doesn’t beat the state of the art in terms of accuracy.

It’s like you didn’t click the link.


>early-stage detection, without needing to pay for, or wait for, training a custom ML model

This doesn't make much sense though, if you're in a hurry to get something working and don't have the data for training a custom model, you'd use the knowledge you have of the system to setup a typical statistical anomaly detector. That way you'd have the insight into the system behavior that you'd want early on, would be able to enforce all sorts of guarantees and would not have the weight and complexity of an LLM on the system's resources, rather than worrying about this LLM that has nowhere near the same info on the system.

Using an LLM for this would be like hiring a random person off the street to keep your network infrastructure working because you're in such a rush that you don't even want to document how your infra is setup, let alone wait to find someone who would know how to decipher it.


The metrics OP suggest increase with the transition from DL to LLMs (and OP had no immediate duty to discuss other metrics).


Colour me old-fashioned, but i am really sceptical about a zero-shot general-purpose LLM-based anomaly detector being able to outperform a trivial, not-obviously-wrong ML/statistical model calibrated on a small subset of the data that makes training extremely fast, in entirely new domains. And in anomaly detection, false negatives (and even false positives) can be very expensive.

The most promising role for LLMs in timeseries prediction is in extracting covariates from unstructured data, in the wind example, weather reports and geographically proximal social media posts.


You are skeptical about something that is explained quite thoroughly in the article. Did you read the article?

Edit: I meant they never claim to outperform SOTA domain-specific approaches. So, I don't know whose claim you are refuting.


> While LLMs could not beat state-of-the-art deep learning models at anomaly detection, they did perform as well as some other AI approaches.

I think one should also be sceptical about the current hype train


Funny how one of the the few comments actually discussing the content of the article gets downvoted while generic memes that have nothing to do with it get upvoted.


I once had to build a complex NN based system for this exact task. We also pitted this system against a consultant building a basic xgboost classifier working with a series of sliding windows over sensor data. Xgboost won. It was also faster to run and faster to train.

I'd still love to see how a VLM did with sensor data as it's often quite obvious to spot anomalies visually for humans. Especially if you're allowed to do comparison overlays.


This exercise seems like highlighting the need for a cheap (under use), all-purpose framework for an efficient function approximator (as bare ML is too costly).

They are trying LLMs for this purpose, but maybe the structure of an optimal architecture should be studied.


Is this saying, we can use LLM to understand non-language signal? Sounds fun!


Why not?

Whether through analogy or an actual underlying isomorphism between the mechanisms underpinning language and other domains, I don’t see a reason LLMs can’t occasionally have insights into non-language problems

Is it better than other methods? No. Is it efficient? Absolutely not.

I don’t work with LLMs but I think a lot of the HN’s users are prematurely skeptical of the potential low-hanging fruit across many domains that can be explored with these new, convenient but invariably suboptimal tools


Because they're not a magic box that spits intelligence, they're a reflection of their training data which will not contain these signals.


"LLMs" are really general-purpose sequence models. No reason it can't work, but so far results have been mixed in practice.


Yes, those systems can understand a bit of math. As long as they have memory and can compare values, it should work.

But it's weird that they would seriously try that given https://github.com/NX-AI/xlstm already exists. I mean, it's cool to know that it works, but I don't get why would they invest any more time trying to improve the results.


Graph Reasoning exists.


> converts time-series data into text-based inputs an LLM can process.

What? Which the model then tokenizes? I am struggling to make sense of this.


I read this as basically finding a language representation of timeseries data that would be understood by LLMs better than feeding raw records. I'm guessing it tokenizes very similarly to any other LLM. Perhaps I misread, though.


This is the modern day equivalent of "every signal is an image" after the initial success of deep learning algorithms in image classification tasks. It is just the academia chasing the hype train...


Obligatory link from a sceptic: https://arxiv.org/pdf/2009.13807

> However, they wanted to develop a technique that avoids fine-tuning, a process in which engineers retrain a general-purpose LLM on a small amount of task-specific data to make it an expert at one task.

This method does not avoid fine tuning. It just offloads the task to somebody else (i.e., to the LLM).

I'll buy the promise of the approach when the authors can show that they can vastly outperform an AR time series model or the simple techniques mentioned in the linked article.


Finally, a use for LLMs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: