No offense, but isn't the NLP field effectively solved with the creation of LLMs, or at least for the majority of the tasks you would expect from an NLP application?
I am sure you can find some special areas or niches where traditional NLP approaches would outcompete a black box like LLMs. But with the LLMs becoming much more efficient now after quantization to the point you can run them locally, I think there is a good argument in saying simple NLP is basically solved.
> I think there is a good argument in saying simple NLP is basically solved
In my experience LLMs can get about 70-80% accuracy on a bunch of NER and text classification tasks if you give it a reasonable prompt. That's not nothing and it's something that you can get started with super quickly. But you'll have slow responses and typically a 3rd party running the inference.
Annotating data yourself to about 2000-3000 examples, on the datasets that I ran my benchmarks on, may get you closer to 80-90%. You'll typically also get fast inference that you can run on your own hardware no problem. By annotating the data myself I also like to think that I understand the problem much better as a consequence.
Don't get me wrong. LLMs are cool and interesting ... but they don't seem to replace old-school methods just yet.
Indeed, one of my last big projects I worked on before I retired was using an LSTM model for named entity recognition. It was many times faster than anything you can do with an LLM.
Manually correcting the wrong 20% may be a reasonable amount of work, but you must examine all 100 to find the 20 that needs fixing. And that is most likely not a reasonable amount of work.
I’ve done image annotation for ML / computer vision. I found it extremely useful to use my first annotations to train a poor model, then use results from that to annotate new data. You get feedback on model quality as you go, and looking at 100 images with multiple objects is way less work than annotating 100 images.
If it’s worth it? I guess that depends on the project you’re working on.
LLMs are useful for things like predicting/generating text, and summarizing text. They are not useful if you want to do other NLP tasks that include things like:
1. Identifying (and highlighting/extracting) the language that spans of text are in within a different language (e.g. a French phrase in English).
2. Text search and highlighting, where you need to do things like performing word stemming or lemmatization on the search input, find matching documents, and highlight spans.
Also, LLMs have a different tokenization model, so you have tokens like "don"/"'t!" -- this makes them difficult to work with in downstream NLP applications where having words, numbers, punctuation, and symbols as specific tokens are more useful.
NLP is not a solved task, as things like part of speech classification (identifying nouns, adjectives, etc.) are not 100% accurate, and tend to have a lower sentence accuracy compared to word accuracy.
The following English text contains several French phrases. List all of them.
Text: Gabonese President Ali Bongo Odimba was deposed in a coup d'etat spearheaded by his father's former aide-de-camp Brice Oligui Nguema, shortly after the announcement that Bongo had won the 2023 election.
List of French phrases from the text:
-
"""
Response:
"""
coup d'etat
- aide-de-camp
"""
2. Prompt:
"""
Turn all words in the following text into their lemma form.
Text: Gabonese President Ali Bongo Odimba was deposed in a coup d'etat spearheaded by his father's former aide-de-camp Brice Oligui Nguema, shortly after the announcement that Bongo had won the 2023 election.
Lemmatized text:
"""
Response:
"""
Gabonese President Ali Bongo Odimba be depose in a coup d'etat spearhead by his father's former aide-de-camp Brice Oligui Nguema, shortly after the announcement that Bongo have win the 2023 election.
"""
Maybe too slow or too inaccurate or too expensive compared to bespoke solutions, but certainly not completely useless.
If you want a different tokenization, just turn the LLM response back into text and retokenize it.
There's nothing about LLMs that is inherently nondeterministic. Sure, if you're using some API you have no control over, anything could happen. But if you run it on your own hardware, you can make it as deterministic as any classic NLP approach.
And whether you always have to check everything is a separate question from nondeterminism. You could have a deterministic heuristic that is often wrong in a domain where mistakes are fatal, or you could have a nondeterministic model that is almost always correct for a task where errors cost next to nothing.
That's what I meant by "if you're using some API you have no control over, anything could happen." OpenAI could offer access to a fully deterministic version at a higher price point, but they choose not to and there's nothing you can do about that.
If you'd read the source you linked you'd see that no, the system is inherently non-deterministic. They cannot sell you a deterministic version of GPT-4. There is no such thing. They could sell you a deterministic version of some other inferior model.
That link is dead.
Either way, you can’t interact with gpt4 without openai’s api, so you don’t know what they’re doing in between your request and inference.
But LLMs are deterministic when all the parameters are the same between inferences
I agree LLMs alone aren't good at search but their embeddings replace the need for stemming, manual synonym lists, etc in most cases. LLMs can also be used for query understanding which can improve the keywords submitted to the engine and extracting the best snippet for a highlight. LLMs + search are better than either alone. However LLMs still have an inference performance/cost issue which may make them unsuitable for some search use cases.
This is a common opinion, but when I speak to small companies that want to use NLP (e.g. in the medical domain) and I give an account of the advantages and disadvantages of "classic" NLP vs. generative LLMs for information extraction applications, they tend to prefer classic NLP more often than not.
The possibility of models making things up combined with zero explainability, together with high costs (or using third-party services and having to upload sensitive data who knows where) are red flags for many.
This may change in a few years if the weaknesses of generative LLMs are successfully addressed, but for the moment I think "classic" NLP still has its place.
1) It is extremely rare for a field to ever 'be solved'. There is still active research into how to multiply 2 numbers together. NLP is not anywhere close to solved.
2) LLMs have different trade-offs to fundamental techniques. Linear regression still gets lots of use despite there usually being a theoretically better method for any specific application. There will be parallels to that in NLP.
3) Isn't the article talking about training things like LLM? It is right there - "Chapter 4: Training a neural network model".
Funny thing, you can get LLM+Code Interpreter to do your linear regression with sklearn. So LLM+toys "can do" linear regression, or any other library algorithm.
And I think I read in some paper that pure LLMs can do regression tasks in few shot mode. Maybe not as good as sklearn, but a decent result.
I'm unimpressed. I'd guess that every modelling technique can do linear regression. It is literally drawing a line through data. If you are lucky enough, even a one parameter model can get pretty close to what a linear regression does. But doing linear regression with LLMs is wasteful. The computational power alone makes it a party trick and not much more.
The #1 thing for me right now is determinism and traceability. I am in a "serious" business domain and non-determinism is a big no-no. We have to be able to justify everything in a traditional sense at the end of the day. Explaining to a regulator that we declined a customer as a consequence of a vast, unthinkable sea of weights and biases is not going to fly.
For each predicted output token, I want to know exactly which source document(s) were utilized including indices from those documents and relevant statistics.
> For each predicted output token, I want to know exactly which source document(s) were utilized including indices from those documents and relevant statistics.
You don't do this for any other kind of decision or tool, why do you need it for LLMs?
i think in truth you need a source that convinces you (or the regulator) that your choice is acceptable, so that you can pass off the responsibility.
If the LLM were to give you an answer backed by relevant (cited) source documents of regulations and a good explanation it would make no difference compared to a human worker doing the same. -> this is already possible
I don’t understand what it means to be “solved”. It’s like saying that “architecture is now solved”, “physics is solved”, or “programming is solved”.
It’s a field of science and/or engineering, it’s not like we will ever run out of things to try/build/investigate. LLMs work… to a certain extent, with limitations and tradeoffs, and for some things. Would you spend days, money and Co2 to split a huge text corpus in sentences with a LLM, if a simpler program can do that just as well, there’s no need to find “the perfect prompt” (and hope that the LLM doesn’t forget one sentence, adds something inbetween, etc etc) and it gets done in three hours?
I take it to mean that there is an effective generally accepted solution or methodology for problems in the field. Bridge building has been largely solved by methods of mathematical and computational structural analysis, manufacturing, and government regulation. We know how to build a bridge. Before the solution was known, designers would just go by intuition and we wouldn’t have any actual assurance that the bridge would hold. We can probably never solve larger domains like physics, programming, or architecture.
But your bridge building analogy doesn't define what "solved" means still. When it comes to bridge building, "solved" means the bridge won't collapse under expected conditions. All you did was bring up an area where "solved" does have a definition, but that does nothing to define "solved" in the field we're discussing.
I think NLP is closer to architecture than to bridge building (and even there, I’m sure we’re still probably researching stuff to know more about how to calculate the stresses and whatnot)
Okay so, first some terminology. LLMs can mean a bunch of different things, people call models the size of BERT LLMs sometimes. So let's talk specifically about in-context learning (ICL) with either zero or a few examples. So we'll say LLM ICL, and contrast that with techniques where you annotate enough data to train with, which might only be something like 10-40 hours of annotation. The something you do with that data is probably training a task-specific classification model initialised with weights from a language modelling objective. This is sometimes called "fine-tuning", but fine-tuning can also mean taking an LLM and adapting its ICL. So we'll just call it "training", and the fact you use transfer learning like BERT or even word vectors is just tactics.
Okay. So, here's something that might surprise you: ICL actually sucks at most predictive tasks currently. Let's take NER. Performance on NER on some datasets is _below 2003_. Here's some recent discussion: https://twitter.com/mayhewsw/status/1700139745769046409
The discussion focusses on how bad the CoNLL 2003 dataset is, and indeed it's a crap dataset. But experiments have also been done on other datasets, e.g. check out the comparison of ICL and training in this paper from Microsoft: https://universal-ner.github.io/ . When GPT4 is used this one paper reports it slightly better on some tasks: https://arxiv.org/abs/2308.10092 . Frustratingly they don't do enough GPT4 experiments. This other paper also does a huge number of experiments, but not with GPT4: https://arxiv.org/pdf/2303.10420.pdf
The findings across the literature are really clear. ICL is generally much worse than training a model in accuracy, and you generally don't need much training data to surpass ICL in accuracy.
For tasks like text classification, ICL sometimes does okay. But you need to pick the problem characteristics carefully. Most text classification tasks people actually want to do have something like 20 labels, the texts are kind of long, and the labels don't capture the category especially well. Applying ICL to such tasks is very non-trivial. Your prompt balloons up if you have lots of classes to predict between, and providing the examples is hard if your texts are even a few hundred words.
Let's say you want to do something ultra simple: classify articles into categories for some news site or blog. This is the type of problem text classifiers have been eating for breakfast for 20 years. This is not a difficult problem -- a unigram bag of words does fine, and the method of estimating the weights can be almost anything, like just averaged perceptron will be totally okay.
But how will an LLM be able to do this? Probably your topic categories include several different types of article under them. If you know what those types of article are you can separate them out and make sure they're represented in the prompt. But now we're back at needing a tonne of domain knowledge about your problem -- that's like having to write custom features to make your model work. We all switched to deep learning so we wouldn't have to do that.
LLMs build a much more sophisticated representation of the meaning of the data to classify. But then you give them very few examples of the problem. So they can only build a shallow function from this deep representation. If you have lots of examples, you can learn a complex function from shallower features. For a great many classification tasks, this is better. The rest of your system usually needs the classification module to have some sort of consistent behaviours anyway. To do that, you basically have to make an annotation manual, and then you want to annotate evaluation documents. Once you're there the effort to make training data and train a model is minimal.
The other elephant in the room is the expense of the LLM solutions. The papers are missing results on GPT4 not because they're lazy, but because it's so expensive to use GPT4 as a classification solution that they want to get the rest of their results out the door.
The world cannot migrate all its current NLP models for text classification and NER to ICL. There are nowhere near enough GPUs in the world for that to happen. And I don't know about you, but I expect the number of text classification and NER models to grow, not shrink. So, the idea that we'll stop training small models for these tasks is just implausible. The OpenAI models that support batching are almost viable for prediction, but models like GPT4 don't support it (perhaps due to the mixture of experts?), so it's super slow.
The other thing is, many aspects of language that are useful as annotations are consistent linguistic features. The English language codes for proper names and numeric entities. They behave differently in the grammar. So some sort of named entity annotation can be done once, and then the model trained and reused. This is what spaCy does. We do this for a variety of other useful annotations across languages. We actually need to do much more: we need to collect new annotations for these models to keep them up to date, and we need to do this for more tasks, such as semantic role labelling. But it's definitely a good way to reuse work. We can do this annotation once, train the models, and users can reuse the models.
The strength of ICL is that you can get started very easily, without doing the work of annotation and training. There's lots of research on making ICL few-shot learning less bad on arbitrary text classification, NER and other tasks. We're working hard to take these results from the literature and build best-practice prompts and parsers you can use as a drop-in annotation module in spaCy: https://github.com/explosion/spacy-llm . Our annotation tool Prodigy also supports initializing the annotations from an LLM, and just correcting the output: https://prodigy.ai . The idea is to let you start with an LLM, and then transition to a model you train yourself, which can be run much faster.
"If you know what those types of article are you can separate them out and make sure they're represented in the prompt. But now we're back at needing a tonne of domain knowledge about your problem -- that's like having to write custom features to make your model work"
I think this is where the different perspectives come into play.
If you're an NLP practitioner you are thinking, oh no! I need to know a lot about audience intention and how the articles are represented navigationally and the kind of variety people are looking for and how articles might fit multiple categories, etc, etc. And you have to think about these things on a meta level ("prompt engineering"), because you have to instruct the model on how to act in an abstract way.
If you're someone who wants to run a news site then you already are thinking a ton about these things, and probably have a dozen things you'd like to change and adjust, new ways of presenting content, etc. You _wish_ you could be thinking about these domain-specific topics.
What feels like a bug to the NLP practitioner – needing a deep understanding of the domain – is a feature to... everyone else. It's a feature to the people who care most about the results.
The other big perspective difference here, I believe, is how you think about goals. How many tasks ARE categorization? My intuition is that it's a quite small number. There are many tasks that can be implemented with one step as categorization, but that is seldom the task. To the NLP practitioner categorization might seem very prominent – that's when someone calls you up or hands over the work. But with an LLM you might be able to do a much larger portion of the real task, with or without categorization.
Even with a categorization task, when I'm working with an LLM I usually don't produce just a "category", but produce other information at the same time, often using natural language as a first-class data type because it can be fed back into an LLM. In my experience the results are often (usually?) much higher quality because I'm not breaking things down into steps where the inaccuracies propagate between steps, but instead going right for the result, and using a model that can basically "check" itself against general knowledge, scrubbing out nonsensical results during inference. (As a result the remaining inaccuracies often appear plausible and are labeled "hallucinations"... it can make things more challenging, but what we don't see are the multitude of obvious inaccuracies that a more traditional NLP system would create, and which in a sense exist momentarily during the LLM inference.)
I know what you mean about the domain knowledge, and it's a thing that's a bit different from the previous situation with the feature engineering. The problem with feature engineering for linear models was you really had to understand the domain _and_ the ML.
I do think there's a similarity in how creative you need to be though. It means that applying LLMs to new problems isn't as straight-forward as people make it seem at first, and isn't necessarily reliable. In contrast, labelling data is something that has a much smoother effort to reward curve for most problems. The experience of labelling the data, training a model and getting it hosted isn't as seamless as it could be -- we're working on that.
I do think classification is pretty fundamental though. The way I see it is, model outputs can be either human-facing, machine facing, or both. If you're going to feed the output into another system, that system wants the data to obey some limited schema, so that you can run logic based on it. For instance, let's say you want to trigger some alert when a particular kind of article is published or a particular kind of message is sent. Triggering the alert is a boolean thing, so that has to be a classification task. You might want to also attach text in the alert, so that's a human-facing part.
I agree that there's lots of ways that LLMs can be used iteratively, allowing more trade-off of computational cost for accuracy. I just think in a lot of cases, the best way to exploit that is to trade towards as much accuracy as you can get, and use that to create training samples. You can then manually correct the training samples as well --- if they're mostly correct, reviewing them is pretty quick. You can then train and evaluate a smaller model.
"For instance, let's say you want to trigger some alert when a particular kind of article is published or a particular kind of message is sent. Triggering the alert is a boolean thing, so that has to be a classification task."
I think this is a good example of how decomposed tasks can feel very different from goal-oriented task definitions.
We can imagine a goal like "I want to know when one of my competitors shows up in the news." Now you have a bunch of tasks: entity extraction, determining the subject of an article, maybe categorizing articles. Then you can define a pipeline and conditional to trigger a notification. And you might get great accuracy on each of these.
But the goal is really about getting actionable information. In practice the approach above creates a ton of alerts, and the person receiving them will filter through them, ignore a bunch, have to figure out what is really new, etc. An LLM could do things like accumulate a running set of background knowledge, identifying what information is truly "new" (and in a granular way, not just detecting duplicate articles). You can tell the LLM all kinds of details about what you are interested in, "categories" that are completely inaccessible to traditional NLP because they are described by higher-level concepts or have to be combined with history or user-provided context (something that happens naturally in a prompt).
Traditional NLP feels very industrial to me. Factories can be very productive and high volume, but they redefine the tasks to satisfy their processes. Individuals don't interface with factories, and factories don't serve individuals.
I think the 'industrial' or assembly-line analogy is probably good, and I see what you mean about the alternative system design. Thanks for explaining the other approach.
I'll put it this way. If you want to integrate ML into a product, or even a system with lots of internal users, you end up increasingly towards the 'factory' approach.
>ICL is generally much worse than training a model in accuracy, and you generally don't need much training data to surpass ICL in accuracy.
For the same model is a huge asterisk you seem to be missing here. Finetuned GPT-4 is better than ICL Gpt-4 and so on but there's no guarantee that finetuned GPT-3 will be better than ICL GPT-4 like how 4 beats the finetuned Med-Palm on medical domain tests.
I'm making the stronger claim that fine-tuned BERT is almost always better than ICL GPT4. That is, you beat ICL GPT4 with a model of less than 1b parameters. These are the results in the literature.
Fine-tuning GPT3 is actually a strange middle-ground: you're constrained to solve the tasks via prompting, rather than attaching output layers and doing gradient descent over a loss function that directly matches the task. So fine-tuning LLMs for these classification tasks is indeed often worse than just using the larger model.
The other thing is, as you scale up the model the ratio of output weights to the whole model weights gets quite out of balance. This introduces some technical challenges, and so it's kind of tough to make it all work with medium size LLMs of around 7b parameters or so. In contrast the process of fine-tuning BERT-sized stuff is really well explored, and it's easy to find more or less push-button solutions.
LLM are not somewhat fast when measured against CPU models in spacy that are specialized for a task. They are multiple orders of magnitude slower for e.g. token classification/NER, text classification, etc.
that's not to mention the plumbing that spacy provides that are mostly bindings to C code for tokenization, lemmatization, etc. things like that which are more algorithms problems than machine learning problems.
I think for a lot of "simple NLP" stuff like constituency parsing, dependency parsing, POS tagging, etc. there have been fast and accurate models/algorithms for quite some time, and putting even a pruned, quantized, hyper-optimized LLM at them would feel like shooting a shotgun at a mosquito.
Quite honestly, I'd imagine getting LLMs to generate regex solutions seems most feasible. Regex can basically run through a character vector full of millions in impressive time. Plugging all of them into an LLM and awaiting a response taking 20 seconds each might be impractical for most use cases. However, asking an LLM to pack the regex statement full of synonyms and | operators just might seem a better solution. Especially if you can give the LLM some samples of what you're looking for.
"NLP is basically solved" --- somewhat but not the entirely yet. I work on a variety of specialized NLP use-cases for clients and there are different strengths these approaches have. The big thing with LLMs is the ability to deploy something quickly if (a) if you can craft an appropriate prompt, (b) put up enough guardrails to stop surfacing hallucinated responses to users. For assistant or co-pilot kind of systems (b) is somewhat easy to deal with, since the user is expected to curate or edit the LLM's responses, hence their proliferation in such use-cases. Note though: if the LLM sounds authoritative enough it might bias the user into believing it is right - not a big problem when the user is relying on it for generating good prose, but this is a problem when presenting facts (esp. based on a specialized knowledge base).
The downsides are (a) latency, (b) cost, and (c) the need for specialized training. In some applications, you require near real-time responses, and smaller models are still better here. The cost angle is a little tricky - it depends both on the volume of calls you want to make and if this cost translates into revenue for the possibly incremental benefit you derive from an LLM. As an example, lets say you have a chatbot that does NER/slot-filling using a spacy or stanza today, and lets also assume ChatGPT can do better - does the incremental accuracy, that comes at a cost since you're paying OpenAI - translate into incremental revenue (or profit)? I am not sure what the answer here is - its probably a NO right now, but there is a positive deferred benefit - as your chatbot solution becomes better in many small and large ways, it might sell better in the future. The specialized training part is when you can gather a use-case-specific dataset that can get a fairly good accuracy (comparable or greater than an LLM), esp. considering (a) and (b). Note that these concerns are largely true even for self-hosted LLMs like llama - just that the precise breakeven point changes.
At least for those of us unfamiliar with the field, LLMs are an easy way of getting the task done. The only thing worth noting I suppose is that the most effective ones are behind paywalls.
In some cases though you may want the NLP task to be run locally - you want it to be free, and should not require excessive resources - for those cases libraries like Spacy and NLTK make sense. Yes there are projects like llama.cpp and friends, but it's a fast moving field and they aren't suited for production applications yet, and even then require high end hardware setups which not everyone will necessarily be running in.
>At least for those of us unfamiliar with the field, LLMs are an easy way of getting the task done.
This is actually an excellent point. You don't really need to know, or even give a damn, how LLMs work in order to make use of them. Find me a C++ library where I can be 100% clueless as to what it does while also integrating it into my code.
Regression is unsolved, for a start. Re LLMs, they are expensive to finetune and their inference times & memory footprints are poor compared to smaller models.
Anyone actually using SpaCy as part of the production API / queue processing? Like 100 RPS or more? I found it to be unstable to serve API requests, unless you dedicate unreasonably extremely overpowered VM instance to it.
How people solve sizing issues for python based APIs with multiple simultaneous requests processing?
I am sure you can find some special areas or niches where traditional NLP approaches would outcompete a black box like LLMs. But with the LLMs becoming much more efficient now after quantization to the point you can run them locally, I think there is a good argument in saying simple NLP is basically solved.