LLMs can label data as well as human annotators, but 20 times faster

rossdavidh · on June 18, 2023

AHAHAHAHAHA! There is approximately a 0% chance that the big companies paying for data annotation will be far-sighted enough to avoid LLM-automated labeling of their data, for several reasons:

1) it will work well, at first, and only become low-quality after they (and their budgets) have become accustomed to paying 1/20th as much for the service

2) even if they pay for "human" labeling, they will go for the low cost bid, in a far-away country, which will subcontract to an LLM service without telling them

3) "hey, we should pay more for this input, in order to avoid not-yet-seen quality problems in the future", has practically never won an argument in any large corporation ever. I won't say absolutely 0 times, but pretty close.

Long story short, the use of LLM's by Big Tech may be doomed. Much like how "SEO optimization" turns quickly into clickbait and link farms if there is not high-urgency and high-priority efforts to combat it, LLM's (and other trendy forms of AI that require lots of labeled input) will quickly turn sour and produce even less impressive results than they already do.

The current wave of "AI" hype looks set to succeed about as well as IBM Watson.

lumost · on June 18, 2023

Have you worked with labeling services before? The quality is always terrible. At least with an LLM I can get consistently terrible output, quickly.

klyrs · on June 19, 2023

A few years ago, I started spoiling answers to captchas. I make a game of getting as many false positives and negatives I can get away with.

rossdavidh · on June 19, 2023

"Consistently terrible output, quickly!"(TM)

poomer · on June 18, 2023

At work we were facing this dilemna. Our team is working on a model to detect fraud/scam messages, in production it needs to label ~500k messages a day at low cost. We wanted to train a basic gbt/BERT model to run locally but we considered using GPT-4 as an label source instead of our usual human labelers.

For us human labeling is suprisingly cheap, the main advantage of GPT-4 would be that it would be much faster, since scams are always changing we could general new labels regularly and be continuously retraining our model.

In the end we didn't go down that route, there were several problems:

- GPT-4 accuracy wasn't as good as human labelers. I believe this is because scam messages are intentionally tricky, and require a much more general understanding of the world compared to the datasets used in this article which feature simpler labeling problems. Also, I don't trust that there was no funny business going on in generating the results for this blog, since there is clear conflict of interest with the business that owns it.

- GPT-4 would be consistently fooled by certain types of scams whereas human annotators work off a consensus procedure. This could probably be solved in the future when there's a larger pool of other high-quality LLMs available, and we can pool them for consensus.

- Concern that some PII information gets accidentally sent to OpenAI, of course nobody trusts that those guys will treat our customers data with any level of appropriate ethics.

bitshiftfaced · on June 19, 2023

I wonder if the LLM could at least reliably label something as "smells funny," and then you could have human labelers only work on that smaller, refined batch. But like you said, PII is a concern. In any case, at the rate its going, does anyone really doubt that LLMs one or two years out will have the same problem?

nihit-desai · on June 19, 2023

>> don't trust that there was no funny business going on in generating the results for this blog

All the datasets and labeling configs used for these experiments are available in our Github repo (https://github.com/refuel-ai/autolabel) as mentioned in the report. Hope these are useful!

poomer · on June 22, 2023

Thank you, I appreciate your transparency with this work.

mycall · on June 18, 2023

Did you consider fine-tuning your own copy of GPT-4 that can handle scam messages better? I'm doing something similar with Azure OpenAI Services and custom vector database to handle ham/spam labeling for some of my customer feedback APIs.

orangepurple · on June 18, 2023

This will probably work as long as the material being annotated is similar to the material the LLM was trained on. When it encounters novel data (value) it will likely perform poorly.

nihit-desai · on June 18, 2023

Partially agree, but it's a continuous value rather than a boolean. We've seen LLM performance largely follow this story: https://twitter.com/karpathy/status/1655994367033884672/phot...

From benchmarking, we've been positively surprised by how effective few-shot learning and PEFT are, at closing the domain gap.

"When it encounters novel data (value) it will likely perform poorly" -- is that not true of human annotators too? :)

orangepurple · on June 19, 2023

> "When it encounters novel data (value) it will likely perform poorly" -- is that not true of human annotators too? :)

Some humans have intelligence and reasoning abilities. No LLMs do :)

yawnxyz · on June 21, 2023

I always regret making that assumption

dimatura · on June 19, 2023

I don't have experience with text/nlp problems, but some degree of automation/assistance in labeling is a fairly common practice in computer vision. If you have a certain task where the ML model gets you 90% there, then you can use that as a starting point and have a human fix the remaining 10%. (Of course, this should be done in a way that the overall effort is lower than labeling from scratch, which is partially a UI problem). If your model is so good that it completely outperforms humans (at least for now, before data drift kicks in) then that's a good problem to have, assume your model evaluation is sane.

voz_ · on June 18, 2023

If an llm labels it, does that have the same value? Isn’t it just fancy regurgitation of knowns?

natch · on June 18, 2023

Even humans disagree about labels. Especially humans willing to do this work.

And with the topical depth say ChatGPT4 has, I would think these labels have more value, although just as with humans some validation and verification steps are required.

nihit-desai · on June 18, 2023

Good question - one followup question there is value for who? If it is to train the LLM that is labeling, then I agree. If it is to train a smaller downstream model (e.g. finetune a pretrained BERT model) then the value is as good as coming from any human annotator and only a function of label quality

voz_ · on June 18, 2023

Why retrain that smaller model from scratch tho? Just do a little transfer learning, or get creative and see if you can prune down to a smaller model algorithmically instead of doing the whole label and train rigamarole from scratch on what is effectively regurgitation.

I’m not sold this has directional value.

nihit-desai · on June 18, 2023

Hmm, I'm not suggesting training a smaller model from scratch - in most cases you'd want to finetune a pretrained model (aka, transfer learning) for your specific usecase/problem domain.

The need for labeled data for any kind of training is a constant though :)

scotty79 · on June 19, 2023

It has some value. If you let AI label the data and feed it back to it you are reaffirming it's guesses. If you independently verified that guesses are as correct as human ones you are teaching AI to be more sure about the correct thing.

mycall · on June 18, 2023

llm have emergent abilities [0] which could provide additional value to any output or label.

[0] https://www.jasonwei.net/blog/emergence

voz_ · on June 19, 2023

Not sold they do.

morelisp · on June 18, 2023

How was ground truth obtained if not via human annotation?

natch · on June 18, 2023

Some data is self annotating. You can count occurrences of features in context and then you know with increasing confidence that a feature occurs in a context. You can also build up meaning with more observation of events in context. Sounds circular, but notice no human is required in this process. Sure, in other cases or for other steps humans can be useful, but they aren't always needed for ground truth.

nihit-desai · on June 18, 2023

Hi, one of the authors here. Good question! For this benchmarking, we evaluated performance on popular open source text datasets across a few different NLP tasks (details in the report).

For each of these datasets, we specify task guidelines/prompts for the LLM and human annotators, and compare each of their performance against ground truth labels.

morelisp · on June 18, 2023

You didn't answer the question at all, although to be fair the answer is both obvious and completely undermines your claim so I can see why you wouldn't.

natch · on June 18, 2023

>compare each of their performance against supposed ground truth labels.

Fixed it for you.

nihit-desai · on June 18, 2023

I mean, sure. For ground truth, we are using the labels that are part of the original dataset: * https://huggingface.co/datasets/banking77 * https://huggingface.co/datasets/lex_glue/viewer/ledgar/train * https://huggingface.co/datasets/squad_v2 ... (exhaustive set of links at the end of the report).

Is there some noise in these labels? Sure! But the relative performance with respect to these is still a valid evaluation

natch · on June 19, 2023

Agreed, thanks for highlighting these links!

coldtea · on June 18, 2023

Only 20?

dimava · on June 19, 2023

The report states humans take ~60s per answer and GPT-4 takes ~3.0s, thus just 20 times time / 7 times cheaper is the baseline for the highest quality result available. Faster models are 2-3 times faster, but 3s is already enough even for real-time application

The cost, hovewer, goes down to 1/1000 for cheaper models (and they can still 100% some of the datasets), meaning that for the same price as a human annotator you could have 1000 parallel realtime LLM annotators