Hacker News new | past | comments | ask | show | jobs | submit login
LLMs can label data as well as human annotators, but 20 times faster (refuel.ai)
55 points by nihit-desai on June 18, 2023 | hide | past | favorite | 31 comments



AHAHAHAHAHA! There is approximately a 0% chance that the big companies paying for data annotation will be far-sighted enough to avoid LLM-automated labeling of their data, for several reasons:

1) it will work well, at first, and only become low-quality after they (and their budgets) have become accustomed to paying 1/20th as much for the service

2) even if they pay for "human" labeling, they will go for the low cost bid, in a far-away country, which will subcontract to an LLM service without telling them

3) "hey, we should pay more for this input, in order to avoid not-yet-seen quality problems in the future", has practically never won an argument in any large corporation ever. I won't say absolutely 0 times, but pretty close.

Long story short, the use of LLM's by Big Tech may be doomed. Much like how "SEO optimization" turns quickly into clickbait and link farms if there is not high-urgency and high-priority efforts to combat it, LLM's (and other trendy forms of AI that require lots of labeled input) will quickly turn sour and produce even less impressive results than they already do.

The current wave of "AI" hype looks set to succeed about as well as IBM Watson.


Have you worked with labeling services before? The quality is always terrible. At least with an LLM I can get consistently terrible output, quickly.


A few years ago, I started spoiling answers to captchas. I make a game of getting as many false positives and negatives I can get away with.


"Consistently terrible output, quickly!"(TM)


At work we were facing this dilemna. Our team is working on a model to detect fraud/scam messages, in production it needs to label ~500k messages a day at low cost. We wanted to train a basic gbt/BERT model to run locally but we considered using GPT-4 as an label source instead of our usual human labelers.

For us human labeling is suprisingly cheap, the main advantage of GPT-4 would be that it would be much faster, since scams are always changing we could general new labels regularly and be continuously retraining our model.

In the end we didn't go down that route, there were several problems:

- GPT-4 accuracy wasn't as good as human labelers. I believe this is because scam messages are intentionally tricky, and require a much more general understanding of the world compared to the datasets used in this article which feature simpler labeling problems. Also, I don't trust that there was no funny business going on in generating the results for this blog, since there is clear conflict of interest with the business that owns it.

- GPT-4 would be consistently fooled by certain types of scams whereas human annotators work off a consensus procedure. This could probably be solved in the future when there's a larger pool of other high-quality LLMs available, and we can pool them for consensus.

- Concern that some PII information gets accidentally sent to OpenAI, of course nobody trusts that those guys will treat our customers data with any level of appropriate ethics.


I wonder if the LLM could at least reliably label something as "smells funny," and then you could have human labelers only work on that smaller, refined batch. But like you said, PII is a concern. In any case, at the rate its going, does anyone really doubt that LLMs one or two years out will have the same problem?


>> don't trust that there was no funny business going on in generating the results for this blog

All the datasets and labeling configs used for these experiments are available in our Github repo (https://github.com/refuel-ai/autolabel) as mentioned in the report. Hope these are useful!


Thank you, I appreciate your transparency with this work.


Did you consider fine-tuning your own copy of GPT-4 that can handle scam messages better? I'm doing something similar with Azure OpenAI Services and custom vector database to handle ham/spam labeling for some of my customer feedback APIs.


This will probably work as long as the material being annotated is similar to the material the LLM was trained on. When it encounters novel data (value) it will likely perform poorly.


Partially agree, but it's a continuous value rather than a boolean. We've seen LLM performance largely follow this story: https://twitter.com/karpathy/status/1655994367033884672/phot...

From benchmarking, we've been positively surprised by how effective few-shot learning and PEFT are, at closing the domain gap.

"When it encounters novel data (value) it will likely perform poorly" -- is that not true of human annotators too? :)


> "When it encounters novel data (value) it will likely perform poorly" -- is that not true of human annotators too? :)

Some humans have intelligence and reasoning abilities. No LLMs do :)


I always regret making that assumption


I don't have experience with text/nlp problems, but some degree of automation/assistance in labeling is a fairly common practice in computer vision. If you have a certain task where the ML model gets you 90% there, then you can use that as a starting point and have a human fix the remaining 10%. (Of course, this should be done in a way that the overall effort is lower than labeling from scratch, which is partially a UI problem). If your model is so good that it completely outperforms humans (at least for now, before data drift kicks in) then that's a good problem to have, assume your model evaluation is sane.


If an llm labels it, does that have the same value? Isn’t it just fancy regurgitation of knowns?


Even humans disagree about labels. Especially humans willing to do this work.

And with the topical depth say ChatGPT4 has, I would think these labels have more value, although just as with humans some validation and verification steps are required.


Good question - one followup question there is value for who? If it is to train the LLM that is labeling, then I agree. If it is to train a smaller downstream model (e.g. finetune a pretrained BERT model) then the value is as good as coming from any human annotator and only a function of label quality


Why retrain that smaller model from scratch tho? Just do a little transfer learning, or get creative and see if you can prune down to a smaller model algorithmically instead of doing the whole label and train rigamarole from scratch on what is effectively regurgitation.

I’m not sold this has directional value.


Hmm, I'm not suggesting training a smaller model from scratch - in most cases you'd want to finetune a pretrained model (aka, transfer learning) for your specific usecase/problem domain.

The need for labeled data for any kind of training is a constant though :)


It has some value. If you let AI label the data and feed it back to it you are reaffirming it's guesses. If you independently verified that guesses are as correct as human ones you are teaching AI to be more sure about the correct thing.


llm have emergent abilities [0] which could provide additional value to any output or label.

[0] https://www.jasonwei.net/blog/emergence


Not sold they do.


How was ground truth obtained if not via human annotation?


Some data is self annotating. You can count occurrences of features in context and then you know with increasing confidence that a feature occurs in a context. You can also build up meaning with more observation of events in context. Sounds circular, but notice no human is required in this process. Sure, in other cases or for other steps humans can be useful, but they aren't always needed for ground truth.


Hi, one of the authors here. Good question! For this benchmarking, we evaluated performance on popular open source text datasets across a few different NLP tasks (details in the report).

For each of these datasets, we specify task guidelines/prompts for the LLM and human annotators, and compare each of their performance against ground truth labels.


You didn't answer the question at all, although to be fair the answer is both obvious and completely undermines your claim so I can see why you wouldn't.


>compare each of their performance against supposed ground truth labels.

Fixed it for you.


I mean, sure. For ground truth, we are using the labels that are part of the original dataset: * https://huggingface.co/datasets/banking77 * https://huggingface.co/datasets/lex_glue/viewer/ledgar/train * https://huggingface.co/datasets/squad_v2 ... (exhaustive set of links at the end of the report).

Is there some noise in these labels? Sure! But the relative performance with respect to these is still a valid evaluation


Agreed, thanks for highlighting these links!


Only 20?


The report states humans take ~60s per answer and GPT-4 takes ~3.0s, thus just 20 times time / 7 times cheaper is the baseline for the highest quality result available. Faster models are 2-3 times faster, but 3s is already enough even for real-time application

The cost, hovewer, goes down to 1/1000 for cheaper models (and they can still 100% some of the datasets), meaning that for the same price as a human annotator you could have 1000 parallel realtime LLM annotators




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: