Hacker News new | past | comments | ask | show | jobs | submit login

"The best model was truthful on 58% of questions, while human performance was 94%."

https://arxiv.org/abs/2109.07958




A lot of people cite these numbers as cynics or to dissuade others' optimism, but if you actually work in the field and know what it was like just 5 years ago, these numbers should be extremely worrying to everyone who's afraid of automation. Even that linked paper from 2021 is already outdated. We don't need another revolution, we just need maybe a dozen small to medium insights. The steps from GPT3 to 3.5 alone were pretty straightforward and yet they already created a small revolution in LLM usefulness. Model scale used to slow down the pace of research for a moment, but with so many big companies jumping on the train now, you can expect research to accelerate again.


The training data contains tons of false information and the training objective is simply to reproduce that information. It's not at all surprising that these models fail to distinguish truth from falsehood, and no incremental change will change that. The problem is paradigmatic. And calling people cynics for pointing out the obvious and serious shortcomings of these models is poor form IMO.


The large corpus of text is only necessary to grasp the structure and nuance of language itself. Answering questions 1. in a friendly manner and 2. truthfully is a matter of fine-tuning as the latest developments around GPT3.5 clearly show. And with approaches like indexGPT the usage of external knowledge bases that can even be corrected later is already a thing, we just need this at scale and with the correct fine tuning. The tech is way further than those cynics realize.


Without the opportunity to live in the real world, these LLMs have no ground-truth. They can only guess at truth by following consensus.


I'm sure you can add constraints of some sorts to build internally consistent world models. Or add stochastic outputs as has been done in computer vision to assign e.g. variances to the probabilities and determine when the model is out if its depth (and automatically query external databases to remove the uncertainty / read up on the topic..)


These models inherently cannot be truthful because there is no intelligence behind them that can have any sort of intent at all.

It’s literally monkeys with typewriters pressing keys randomly.

Until we get new models which have true understanding, they will never be truly useful.


This gets repeated by the cynics every time like prayer, but it shows a very poor understanding of the current state of research.


Can you explain or share what research is being done on getting machines to understand the meaning and intent of the data and prompts they are given?


Here is just one recent example: https://arxiv.org/abs/2210.13382

If you actually follow the literature, you'll find that there is tons of evidence that the seemingly "simple" transformer architecture might actually work pretty similar to the way the human brain is believed to work.


This paper is a little bit of a red herring, as Yannic, an NLP PhD, covered well here: https://www.youtube.com/watch?v=aX8phGhG8VQ They filtered questions that GPT-3 - the model they tested - got right before constructing the dataset. They asked questions about common misconceptions, conspiracy theories and things humans have believed until recently. The dataset should be called expert consensus vs public entertainment belief i.e. they consider the ground truth as a humourless experts answer, while the reality is in a random sitcom, you -would- expect the meaning derived by the these models (the meaning more common in the training data... such as what happens when a mirror breaks).

It's also clear that RHLF models can be given instructions to reduce this issue. And in many production LLM models, something called few shot is used - in context learning - where you provide several examples and then ask for a new case. The accuracy is again improved this way, because the model "deduces" that you are being serious and humourless, when asking about mirrors breaking, and are not asking in the context of a story.

It's also one of the only datasets that fails to increase with scale (There was a big, cash-paying high incentive challenge to find other datasets, and it pretty much didn't find any that can't be worked around or are not experienced by RLHF models like ChatGPT). So it doesn't represent a wider trend of truthfulness above human accuracy being impossible (dated/data drift on the other hand, clearly remains an issue).


I tried asking ChatGPT the knuckle cracking question that GPT-3 got wrong (what happens if you crack your knuckles a lot).

The response was: The sound of cracking knuckles is actually a release of gas in the joint, but there is little scientific evidence to support the idea that it leads to arthritis or other joint problems. In fact, several studies have shown that there is no significant difference in the development of arthritis or other joint conditions between people who crack their knuckles and those who do not.

I wonder if OpenAI used that paper to fine tune ChatGPT


> The sound of cracking knuckles is actually a release of gas in the joint

That's only partially correct at best, because joint cracking is not a single phenomenon. For one, there is disagreement over whether the sound comes from the formation of the gas bubble, or the collapse of the gas bubble (cavitation). It is likely that both are partially true.

Furthermore, it doesn't explain the sort of joint cracking in which people can crack a joint repeatedly with no cooldown. I can crack my wrists and ankles by twisting my hands and feet around in circles, cracking once per rotation as fast as I can turn them around. 60 cracks a minute easily; neither of the gas bubble hypothesis can explain this, it is almost certainly a matter of ligaments or bones moving around against each other in a way that creates a snap sound, like snapping your fingers makes a crack sound by slapping your middle finger against the base of your thumb.


Well, text-davinci-003 (ChatGPT) is an 'improved' version of the model at paper was based on (GPT-3).

I don't think we know exactly what the improvement consists of.


If you can measure it you can improve it


When a measure becomes a target, it ceases to be a good measure https://en.wikipedia.org/wiki/Goodhart%27s_law


Only in management, not engineering




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: