Lies. Damned Lies. P-value thresholds

sdwr · 2025-01-01T01:31:37 1735695097

This piece is basically set up to fail. It's historical, based on math, and doesn't venture towards any drama or stakes. I can easily imagine it being a farticle on a third-rate SEO webzine.

So why does it work okay as a New Yorker piece? How is their writing consistently good?

I think their secret sauce is the (implied) perspective. The impression that the author has a unique, complete, accurate take on the subject, and is letting you in on it piece by piece, in a meandering way.

tucnak · 2025-01-01T02:38:30 1735699110

You're exactly right: what you refer to as "perspective" is easily faked.

martindale · 2025-01-01T06:48:09 1735714089

Astroturf on HN. Never thought I'd see the day.

sdwr · 2025-01-04T01:46:32 1735955192

Yes I'm a shill for big "New Yorker magazine". You are so smart how did you figure it out.

brudgers · 2024-12-30T02:06:00 1735524360

Previous discussion under original title, https://news.ycombinator.com/item?id=20978134

dccsillag · 2025-01-01T12:35:41 1735734941

If thresholding of P-values is the issue, E-values -- a recent, much more elegant, easier to work with, and more robust alternative to P-values -- solve this.

https://arxiv.org/abs/2312.08040 https://arxiv.org/abs/2205.00901 https://arxiv.org/abs/2210.01948 https://arxiv.org/abs/2410.23614

alphan0n · 2025-01-03T19:12:16 1735931536

Thank you, excellent reading.

Hilift · 2025-01-01T09:33:07 1735723987

Harold Shipman with the NHS killed about one person per month for 30 years, and the response was to ask if doctors could be monitored to discover this earlier. However, the systems they conceived "eventually cast suspicion on the innocent". 25 years later the NHS is still struggling to answer these questions.

chkgk · 2025-01-01T01:06:51 1735693611

http://archive.today/yju8K

flobosg · 2025-01-01T04:48:09 1735706889

(2019)

moonlion_eth · 2025-01-01T03:24:02 1735701842

P values are why llms hallucinate

klysm · 2025-01-01T04:23:02 1735705382

I have no idea how this could possibly be true

seanhunter · 2025-01-01T07:25:58 1735716358

I can tell by your response that you are burdened by understanding at least one of a)what a p-value is b)What an “LLM hallucination” means or c)what actually causes llms to hallucinate.

If you set yourself free from the meaning of all the nouns in the sentence then you can get there.

mnky9800n · 2025-01-01T11:46:29 1735731989

I recommend setting yourself free of all nouns in general.

klysm · 2025-01-01T08:55:41 1735721741

Okay, if I forget what p-value means and just take it to mean probability, I guess I can see the point? It’s still wrong through

seanhunter · 2025-01-01T09:24:22 1735723462

Yes, it is still wrong even if you just think "probability", not "p-value".

For people who don't believe me, spin up your LLM of choice using the API in your favourite language[1] and make some query using a temperature of zero. You will find if you repeat the query multiple times you always get the same response. That is because it always giving you the highest weighted result in the transformer output whereas if you set a non-zero temperature (or use the default chat frontend) it does a weighted random sample.

So there is no probabilistic variance between responses with temperature set to zero for a given model, but you will nonetheless find that you can get the LLM to hallucinate. One way I've found to get LLMs to frequently hallucinate is to ask the difference between two concepts that are actually the same (eg gemini gave me a very convincing looking but totally wrong "explanation" of the difference between a linear map and a linear transformation in linear algebra [2].

Therefore the probabilistic nature of a normal LLM response can not be the reason for hallucination because when we turn that off we find we still get hallucinations.

The real reason that LLMs hallucinate is more mundane and yet more fundamental- Hallucinating (in the normal sense of the word) is actually all that LLMs do. This is what Karpathy is talking about when he says that LLMs "dream documents". We just specifically call it "hallucination" when the results are somehow undesirable, typically because they don't correspond with some particular facts we would like the model's output to be grounded in.

But LLMs don't have any sort of model of the world, they have weights which are a lossy compression of their raw training data, so in response to some prompt they give the response that they have learned in instruction fine-tuning minimizes whatever loss function was used for that fine-tuning process. That's all. When we use words like "hallucination" we are in danger of anthropomorphising the model and using our reasoning process to try to back into how the model actually works.

[1] You need to use the programming API rather than the usual web frontend to set the temperature parameter.

[2] For the curious, it more or less said that for one of them (I forget which) you could move the origin so turned it into an affine transformation, but it mangled the underlying maths further. The evidence has fallen out of my gemini history so I can't share it, but that sort of approach has been fruitful in the past. Neither chatgpt nor claude fall for that specific example fwliw.

BlueTemplar · 2025-01-01T10:26:41 1735727201

While I like to call out people for using "hallucinate" for this kind of behavior too (for language models, at least, it might actually be appropriate for visual models ?)-

> One way I've found to get LLMs to frequently hallucinate is to ask the difference between two concepts that are actually the same

-this only confirms my belief that "bulshitting" is an appropriate term to use for this behavior : doesn't exactly the same thing happen with (not savvy enough) human students ?

You call it "anthropomorphizing", and "not having a model of the world", but isn't it more like forcing a model of the world on the student / language model by the way that you frame the question ?

(Interestingly, there might be a parallel here with the article : with the language model not being a real student, but a statistical average over all students, including being "with one breast and one testicle".)

seanhunter · 2025-01-01T10:37:45 1735727865

Yes actually I think you're right.

zrm · 2025-01-01T07:03:38 1735715018

LLMs are essentially predicting what token could plausibly come next. They sort the possible next tokens by probability, often throw out the the ones with very low probabilities (p-value too low) and then use a weighted random number generator to choose one from the rest.

Sometimes that means you exclude a good next token, or include a bad one, so a bad one gets chosen. And then once it has, the thing is going to pick whatever is most likely to come after that, which will be some malarkey because it has already emitted a nonsense token and is now using that as context. But whatever it is, it will still sound plausible because it's still choosing from the most likely things to follow what has already been emitted.

seanhunter · 2025-01-01T07:29:36 1735716576

A p-value isn’t just any old probability- it has a specific meaning related to hypothesis testing[1]. A p-value is the conditional probability of seeing a result at least as extreme as some observation under the null hypothesis.

Yes LLMs generate tokens using a stochastic process, so it is probabalistic. Everyone knows that, but in the normal process of generating text, LLMs aren’t doing a hypothesis test so by definition p-values are completely irrelevant to how LLMs hallucinate.

[1] https://math.libretexts.org/Courses/Queens_College/Introduct...

seanhunter · 2025-01-01T08:00:01 1735718401

This is such a common thing to misunderstand that I'm going to respond to my own message to give an explanation, because many of the links I've found make sense once you know the lingo etc but might not before then.

Say you go into a bar and just by chance there is a football[1] match on television between Denmark and France. You see a bunch of fans of each country and you think "Hey, the Danes look taller than the French". You want to find out whether this is true in general, so to test this hypothesis you persuade them during a lull in the match to line up and get measured. As luck would have it there are exactly n people from each country.

H_0 (the null hypothesis) is that the two population means are the same. That is, that Danish people have the same average height as French people.

H_1 (the alternative hypothesis) is that Danish people are taller on average than French people (ie the population mean is larger).

So you take the average height and see that the Danes in this bar are say 5cm taller on average than the French people in this bar.

The p-value is how likely it would be to select a random sample of n people from each of two populations (one from Danes, one from French people) with an average height of the Danes in the sample being at least 5cm larger than the French if the actual average height of the underlying populations you sampled from (all Danes and all French people) were the same.

How you use a p-value is typically if it is smaller than some threshold called a critical value you "reject the null hypothesis" at some significance level. So in this case if the p-value was small enough you conclude that the population means are unlikely to be the same.

Actually calculating the p-value is going to depend a bit on the distribution etc but that's what a p-value is. As you can see it's not just a probability.

[1]soccer if you're from the US

ellisv · 2025-01-01T03:26:48 1735702008

Would you elaborate?

svnt · 2025-01-01T06:37:18 1735713438

If I take it at face value they seem to be blaming LLM hallucinations on questionable training data from published science done badly while focused on p-values.