If a model is not making use of the whole context window - shouldn't that be ver...

MallocVoidstar · 2025-11-09T10:45:40 1762685140

I don't think there are any up-to-date leaderboards, but models absolutely degrade in performance the more context they're dealing with.

https://wandb.ai/byyoung3/ruler_eval/reports/How-to-evaluate...

>Gpt-5-mini records 0.87 overall judge accuracy at 4k [context] and falls to 0.59 at 128k.

And Llama 4 Scout claimed a 10 million token context window but in practice its performance on query tasks drops below 20% accuracy by 32k tokens.

mg · 2025-11-09T11:14:34 1762686874

That makes me wonder if we could simply test this by letting the LLM add or multiply a long list of numbers?

Here is an experiment:

https://www.gnod.com/search/#q=%23%20Calcuate%20the%20below%...

The correct answer:

    Correct:    20,192,642.460942328

Here is what I got from different models on the first try:

    ChatGPT:    20,384,918.24
    Perplexity: 20,000,000
    Google:     25,167,098.4
    Mistral:    200,000,000
    Grok:       Timed out after 300s of thinking

gcanyon · 2025-11-09T12:43:15 1762692195

> Do not use a calculator. Do it in your head.

You wouldn't ask a human to do that, why would you ask an LLM to? I guess it's a way to test them, but it feels like the world record for backwards running: interesting, maybe, but not a good way to measure, like, anything about the individual involved.

throwuxiytayq · 2025-11-09T13:49:27 1762696167

I’m starting to find it unreasonably funny how people always want language models to multiply numbers for some reason. Every god damn time. In every single HN thread. I think my sanity might be giving out.

solatic · 2025-11-09T14:27:57 1762698477

A model, no, but an agent with a calculator tool?

Then there's the question of why not just build the calculator tool into the model?

KristoAI · 2025-11-09T16:32:53 1762705973

Since grok 4 fast got this answer correct so quickly, I decided to test more.

Tested this on the new hidden model of ChatGPT called Polaris Alpha: Answer: $20,192,642.460942336$

Current gpt-5 medium reasoning says: After confirming my calculations, the final product (P) should be (20,192,642.460942336)

Claude Sonnet 4.5 says: “29,596,175.95 or roughly 29.6 million”

Claude haiku 4.5 says: ≈20,185,903

GLM 4.6 says: 20,171,523.725593136

I’m going to try out Grok 4 fast on some coding tasks at this point to see if it can create functions properly. Design help is still best on GPT-5 at this exact moment.

jarek83 · 2025-11-09T11:36:58 1762688218

Isn't that LLMs are not designed to do calculations?

cluckindan · 2025-11-09T11:55:12 1762689312

They are not LMMs, after all…

mg · 2025-11-09T11:54:25 1762689265

Neither are humans.

cuu508 · 2025-11-09T12:46:23 1762692383

But humans can still do it.