AI leaderboards are no longer useful. It's time to switch to Pareto curves

ukuina · 2024-04-30T15:38:45 1714491525

This is the most applicable part of the article:

Strategies to improve LLM accuracy:

Retry: We repeatedly invoke a model with the temperature set to zero, up to five times, if it fails the test cases provided with the problem description. Retrying makes sense because LLMs aren’t deterministic even at temperature zero.

Warming: This is the same as the retry strategy, but we gradually increase the temperature of the underlying model with each run, from 0 to 0.5. This increases the stochasticity of the model and, we hope, increases the likelihood that at least one of the retries will succeed.

Escalation: We start with a cheap model (Llama-3 8B) and escalate to more expensive models (GPT-3.5, Llama-3 70B, GPT-4) if we encounter a test case failure.

vok · 2024-04-30T15:53:23 1714492403

These strategies seem immediately practical. If you want to go beyond zero-shot for LLM coding, you may not need a complicated agent architecture - just start with escalation, retry, and warming.

smaddox · 2024-04-30T15:47:00 1714492020

> Retrying makes sense because LLMs aren’t deterministic even at temperature zero.

This is news to me. I'm trying to think where non-determinism would come in at temperature zero, but coming up with nothing. What am I missing?

wongarsu · 2024-04-30T15:56:19 1714492579

It can happen due to a number of reasons, but in the case of GPT-4 it's probably because of their MoE implementation

https://152334h.github.io/blog/non-determinism-in-gpt-4/

nicklecompte · 2024-04-30T16:41:57 1714495317

It's because floating-point arithmetic isn't deterministic, which becomes salient when (speaking loosely) the difference between likelihood of two different tokens is less than the precision of the FPU.

I am not sure to what extent this effect has been quantified.

chessgecko · 2024-04-30T17:44:03 1714499043

Having played with this stuff its definitely spots in the expert buffers (the other comment in the thread has the link to explanation) and not the extremely small differences in floating point arithmetic. The effect from this is much much less than any change in quantization, i.e. almost impossible to see from the outputs.

nicklecompte · 2024-04-30T18:34:24 1714502064

I guess the root cause of my claim is that OpenAI won't tell us whether or not GPT-3.5 is an MoE model, and I assumed it wasn't. Since GPT-3.5 is clearly nondeterministic at temp=0, I believed the nondeterminism was due to FPU stuff, and this effect was amplified with GPT-4's MoE. But if GPT-3.5 is also MoE then that's just wrong.

What makes this especially tricky is that small models are truly 100% deterministic at temp=0 because the relative likelihoods are too coarse for FPU issues to be a factor. I had thought 3.5 was big enough that some of its token probabilities were too fine-grained for the FPU. But that's probably wrong.

On the other hand, it's not just GPT, there are currently floating-point difficulties in vllm which significantly affect the determinism of any model run on it: https://github.com/vllm-project/vllm/issues/966 Note that a suggested fix is upcasting to float32. So it's possible that GPT-3.5 is using an especially low-precision float and introducing nondeterminism by saving money on compute costs.

Sadly I do not have the money[1] to actually run a test to falsify any of this. It seems like this would be a good little research project.

[1] Or the time, or the motivation :) But this stuff is expensive.

memhole · 2024-04-30T18:03:39 1714500219

I’m so glad to see LLMs spark these conversations lately. It’s been a huge gripe of mine that we don’t question the underlying precision in other areas of AI/ML

wongarsu · 2024-04-30T19:31:38 1714505498

The last couple of years have been a steady journey of us discovering that in most neural networks precision only matters in a couple key places, and everything else can get away with astonishingly little.

We started out training everything in full (f32) or double precision (f64), then around 2020 everyone switched to half precision (f16) with some stuff in full precision, now we are starting to move to quarter precision, and the newest Nvidia card even supports f4 (eighth precision?). And then of course there's the 1.58bit LLM paper.

So there has been a steady stream of people questioning the underlying precision, and most of the time the answer they came back with was: there's more precision than we need, a larger network with less precision is faster and better than a smaller network with more precision

nicklecompte · 2024-04-30T20:10:55 1714507855

To be clear there’s a distinction between the quality of the results and the determinism of the results. If a low-precision LLM is wildly stochastic but the variation is mostly linguistic rather than factual or deductive (e.g. coin tosses on synonyms or presenting independent facts in a different order), then there’s not really a contradiction.

AFAIK the determinism side of floating-point precision hasn’t been well-addressed, but it’s been a while since I skimmed those papers.

exe34 · 2024-04-30T17:36:04 1714498564

They can be made to be deterministic on CPU, but not on GPU (unless you want to give up on the speedup). With floating points, things like addition are not associative: a + (b + c) is not the same as (a + b) + c. So on CPU, you can make sure the order is always the same and the result is deterministic. On GPU, the order is not guaranteed, and thus the output is not deterministic.

This is because of the

nicklecompte · 2024-04-30T16:50:19 1714495819

I had a very similar comment last month - albeit more ignorant and less helpful: https://news.ycombinator.com/item?id=39957153

Basically, none of these agentic / MoE / etc papers have actually compared their results to the naive baseline: since these are nondeterministic programs, Randomized Algortihms 101 tells you that if the probability of success is sufficiently high, you can improve performance simply by running the algorithm multiple times and taking the majority/plurality result.

So is MoE or agents actually more effective than doing it the dumb way? AI Snake Oil says "no." Truly bizarre that dozens of researchers didn't even ask! It made me feel like I was missing something.

sgt101 · 2024-04-30T15:51:55 1714492315

I have advocated and used pareto fronts as a model selection method for ML for a long while. It's really useful to construct two tests - hard but important, and run of the mill - and plot model performance against each one and draw a pareto front so you can see which of your models are off the edge. In fact if you were to look at Figure 8.3 in "Managing Machine Learning Projects" then you would see this kind of thing!

But, I'm just an old bot shilling for my product.

wongarsu · 2024-04-30T15:42:11 1714491731

Alternative title: AI leaderboards would be useful if they didn't blindly believe the author's benchmarks, included good baselines, and factored in real cost to run the model (parameter count can be misleading). Pareto curves are a good tool to decide which model is the best for a given price/performance tradeoff and should be used more

But that's not quite as catchy. Great article