"I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago" looks like you're using llama-cpp in that repo. This is about vllm serving many requests at once, at long sequence lengths.
> As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass.
I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago.
Run two of these with the same prompts and same seed and you get the same results.
Obviously in GPU clusters with different hardware things get more complicated.
https://git.distrust.co/public/llmshell