Non-determinism in GPT-4 is caused by Sparse MoE (152334h.github.io)
397 points by 152334H on Aug 4, 2023

Floating point inaccuracies are generally deterministic - running the same calculations twice ought to yield the same results, down to the bit.

You only get divergent results if there is some other source of state or entropy: not zeroing buffers correctly, race conditions, not setting rounding mode flags consistently, etc…

From the quality of the code I’ve seen being cobbled together in the AI/ML ecosystem I would assume all three of those issues going on, and maybe more.

No, this is not true for GPUs. https://www.twosigma.com/articles/a-workaround-for-non-deter...

(In this particular case, the order in which the numbers are summed up is non-deterministic due to GPU parallelism, which may change the result slightly.)

I would generally refrain from insulting other people's code if you don't know much about the system it's written on.


Editing here since all the replies to this are mostly saying the same thing: Yes, CPUs can also be parallel and it can happen there as well, but unlike a CPU where most instructions on their own are deterministic, CUDA provides primitives that aren't. This is very much by design (as they're faster than their deterministic counterparts), and I mostly just take issue with how parent phrased this as a bug caused by bad code.

GPUs are deterministic machines, even for floating point.

The behavior in the linked article has to do with the use of atomic adds to reduce sums in parallel. Floating point addition is not associative, so the order in which addition occurs matters. When using atomic adds this way, you get slightly different results depending on the order in which threads arrive at the atomic add call. It's a simple race condition, although one which is usually deemed acceptable.

I just edited my comment while you were writing your comment to add an explanation. The point here is that some primitives in eg. cudNN are non-deterministic. Whether you classify that as a race condition or not is a different question; but it's intended behaviour.

Right but that's not an inherent GPU determinism issue. It's a software issue.

https://github.com/tensorflow/tensorflow/issues/3103#issueco... is correct that it's not necessary, it's a choice.

Your line of reasoning appears to be "GPUs are inherently non-deterministic don't be quick to judge someone's code" which as far as I can tell is dead wrong.

Admittedly there are some cases and instructions that may result in non-determinism but they are inherently necessary. The author should thinking carefully before introducing non-determinism. There are many scenarios where it is irrelevant, but ultimately the issue we are discussing here isn't the GPU's fault.

What I'm saying is "there are non-deterministic primitives", not "there are no deterministic primitives".

Yes, and `gettimeofday` is a non-deterministic primitive. There is nothing special about GPUs here. If you write tests that fail sometimes because you used non-deterministic primitives like gettimeofday and someone files a bug we don't throw up our hands and say "this is not a bug but due to how CPUs work." We remove the non-deterministic bit.

There's no difference here. This isn't a GPU problem.

Except the issue is inextricably linked to GPUs. All of the work in practical DNNs exists because of the extreme parallel performance available from GPUs, and that performance is only possible with non-deterministic threading. You can't get reasonable training and inference time on existing hardware without it.

1000 threads can run in parallel. It doesn't prevent us to sum their results deterministically:

    results = ThreadPool(workers=1000).imap_unordered(calc, inputs)
Due to the magic of the fsum alg, the result is deterministic whatever order we get results in. https://docs.python.org/3/library/math.html#math.fsum

That's not the operation being performed on GPUs that is the problem. The issue is that fundamentally GPUs allow for high performance operations using atomics, but this comes at the cost of nondeterministic results. You can get deterministic results but doing so comes with a significant performance costs.

Using atomics is easier than warp operations (using warp shuffle for example), but warp shuffle is quite fast.

I guess if determinism is so important implementations can be changed, it is just maybe not that high priority.

That summation is slow and would not be used in practice.

You could use just one thread on your 10000 thread GPU too and it would be deterministic, sure. Completely beside the point.

In my experience cuBLAS is deterministic, since matmul is the most intensive part I don‘t see other reasons for non-determinism other than sloppyness (at least when just a single GPU is involved)

Yeah. In curated transformers [1] we are seeing completely deterministic output across multiple popular transformer architectures on a single GPU (there can be variance between GPUs due to different kernels). Of course, it completely depends on what ops and implementations you are using. But most transformers do not use ops that are typically non-deterministic to be fast (like scatter-add).

One non-determinism we see with a temperature of 0 is that once you have quantized weights, many predicted pieces will have the same probability, including multiple pieces with the highest probability. And then the sampler (if you are not using a greedy decoder) will sample from those pieces. So, generation is non-deterministic with a temperature of 0.

In other words, a temperature of 0 is a poor man’s greedy decoding. (It is totally possible that OpenAI’s implementation switches to a greedy decoder with a temperature of 0).

[1] https://github.com/explosion/curated-transformers

If the hardware is deterministic, so are the results. You can't generate random numbers purely in software with deterministic hardware.

The behaviour of atomic operations is definitely not deterministic. E.g. if you have a lot of atomic adds, every time you run the code you'll get a different result without a random number generator.

Read the article you linked.

It literally says that the GPU is deterministic, the NVIDIA libraries on top are deterministic, but it is Tensorflow that introduces variability (errors!) for “performance”.

My argument is that it is the AI/ML code that is introducing non-determinism, usually by sacrificing repeatability to gain performance.

That's precisely what's happening here. Tensorflow introduced a "harmless"[1] data race to improve performance by not having to use a deterministic but slower algorithm.

The individual floating point computations are deterministic, it's the multi-threaded design on top that's introducing the variability in the output.

[1] Used to be harmless, but cutting corners like this will make it nigh impossible to repeatably validate the safety of future models like GPT5. That seems pretty dangerous...

As the article says, cuBLAS is deterministic, but other CUDA primitives (eg. some of those in cudNN) are not.

Yes, the non-determinism is being introduced somewhere, but that is splitting hairs. The point is that the primitives that you work with on GPUs are non-deterministic by design.

I mostly take issue with you phrasing it as a bug and using it to insult the authors.

How is that splitting hairs?

> The point is that the primitives that you work with on GPUs are non-deterministic by design.

This is just blatantly wrong. There are _some_ operations that can be non-deterministic in some scenarios but they are not necessary.

GPUs are deterministic. If you ask them to add a million floats in order, you get the same result every time. If you ask them to add a million floats in some arbitrary order, then you may get different results every time. The distinction is that someone had to ask the GPU to do that. It's a choice.

> I mostly take issue with you phrasing it as a bug and using it to insult the authors.

It's a bug, whether it insults the authors or not is irrelevant. It's most definitely a bug.

Basically any parallel map-reduce operation using non-commutative reduce operators[0] is non-deterministic unless you specifically sort after/during the gather, or block on the gather (and gather to a thread-determined memory location). Sorting and blocking takes time. If you remove the sort/block, you will get a non-deterministic answer when operating on floats for a wide variety of reduce operations, but it will be faster. This is true of any parallel map-reduce, done anywhere (MPI, cuda kernels, openMP, spark, etc.), and is not unique to gpus/cuda.

> If you ask them to add a million floats in order, you get the same result every time.

There are a bunch of ways to add a million floats in order on a gpu, but they will all get you different results.:

* split the million floats into ‘n’ chunks, each chunk is summed, then you sum the ‘n’ results. * if you sum results as they are gathered (you don’t need to block) you will get a non-deterministic result, as the threads finishing (outside of a warp) is non-deterministic in order. * if you change ‘n’, your result will change. * if you sort after gathering , your result will change.

TLDR: parallel race-conditions are nondeterministic. Map-reduce has an underlying race-condition that you can prevent, but it costs time/performance. Sometimes you don’t care about the non-determinism enough to pay the performance penalty to fix it.

[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...

Your comment, along with cpgxiii and n2d4’s are all really good. I have a question: suppose training and inference of an LLM were made to be deterministic at the cost of performance.

Would the cost be “everything will take twice as long” or would it be more like “inference will take a week and training will take a couple lifetimes”?

If it’s the latter, then it seems disingenuous to call this a “bug.” It’s like saying F1 cars could be horse drawn, and they only use internal combustion for “performance reasons.” If its the former, then maybe there is a more interesting discussion to be had about the potential benefits of determinism? (That said, I agree with n2d4 that it’s stupid to insult the authors. Talk is cheap and building is hard.)

> That said, I agree with n2d4 that it’s stupid to insult the authors. Talk is cheap and building is hard.

If your code offers an expectation of determinism then it's sloppy to not distinguish where there isn't determinism. There's nothing difficult about writing a comment to the effect of "this function is non-deterministic. For deterministic results, use X".

The code is sloppy if the developers didn't consider determinism and offer nothing to consumers, or if the consumers writing software cannot know where non-determinism is introduced.

If that's somehow insulting then I'd say someone has very thin skin.

There are flags[1] for that indeed. It feels like half of the people commenting here don't know all that much about the topic they're commenting upon

1: https://pytorch.org/docs/stable/generated/torch.use_determin...

Nothing I said conflicts with this, though?

Yes, if you eschew determinism for the sake of raw performance then the result will be non-deterministic. But you don't have to do this, nor is it inherently untenable to solve these problems in a deterministic way.

Sure it may require some performance overhead, and increase development time, but it's no different than writing deterministic code elsewhere. It's disingenuous to hand-wave away the solution because of some opaque cost or overhead we're unwilling to entertain. None of the parent posts ever mention performance tradeoffs.

In particular there is no indication that the problem being discussed couldn't be solved with determinism in an equivalent amount of time. You're making my point: GPUs are deterministic, software may decide not to be.

FWIW, I took “GPUs are deterministic” to mean they are deterministic in all possible intended use cases. This is not strictly true, since the whole point of using them is massive parallelism, which brings along non-determinism, for reasons that others have noted. Of course it’s possible to choose to forego that, but what is the point of a GPU in that case?

This is a false dichotomy. You can have massive parallelism and determinism.

You can trade determinism for convenience, but that doesn't make things easier: now you have to deal with the determinism.

But to suggest that massive parallelism somehow implies non-determinism is quite disingenuous from my perspective.

We have mutexes and lock-free ring buffers and stable sorts and all sorts of bells and whistles to make parallelism safe elsewhere. We also already have tools to solve this for GPUs.

I think whether it’s a bug or not depends on the software requirements and expectations. If the code has some expected bounds on runtime, switching the GPU code to sequential processing (for the sake of exact reproducibility) would break that expectation and could be considered a bug as well. If we expect performant code and exact reproducibility, that just might not be possible…

It's hard to call it a bug given that any concurrent float sum or product will be different in regards to changing the amount of concurrency. Even if you order the final value per thread before reducing the result will differ if you use a different amount of threads to split the problem.

Because in floating point arithmetic 1 + 2 + 3 + 4 is different than (1+2) + (3+4).

The PyTorch documentation has an entire section about how to make your code deterministic. In my experience, the performance difference is negligible.


Unfortunately, determinism across devices or even driver versions is not that easy. You'd have to write your own BLAS kernels using only basic operations, which are guaranteed to follow IEEE 754 semantics.


One gotcha are fused multiply-adds, which the compiler may or may not introduce, so you have to wrap all your floating point operations with __fma* intrinsics to make sure the compiler does not interpret them differently.

As far as I can tell this article doesn't explain why this happens on the GPU (for example, why Tensorflow's reduce_sum is non-deterministic). My hypothesis is that this is entirely due to concurrency: if the same code can be run in two or more different interleavings, they can produce different results. This is corroborated by the first answer here [0].

If so, this exact same issue happens in CPU code as well: have two or more threads, run the program many times, observe different interleavings that expose race conditions which (depending on the algorithm) may or may not produce different results. This can happen even if you don't use floating point, and has nothing to do with floating point non-determinism itself. For example, have a thread print "Hello" and another thread print "World"; even without tearing, you may see either Hello World or World Hello on the screen.

Now, proper floating point non-determinism happens in two cases. One is that when you run the same code in two different architectures you could have different answers (because of rounding modes, or because some architecture doesn't support subnormal numbers or signaling nans, because transcedental functions like sine are implemented with different accuracy, etc). In this case it's deterministic when run the same in the same machine, but may run differently in another machine with a different architecture.

The other case is that some "optimizations" actually break your code if applied carelessly (you enable those broken optimizations with -ffast-math in C for example). Among other things, this may break numerical stability of algorithms like Kahan summation. And, if you let the compiler decide which exact optimizations will be applied and in what order, you get non-determinism between different compilers. So in this case it's deterministic when compiled with the same compiler, but may run differently with another compiler.

[0] https://stackoverflow.com/questions/50744565/how-to-handle-n...

To nitpick in addition to the already existing comments: this has nothing to do with GPUs per se. You would see the same issue in multithreaded code on a CPU. Even on a single core CPU this can happen with a multithreaded program depending on how the OS schedules and interrupts the threads. It just happens to be an implementation choice in a GPU library/API.

> I would generally refrain from insulting other people's code if you don't know much about the system it's written on.

Well, the general state of how utterly shoddy most of the code in the AI/ML ecosystem is is observable to anyone trying to follow a guide on how to set up Stable Diffusion on AWS. It's a fucking mess of trying various combinations of driver versions, Ubuntu kernel versions, Python versions, and the fact that Python requirements.txt (similar to NodeJS) doesn't pin versions of transitive dependencies doesn't make it easier because it makes for very brittle and not reproducible builds/guides. Oh, and at least some of that stuff won't work without root.

Yeah I'll keep AI shit cordoned off in its own subnet.

Years before ChatGPT I made the joke that AI would want to take over the world like a computer virus, but it’s written in Python, so it can’t figure out how to install itself on other computers.

I think the joke was on Twitter, RIP.

I 'member that joke. Think it must have gone around while OpenStack transitioned from Python 2 to 3. What a fucking mess.

There isn't much of a culture around code quality in ML / AI / DS.

It's not a code quality issue, there are ways to ensure determinism (sometimes you just need to set a flag), however, they are intentionally explicitly not used in order to gain performance.

I don’t know about how insulting it is, I don’t like rushing things out but we’ve all had to.

People are rushing like crazy to get there first with X for AI all over the place, it would be pretty shocking if there weren’t wires sticking out everywhere.

I don’t think that says anything positive or negative about the hackers involved.

it’s basically always reasonable to insult someone’s code because we are computer programmers and we know what we have done

So you can generate true random numbers using just the GPU parallelism? Consider me impressed!

You've moved the goal posts. You're conflating CUDA with GPUs. From Wikipedia:

> CUDA (or Compute Unified Device Architecture) is a proprietary and closed source parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels.

Is the issue we're discussing because of the GPU or is it because of choices made in software libraries?

The parent is right, there is a deterministic, reproducible way to solve these problems, so if determinism is a desired or expected property, then this is a bug. It's not an inherent problem like you make it out to be. The fact that "workarounds" are given in what you link prove this.

What you said can be violated when parallelism is involved. One such example is that we know some floating point operations such as addition and multiplication are non-commutative, hence it depends on order of execution to complete reduction for example. And then in parallel situation, some implementation will make the order or reduction non-deterministic (for performance reason) and hence the final result also non-deterministic.

Minor nit but commutative is the wrong term. Floats always obey a+b == b+a, but not associativity: (a+b)+c != a+(b+c).


It's still deterministic even if the results appear not to be. If you have memory, CPU cache, CPU registers in the same state, you will get the very same results. You need a source of entropy for the results to be non deterministic.

Actually, clock domain crossing for asynchronous clocks (as is AFAIK typical for granular dynamic frequency scaling, like running CPU cores at individual frequencies instead of all at the same, because it quite softly smoothes over to any new target frequency to prevent glitches) implicitly includes thermal noise in the raw transistors that determine which of the two involved clock edges happened earlier (a decision that eventually ends up truly random when they are at (almost) exactly the same time). And this is involved in even L3 hit latency.

Sure, but they will never be in the same state, which can even be used as a source of entropy: https://link.springer.com/article/10.1007/s11071-015-2287-7

Mathematically, computation is deterministic. The author dismisses or ignores the many ways that the physical apparatus driving the computation can force the result of a software application to be a function of time.

Calling GetTimeOfDay() could do it.

Clock frequency drift between multiple processors could it.

Quantum computer is under the category of computers.

Quantum computation relied on Quantum mechanics.

Quantum mechanics are not deterministic.

So, Quantum computers are not deterministic.

Therefore, unless P=NP, not all computations are deterministic.

When theory fails to consult reality.

hmm, how, I wonder if Alhazen’ s Circular Billiard Problem[1] results for n steps in simulation will be same for multiple runs.

[1] https://forumgeom.fau.edu/FG2012volume12/FG201216.pdf

On a large scale, not having memory with good ECC is enough to have entropy.

Small nit. You mean errors due to floating point math

Not sure I understand the excerpt from the referenced paper.

Is it saying that part of its more-efficient inferencing relies on mixing tokens from completely-separate inputs – eg, from other users? And then, depending on what other inputs chance into the same grouping, the relative assignment-to-'experts' varies, and thus the eventual completions?

If so, I'd see that as not just introducing non-determinism, but also potentially making the quality of your responses dependent on how-many-concurrent-requests are fighting for the same expert-allocations.

(For example, maybe the parts of the system best at translating/interpreting Hindi give worse results during peak usage hours-of-the-day in India, when the most concurrent inputs are competing for that same competence.)

Perhaps also, this is another possible explanation for perceived quality-degradation over time. When certain tests were reliably succeeding earlier, there was less congestion for the relevant 'experts'. Now, with more concurrent use, those same tests aren't as reliably winning as much of relevant 'experts' effort.

This may also suggest a bit of a quagmire: on whatever domains some sub-experts seem impressively good, initially, even more proportionate use will be attracted. But such new congestion means all the copycat use no longer gets the same expert allocations – and thus the initially-impressive performance degrades.

(And if the effect is strong, & known-but-undisclosed-by-OpenAI, does it amount to a bait-and-switch? Attract users with unrepresentative excellence on an initially-uncongested Mixture-of-Experts system, but then offer them the lower-quality results from a more-congested system.)

The results are showing essentially 12 unique responses from 30 tries… not what you would expect from mixing tokens.

I think it groups the batch up differently, so if I have a batch of 10, and it groups it up into 2 groups of 5, if my prompt makes it to the second group or 1st group I get a different answer. But if I’m in the same location in the batch, then I get the same answer.

The whole batch is deterministic given the same batch (sequences and ordering), but if you shuffle the batch then you lose that determinism.

this seems like a plausible outcome, and if true could spell disaster for OpenAI models relative to the competition and open source models. Currently, reliability is one of the core obstacles preventing widespread adoption of LLMs in many business critical workflows. And if these rumors, that GPT-4 is inherently un-deterministic and unreliable, are true then most enterprises are better off finetuning open source LLMs—which are just as capable—for their specific domains. they stand to gain better performance that way anyways, as domain-specific models will always outperform generalist ones

> And if these rumors, that GPT-4 is inherently un-deterministic and unreliable, are true then most enterprises are better off finetuning open source LLMs—which are just as capable

Wait, am I misunderstanding you? I feel like I've had a head injury or something, because I've never heard of an open source LLM that's as capable as GPT-4 (in most scenarios).

Only on specific domains, these models don't become generalists like GPT-4, they can become task experts for a single task.

Fine-tuned MedPalm is worse than GPT-4 on most Medical Challenge Tests. Fine-tuned Minerva is much worse on arithmetic benchmarks.

The LLM space is just different. There's no guarantee a fine-tuned model will beat a bigger generalist one.

> domain-specific models will always outperform generalist ones

That's only true assuming you habe enough data to train a domain-specific model / expertise to train it and test it correctly.

I've encountered cases where an image recognition task could be accomplished well with a very general model like CLIP, but people still fine-tuned another model on their own small data set because that's considered better.

A domain specific model might be more likely to fail on weird outliers not present in the small domain specific training data.

> could spell disaster for OpenAI

Nah I don't think so. They are not all in on one specific model architecture. If the current architecture is found to have serious unfixable flaws then they'll just change architecture.

>as domain-specific models will always outperform generalist ones

This is not even close to true for Language models.

Fine-tuned MedPalm is worse than GPT-4 on most Medical Challenge Tests. Fine-tuned Minerva is much worse on arithmetic benchmarks.

The LLM space is just different. There's no guarantee a fine-tuned model will beat a bigger generalist one.

_If_ 3.5 is a MoE model, doesn't that give a lot of hope to open source movements? Once a good open source MoE model comes out, maybe even some type of variation of the decoder models available(I don't know whether MoE models have to be trained from scratch), that implies a lot more can be done with a lot less.

I agree, and really hope that Meta is doing something in that vein. Reducing the FLOPs:Memory ratio (as in Soft MoE) could also open the door to CPU (or at least Apple Silicon) inference becoming more relevant.

It would be bad for single-consumer-GPU inference setups.

Not an expert (no pun intended), but MoE where each expert is actually just a LoRA adaptor on top of the base model gets me pretty excited. Since LoRA adaptors can be swapped in and out at runtime, it might be possible to get decent performance without a lot of extra memory pressure.

While MoE-LoRAs are exciting in themselves, they are a very different pitch from full on MoEs. If the idea behind MoEs is that you want completely separate layers to handle different parts of the input/computation, then it is unlikely that you can get away with low-rank tweaks to an existing linear layer.

Could this work well with distributed solutions like petals?


I don't understand how petals can work though. I thought LLMs were typically quite monolithic.

Petals does a layerwise split I think. You could probably run separate experts on each system. I don't think this sort of tech is very promising so I haven't looked.

It could be good if the relevant expert(s) can be loaded on demand after reading the prompt? If the MOE is, say 8x8b params, then you could get good speed out of a 12GB GPU, despite the model being 64 params in size. Or am I misunderstanding how this all works?

I feel like this introduces the potential for weird and hard-to-implement side channel attacks, if the sequences in a batch can affect the routing of others.

I think you’re right. Would be very hard to exploit I imagine though.

Hard like building a virtual machine in an image decoder? If there’s a way there’s a will.

the tools available to imagine such things are limited today.

the language models in our heads have not caught up to the ones in our browsers.

as the similarities and associations crystallize a bit better, it won’t look so hard.

bookmark this if you think it bullshit. eight months.

I don't expect LLMs to be good enough at engineering to trivialize this kind of thing for a while - possibly never, if something else comes along and outcompetes them.

not models.


Same thing was said about Spectre-like bugs

This is _excellent_ work, I've been adamantly against MoE for a set of reasons, this is the first compelling evidence I've seen that hasn't been on Substack or a bare repeating of rumor.

I had absolutely no idea GPT4 was nondeterministic and I use it about 2 hours a day. I can see why a cursory looking wasn't cutting it, they "feel" the same in your memory, a lot of similar vocab usage, but are formatted entirely differently, and have sort of a synonym-phrase thing going where some of the key words are the same.

Thanks. I'm really no expert (:P) on MoE research; I just noticed what was written in the Soft MoE paper and felt a need to check.

The non-deterministic outputs are really similar, yeah, if you check the gist examples I linked https://gist.github.com/152334H/047827ad3740627f4d37826c867a.... This part is at least no surprise, since the randomness should be bounded.

I suspect OpenAI will figure out some way to reduce the randomness at some point, though, given their public commitment to eventually adding logprobs back to ChatCompletions.

I don't think this commitment had any plausibility. Token "probabilities" only have a straightforward probabilistic interpretation for base models. In fine-tuned models, they do no longer represent the probability of the next token given the prompt, but rather how well the next token fulfills the ... tendencies induced by SL and RL tuning. Which is presumably pretty useless information. OpenAI has no intention to provide access to the GPT-4 base model, and they in fact removed API access to the GPT-3.5 base model.

Topic laundering, the probabilities are the probabilities, you don't suddenly get wrong probabilities with more training on more data

You do, because it’s not just more training it’s PPO updates instead of MLE. It’s no longer trying to estimate the token distribution of the training corpus, it’s trying to shift logprobs into tokens that maximize expected reward from the RM. The GPT-4 technical report has a figure showing that logprobs become less well calibrated as confidence scores in the RLHF vs pre-train model.

Fascinating, ty

GPT4 web chat for two hours a day? I buy that. Using the API repeatedly for the same inputs, eg developing a program, and the non-determinism is hard to miss.

I would imagine that most people use nonzero temperature, so they won't need to look for any explanation for non-determinism.

Literally the first thing I did when I had llama.cpp working was set the temperature to 0 and repeat queries.

(but that's mainly because I'm a weird old scientist with lots of experience with nondeterminism in software).

I did too, Kmeans broke me a couple years ago: but, never temperature at 0 with long length, and trusted my instinct instead of actual diffs. This is was the first time I actually diffed

Yeah, it's one of the first things you notice when trying to do some kind of "feed GPT some data and get it to produce a novel answer to a question" task with the API.

No, because if you wanted a novel answer, why would you set 0 temperature? ;)

> I've been adamantly against MoE for a set of reasons

Such as?

It was completely unsubstantiated, based on rumours from a blog, but everyone repeated it as fact.

I think it is pretty compelling that almost all of the people doing research into switch transformers at Google were hired into OAI. I am not sure if that is ouboicly reported but once Ghotz leaked those details about the models, I went to check where the authirs of those papers are now and.... yep

What do you use it for? Are you using many plugins? Curious what sort of insights someone using the tool this much might have, perhaps even through the batch of features released this week.

Mixture of Experts

Thanks. I assumed it was Margin of Error. The article doesn't expand the acronym until midway through the post, where it appears almost accidentally. Perhaps the intended audience is a mixture of experts, of which I'm not a part.

I suspect the article is written primarily to be clear to people sufficiently immersed in the relevant areas to be able to have a concrete opinion on the theory.

Also I strongly suspect that at least in the case of -me-, an article that was easier for me to understand wouldn't make the underlying theory any easier for me to judge.

(on the upside, at least I -did- understand and appreciate your self deprecating pun :)

Thank you! I knew it couldn't mean "Merger of Equals"... but then again, if those experts are equals, then maybe that acronym also works ;-)

The GPT-3.0 "davinci-instruct-beta" models have been returning non-deterministic logprobs as early as early 2021. This is speculation. CUDA itself often has nondeterminism bugs.

text-davinci-001 and text-davinci-002 were trained through FeedMe and SFT, while text-davinci-003 was RLHF; the models themselves have more variance at high temperature.

What about the foundation models, i.e. davinci and code-davinci-002?

"these tokens often compete against each other for available spots in expert buffers. " So is this also why ChatGPT is often just writing placeholders in place of functions when I ask him for some long code?

> these tokens often compete against each other for available spots in expert buffers.

Hold up, does this mean that under heavy load the results change? Does this explain why it sometimes feels like the output quality changes?

MoE: Mixture of Experts

There’s a comment that’s 3 hours older than yours that clarifies this.

I searched for MoE in the comments and didn't see it. ah, you must mean this one https://news.ycombinator.com/item?id=37006549, which doesn't include "MoE", so that's why I didn't find it. Still, my comment's upvotes show it was helpful to some - maybe they searched for "MoE" too, instead of "mixture of experts".

I asked GPT to explain this:

>In the MoE approach, different "experts" or portions of the model are selected for different parts of the input data. The selection of which experts to use can be influenced by several factors, including the specific content of the input data, the order in which data is processed in a batch, and possibly even minor variations in the internal state of the model.

>This "expert selection" process introduces a level of stochasticity, or randomness, into the model's operation. For example, if you process the same input data twice in slightly different contexts (e.g., as part of different batches), you might end up consulting slightly different sets of experts, leading to slightly different outputs.

> It’s well-known at this point that GPT-4/GPT-3.5-turbo is non-deterministic, even at temperature=0.0

Interestingly, on another discussion there was a claim that setting the temperature to 0.0 made gpt-4 deterministic: https://news.ycombinator.com/item?id=36503146

This guy probably never did anything nontrivial with the API - you notice almost instantly that the chat models (both 3.5 and 4) are nondeterministic at 0 temperature. Source - built a documentation search bot and had it crap out on me on copy pasted prompts when I was demoing it.

Apparently, and I haven't tested this, just from what I read, the simpler GPT-2 models are deterministic at 0 temperature.

If you want to make it deterministic, just cache the responses keyed by queries.

How interesting. I was just discussing this last night with our analysts after I experimentally noticed that temp=0.0 (and all penalties/top_p set accordingly) still showed non-determinate behavior. Wasn't sure why this was, and now this article comes about.

The explanation makes quite a bit of sense.

This is a plausible hypothesis. I’m curious whether OpenAI has considered this already and examined it I feel like an average senior eng could eval this in under two focused days, but maybe OpenAI has less unit-testing than I expect.

Well, a colleague of mine managed to build a non deterministic GET REST API endpoint. :D

this hypothesis makes a lot of sense. if indeed gpt-4 is a sparse MoE—which i believe it is—then OpenAI must have tested and proved their initial idea of a large capacity MoE LLM model first training/building a smaller one. this smaller test model might be gpt-3.5-turbo.

I see in the comments it seems to be a huge miss understanding between 2 uses of “non-deterministic”: 1) from normal English: cannot be determined beforehand (results may vary) 2) from theory of computation: loosely “parallel computation” (unknown path to the solution)

For floating point math, there's no distinction, as "parralel computation with unknown path to the solution" inherently implies "results will vary", as (a+b)+c != a+(b+c).

I wonder if there’s a side channel attack in there waiting to happen..

Determinism should always be an option in any system.

can somebody make some quantum AI, that's super deterministic.

