apstroll's comments

apstroll · 2025-07-03T20:01:07 1751572867

Extremely doubtful that it boils down to quadratic scaling of attention. That whole issue is a leftover from the days of small bert models with very few parameters.

For large models, compute is very rarely dominated by attention. Take, for example, this FLOPs calculation from https://www.adamcasson.com/posts/transformer-flops

Compute per token = 2(P + L × W × D)

P: total parameters L: Number of Layers W: context size D: Embedding dimension

For Llama 8b, the window size starts dominating compute cost per token only at 61k tokens.

apstroll · on Jan 14, 2025

Under a crossentropy loss the output activations do absolutely represent a probability distribution, since that is what we're modeling.

apstroll · on Jan 13, 2025

The output distribution is deterministic, the output token is sampled from the output distribution, and is therefore not deterministic. Temperature modulates the output distribution, but sitting it to 0 (i.e. argmax sampling) is not the norm.

Der_Einzige · on Jan 13, 2025

Running temperature of zero/greedy sampling (what you call "argmax sampling") is EXTREMELY common.

LLMs are basically "deterministic" when using greedy sampling except for either MoE related shenanigans (what historically prevented determinism in ChatGPT) or due to floating point related issues (GPU related). In practice, LLMs are in fact basically "deterministic" except for the sampling/temperature stuff that we add at the very end.

HarHarVeryFunny · on Jan 13, 2025

> except for either MoE related shenanigans (what historically prevented determinism in ChatGPT)

The original ChatCPT was based on GPT-3.5, which did not use MoE.

apstroll · on Dec 28, 2024

This paper is doing exactly that though, handwaving with a couple of floats. The paper is just a collection of observations about what their implementation of shapley value analysis gives for a few variations of a prompt.

refulgentis · on Dec 28, 2024

You have an excellent point. Bear with me.

I realized when writing this up that saying SAE isn't helpful but this is comes across as perhaps devils advocating. But I came across this in a stream of consciousness while writing, so I had to take a step back and think through it before editing it out.

Here is that thinking:

If I had a model completely mapped using SAE, at most, I can say "we believe altering this neuron will make it 'think' about the golden gate bridge more when it talks" ---- that's really cool for mutating behavior, don't get me wrong, it's what my mind is drawn to as an engineer.

However, as a developer of LLMs, through writing the comment, I realized SAE isn't helpful for qualifying my outputs.

For context's sake, I've been laboring on a LLM client for a year with a doctor cofounder. I'm picking these examples because it feels natural, not to make them sound fancy or important

Anyways, let's say he texts me one day with "I noticed something weird...every time I say 'the patient presents with these symptoms:' it writes more accurate analyses"

With this technique, I can quantify that observation. I can pull 20 USMLE questions and see how it changes under the two prompts.

With SAE, I don't really have anything at all.

There's a trivial interpretation of that: ex. professionals are using paid LLMs, and we can't get SAE maps.

But there's a stronger interpetation too: if I waved a magic wand and my cofounder was running Llama-7-2000B on their phone, and I had a complete SAE map of the model, I still wouldn't be able to make any particular statement at all about the system under test, other than "that phrase seems to activate these neurons" -- which would sound useless / off-topic / engineer masturbatory to my cofounder.

But to my engineering mind, SAE is more appealing because it reveals how it works fundamentally. However, I am overlooking that it still doesn't say how it works, just a unquantifiable correlation between words in a prompt and what floats get used. To my users, the output is how it works.

apstroll · on March 12, 2024

Cosine Similarity is very much about similarity, but it's quite fickle and indirect.

Given a function f(l, r) that measures, say, the logprobability of observing both l and r, and that the function takes the form f(l, r) = <L(l), R(r)>, i.e. the dot product between embeddings of l and r, then cosine similarity of x and y, i.e. normalized dot product of L(x) and L(y) is very closely related to the correlation of f(x, Z) and f(y, Z) when we let Z vary.