> This approach works by randomly polling participating devices for whether they’ve seen a particular fragment, and devices respond anonymously with a noisy signal. By noisy, we mean that devices may provide the true signal of whether a fragment was seen or a randomly selected signal for an alternative fragment or no matches at all. By calibrating how often devices send randomly selected responses, we ensure that hundreds of people using the same term are needed before the word can be discoverable. As a result, Apple only sees commonly used prompts, cannot see the signal associated with any particular device, and does not recover any unique prompts. Furthermore, the signal Apple receives from the device is not associated with an IP address or any ID that could be linked to an Apple Account. This prevents Apple from being able to associate the signal to any particular device.
The way I read this, there's no discovery mechanism here, so Apple has to guess a priori which prompts will be popular. How do they know what queries to send?
Later in the article, for a different (but similar) feature:
> To curate a representative set of synthetic emails, we start by creating a large set of synthetic messages on a variety of topics... We then derive a representation, called an embedding, of each synthetic message that captures some of the key dimensions of the message like language, topic, and length. These embeddings are then sent to a small number of user devices that have opted in to Device Analytics.
It's crazy to think Apple is constantly asking my iPhone if I ever write emails similar to emails about tennis lessons (their example). This feels like the least efficient way to understand users in this context. Especially considering they host an email server!
yeah, the linked paper [1] has more detail--basically they seem to start with a seed set of "class labels" and subcategories (e.g. "restaurant review" + "steak house"). They ask an LLM to generate lots of random texts incorporating those labels. They make a differentially private histogram of embedding similarities from those texts with the private data, then use that histogram to resample the texts, which become the seeds for the next iteration, sort of like a Particle Filter.
I'm still unclear on how you create that initial set of class labels used to generate the random seed texts, and how sensitive the method is to that initial corpus.
You could brute force it by querying about all 500k English words. With 1.3+ billion iPhone users, that means about 2600 users will see any goven word, which may be enough to observe trends.
No i think it's fairly well guaranteed that devices are encrypting and then submitting prompts. Differential encryption allows them to do honest-to-god work without decrypting the data. The "fragments" the polled devices are sent are probably some sub-sequence of the differentially encrypted prompt.
I think the main advantage is that you can compute the extra parameters (the PRNG seeds) from the network weights alone, whereas most other quantization methods require simulating the quantization procedure at training time (Quantization-Aware Training) or setting them from a calibration dataset (Post-Training Quantization)
> What makes this technique particular to LLM weights
This is my understanding as a non-expert.
LLM activations tend to be relatively sparse with large outliers. With linear quantization, this means you either have to clip off the outliers or you have to stretch your range to include the outliers, which wastes precious bits. Neither of these works well, so essentially all LLM quantization research is using various heuristics to get around these outliers. For example, you can do linear quantization but split the activations up into smaller blocks to make it less likely that any given block contains an outlier.
Another trick people have discovered (predates LLMs) is applying a random rotation/projection to the embeddings. This has the effect of making sure no one dimension in the vector dominates the others (which again hurts quantization). This works because in order for a single dimension to dominate, all the others have to "conspire" to be near zero. When you have 10,000+ dimensions, that's very unlikely.
This paper applies the latter trick. Instead of pre-generating the random projection matrices, they generate them on the fly on the accelerator from a seed that is fixed for each block. The seed is chosen from an offline brute-force search that needs only the weights of the network. This separates it from a lot of other quantization methods that either require calibration data or have to be simulated at training time so the network learns the quantization parameters itself.
You might think this is wasteful/might hurt performance, but it turns out that LLM inference is heavily memory-bound as it involves streaming a very large neural network into the accelerator (GPU/TPU/NPU/whatever) to operate on a relatively small amount of data, so there are lots of "free cycles" to generate these random numbers. Of course, if you care about power usage that might not be a great idea...
This doesn’t answer your question, but one thing to keep in mind is that past the very first layer, every “token” position is a weighted average of every previous position, so adjacency isn’t necessarily related to adjacent input tokens.
A borderline tautological answer might be “because the network learns that putting related things next to each other increases the usefulness of the convolutions”
Are there any technical innovations here over Moshi, which invented some of the pieces they use for their model? The only comparison I see is they split the temporal and depthwise transformers on the zeroth RVQ codebook, whereas Moshi has a special zeroth level vector quantizer distilled from a larger audio model, with the intent to preserve semantic information.
EDIT: also Moshi started with a pretrained traditional text LLM
> To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two
types of rewards:
> Accuracy rewards: The accuracy reward model evaluates whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases.
> Format rewards: In addition to the accuracy reward model, we employ a format reward model that enforces the model to put its thinking process between ‘<think>’ and ‘</think>’ tags.
This is a post-training step to align an existing pretrained LLM. The state space is the set of all possible contexts, and the action space is the set of tokens in the vocabulary. The training data is a set of math/programming questions with unambiguous and easily verifiable right and wrong answers. RL is used to tweak the model's output logits to pick tokens that are likely to lead to a correctly formatted right answer.
(Not an expert, this is my understanding from reading the paper.)
Has there been any serious study of exactly how LLMs store and retrieve memorized sequences? There are so many interesting basic questions here.
Does verbatim completion of a bible passage look different from generation of a novel sequence in interesting ways? How many sequences of this length do they memorize? Do the memorized ones roughly correspond to things humans would find important enough to memorize, or do LLMs memorize just as much SEO garbage as they do bible passages?
LLMs do not store and retrieve sequences. LLMs are not databases. LLMs are not predictable state machines. Understand how these things work.
They take the input context and generate the next token, then feed that whole thing back in as context and predict the next token, and repeat until the most likely next token is their stop word.
If they produce anything like a retrieved sequence, that's because they just happened to pick that set of tokens based on their training data. Regenerating the output from exactly the same input has a non-zero chance of generating different output.
Sure, and human brains aren’t databases either, but it’s sometimes reasonable to say that we “store” and “retrieve” knowledge. All models are wrong but some are useful.
The question I’m asking is, how is this working in an LLM? How exactly do their weights encode (seemingly) the entire bible such that they can recreate long passages verbatim from a prompt that likely doesn’t appear anywhere in the training data (e.g. some vague description of a particular passage).
It should have a zero chance of generating different output if the temperature is set to zero as in TFA. LLMs are not stochastic algorithms unless you add entropy yourself. Of course most people just use ChatGPT with its default settings and know nothing about the specifics.
The point is, though – somehow the model has memorized these passages, in a way that allows reliable reproduction. No doubt in a super amorphous and diffuse way, as minute adjustments to the nth sigbits of myriads of floating-point numbers, but it cannot be denied that it absolutely has encoded the strings in some manner. Or otherwise you have to accept that humans can't memorize things either. Indeed given how much our memory works by association, and how it's considerably more difficult to recount some memorized sequence from an arbitrary starting point, it's easy to argue that in some relevant way human brains are next-token predictors too.
The model has taken the input passages from its training data and tokenised it into weights. Don't humanise it by saying it has "remembered" anything. It does not and cannot remember sequences.
Yes, if you reduce temperature to zero and set the same random seed, you should get the same output tokens for a given set of input tokens.
However, there is no guarantee the output for a given seed will be the correct expected output.
For example, there logically must be a model and seed where providing the lord's prayer as input for completion produces a Metallica song as output, because that's a viable set of input tokens: https://genius.com/Metallica-enter-sandman-lyrics
That seed is no more or less valid than any other seed which completes the actual lord's prayer or which provides something completely different. All those seeds are just predicting their next token.
If people want that sort of exact reliable retrieval of sequences, and for the sequences to be "correct", then an LLM is the wrong tool for the job.
I imagine Bible passages, at least the more widely quoted and discussed ones, appear many, many times in the various available translations, in inspirational, devotional, scholarly articles, in sermon transcripts, etc. This surely reinforces almost word-for-word recall. SEO garage is a bit different each time, so common SEO-reinforced themes might be recalled in LLM output, but not word for word.