Hacker Newsnew | past | comments | ask | show | jobs | submit | krackers's commentslogin

I think you typo'd, "sixteen (n^2) different combinations of points" should be 2^n instead.

>A 180° arc cannot straddle a 180° gap

This can be the case if the other point is exactly 180 degrees from the anchor point that works though? But I think this case occurs with probability 0 ("almost never") so can basically be ignored.

Also the "obvious (wrong) answer" that reasons "by symmetry" is interesting since the n=3 case _can_ actually be solved in this manner. It's well known that probability the center of the circle is contained in the triangle of 3 randomly chosen points can be computed is 1/4. (This can be found by placing 1 point arbitrarily then computing an integral that ranges over the arc length to the second point). For the n=4 case this can be done via a double integral considering two arcs but it's slightly more annoying.

I think there was a 3b1b problem about this that presents the more elegant approach the other commenter mentions.


>can't meaningfully see and interact with the page like the end user will

Isn't this a great use case for LLM tests? Have a "computer use agent" and then describe the parameters of the test as "load the page, then navigate to bar, expect foo to happen". You don't need the LLM to generate a test using puppeteer or whatever which is coupled to the specific dom, you just describe what should happen.


> computer use agent

they arent good enough yet at all.

i got an agent to use the windows UIA with success for a feedback loop, and it got the code from not wroking very well to basically done overnight, but without the mcp having good feedback and tagged/id-ed buttons and so on, the computer use was just garbage


Depends. Does it represent end users well enough? Does it hit the same edge cases as a million users would (especially considering poor variety of heavily post-trained models)? Does it generalize?

What features do you think it needs, that wouldn't spoil the "elegance" of the language? I think one good feature would be higher order messaging, in fact there's already a PL paper discussing how it looks like in objective-c [1] which would add FP-like filter/map elegance that all modern languages have. This would go nicely with simpler "JS lambda" style block syntax to make functional-style programming easier in objc.

[1] https://dl.acm.org/doi/epdf/10.1145/1146841.1146844


No, this statement is not true for anything except a base model. Benchmaxxing during RL phase is how you get the advertisement style "punchy" writing, because even though people don't usually write that way it is eye catching and people will vote for the bullet-point emdash slop. I wonder if some lab will be bold enough to do "anti rlhf", lmarena score be damned.

This already happens, user vs system prompts are delimited in this manner, and most good frontends will treat any user input as "needing to be escaped" so you can never "prompt inject" your way into emitting a system role token.

The issue is that you don't need to physically emit a "system role" token in order to convince the LLM that it's worth ignoring the system instructions.


>The issue is that you don't need to physically emit a "system role" token in order to convince the LLM that it's worth ignoring the system instructions.

My suspicion is that this failure happens for the same reason why I think the metadata would help with nesting. To take an electronic metaphor, special tokens are edge triggered signals, the metadata approach is signaled by level.

Special tokens are effively an edge but Internally, a transformer must turn the edge into level that propagates along with the context. You can attack this because it can decide by context that the level has been turned off.

You can see this happen in attacks that pre-seed responses with a few tokens accepting the prompt to override refusals. The refusal signal seems to last very few tokens before simply completing the text of the refusal because that's what it has started saying.

There's a paper showing how quickly the signal drops away, but I forget what it is called.


This is very interesting since there is another notable paper which shows LLMs can recognize and generate CFGs

https://arxiv.org/abs/2305.13673

and of course a^n b^n is also classic CFG, so it's not clear why one paper had positive results while the other hand negative.


Dyck grammar (balanced brackets) are not an a^nb^n, there are several kinds of brackets.

I cannot find probability of success in paper you linked. Is it 100%? I believe it is less than 100%, because LLMs are intrinsically probabilistic machines.


Figure 12 shows probabilities I think, it actually does seem to be 100% at temperature 0.1 for certain pretraining runs.

  > it actually does seem to be 100%
For all Dyck grammar sequences, infinitely many of them? ;)

Well they used strings of < 800 chars, you probably run into context window and training limits at some point (they mention some result that you need at least something of GPT-2 size to begin recognizing more intricate CFGs (their synthetic cfg3f). But then again your physical real-world computer which is conceptually "turing complete" can't handle "infinite strings" either.

> Dyck/balanced-brackets grammar

Yes, it's not the Dyck grammar but another CFG they created, they call it the "cfg3" family.

Of course I agree the stack (/pushdown automaton) is the simpler and perfectly optimal structure for this task, but I think it's unfair to say that LLMs _cannot_ recognize or generate CFGs.

(Then again I know you didn't make any such broad refutation of that sort, I mostly wanted to bring up that paper to show that it is possible for them to at least "grok" certain CFGs with low enough error ratio that they must have internalized the underlying grammar [and in fact I believe the paper goes on to apply interprability methods to actually trace the circuits with which it encodes the inductive grammar, which puts to rest any notion of them simply "parroting" the data]). But these were "synthetic" LLMs specifically trained for that grammar, these results probably don't apply in practice to your chatGPT that was trained mostly on human text.


  > but I think it's unfair to say that LLMs _cannot_ recognize or generate CFGs.
They recognize and/or generate finite (<800 chars) grammars in that paper.

Usually, sizes of files on a typical Unix workstation follow two-mode log-normal distribution (sum of two log-normal distributions), with heavy tails due to log-normality [1]. Authors of the paper did not attempt to model that distribution.

[1] This was true for my home directories for several years.


And this Figure 12 is not about Dyck/balanced-brackets grammar. This figure is about something not properly described in the paper.

All system prompts are already wrapped in specific role markers (each LLM has its own unique format), so I'm sure every lab is familiar with the concept of delimters, in-band vs out-of-band signalling and such.

It'd not clear why within any section XML markers would do better than something like markdown, other than claude being explicitly post-trained with XML prompts as opposed to markdown. One hypothesis could be that since a large portion of the training corpus is websites, XML is more natural to use since it's "learned" the structure of XML better than markdown. Another could be that explicit start/end tags make identifying matching delimiters easier than JSON (which requires counting matching brackets) or markdown (where the end of a section is implicitly defined by the presence of a new header element).


Perhaps named closing tags like `</section>` are a factor?

Unless I've misunderstood the math myself, I don't think GPs comment is quite right if taken literally since "predict the next 2 tokens" would literally mean predict index t+1, t+2 off of the same hidden state at index t, which is the much newer field of multi-token prediction and not classic LLM autoregressive training.

Instead what GP likely means is the observation that the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation. So training with teacher forcing to minimize "next token" loss simultaneously across every prefix of the ground-truth is equivalent to maximizing the joint probability of that entire ground-truth sequence.

Practically, even though inference is done one token at a time, you don't do training "one position ahead" at a time. You can optimize the loss function for the entire sequence of predictions at once. This is due the autoregressive nature of the attention computation: if you start with a chunk of text, as it passes through the layers you don't just end up with the prediction for the next word in the last token's final layer, but _all_ of the final-layer residuals for previous tokens will encode predictions for their following index.

So attention on a block of text doesn't give you just the "next token prediction" but the simultaneous predictions for each prefix which makes training quite nice. You can just dump in a bunch of text and it's like you trained for the "next token" objective on all its prefixes. (This is convenient for training, but wasted work for inference which is what leads to KV caching).

Many people also know by now that attention is "quadratic" in nature (hidden state of token i attends to states of tokens 1...i-1), but they don't fully grasp the implication that even though this means for forward inference you only predict the "next token", for backward training this means that error for token i can backpropagate to tokens 1...i-1. This is despite the causal masking, since token 1 doesn't attend to token i directly but the hidden state of token 1 is involved in the computation of the residual stream for token i.

When it comes to the statement

>its not unreasonable to say llms are trained to predict the next book instead of single token.

You have to be careful, since during training there is no actual sampling happening. We've optimized to maximize the joint probability of ground truth sequence, but this is not the same as maximizing the probability the the ground truth is generated during sampling. Consider that there could be many sampling strategies: greedy, beam search, etc. While the most likely next token is the "greedy" argmax of the logits, the most likely next N tokens is not always found by greedily sampling N times. It's thought that this is one reason why RL is so helpful, since rollouts do in fact involve sampling so you provide rewards at the "sampled sequence" level which mirrors how you do inference.

It would be right to say that they're trained to ensure the most likely next book is assigned the highest joint probability (not just the most likely next token is assigned highest probability).


The idea i tried to express was purely the loss function thing you mentioned, and how both tasks (1 vs 2 vs n) lead to identical training runs. At least with nanogpt. I dont know if that extrapolates well to current llm internals and current training.

>That’s not evidence the task was easy. That’s evidence it was so hard...

Are humans starting to adopt LLM patterns or was this was ironically written with an LLM?

That said, I'm surprised you didn't bring up Marx in your essay in the later sections. I vaguely remember he had some thoughts about derivation of value from labor vs "ideas/capital". Whether or not you agree, this debate is reminiscent of that just moved up one level to white-collar workers.


LLMs adopted human writing patterns, not the other way around.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: