Hacker Newsnew | past | comments | ask | show | jobs | submit | D-Machine's commentslogin

This was sort of my reading as well: I took "clumping" to mean "bump-shaped".

Right, I think (hope) the OP meant not to emphasize the "search" in the sentence, but the "reputable source". Of course a Google search now is much worse than an AI search.

And it is the ultimately the reputable source that matters, and whether the person actually read it and checked that the details matched the summary (be it human abstract, LLM-generated, or otherwise).


Yup. And in general more heavy-tailed bumps are in fact better models (assuming normality tends to lead to over-confidence). Really think the universality is strictly mathematical, and actually rare in nature.

No, but when you get into the nitty gritty of most things sometimes being influenced by extremely rare things, and also that the convergence rate of the central limit theorem is not universal at all, then much of the utility (and apparent universality) of the CLT starts to evaporate.

In practice when modeling you are almost always better not assuming normality, and you want to test models that allow the possibility of heavy tails. The CLT is an approximation, and modern robust methods or Bayesian methods that don't assume Gaussian priors are almost always better models. But this of course brings into question the very universality of the CLT (i.e. it is natural in math, but not really in nature).


This is also right I believe, normal distributions are not ubiquitous really, just they are approximately ubiquitous (and only really if "ignoring rare outliers", and if you also close your eyes to all the things we don't actually understand at all).

The point on convergence rates re: the central limit theorem is also a major point otherwise clever people tend to miss, and which comes up in a lot of modeling contexts. Many things which make sense "in the limit" likely make no sense in real world practical contexts, because the divergence from the infinite limit in real-world sizes is often huge.

EDIT: Also from a modeling standpoint, say e.g. Bayesian, I often care about finding out something like the "range" of possible results for (1) a near-uniform prior, (2), a couple skewed distributions, with the tail in either direction (e.g. some beta distributions), and (3) a symmetric heavy-tailed distribution (e.g. Cauchy). If you have these, anything assuming normality is usually going to be "within" the range of these assumptions, and so is generally not anything I would care about.

Basically, in practical contexts, you care about tails, so assuming they don't meaningfully exist is a non-starter. Looking at non-robust stats of any kind today, without also checking some robust models or stats, just strikes me as crazy.


Came here basically looking to see this explanation. Normal dist is [approximately] common when summing lots of things we don't understand, otherwise, it isn't really.

These are some really great explicit examples and links, much appreciated.

I'd tend to agree, the only good points I've seen were made by @hedgehog [1] here in this thread:

    I'm not sure about the rest but a significant problem with high frequency tool calling (especially in training) is that it breaks batching.
and then later by @ACCount37 [2]:

    I'm less interested in turning programs into transformers and more interested in turning programs into subnetworks within large language models.
In theory, if you can create a very efficient sub-net to replicate certain tool calls (even if the weights are frozen during any training steps, and manually compiled), this might help with making inference much more efficient at scale. No idea why in general you would want to do this through the clunky transformer architecture though. Just implement a non-trainable, GPU-accelerated layer to do the compute and avoid the tool-call.

[1] https://news.ycombinator.com/item?id=47367986

[2] https://news.ycombinator.com/item?id=47363909


Except their process isn't actually differentiable, as they admit near the end of the post, they just sort hand-wavily suggest that approximately differentiable methods "should" work. Also no mention at all of what the training data would be, where it would come from, or how a loss function could be constructed to continuously score "partially correct" programs (of what that would even mean, or if that idea is even coherent).

What was a good point, mentioned by @hedgehog in this thread (https://news.ycombinator.com/item?id=47367986), is that tool-calls break batching a lot, so there could be huge efficiency gains at scale if you can just pass through a computation sub-network (even if that sub-net is frozen and can't be updated, and is programmed in manually rather than trained in).

Why on Earth you'd want that sub-net to be a clunky transformer rather than just an efficient, GPU-accelerated custom non-trainable layer, though, is unclear to me.


If you read the section "Richer attention mechanisms", you can see, no, the mechanism is not generally useable (it requires significant modification to become differentiable). They later speculate:

    While we do not yet know whether exact softmax attention
    can be maintained with the same efficiency, it is easy to
    approximate it with k-sparse softmax attention: retrieve
    the top-k keys and perform the softmax only over those
but if you have played around with training models that use e.g. topk or other hard thresholding operations in e.g. PyTorch (or just think about how many gradients become zero with such an operation) you know that these tend to work only in extremely limited / specific cases, and make training even more finicky than it already is.

I saw that, but the image included nearby made it look like it might be plausible to replace the 1D line around their points with a pretty narrow 2D area. This could still be a somewhat effective filter, right?

The problem is they are talking about tricks for compiling VMs into transformer weights, which is basically unrelated to actually training transformers on data via gradient descent. Once you get into this actual messy practical reality, you have non-trivial stuff like sparsemax and the Gumbel-Softmax trick to get some desirable improvements to things like the softmax, without all the gradient destruction of things like top-k approaches, but usually at pretty serious other costs (most approaches using Gumbel-Softmax I have read essentially create a bi-level optimization problem that is claimed to be "solved" by some handwavey annealing, but which is clearly highly unstable and hard to tune. I don't know if things have improved here since I last read on it).

So the issue isn't if there aren't ways to effectively approximate their approach, from a strictly numerical approximation standpoint, it is that other factors matter much more in optimization when training on actual data.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: