You're treating each list as unique, all the lists have a distribution of digits in common... I'm at a loss to even understand what you're saying here really -- this is why you need to actually state, formally, what you think the "LLMs are just stats" hypothesis amounts to.
It seems you think it amounts to saying LLMs sample from a combinatorial space, naively construed -- but that isnt the claim?
The claim is rather, they sample from a statistical distribution of tokens.
Take each position in the input vector, 1...127. It needs to "learn":
P(x0 position | y, x1...x127 positions), P(1|y, 2...127), P(2|y, 3...127), etc.
Which is a family of 127 conditional distributions that seem trivial to learn.
I really don't know why you think the size of a combinatorial space is relevant here?
All the sorted lists share basically the same tiny family of conditional distributions { P(x_i | x_(i-1)...x_127) }
I agree a neural network can certainly learn the conditional distributions that let it make that choice correctly every time. Once it has done so, then do you not have a sorting algorithm?
So this is what I thought you would say, and it's the origin of the issues here: to say that LLMs are "statistical parrots" is just to say they learn conditional distributions of text tokens.
So you aren't replying to the "only stats" claim: that is the claim!
The issue is that language-use isn't a matter of distributions of text tokens: when i say, "the sky is clear today!" it is caused by there being a blue sky. Then I say, "therefore I'd like to go out!" it is caused by my preferences, etc.
So if we had a generative causal model of language it would be something like this: Agent + Environment + Representations ---SymbolicTranslation---> Language.
All LLMs do is model the data being generated by this process, they dont model the process (ie., agents, environments, representations, etc.)
They say, "it is a nice day" only because those tokens match some statistical distribution over historical texts. Not because it has judged the day nice.
To model language is not to provide an indistinguishable language-like distribution of text tokens, but rather, for an agent to use language to express ideas caused by their internal states + the world.
In the case of sorting numbers, the tokens themselves have the property (ie., mathematical properties such as ranking are had by ranked tokens). So learning the distribution is learning the property of interest.
This is why no papers which demonstrate NNs "have representations" etc. which appeal to formal properties the data itself has, are even releveant to the discussion. Yet, all this "world model, algorithm, blah blah" said of NNs, is only ever shown using data whose "unsupervised model" constitues the property of interest.
Statistical models of the distributions of tokens are not models of the data generating process which produces those tokens (unless that process is just the distribution of those tokens). This is obvious from the outset.
This is why 10^80 random lists gets reduced to only 10^36 sorted lists. However, 10^36 is still very large with respect to the size of the model.