I think responses like this are a good reminder that technology is most leveraged when it's useful to both people who are technical and nontechnical. For the first time in the the 70+ year history of software, the cost profile of "integration software" is approaching cents per app. A child can magic up an NES emulator from the ether; semi- and non-technical users can instantly create something personalized to their use, which is useful to their specific team and their specific company.
I'm not going to be the one to shut them out of this change. We will work to help both of these audiences.
Moment is distinctive because (1) it's natively programmable, and (2) has a native, high-performance, live collaborative editing.
For (1), programmability is why our templates[1] are generally so rich. See, e.g, our NES emulator[2] or our SQLite Explorer[3], both of which would be vastly harder to accomplish in Notion. It's even much harder in Obsidian, which is Markdown-first! Both templates took ~30 minutes of work with `claude`, and to do something similar with their respective extension APIs would have take orders of magnitude more time, especially to publish.
For (2), actually-working live collaborative editing is pretty hard to come by on Markdown-based docs editors. If you don't think you need this, the offering will be less compelling. My opinion is that many people who think they don't need this in a team setting end up being surprised how big a barrier this is when they try to use Obsidian as the central knowledge base for a team setting. Notion is extremely buggy and sometimes very slow, but in almost all cases I've seen, that ends up being worth the trade-off of not having to write code or get everyone to use the exact same extension set. Just my 2 cents though.
You can make a NES emulator, that's neat! I'm probably not your target audience it seems. I don't know why a large amount of programmability is beneficial for a knowledge document system. Then again, I don't generally like my notes to be much fancier than text when possible so I can read it on any device or platform.
Author here, speaking only for myself, sometimes you're right, and I do want to create a new Next.js app, deploy it to our internal Kubernetes cluster behind our corporate intranet with all the credentials and so on.
And sometimes I don't! Sometimes I just want to add a graph of customer churn from a ClickHouse query directly into a PRD, and just say who has access to run it, so the proxy can enforce it. Or I want to put a release button into the document that documents the release process. Sometimes I want a collaboratively-editable document that just happens to be enriched with a little bit of personalized UI.
Not 100% sure I understand, but, if you opt into sharing a doc, we do spin up a collaboration server on your behalf, and editors do not have to set anything up to use it. The bulk of the work we've had to do is to make this seamless and good.
For the other points: yes, we aspire to do all of those things. :)
Our approach, which is behind a feature flag right now, is to allow users to attach "assets" to jj change IDs (these are sort of like stable git commit SHAs). This is how we will power inline comments, and it's how we'll power Notion-style SQLite-based databases. We already have, checked in, an implementation of IVM built on top of SQLite, specifically for this purpose. I don't see how we could be a Notion competitor without them.
> Unfortunately, when it sold out to Microsoft, the clock started ticking. “Please just give me 5 years before everything goes to shit,” I thought to myself. And here we are, 7 years later, living on borrowed time.
Man, sometimes I feel like I live on a different planet. I have been using GitHub since 2010 and—while I really wish I had a nicer way of putting this—I cannot remember a time when all of the flagship products were not uniformly either worst-in-class or close to it. Code review/PRs, issues, code search, CI, a real enterprise offering, and now AI features: all of these offerings had gaps serious enough to instigate real, threatening upstarts, and some of those upstarts were themselves big enough to become public companies. Seriously. A viable path to IPO from 2013 to (say) 2019 was literally "make a version of a GitHub feature that simply does not suck."
I loved GitHub in 2010. I also remember those years, 2013 to 2019, being essentially totally lost, with no meaningful product movement at all. Am I truly alone in this? What is this Andrew talking about here?
I'm not going to defend the Microsoft acquisition, but at least—excruciatingly slowly—things like code review and issues are finally starting to receive features. It's crazy to say it out loud but that is what I see.
I just can't help but think the product "enshittification" narrative here is an ex post justification of the author's own feelings.
I like this but based on what I am seeing here and the THRML readme, I would describe this as "an ML stack that is fully prepared for the Bayesian revolution of 2003-2015." A kind of AI equivalent of, like, post-9/11 airport security. I mean this in a value-neutral way, as personally I think that era of models was very beautiful.
The core idea of THRML, as I understand it, is to present a nice programming interface to hardware where coin-flipping is vanishingly cheap. This is moderately useful to deep learning, but the artisanally hand-crafted models of the mid-2000s did essentially nothing at all except flip coins, and it would have been enormously helpful to have something like this in the wild at that time.
The core "trick" of the era was to make certain very useful but intractable distributions built on something called "infinitely exchangeable sequences" merely almost intractable. The trick, roughly, was that conditioning on some measure space makes those sequences plain-old iid, which (via a small amount of graduate-level math) implies that a collection of "outcomes" can be thought of as a random sample of the underlying distribution. And that, in turn, meant that the model training regimens of the time did a lot of sampling, or coin-flipping, as we have said here.
Peruse the THRML README[1] and you'll see the who's who of techniques and modeling prodedures of the time. "Gibbs sampling", "probabilistic graphical models", and "energy-based models", and so on. All of these are weaponized coin flipping.
I imagine the terminus of this school of thought is basically a natively-probabilistic programming environment. Garden variety deterministic computing is essentially probabilistic computing where every statement returns a value with probability 1. So in that sense, probabilistic computing is a ful generalization of deterministic computing, since an `if` might return a value with some probability other than 1. There was an entire genre of languages like this, e.g., Church. And now, 22 years later, we have our own hardware for it. (Incidentally this line of inquiry is also how we know that conditional joint distributions are Turing complete.)
Tragically, I think, this may have arrived too late. This is not nearly as helpful in the world of deep learning, with its large, ugly, and relatively sample-free models. Everyone hates to hear that you're cheering from the sidelines, but this time I really am. I think it's a great idea, just too late.
Really informative insight, thanks. I'm not too familiar with those models, is there any chance that this hardware could lead to a renaissance of sample-based methods? Given efficient hardware, would they scale to LLM size, and/or would they allow ML to answer some types of currently unanswerable questions?
Any time something costs trillionths of a cent to do, there is an enormous economic incentive to turn literally everything you can into that thing. Since the 50s “that thing” has been arithmetic, and as a result, we’ve just spent 70 years trying to turn everything from HR records to images into arithmetic.
Whether “that thing” is about to be sampling is not for me to say. The probability is certainly not 0 though.
I do actually really like this, but there it is a little ironic that the website advocates for straightforward, unpretentious UI (e.g., one should make a button "look like button" not, say, a kitschy bird), but this idea is expressed not through plain-spoken words, but a kitschy caveperson gimmick.
I think this kind of undermines the point, and goes a long way to showing that sometimes the best way to communicate actually is in a way that is unique.
If I give you a biased coin can you simulate a truly random coin flip with it? The answer turns out to be yes. Flip the biased coin twice: HT = heads, TH = tails, and HH/TT = flip twice again.
The general study of such things is called “randomness extractors”. The new Gödel prize goes to this paper which shows how to extract a nearly perfect random bit out of two sources with low min-entropy.
Yes, but - you need to replace "twice" there with "an unbounded number of times". If you apply this in an environment where the biased coin is coming from an external source, your system becomes susceptible to DoS attacks.
While I obviously think randomness extractors over adversarial sources are very interesting, I think talking about them specifically in this example complicates the point I'm trying to make, which is that it's incredible it can be done at all.
Note that adversarial is kind of a red herring, not sure why they mentioned that. The number of flips is unbounded regardless. Which is why it's not really incredible that it can be done: it can't, not as the problem was originally stated. What can be done is solving a different (but useful) problem than the one originally posed.
I realize this sounds like a minor detail to someone who finds this cool (and so do I), but I don't think it is. It's kind of frustrating to be told that your intuition is wrong by someone smarter than you, when your intuition is actually correct and the experts are moving the goalposts. IMO, it makes people lose respect for experts/authority.
So, the problem in its original framing is: can we simulate a fair coin flip with an unfair coin? As stated, I do actually think the von Neumann response answer ("this is actually technically possible") is fair, in that if I wanted a solution in O(1), I think I should have to say so ahead of time.
I suppose we'll have to disagree about whether this is incredible. The response shows that (1) this can be done at all, and (2) that the answer is exponentially likely as time goes on, not asymptotically, but for finite n. Incredible! You don't see finite-decay bounds very often! If you don't think that's incredible I invite you to ask a room full of people, even with the qualifications you deem appropriate, e.g., "solution does not need to be constant-time", or whatever.
There is no N such that your algorithm is guaranteed to terminate before N flips — even for a single bit. The complexity class in the worst case (100% T) is infinite; it’s only the average cases that have something reasonable.
To borrow your phrasing:
If you only have a stochastic algorithm, I think you should have to say so.
Kind of, kind of not. I think it's fair to argue a lot of theory people would argue that it is effectively guaranteed to terminate, for any reasonable definition of "guaranteed."
As I mention elsewhere, for a coin where P(H)=0.75, with 600 flips, the probability of not having received an answer is less than 1 divided by the number of atoms in the universe. While this is not 0, neither is the probability of the atoms of the coin lining up just right that it falls through the table, producing neither H or T. Theoretically and practically we do generally consider these kinds of things immaterial to the problem at hand, perhaps papering over them with terms-of-art like "with high/extremely high/arbitrarily high/ probability". They all mean roughly the same thing: this algorithm is "essentially guaranteed" to terminate after n flips.
If I asked a someone gambling, does it matter if this coin is biased 0.0001%, they will probably say no.
And just like that, the algorithm is now guaranteed to terminate.
It's a bit unfair to talk about the case (100% T) because in that case you don't have a source of randomness anymore, we've dropped the assumption that you have a source of randomness, and as expected you can't make one from nothing.
A random source can produce the same value indefinitely, but it cannot produce the same value forever. It is impossible in the sense that the probability that it will happen is zero. That is, even the unbiased version of the algorithm will terminate with 100% probability.
> My point is that you have stochastic termination, but not guaranteed termination. Those are different things.
You missed their suggestion about that.
If you rephrase the algorithm as one that almost almost almost almost almost completely eliminates bias, you can guarantee termination by giving up after a certain number of repetitions.
Sure — but now we’re back in the case the person above me was calling out and I was emphasizing by pointing out the stochastic nature: you moved the goal posts.
From their comment:
> Which is why it's not really incredible that it can be done: it can't, not as the problem was originally stated. What can be done is solving a different (but useful) problem than the one originally posed.
You can have an arbitrarily small bias, at the cost of increased runtime — but you can’t have a zero bias algorithm that always terminates. You have to move the goalposts (allowing some bias or exceedingly rare cases to not terminate).
I’m not sure why people have such a hard time admitting that.
I don't think the termination thing is moving the goalposts.
If you have infinite time, then you can wait for it to take more than a few hundred iterations.
If you don't have infinite time, it won't take more than a few hundred iterations.
Also "constant time" wasn't even part of the original promise! If a goalpost was moved, it was by the person that decided on that requirement after the fact.
> If you have infinite time, then you can wait for it to take more than a few hundred iterations.
To be clear, you're making this argument while also arguing "this wouldn't happen in the real world". [1] You can't have it both ways.
> Also "constant time" wasn't even part of the original promise!
We're not even asking for "constant time". We're literally only asking for "will finish in any bounded amount of time". Even an exponential time bound would've been better than unbounded!
A simulator that can't even promise to get to step #2 of the simulation in any bounded amount of time very much needs a giant proactive asterisk on its labeling, because that's not what people understand to be a simulator. Again, if you sold someone that in such a manner that obscured this fact, they would absolutely be justified in wanting their money back.
I addressed that in my next sentence! The whole point of making it two sentences was to split up the real world and not real world cases. Come on.
> We're not even asking for "constant time". We're literally only asking for "will finish in any bounded amount of time". Even an exponential time bound would've been better than this!
N is 1. Every bound is a constant bound.
> if you sold someone that in such a manner that obscured this fact, they would absolutely be justified in wanting their money back.
They would not be entitled to a penny back because it's impossible for them to hit the failure case.
Even if they could hit it, nobody is getting a refund for "it crashes once per googolplex runs".
> I addressed that in my next sentence! The whole point of making it two sentences was to split up the real world and not real world cases. Come on.
Your next sentence was just obviously flat-out wrong though. Not just because who says that's the case (maybe I don't have that much time?) but because it's trivial to find P(H) that makes it false for any duration of time. "A few hundred iterations" literally doesn't guarantee anything unless you make unstated assumptions about the biases your device works for.
I don't get why we're going in circles here. It seems we've hashed everything out.
> N is 1. Every bound is a constant bound.
Kind of a meaningless statement when you don't even say what your N is. I can imagine lots of N where that's not the case. But whatever you want to call it, my point entirely stands.
> They would not be entitled to a penny back because it's impossible for them to hit the failure case.
No, it is very possible. All they need to be given is a coin whose bias they don't know beforehand, whose bias is unfortunate. Or a coin whose bias they do know to be much worse than whatever you imagined a few hundred tosses would be enough for.
> because it's trivial to find P(H) that makes it false
As I already said in the comments you read, it's proportional to the bias. A few hundred multiplied by the ratio between heads and tails or vice versa will never be reached.
> when you don't even say what your N is
Number of outputs?
Look, if you want to invoke concepts like exponential time then you tell me what N is. Exponential of what?
> All they need to be given is a coin whose bias they don't know beforehand
See first answer.
If someone expects the bias to not matter, even a one in a million coin, they're the one with the problem, not the algorithm.
If they accept the bias affects speed, things are fine.
All infinite sequences are physically impossible, but they’re the basis of asymptotics; arbitrary failure and non-production of entropy are still physically possible.
Note that the solution dataflow objects to already operates in constant time on an average-case basis. There isn't room to make a complexity improvement.
What do you mean by "exponentially, not asymptotically, but for finite n"? Exponential is by definition asymptotic and continues infinitely, no?
And to be clear, I'm not disagreeing (or agreeing) with the result being inherently incredible. I'm just saying it's not an incredible example of simulating a fair coin, because it just... isn't doing that. As an analogy: communicating with the Voyager spacecraft might be incredible, but it's not an incredible example of infinitely fast communication... because it just isn't. Telling me to go ask a room full of people whether they find either of these incredible is missing the point.
> What do you mean by "exponentially, not asymptotically, but for finite n"? Exponential is by definition asymptotic and continues infinitely, no?
In statistics, we generally separate asymptotic bounds—which usually make guarantees only asymptotically—from finite-decay bounds like Chernoff, which decay exponentially not only in the limit, but also at each particular finite n. The second is much rarer and much more powerful.
This is an important distinction here: we are not talking about an asymptotic limit of coin flips. Each and every round of coin flips reduces the probability of not having an answer exponentially.
> I'm just saying it's not an incredible example of simulating a fair coin, because it just... isn't doing that. As an analogy: communicating with the Voyager spacecraft might be incredible, but it's not an incredible example of infinitely fast communication... because it just isn't.
Ok, no problem, it's up to you to decide if you want to use your own definition here.
But, FYI, in computability theory, it is definitely fair game and very common to "simulate" some computation with something that is either much more memory-intensive or compute-intensive, in either the deterministic or probabilistic case. For example, it's pretty common to add or remove memory and see whether the new "simulated" routine takes more or less compute power than the "standard" algorithm, and that is kind of what people are up to with the PSPACE stuff that is going on right now.
Using that lens—and this is entirely off the cuff, but in the 10 seconds of thinking, I believe probably mostly correct—this algorithm "simulates" a fair coin toss by using 1 extra bit of memory and O(p) compute, where p is the reciprocal of |0.5-<coin's bias>|. You can choose p to be infinite but you can do that for a sorting algorithm too (and that is why both sorting and this algorithm have DoS implications). Is this the best we can do? Well, that's the problem we're studying with randomness exractors: given this imperfect source of randomness, can we extract perfect randomness out of it at all?
> In statistics, we generally separate asymptotic bounds—which usually make guarantees only asymptotically—from finite-decay bounds like Chernoff, which decay exponentially not only in the limit, but also at each particular finite n. The second is much rarer and much more powerful.
I don't know if you're using different terminology than I understand here, but I'm reading "exponential in the limit" to mean "eventually exponential", or in other words "there is some finite n_0 where for n > n_0 the decay becomes exponential"? In which case it basically means that the first n_0 tosses would have to be discarded? It's nice that that's not the case, I guess. Somehow I don't find myself blown away by this, but perhaps I'm misunderstanding what it means to be "exponential but only in the limit but not for finite n".
> in computability theory, it is definitely fair game and very common to "simulate" some computation with something that is either much more memory-intensive or compute-intensive, in either the deterministic or probabilistic case.
"Much" more is one thing, "arbitrarily more if you're unlucky" is another. I'm not an expert in computability theory by any means, but whenever I've heard of simulation in that context, it's been with a hard bound (whether constant, polynomial, or whatever). I've never heard it called simulation in a context where the simulator can take arbitrarily long to simulate step #2. Even if this is normal to those in the field, don't think the average person thinks of it this way -- I very much think most people would want their money back if they were sold a "simulator" that took arbitrarily long to get to step #2.
What is our goal here? I'm willing to continue this discussion but after answering these questions and seeing your responses elsewhere it does not seem like your understanding is improving (e.g., still misunderstanding that the algorithm does not take "arbitrary" time), so I'm not sure you're getting the value here... If you want to continue maybe we should try you asking specific questions instead.
I have the exact opposite reaction, that if someone told me the answer is "no" because it requires an unbounded number of coin flips that they were the ones trying to bullshit me. In antic's formulation, nothing is said about requiring a bounded number of flips.
"Simulate a truly random coin" implies it IMO. You're not simulating a truly random coin if you need unbounded time for a single flip. The truly random coin definitely doesn't need that. It just feels like a scam if someone sold me such a machine with that description - I'd want my money back. I don't expect everyone would feel the same, but I think a lot of people would.
I'm not sure we're on the same page about what this result practically means, so let me re-state it a few different ways, so that people can draw their own conclusions:
* The von Neumann approach will "appear to be O(1) (constant-time)" for any particular biased coin, but that constant might be big if the coin is very, VERY biased in one direction.
* How can this be true? Every flip reduces the probability you do not have an answer exponentially. The "concentration" around the mean is very sharp—e.g., at 275 coin tosses for P(H)=0.5 (the fair case), the probability of not having an answer is smaller than 1 divided by the number of atoms in the known universe. It is technically possible, but I think most people would say that it's "effectively constant time" in the sense that we'd expect the coin to phase-shift through the desk before flipping it 275 times in a row and not getting answer. So it takes 275 flips, it's "constant" time! Interpret it how you like though.
* As you make the coin more and more biased, that "horizon" increases linearly, in that 0.99^1,000 is approximately the same thing as 0.999^10,000. So, an order of magnitude increase in probability requires roughly an order of magnitude increase in the number of flips. This is why it's not useful for the adversarial case, and why adversarial extractors are held apart from normal extractors.
Whether this is a "give me my money back" type thing is for you to decide. I think for most people the claim that you can simulate a fair coin from a biased coin in, effectively, O(1), and that the constant increases in O(n) in the bias, is plainly incredible. :)
You don't need unbounded time for a single flip, that's all in your imagination. The worst-case time is unbounded, but you can't achieve the worst case.
It's very easy to achieve if someone hands you a "coin" that is made such that it never lands on tails.
Sorry for still being in the adversarial mindset, but this means that you essentially have to hardcode a maximum number of same-side flips after which you stop trusting the coin.
> You don't need unbounded time for a single flip, that's all in your imagination. The worst-case time is unbounded, but you can't achieve the worst case.
There literally isn't a bound, it can be arbitrarily large. This isn't just in my head, it's a fact. If it's bounded in your mind then what is the bound?
It's not a fact of the real world, at least. "You can't achieve it" is true. A pretty small number of failures and you're looking at a trillion years to make it happen. And if you buffer some flips that number gets even smaller.
> A pretty small number of failures and you're looking at a trillion years to make it happen.
This depends on the bias of the original coin. P(H) can be arbitrarily large, making P(HH) the likeliest possibility even for a trillion years. "This wouldn't happen in the real world" would be a sorry excuse for the deliberate refusal to clearly state the problem assumptions upfront.
IMO, if you really want to pleasantly surprise people, you need to be forthcoming and honest with them at the beginning about all your assumptions. There's really no good excuse to obfuscate the question and then move the goalposts when they (very predictably) fall into your trap.
> This depends on the bias of the original coin. P(H) can be arbitrarily large
> There's really no good excuse to obfuscate the question and then move the goalposts when they (very predictably) fall into your trap.
Interesting. Because I see the guy pulling out the one-in-a-million coin and expecting it to run at a similar speed to be doing a gotcha on purpose, not falling into a trap and having the goalposts moved.
And I think "well if it's a million times less likely to give me a heads, then it takes a million times as many flips, but it's just as reliable" is an answer that preserves the impressiveness and the goalposts.
It's fast relative to the bias. Which seems like plenty to me when the original claim never even said it was fast.
(And if the coin never gives you a heads then I'd say it no longer qualifies as randomly flipping a coin.)
i read "flip twice" as recussion, so, given we're talking randomness, yes, that could go on forever. but i don't think you really need to replace "twice."
Speaking from complete ignorance, with apologies to those who that will annoy:
I'm sure it's possible to make a coin with what one might term "complex bias" where the bias extends over two events (thick gloop inside the coin, possibly heat activated or non-Newtonian).
This method sounds like the bias needs to be fixed ("simple bias" if you like)?
I guess that's just out of scope here.
Aside: There's a strong 'vibe' that HHHH HHHH with a single coin is somehow less random than HTHTHTHT HTHTHTHT with two coins when discarding HH and TT. I wonder what the origin of that feeling is - maybe just HT doesn't seem like it's a constant result simply due to nomenclature? If we call HT "type-A", then we just have HHHH HHHH vs. AAAA AAAA; and I _feel_ happier about the result!
I suspect this depends on where you drawn the upper bound, since a really really complex biased coin is one that spies on your thesis and data and is committed to making you suffer.
Could the vibe be due to the fact that HHHH… seems like the coin could not just be biased - it could be completely broken and come up heads every time. There are two distinct possibilities here:
1) the coin is broken and is always H
2) the coin is random or possibly biased, and you got a string of H by chance
And observing the string of H… increases the probability of 1) in the Bayesian sense.
With the two coins you eliminate this possibility altogether - a broken coin can never produce HT or TH.
The legacy of the electric motor is not textile factories that are 30% more efficient because we point-replaced steam engines. It's the assembly line. The workforce "lost" the skills to operate the textile factories, but in turn, the assembly line made the workflow of goods production vastly more efficient. Industrial abstraction has been so successful that today, a small number of factories (e.g., TSMC) have become nearly-existential bottlenecks.
That is the aspiration of AI software tools, too. They are not coming to make us 30% more efficient, they are coming to completely change how the software engineering production line operates. If they are successful, we will write fewer tests, we will understand less about our stack, and we will develop tools and workflows to manage that complexity and risk.
Maybe AI succeeds at this stated objective, and maybe it does not. But let's at least not kid ourselves: this is how it has always been. We are in the business of abstracting things away so that we have to understand less to get things done. We grumbled when we switched from assembly to high-level languages. We grumbled when we switched from high-level languages to managed languages. We grumbled when we started programming enormous piles of JavaScript and traveled farther from the OS and the hardware. Now we're grumbling about AI, and you can be sure that we're going to grumble about whatever is next, too.
I understand this is going to ruffle a lot of feathers, but I don't think Thomas and the Fly team actually have missed any of the points discussed in this article. I think they fully understand that software production is going to change and expect that we will build systems to cope with abstracting more, and understanding less. And, honestly, I think they are probably right.
I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted. And the question they are collectively trying to ask is whether this will continue forever.
I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.
But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.
> I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted.
We keep assigning adjectives to this technology that anthropomorphize the neat tricks we've invented. There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.
All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.
This is a neat trick, but it doesn't solve the underlying problems that plague these models like hallucination. If the "reasoning" process contains garbage, gets stuck in loops, etc., the final answer will also be garbage. I've seen sessions where the model approximates the correct answer in the first "reasoning" step, but then sabotages it with senseless "But wait!" follow-up steps. The final answer ends up being a mangled mess of all the garbage it generated in the "reasoning" phase.
The only reason we keep anthropomorphizing these tools is because it makes us feel good. It's wishful thinking that markets well, gets investors buzzing, and grows the hype further. In reality, we're as close to artificial intelligence as we were a decade ago. What we do have are very good pattern matchers and probabilistic data generators that can leverage the enormous amount of compute we can throw at the problem. Which isn't to say that this can't be very useful, but ascribing human qualities to it only muddies the discussion.
> There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.
> All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.
I always wonder when people make comments like this if they struggle with analogies. Or if it's a lack of desire to discuss concepts at different levels of abstraction.
Clearly an LLM is not "omniscient". It doesn't require a post to refute that, OP obviously doesn't mean that literally. It's an analogy describing two semi (fairly?) independent axes. One on breadth of knowledge, one on something more similar to intelligence and being able to "reason" from smaller components of knowledge. The opposite of which is dim witted.
So at one extreme you'd have something completely unable to generalize or synthesize new results. Only able to correctly respond if it identically matches prior things it has seen, but has seen and stored a ton. At the other extreme would be something that only knows a very smal set of general facts and concepts but is extremely good at reasoning from first principles on the fly. Both could "score" the same on an evaluation, but have very different projections for future growth.
It's a great analogy and way to think about the problem. And it me multiple paragraphs to write ehat OP expressed in two sentences via a great analogy.
LLMs are a blend of the two skills, apparently leaning more towards the former but not completely.
> What we do have are very good pattern matchers and probabilistic data generators
This an unhelpful description. And object is more than the sum of its parts. And higher levels behaviors emerge. This statement is factually correct and yet the equivalent of describing a computer as nothing more than a collection of gates and wires so shouldn't be discussed at a higher level of abstraction.
Language matters. Using language that accurately describes concepts and processes is important. It might not matter to a language model which only sees patterns, but it matters to humans.
So when we label the technical processes and algorithms these tools use as something that implies a far greater level of capability, we're only doing a disservice to ourselves. Maybe not to those of us who are getting rich on the market hype that these labels fuel, but certainly to the general population that doesn't understand how the technology works. If we claim that these tools have super-human intelligence, yet they fail basic tasks, how do we explain this? More importantly, if we collectively establish a false sense of security and these tools are adopted in critical processes that human lives depend on, who is blamed when they fail?
> This statement is factually correct and yet the equivalent of describing a computer as nothing more than a collection of gates and wires so shouldn't be discussed at a higher level of abstraction.
No, because we have descriptive language to describe a collection of gates and wires by what it enables us to do: perform arbitrary computations, hence a "computer". These were the same tasks that humans used to do before machines took over, so the collection of gates and wires is just an implementation detail.
Pattern matching, prediction, data generation, etc. are the tasks that modern AI systems allow us to do, yet you want us to refer to this as "intelligence" for some reason? That makes no sense to me. Maybe we need new higher level language to describe these systems, but "intelligence", "thinking", "reasoning" and "wit" shouldn't be part of it.
>There's nothing "omniscient" or "dim-witted" about these tools
I disagree in that that seems quite a good way of describing them. All language is a bit inexact.
Also I don't buy we are no closer to AI than ten years ago - there seem lots going on. Just because LLMs are limited doesn't mean we can't find or add other algorithms - I mean look at alphaevolve for example https://www.technologyreview.com/2025/05/14/1116438/google-d...
>found a faster way to solve matrix multiplications—a fundamental problem in computer science—beating a record that had stood for more than 50 years
I figure it's hard to argue that that is not at least somewhat intelligent?
> I figure it's hard to argue that that is not at least somewhat intelligent?
The fact that this technology can be very useful doesn't imply that it's intelligent. My argument is about the language used to describe it, not about its abilities.
The breakthroughs we've had is because there is a lot of utility from finding patterns in data which humans aren't very good at. Many of our problems can be boiled down to this task. So when we have vast amounts of data and compute at our disposal, we can be easily impressed by results that seem impossible for humans.
But this is not intelligence. The machine has no semantic understanding of what the data represents. The algorithm is optimized for generating specific permutations of tokens that match something it previously saw and was rewarded for. Again, very useful, but there's no thinking or reasoning there. The model doesn't have an understanding of why the wolf can't be close to the goat, or how a cabbage tastes. It's trained on enough data and algorithmic tricks that its responses can fool us into thinking it does, but this is just an illusion of intelligence. This is why we need to constantly feed it more tricks so that it doesn't fumble with basic questions like how many "R"s are in "strawberry", or that it doesn't generate racially diverse but historically inaccurate images.
>The machine has no semantic understanding of what the data represents.
How do you define "semantic understanding" in a way that doesn't ultimately boil down to saying they don't have phenomenal consciousness? Any functional concept of semantic understanding is captured to some degree by LLMs.
Typically when we attribute understanding to some entity, we recognize some substantial abilities in the entity in relation to that which is being understood. Specifically, the subject recognizes relevant entities and their relationships, various causal dependences, and so on. This ability goes beyond rote memorization, it has a counterfactual quality in that the subject can infer facts or descriptions in different but related cases beyond the subject's explicit knowledge. But LLMs excel at this.
>feed it more tricks so that it doesn't fumble with basic questions like how many "R"s are in "strawberry"
This failure mode has nothing to do with LLMs lacking intelligence and everything to do with how tokens are represented. They do not see individual characters, but sub-word chunks. It's like expecting a human to count the pixels in an image it sees on a computer screen. While not impossible, it's unnatural to how we process images and therefore error-prone.
You don't need phenomenal consciousness. You need consistency.
LLMs are not consistent. This is unarguable. They will produce a string of text that says they have solved a problem and/or done a thing when neither is true.
And sometimes they will do it over and over, even when corrected.
Your last paragraph admits this.
Tokenisation on its own simply cannot represent reality accurately and reliably. It can be tweaked so that specific problems can appear solved, but true AI would be based on a reliable general strategy which solves entire classes of problems without needing this kind of tweaking.
This is a common category of error people commit when talking about LLMs.
"True, LLMs can't do X, but a lot of people don't do X well either!"
The problem is, when you say humans have trouble with X, what you mean is that human brains are fully capable of X, but sometimes they do, indeed, make mistakes. Or that some humans haven't trained their faculties for X very well, or whatever.
But LLMs are fundamentally, completely, incapable of X. It is not something that can be a result of their processes.
These things are not comparable.
So, to your specific point: When an LLM is inconsistent, it is because it is, at its root, a statistical engine generating plausible next tokens, with no semantic understanding of the underlying data. When a human is inconsistent, it is because they got distracted, didn't learn enough about this particular subject, or otherwise made a mistake that they can, if their attention is drawn to it, recognize and correct.
LLMs cannot. They can only be told they made a mistake, which prompts them to try again (because that's the pattern that has been trained into them for what happens when told they made a mistake). But their next try won't have any better odds of being correct than their previous one.
>But LLMs are fundamentally, completely, incapable of X. It is not something that can be a result of their processes.
This is the very point of contention. You don't get to just assume it.
> it is because it is, at its root, a statistical engine generating plausible next tokens, with no semantic understanding of the underlying data.
Another highly contentious point you are just outright assuming. LLMs are modelling the world, not just "predicting the next token". Some examples here[1][2][3]. Anyone claiming otherwise at this point is not arguing in good faith. It's interesting how the people with the strongest opinions about LLMs don't seem to understand them.
OK, sure; there is some evidence potentially showing that LLMs are constructing a world model of some sort.
This is, however, a distraction from the point, which is that you were trying to make claims that the described lack of consistency in LLMs shouldn't be considered a problem because "humans aren't very consistent either."
Humans are perfectly capable of being consistent when they choose to be. Human variability and fallibility cannot be used to handwave away lack of fundamental ability in LLMs. Especially when that lack of fundamental ability is on empirical display.
I still hold that LLMs cannot be consistent, just as TheOtherHobbes describes, and you have done nothing to refute that.
Address the actual point, or it becomes clear that you are the one arguing in bad faith.
You are misrepresenting the point of contention. The question is whether LLMs lack of consistency undermines the claim that they "understand" in some relevant sense. But arguing that lack of consistency is a defeater for understanding is itself undermined by noting that humans are inconsistent but do in fact understand things. It's as simple as that.
If you want to alter the argument by saying humans can engage in focused effort to reach some requisite level of consistency for understanding, you have to actually make that argument. It's not at all obvious that focused effort is required for understanding or that a lack of focused effort undermines understanding.
You also need to content with the fact that LLMs aren't really a single entity, but are a collection of personas, and what you get and its capabilities do depend on how you prompt it to a large degree. Even if the entity as a whole is inconsistent between prompts, the right subset might very well be reliably consistent. There's also the fact of the temperature setting that artificially injects randomness into the LLMs output. An LLM itself is entirely deterministic. It's not at all obvious how consistency relates to LLM understanding.
Feel free to do some conceptual work to make an argument; I'm happy to engage with it. What I'm tired of are these half-assed claims and incredulity that people don't take them as obviously true.
I imagine if you asked the LLM why the wolf can't be close to the goat it would give a reasonable answer. I realise it does it by using permutation of tokens but I think you have to judge intelligence by the results rather than the mechanism otherwise you could argue humans can't be intelligent because they are just a bunch of neurons that find patterns.
Actually I think the Chinese room fits my idea. It's a silly thought experiment that would never work in practice. If you tried to make one you would judge it unintelligent because it wouldn't work. Or at least in the way Searle implied - he basically proposed a look up table.
We have had programs that can give good answers to some hard questions for a very long time now. Watson won jeapordy already 2011, but it still wasn't very good at replacing humans.
So that isn't a good way to judge intelligence, computers are so fast and have so much data that you can make programs to answer just about anything pretty well, LLM is able to do that but more automatic. But it still doesn't automate the logical parts yet, just the lookup of knowledge, we don't know how to train large logic models, just large language models.
LLMs are not the only model type though? There's a plethora of architectures and combinations being researched.. And even transformers start to be able to do cool sh1t on knowledge graphs, also interesting is progress on autoregressive physics PDE (partial differential equations) models.. and can't be too long until some providers of actual biological neural nets show up on openrouter (probably a lot less energy and capital intense to scale up brain goo in tanks compared to gigawatt GPU clusters).. combine that zoo of "AI" specimen using M2M, MCP etc. and the line between mock and "true"intelligence will blur, escalating our feable species into ASI territory.. good luck to us.
> There's a plethora of architectures and combinations being researched
There were plethora of architectures and combinations being researched before LLM, still took a very long time to find LLM architecture.
> the line between mock and "true"intelligence will blur
Yes, I think this will happen at some point. The question is how long it will take, not if it will happen.
The only thing that can stop this is if intermediate AI is good enough to give every human a comfortable life but still isn't good enough to think on its own.
Its easy to imagine such an AI being developed, imagine a model that can learn to mimic humans at any task, but still cannot update itself without losing those skills and becoming worse. Such an AI could be trained to perform every job on earth as long as we don't care about progress.
If such an AI is developed, and we don't quickly solve the remaining problems to get an AI to be able to progress science on its own, its likely our progress entirely stalls there as humans will no longer have a reason to go to school to advance science.
This approach to defining “true” intelligence seems flawed to me because of examples in biology where semantic understanding is in no way relevant to function. A slime mold solving a maze doesn’t even have a brain, yet it solves a problem to get food. There’s no knowing that it does that, no complex signal processing, no self-perception of purpose, but nevertheless it gets the food it needs. My response to that isn’t to say the slime mold has no intelligence, it’s to widen the definition of intelligence to include the mold. In other words, intelligence is something one does rather than has; it’s not the form but the function of the thing. Certainly LLMs lack anything in any way resembling human intelligence, they even lack brains, but they demonstrate a capacity to solve problems I don’t think is unreasonable to label intelligent behavior. You can put them in some mazes and LLMs will happen to solve them.
>LLMs lack anything in any way resembling human intelligence
I think intelligence has many aspects from moulds solving mazes to chess etc. I find LLMs resemble very much human rapid language responses where you say something without thinking about it first. They are not very good at thinking though. And hopeless if you were to say hook one to a robot and tell it to fix your plumbing.
While it's debatable whether slime molds showcase intelligence, there's a substantial difference between its behavior and modern AI systems. The organism was never trained to traverse a maze. It simply behaves in the same way as it would in its natural habitat, seeking out food in this case, which we interpret as "solving" a human-made problem. In order to get an AI system to do the same we would have to "train" it on large amounts of data that specifically included maze solving. This training wouldn't carry over any other type of problem, for which we would also need to specifically train it on.
When you consider how humans and other animals learn, knowledge is carried over. I.e. if we learn how to solve a maze on paper, we can carry this knowledge over to solve a hedge maze. It's a contrived example, but you get the idea. When we learn, we build out a web of ideas in our minds which we can later use while thinking to solve other types of problems, or the same problems in different ways. This is a sign of intelligence that modern AI systems simply don't have. They're showing an illusion of intelligence, which as I've said before, can still be very useful.
My alternative definition would be something like this. Intelligence is the capacity to solve problems, where a problem is defined contextually. This means that what is and is not intelligence is negotiable in situations where the problem itself is negotiable. If you have water solve a maze, then yes the water could be said to have intelligence, though that would be a silly way to put it. It’s more that intelligence is a material phenomenon, and things which seem like they should be incredibly stupid can demonstrate surprisingly intelligent behavior.
LLMs are leagues ahead of viruses or proteins or water. If you put an LLM into a code editor with access to error messages, it can solve a problem you create for it, much like water flowing through a maze. Does it learn or change? No, everything is already there in the structure of the LLM. Does it have agency? No, it’s a transparently deterministic mapping from input to output. Can it demonstrate intelligent behavior? Yes.
That's an interesting way of looking at it, though I do disagree. Mainly because, as you mention, it would be silly to claim that water is intelligent if it can be used to solve a problem. That would imply that any human-made tool is intelligent, which is borderline absurd.
This is why I think it's important that if we're going to call these tools intelligent, then they must follow the processes that humans do to showcase that intelligence. Scoring high on a benchmark is not a good indicator of this, in the same way that a human scoring high on a test isn't. It's just one convenient way we have of judging this, and a very flawed one at that.
I keep on trying this wolf cabbage goat problem with various permutations, let’s say just a wolf and a cabbage, no goat mentioned. At some step the got materializes in the answer. I tell it there is no goat and yet it answers again and the goat is there.
I am not sure we are on the same page that the point of my response is that this paper is not enough to prevent exactly the argument you just made.
In any event, if you want to take umbrage with this paper, I think we will need to back up a bit. The authors use a mostly-standardized definition of "reasoning", which is widely-accepted enough to support not just one, but several of their papers, in some of the best CS conferences in the world. I actually think you are right that it is reasonable to question this definition (and some people do), but I think it's going to be really hard for you to start that discussion here without (1) saying what your definition specifically is, and (2) justifying why its better than theirs. Or at the very least, borrowing one from a well-known critique like, e.g., Gebru's, Bender's, etc.
Output orientation - Is the output is similar to what a human would create if they were to think.
Process orientation - Is the machine actually thinking, when we say its thinking.
I met someone who once drew a circuit diagram from memory. However, they didn’t draw it from inputs, operations, to outputs. They started drawing from the upper left corner, and continued drawing to the lower right, adding lines, triangles and rectangles as need be.
Rote learning can help you pass exams. At some point, it’s a meaningless difference between the utility of “knowing” how engineering works, and being able to apply methods and provide a result.
This is very much the confusion at play here, so both points are true.
1) These tools do not “Think”, in any way that counts as human thinking
2) the output is often the same as what a human thinking, would create.
IF you are concerned with only the product, then what’s the difference? If you care about the process, then this isn’t thought.
To put it in a different context. If you are a consumer, do you care if the output was hand crafted by an artisan, or do you just need something that works.
If you are a producer in competition with others, you care if your competition is selling Knock offs at a lower price.
> IF you are concerned with only the product, then what’s the difference?
The difference is substantial. If the machine was actually thinking and it understood the meaning of its training data, it would be able to generate correct output based on logic, deduction, and association. We wouldn't need to feed it endless permutations of tokens so that it doesn't trip up when the input data changes slightly. This is the difference between a system with _actual_ knowledge, and a pattern matching system.
The same can somewhat be applied to humans as well. We can all either memorize the answers to specific questions so that we pass an exam, or we can actually do the hard work, study, build out the complex semantic web of ideas in our mind, and acquire actual knowledge. Passing the exam is simply a test of a particular permutation of that knowledge, but the real test is when we apply our thought process to that knowledge and generate results in the real world.
Modern machine learning optimizes for this memorization-like approach, simply because it's relatively easy to implement, and we now have the technical capability where vast amounts of data and compute can produce remarkable results that can fool us into thinking we're dealing with artificial intelligence. We still don't know how to model semantic knowledge that doesn't require extraordinary amounts of resources. I believe classical AI research in the 20th century leaned more towards this direction (knowledge-based / expert systems, etc.), but I'm not well versed in the history.
But if you need a submarine that can swim as agiley as a fish then we still aren't there yet, fish are far superior to submarines in many ways. So submarines might be faster than fish, but there are so many maneuvers that fish can do that the submarine can't. Its the same with here with thinking.
So just like computers are better at humans at multiplying numbers, there are still many things we need human intelligence for even in todays era of LLM.
The point here (which is from a quote by Dijkstra) is that if the desired result is achieved (movement through water) it doesn't matter if it happens in a different way than we are used to.
So if an LLM generates working code, correct translations, valid points relating to complex matters and so on it doesn't matter if it does so by thinking or by some other mechanism.
> if the desired result is achieved (movement through water) it doesn't matter if it happens in a different way than we are used to
But the point is that the desired result isn't achieved, we still need humans to think.
So we still need a word for what humans do that is different from what LLM does. If you are saying there is no difference then how do you explain the vast difference in capability between humans and LLM models?
Submarines and swimming is a great metaphor for this, since Submarines clearly doesn't swim and thus have very different abilities in water, its way better in some ways but way worse in other ways. So using that metaphor its clear that LLM "thinking" cannot be described with the same words as human thinking since its so different.
>If you are saying there is no difference then how do you explain the vast difference in capability between humans and LLM models?
No I completely agree that they are different, like swimming and propulsion by propellers - my point is that the difference may be irrelevant in many cases.
Humans haven't been able to beat computers in chess since the 90s, long before LLM's became a thing. Chess engines from the 90s were not at all "thinking" in any sense of the word.
It turns out "thinking" is not required in order to win chess games. Whatever mechanism a chess engine uses gets better results than a thinking human does, so if you want to win a chess game, you bring a computer, not a human.
What if that also applies to other things, like translation of languages, summarizing complex texts, writing advanced algorithms, realizing implications from a bunch of seemingly unrelated scientific papers, and so on. Does it matter that there was no "thinking" going on, if it works?
> I think AI maximalists will continue to think that the models are in fact getting less dim-witted
I'm bullish (and scared) about AI progress precisely because I think they've only gotten a little less dim-witted in the last few years, but their practical capabilities have improved a lot thanks to better knowledge, taste, context, tooling etc.
What scares me is that I think there's a reasoning/agency capabilities overhang. ie. we're only one or two breakthroughs away from something which is both kinda omniscient (where we are today), and able to out-think you very quickly (if only through dint of applying parallelism to actually competent outcome-modelling and strategic decision making).
That combination is terrifying. I don't think enough people have really imagined what it would mean for an AI to be able to out-strategise humans in the same way that they can now — say — out-poetry humans (by being both decent in terms of quality and super fast). It's like when you're speaking to someone way smarter than you and you realise that they're 6 steps ahead, and actively shaping your thought process to guide you where they want you to end up. At scale. For everything.
This exact thing (better reasoning + agency) is also the top priority for all of the frontier researchers right now (because it's super useful), so I think a breakthrough might not be far away.
Another way to phrase it: I think today's LLMs are about as good at snap judgements in most areas as the best humans (probably much better at everything that rhymes with inferring vibes from text), but they kinda suck at:
1. Reasoning/strategising step-by-step for very long periods
2. Snap judgements about reasoning or taking strategic actions (in the way that expert strategic humans don't actually need to think through their actions step-by-step very often - they've built intuition which gets them straight to the best answer 90% of the time)
Getting good at the long range thinking might require more substantial architectural changes (eg. some sort of separate 'system 2' reasoning architecture to complement the already pretty great 'system 1' transformer models we have). OTOH, it might just require better training data and algorithms so that the models develop good enough strategic taste and agentic intuitions to get to a near-optimal solution quickly before they fall off a long-range reasoning performance cliff.
Of course, maybe the problem is really hard and there's no easy breakthrough (or it requires 100,000x more computing power than we have access to right now). There's no certainty to be found, but a scary breakthrough definitely seems possible to me.
I think you are right, and that the next step function can be achieved using the models we have, either by scaling the inference, or changing the way inference is done.
People are doing all manner of very sophisticated inferency stuff now - it just tends to be extremely expensive for now and... people are keeping it secret.
If it was good enough to replace people then it wouldn't be too expensive, they would have launched it and replaced a bunch of people and made trillions of dollars by now.
So at best their internal models are still just performance multipliers unless some breakthrough happened very recently, it might be a bigger multiplier but that still keeps humans with jobs etc and thus doesn't revolutionize much.
I am not sure if you mean this to refute something in what I've written but to be clear I am not arguing for or against what the authors think. I'm trying to state why I think there is a disconnect between them and more optimistic groups that work on AI.
I think that commenter was disagreeing with this line:
> because omniscient-yet-dim-witted models terminate at "superhumanly assistive"
It might be that with dim wits + enough brute force (knowledge, parallelism, trial-and-error, specialisation, speed) models could still substitute for humans and transform the economy in short order.
Sorry, I can't edit it any more, but what I was trying to say is that if the authors are correct, that this distinction is philosophically meaningful, then that is the conclusion. If they are not correct, then all their papers on this subject are basically meaningless.
I'm not going to be the one to shut them out of this change. We will work to help both of these audiences.
reply