More

strangescript · 2025-11-18T18:05:44 1763489144

And by this time next year, this comment is going to look very silly

strangescript · 2025-11-03T14:09:11 1762178951

there are built in moderation tools you should turn on if you have external customers generating images, or inputing data that might be sketch

samtheprogram · 2025-11-03T17:37:36 1762191456

The example in this blog post, they did something recommended by Google and still got banned. Based on that, I'm not sure their built in moderation tools are enough insurance.

bhouston · 2025-11-03T14:49:29 1762181369

It can be super hard to moderate before an image is generated though. People can write in cryptic language and then say decode this message and generate an image the result, etc. The downside of LLMs is that they are super hard to moderate because they will gladly arbitrarily encode input and output. You need to use an LLM as advanced as the one you are running in production to actually check if they are obscene.

ceejayoz · 2025-11-03T15:18:56 1762183136

And these tools are perfect?

strangescript · 2025-10-19T13:17:21 1760879841

This entire thing has been pretty disingenuous on both sides of the fence. All the anti-AI (or anti OpenAI) people are doing victory laps, but what GPT-5 Pro did is still very valuable.

1) What good is your open problem set if really its a trivial "google search" away from being solved. Why are they not catching any blame here?

2) These answers still weren't perfectly laid out for the most part. GPT-5 was still doing some cognitive lifting to piece it together.

If a human would have done this by hand it would have made news and instead the narrative would have been inverted to ask serious questions about the validity of some these style problem sets and/or ask the question how many other solutions are out there that just need pieced together from pre-existing research.

But, you know, AI Bad.

lukev · 2025-10-19T13:41:19 1760881279

Framing this question as "AI good" OR "AI bad" is culture-war thinking.

The real problem here is that there's clearly a strong incentive for the big labs to deceive the public (and/or themselves) about the actual scientific and technical capabilities of LLMs. As Karpathy pointed out on the recent Dwarkesh podcast, LLMs are quite terrible at novel problems, but this has become sort of an "Emperor's new clothes" situation where nobody with a financial stake will actually admit that, even though it's common knowledge if you actually work with these things.

And this directly leads to the misallocation of billions of dollars and potentially trillions in economic damage as companies align their 5-year strategies towards capabilities that are (right now) still science fiction.

The truth is at stake.

strangescript · 2025-10-19T15:41:46 1760888506

Except they weren't intentionally trying to deceive anyone. They made the faulty assumption that these problems were non-trivial to solve and didn't think it was simply GPT-5 aggregating solutions in the wild.

lukev · 2025-10-19T15:47:05 1760888825

Knowing what I know about LLMs, from their internal architecture and from extensive experience working with them daily, I would find this kind of result highly surprising and in a clear violation of my mental model of how these things work. And I'm very far from an expert.

If a purported expert in the field can is willing to credulously publish this kind of result, it's not unreasonable to assume that either they're acting in bad faith, or (at best) are high on their own supply regarding what these things can actually do.

Topfi · 2025-10-19T13:26:33 1760880393

> What good is your open problem set if really its a trivial "google search" away from being solved. Why are they not catching any blame here?

They are a community run database, not the sole arbiter and source of this information. We learned the most basic research back in highschool, I'd hope researchers from top institutions now working for one of the biggest frontier labs can do the same prior to making a claim, but microblogging has and continues to be a blight on any accurate information so nothing new there.

> GPT-5 was still doing some cognitive lifting to piece it together.

Cognitive lifting? It's a model, not a person, but besides that fact, this was already published literature. Handy that a LLM can be a slightly better search, but calling claims of "solving maths problems" out as irresponsible and inaccurate is the only right choice in this case.

> If a human would have done this by hand it would have made news [...]

"Researcher does basic literature review" isn't news in this or any other scenario. If we did a press release every journal club, there wouldn't be enough time to print a single page advert.

> [...] how many other solutions are out there that just need pieced together from pre-existing research [...]

I am not certain you actually looked into the model output or why this was such an embarrassment.

> But, you know, AI Bad.

AI hype very bad. AI anthropomorphism even worse.

puttycat · 2025-10-19T13:22:31 1760880151

This is a strawman argument. No anti-AI sentiment was involved here. Simply the fact that finding and matching text on the Internet is several orders of magnitude easier than finding novel solutions to hard math problems.

strangescript · 2025-10-19T15:43:44 1760888624

You didn't read the X replies if you believe that

andrepd · 2025-10-19T13:28:37 1760880517

> 1) What good is your open problem set if really its a trivial "google search" away from being solved. Why are they not catching any blame here?

Please explain how this is in any way related to the matter at hand. What is the relation between the incompleteness of an math problem database, and AI hypesters lying about the capabilities of GPT5? I fail to see the relevance.

> If a human would have done this by hand it would have made news

If someone updated information on an obscure math problem aggregator database this would be news?? Again, I fail to see your point here.

nurettin · 2025-10-19T13:22:10 1760880130

AI great, but AI not creative, yet.

matsemann · 2025-10-19T13:25:22 1760880322

You're moving the goal post.

strangescript · 2025-10-18T17:46:58 1760809618

I love Karpathy, but he is wrong here. In a few short years we went from chat bots being toys and video creation predicted to be impossible in the near term to agents writing working apps and high def video that occasionally is indistinguishable from real life.

The rate depth, breadth and frequency of releases has only increased, not decreased. Meanwhile, everyone is waiting on bated breath for Gemini 3 to drop. A decade for reliable agents is not only comical, but willful cognitive dissonance at this point.

strangescript · 2025-10-14T13:46:33 1760449593

"grunt, gulp, webpack, coffeescript, babel" --- except no one uses these anymore and they are dead outside of legacy software.

The problem with the python tooling is no one can get it right. There aren't clear winners for a lot of the tooling.

jon-wood · 2025-10-14T14:01:54 1760450514

I think that's the point. Every now and then a language will have a small explosion of new tooling, and all you can really do is wait for it to blow over and see what tools people adopted afterwards, it feels like Python is going through a period like that at the moment.

strangescript · 2025-10-09T18:14:29 1760033669

13B is still super tiny model. Latent reasoning doesn't really appear until around 100B params. Its like how Noam reported GPT-5 finding errors on wikipedia. Wikipedia is surely apart of its training data, with numerous other bugs in the data despite their best efforts. That wasn't enough to fundamentally break it.

dingnuts · 2025-10-09T18:59:01 1760036341

> Latent reasoning doesn't really appear until around 100B params.

Please provide a citation for wild claims like this. Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.

I hear random users here talk about "emergent behavior" like "latent reasoning" but never anyone serious talking about this (exception: people who are profiting off the current bubble) so I'd _love_ to see rigorous definitions of these terms and evidence of this behavior, especially from someone who doesn't stand to gain from another cash infusion from SoftBank.

I suspect these things don't exist. At the very most, they're a mirage, and exist in the way a rainbow does. Go on and try to find that pot of gold, eh?

criemen · 2025-10-09T19:40:53 1760038853

> Please provide a citation for wild claims like this. Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.

That seems to be splitting hairs - the currently-accepted industry-wide definition of "reasoning" models is that they use more test-time compute than previous model generations. Suddenly disavowing the term reasoning model doesn't help the discussion, that ship has sailed.

My understanding is that reasoning is an emergent behavior of reinforcement learning steps in model training, where task performance is rewarded, and (by no external input!) the model output starts to include phrases ala "Wait, let me think". Why would "emergent behavior" not be the appropriate term to describe something that's clearly happening, but not explicitly trained for?

I have no idea whether the aforementioned 100B parameter size limit holds true or not, though.

xandrius · 2025-10-09T20:48:40 1760042920

Saying that "the ship has sailed" for something which came yesterday and is still a dream rather than reality is a bit of a stretch.

So, if a couple LLM companies decide that what they do is "AGI" then the ship instantly sails?

noir_lord · 2025-10-09T22:07:25 1760047645

Only matters if they can convince others that what they do is AGI.

As always ignore the man behind the curtain.

jijijijij · 2025-10-10T00:02:20 1760054540

Just like esoteric appropriation of 'quantum entanglement', right? It's vibe semantics now.

drakythe · 2025-10-09T20:36:09 1760042169

I'm almost positive reasoning is not an emergent behavior considering the reasoning models have specific architecture. As a source: https://arxiv.org/html/2504.09762v1

habinero · 2025-10-09T23:46:48 1760053608

> currently-accepted industry-wide definition of "reasoning"

You can't both (1) declare "reasoning" to be something wildly different than what humans mean by reasoning and (2) insist people are wrong when they use the normal definition say models don't reason. You gotta pick a lane.

cowboylowrez · 2025-10-10T02:19:00 1760062740

I don't think its too problematic, its hard to say something is "reasoning" without saying what that something is, for another example of terms that adjust their meaning to context for example, the word "cache" in "processor cache", we know what that is because its in the context of a processor, then there's "cache me outside", which comes from some tv episode.

whatevertrevor · 2025-10-10T06:11:35 1760076695

It's a tough line to tread.

Arguably, a lot of unending discourse about the "abilities" of these models stems from using ill-defined terms like reasoning and intelligence to describe these systems.

On the one hand, I see the point that we really struggle to define intelligence, consciousness etc for humans, so it's hard to categorically claim that these models aren't thinking, reasoning or have some sort of intelligence.

On the other, it's also transparent that a lot of the words are chosen somewhat deliberately to anthropomorphize the capabilities of these systems for pure marketing purposes. So the claimant needs to demonstrate something beyond rebutting with "Well the term is ill-defined, so my claims are valid."

And I'd even argue the marketers have won overall: by refocusing the conversation on intelligence and reasoning, the more important conversation about the factually verifiable capabilities of the system gets lost in a cycle of circular debate over semantics.

cowboylowrez · 2025-10-10T10:10:29 1760091029

sure, but maybe the terms intelligence and reasoning aren't that bad when describing what human behavior we want these systems to replace or simulate. I'd also argue that while we struggle to define what these terms actually mean, we struggle less about remembering what these terms represent when using them.

I'd even argue that its appropriate to use these terms because machine intelligence kinda sorta looks and acts like human intelligence, and machine reasoning models kinda sorta look like how a human brain reasons about things, or infer consequences of assertions, "it follows that", etc.

Like computer viruses, we call them viruses because they kinda sorta behave like a simplistic idea of how biological viruses work.

> currently-accepted industry-wide definition of "reasoning"

The currently-accepted industry-wide definition of reasoning will probably only apply to whatever industry we're describing, ie., are we talking human built machines, or the biological brain activity we kinda sorta model these machines on?

marketting can do what they want I got no control over either the behavior of marketters or their effect on their human targets.

quinndexter · 2025-10-10T11:04:11 1760094251

Or you could accept that sometimes fields contain terms-of-art that are non-intuitive to outsiders. Go ask an astromer what their working definition of a metal is.

habinero · 2025-10-11T16:46:14 1760201174

No. This is the equivalent of an astronomer telling a blacksmith they're using the term "metal" incorrectly. Your jargon does not override everyone else's language.

dr_dshiv · 2025-10-09T20:46:37 1760042797

> Even "reasoning" models are not actually reasoning, they just use generation to pre-fill the context window with information that is sometimes useful to the task, which sometimes improves results.

I agree that seems weak. What would “actual reasoning” look like for you, out of curiosity?

Terr_ · 2025-10-09T21:58:52 1760047132

Not parent poster, but I'd approach it as:

1. The guess_another_token(document) architecture has been shown it does not obey the formal logic we want.

2. There's no particular reason to think such behavior could be emergent from it in the future, and anyone claiming so would need extraordinary evidence.

3. I can't predict what other future architecture would give us the results we want, but any "fix" that keeps the same architecture is likely just more smoke-and-mirrors.

famouswaffles · 2025-10-09T22:08:42 1760047722

Seems to fall apart at 1

>1. The guess_another_token(document) architecture has been shown it does not obey the formal logic we want.

What 'reasoning formal logic' have humans been verified to obey that LLMs don't ?

Terr_ · 2025-10-09T22:42:26 1760049746

... Consider this exchange:

Alice: "Bob, I know you're very proud about your neural network calculator app, but it keeps occasionally screwing up with false algebra results. There's no reason to think this new architecture will reliably do all the math we need."

Bob: "How dare you! What algebra have humans been verified to always succeed-at which my program doesn't?! Huh!? HUH!?"

___________

Bob's challenge, like yours, is not relevant. The (im)perfection of individual humans doesn't change the fact that the machine we built to do things for us is giving bad results.

famouswaffles · 2025-10-09T23:26:48 1760052408

It's not irrelevant, because this is an argument about whether the machine can be said to be reasoning or not.

If Alice had concluded that this occasional mistake NN calculator was 'not really performing algebra', then Bob would be well within his rights to ask Alice what on earth she was going on about.

Terr_ · 2025-10-10T01:51:56 1760061116

> If Alice had concluded that this occasional mistake NN calculator was 'not really performing algebra', then Bob would be well within his rights to ask Alice what on earth she was going on about.

No, your burden of proof here is totally bass-ackwards.

Bob's the one who asked for blind trust that his magical auto-learning black-box would be made to adhere to certain rules... but the rules and trust are broken. Bob's the one who has to start explaining the discrepancy, and whether the failure is (A) a fixable bug or (B) an unfixable limitation that can be reliably managed or (C) an unfixable problem with no good mitigation.

> It's not irrelevant, because this is an argument about whether the machine can be said to be reasoning or not.

Bringing up "b-b-but homo sapiens" is only "relevant" if you're equivocating the meaning of "reasoning", using it in a broad, philosophical, and kinda-unprovable sense.

In contrast, the "reasoning" we actually wish LLMs would do involves capabilities like algebra, syllogisms, deduction, and the CS-classic boolean satisfiability.

However the track-record of LLMs on such things is long and clear: They fake it, albeit impressively.

The LLM will finish the popular 2+2=_, and we're amazed, but when we twiddle the operands too far, it gives nonsense. It answers "All men are mortal. Socrates is a man. Therefore, Socrates is ______", but reword the situation enough and it breaks again.

famouswaffles · 2025-10-10T04:51:02 1760071862

>Bob's the one who asked for blind trust that his magical auto-learning black-box would be made to adhere to certain rules... but the rules and trust are broken.

This is the problem with analogies. Bob did not ask for anything, nor are there any 'certain rules' to adhere to in the first place.

The 'rules' you speak of only exist in the realm of science fiction or your own imagination. Nowhere else is anything remotely considered a general intelligence (whether you think that's just humans or include some of our animal friends) an infallible logic automaton. It literally does not exist. Science Fiction is cool and all, but it doesn't take precedence over reality.

>Bringing up "b-b-but homo sapiens" is only "relevant" if you're equivocating the meaning of "reasoning", using it in a broad, philosophical, and kinda-unprovable sense.

You mean the only sense that actually exists ? Yes. It's also not 'unprovable' in the sense I'm asking about. Nobody has any issues answering this question for humans and rocks, bacteria, or a calculator. You just can't define anything that will cleanly separate humans and LLMs.

>In contrast, the "reasoning" we actually wish LLMs would do involves capabilities like algebra, syllogisms, deduction, and the CS-classic boolean satisfiability.

Yeah, and they're capable of doing all of those things. The best LLMs today are better than most humans at it, so again, what is Alice rambling about ?

>The LLM will finish the popular 2+2=_, and we're amazed, but when we twiddle the operands too far, it gives nonsense.

Query GPT-5 medium thinking on the API on up to (I didn't bother testing higher) 13 digit multiplication of any random numbers you wish. Then watch it get it exactly right.

Weeks ago, I got Gemini 2.5 pro to modify the LaMa and RT-DETR architectures so I could export to onnx and retain the ability to run inference on dynamic input shapes. This was not a trivial exercise.

>It answers "All men are mortal. Socrates is a man. Therefore, Socrates is ______", but reword the situation enough and it breaks again.

Do you actual have an example of a reword SOTA models fail at ?

Terr_ · 2025-10-10T18:03:18 1760119398

> Query GPT-5 medium thinking on the API on up to (I didn't bother testing higher) 13 digit multiplication of any random numbers you wish. Then watch it get it exactly right.

I'm not sure if "on the API" here means "the LLM and nothing else." This is important because it's easy to overestimate the algorithm when you give it credit for work it didn't actually do.

In general, human developers have taken steps to make the LLM transcribe the text you entered into a classically-made program, such as a calculator app, python, or Wolfram Alpha. Without that, the LLM would have to use its (admittedly strong) powers of probabilistic fakery [0].

Why does it matter? Suppose I claimed I had taught a chicken to do square roots. Suspicious, you peer behind the curtain, and find that the chicken was trained to see symbols on a big screen and peck the matching keys on pocket calculator. Wouldn't you call me a fraud for that?

_____________

Returning to the core argument:

1. "Reasoning" that includes algebra, syllogisms, deduction, etc. involves certain processes for reaching an answer. Getting a "good" answer through another route (like an informed guess) is not equivalent.

2. If an algorithm cannot do the algebra process, it is highly unlikely that it can do the others.

3. If an algorithm has been caught faking the algebra process through other means, any "good" results for other forms of logic should be considered inherently suspect.

4. LLMs are one of the algorithms in points 2 and 3.

_____________

[0] https://www.mindprison.cc/p/why-llms-dont-ask-for-calculator...

famouswaffles · 2025-10-10T22:01:52 1760133712

>I'm not sure if "on the API" here means "the LLM and nothing else." This is important because it's easy to overestimate the algorithm when you give it credit for work it didn't actually do.

That's what I mean yes. There is no tool use for I what I mentioned.

>1. "Reasoning" that includes algebra, syllogisms, deduction, etc. involves certain processes for reaching an answer. Getting a "good" answer through another route (like an informed guess) is not equivalent.

Again if you cannot confirm that these 'certain processes' are present when humans do it but not when LLMs do it then your 'processes' might as well be made up.

And unless you concede humans are also not performing 'true algebra' or 'true reasoning', then your position is not even logically consistent. You can't eat your cake and have it.

habinero · 2025-10-11T17:18:16 1760203096

No. I see AI people use this reasoning all the time and it's deeply misleading.

"You can't explain how humans do it, therefore you can't prove my statistical model doesn't do it" is kinda just the god of the gaps fallacy.

It abuses the fact that we don't understand how human cognition works, and therefore it's impossible to come up with a precise technical description. Of course you're going to win the argument, if you insist the other party do something currently impossible before you will accept their idea.

It's perfectly fine to use a heuristic for reasoning, as the other person did. LLMs don't reason by any reasonable heuristic.

famouswaffles · 2025-10-12T22:15:42 1760307342

>No. I see AI people use this reasoning all the time and it's deeply misleading. "You can't explain how humans do it, therefore you can't prove my statistical model doesn't do it" is kinda just the god of the gaps fallacy.

No, this is 'stop making claims you cannot actually support'.

>It abuses the fact that we don't understand how human cognition works, and therefore it's impossible to come up with a precise technical description.

Are you hearing yourself ? If you don't understand how human cognition works then any claims what is and isn't cognition should be taken with less than a grain of salt. You're in no position to be making such strong claims.

If you go ahead and make such claims, then you can be hardly surprised if people refuse to listen to you.

And by the way, we don't understand the internals of Large Neural Networks much better than human cognition.

>It's perfectly fine to use a heuristic for reasoning

You can use whatever heuristic you want and I can rightly tell you it holds no more weight than fiction.

cap11235 · 2025-10-09T21:23:55 1760045035

It's the same bitching every time an LLM post can be responded to. ITS NOT THINKING!!! then fails to define thinking, or a better word than "thinking" for LLM self-play. I consider these posts to be on par for quality with "FRIST!!!!!!" posts.

nucleogenesis · 2025-10-10T01:03:35 1760058215

Idk I think saying it’s “computing” is more precise because “thinking” applies to meatbags. It’s emulating thinking.

Really I just think that anthropomorphizing LLMs is a dangerous road in many ways and really it’s mostly marketing BS anyway.

I haven’t seen anything that shows evidence of LLMs being anything beyond a very sophisticated computer system.

cactusplant7374 · 2025-10-10T00:30:37 1760056237

Do submarines swim? Thinking is something that doesn’t happen inside a machine. Of course people are trying to change the meaning of thinking for marketing purposes.

dgfitz · 2025-10-10T03:53:19 1760068399

Ironically, in the UUV space, they use the term “flying” when talking about controlling UUVs.

sharkjacobs · 2025-10-09T18:38:19 1760035099

It doesn't feel like the wikipedia thing is a good counterpoint. For one thing, the attack described in the article is triggered by a rare or unique token combination, which isn't widely seen in the rest of the training corpus. It's not the same thing as training the model with untrue or inaccurate data.

Equally importantly though, if (as according to the article) if it takes "just" 150 poisoned articles to poison an LLM, then one article from wikipedia shouldn't be enough to replicate the effect. Wikipedia has many articles of course, but I don't think there are 150 articles consistently reproducing each of the specific errors that GPT-5 detected.

edit: correction, 250 articles, not 150

dgfitz · 2025-10-10T03:46:27 1760067987

> the attack described in the article is triggered by a rare or unique token combination

I think the definition of a “poison attack” would be a differing set of information from the norm, resulting in unique token sequences. No?

Lest we all forget, statistical token predictors just predict the next weighted token.

Powdering7082 · 2025-10-09T18:26:17 1760034377

Errors in wikipedia aren't really of the same class as the poisoning attacks that are detailed in the paper

dotancohen · 2025-10-09T23:11:47 1760051507

Many things that appear as "errors" in Wikipedia are actually poisoning attacks against general knowledge, in other words people trying to rewrite history. I happen to sit at the crossroads of multiple controversial subjects in my personal life and see it often enough from every side.

emmelaich · 2025-10-10T00:58:32 1760057912

Fnord

cowboylowrez · 2025-10-10T01:51:17 1760061077

yeah, I'm still hoping that Wikipedia remains valuable and vigilant against attacks by the radical right but its obvious that Trump and congress could easily shut down wikipedia if they set their mind to it.

fouc · 2025-10-10T02:33:31 1760063611

you're ignoring that both sides are doing poisoning attacks on wikipedia, trying to control the narrative. it's not just the "radical right"

InvertedRhodium · 2025-10-10T02:55:27 1760064927

Not to mention that there is subset of people that are on neither side, and just want to watch the world burn for the sake of enjoying flames.

cowboylowrez · 2025-10-10T04:22:58 1760070178

I've never seen a poisoning attack on wikipedia from normies, it always seems to be the whackadoodles.

aleph_minus_one · 2025-10-10T07:05:23 1760079923

> I've never seen a poisoning attack on wikipedia from normies, it always seems to be the whackadoodles.

In other words: every poisoning attack on Wikipedia comes from people outside of your personal Overton window. [1] :-)

[1] https://en.wikipedia.org/wiki/Overton_window

cowboylowrez · 2025-10-10T09:08:45 1760087325

very true. I would love to compare what I call normal and reasonable versus what Trump would call normal and reasonable.

dgfitz · 2025-10-10T01:19:27 1760059167

s/latent reasoning/next token prediction with guardrails

DoctorOetker · 2025-10-11T11:53:05 1760183585

thats not a general substitution since you omit the latent qualifier.

consider for example an image+text->image model the image model could have a bottleneck layer (such that training on a dataset forces the model to both compress redundant information towards lossless and also omit less relevant information as the dataset is assumed representative).

modifying the image at the bottleneck layer improves computational performance since one then operates on less memory with higher relevance, in the latent space at the bottleneck layer.

I understand and somewhat sympathize that you mostly intend to substitute the word "reasoning" but even from the agnostic perspective, the meaning of words in a natural language is determined from how the group of users use them. I don't see you complain about overloading meanings for 99.99% of other words in our dictionaries, open any and you'll see many.

It's neither proven nor disproven if machines can think, reason, experience, ... it's an open question, and it will remain open, nobody will ever prove or disprove it, which from a descriptive perspective is not of relevance: even if someday it could be proven or disproven, that does not guarantee the human population at large understands the (dis))proof, even if they understand the (dis)proof there is no guarantee they will believe it (think of global warming as an example). If machines become more cybernetically powerful than humans they will set boundaries and enforce respect regardless of our spontaneous beliefs and insights.

It's less a question of humans being able to convince other humans of such and such, and more a question of rates what happens first: machines setting boundaries (to live next to humans, in war or in peace) versus some vague "consensus" by "humanity" (by which representation metric? the beliefs of tech leaders? of the media owners? of politicians?).

strangescript · 2025-10-09T14:41:12 1760020872

You don't want your agents to ask questions. You are thinking too short term. Its not ideal now, but agents that have to ask frequent questions are useless when it comes the vision of totally autonomous coding.

Humans ask questions of groups to fix our own personal short comings. It make no sense to try and master an internal system I rarely use, I should instead ask someone that maintains it. AI will not have this problem provided we create paths of observability for them. It doesn't take a lot of "effort" for them to completely digest an alien system they need to use.

justonceokay · 2025-10-09T14:46:10 1760021170

If you look at a piece of architecture, you might be able to infer the intentions of the architect. However, there are many interpretations possible. So if you were to add an addendum to the building it makes sense that you might want to ask about the intentions.

I do not believe that AI will magically overcome the Chesterton Fence problem in a 100% autonomous way.

strangescript · 2025-10-09T17:04:50 1760029490

AI won't, but humans will to un-encumber AI

strangescript · 2025-10-08T13:41:27 1759930887

This has been the fundamental issue with the 2.5 line of models. It seems to forget parts of its system prompts, not understand where its "located".

strangescript · 2025-10-07T20:24:53 1759868693

I assume its tool calling and structured output are way better, but this model isn't in Studio unless its being silently subbed in.

phamilton · 2025-10-07T20:26:12 1759868772

Just tried it in an existing coding agent and it rejected the requests because computer tools weren't defined.

omkar_savant · 2025-10-07T21:42:10 1759873330

We can definitely make the docs more clear here but the model requires using the computer_use tool. If you have custom tools, you'll need to exclude predefined tools if they clash with our action space.

See this section: https://googledevai.devsite.corp.google.com/gemini-api/docs/...

And the repo has a sample setup for using the default computer use tool: https://github.com/google/computer-use-preview

strangescript · 2025-09-30T12:27:00 1759235220

So many of these concepts only make sense under the assumption that AI will not get better and humans will continue to pour over code by hand.

They won't. In a year or two these will be articles that get linked back to similar to "Is the internet just a fad?" articles of the late 90s.

gwbas1c · 2025-09-30T13:51:47 1759240307

I disagree. They Not every technological advance "improves" at an exponential rate.

The issue is that LLMs don't "understand." They merely copy without contributing original thought or critical thinking. This is why LLMs can't handle complicated concepts in codebases.

What I think we'll see in the long run is:

(Short term) Newer programming models that target LLMs: IE, describe what you want the computer to do in plain English, and then the LLM will allow users to interact with the program in a more conversational manner. Edit: These will work in "high tolerance" situations where small amounts of error is okay. (Think analog vs digital, where analog systems tend to tolerate error more gracefully than digital systems.)

(Long term) Newer forms of AI that "understand." These will be able to handle complicated programs that LLMs can't handle today, because they have critical thinking and original thought.

tkgally · 2025-09-30T13:00:06 1759237206

A couple of those articles, in case anyone is interested:

“The Internet? Bah! Hype alert: Why cyberspace isn't, and will never be, nirvana” by Clifford Stoll (1995)

Excerpt: “How about electronic publishing? Try reading a book on disc. At best, it's an unpleasant chore: the myopic glow of a clunky computer replaces the friendly pages of a book. And you can't tote that laptop to the beach. Yet Nicholas Negroponte, director of the MIT Media Lab, predicts that we'll soon buy books and newspapers straight over the Internet. Uh, sure.”

https://www.nysaflt.org/workshops/colt/2010/The%20Internet.p...

“Why most economists' predictions are wrong” by Paul Krugman (1998)

Excerpt: “By 2005 or so, it will become clear that the Internet's impact on the economy has been no greater than the fax machine's.”

https://web.archive.org/web/19980610100009/http://www.redher...

senordevnyc · 2025-09-30T13:27:51 1759238871

Krugman's quotes are even worse in full:

The growth of the Internet will slow drastically, as the flaw in "Metcalfe's law"--which states that the number of potential connections in a network is proportional to the square of the number of participants--becomes apparent: most people have nothing to say to each other! By 2005 or so, it will become clear that the Internet's impact on the economy has been no greater than the fax machine's.

As the rate of technological change in computing slows, the number of jobs for IT specialists will decelerate, then actually turn down; ten years from now, the phrase information economy will sound silly.

singleshot_ · 2025-09-30T14:27:20 1759242440

If you assume Krugman was talking about a positive impact, it makes sense to make fun of him.

yfw · 2025-09-30T12:32:27 1759235547

Or it could also be like blockchain and nfts...

monkmartinez · 2025-09-30T12:43:14 1759236194

I have been programming as a hobby for almost 20 years. At least for me, there is huge value using LLM's for code. I don't need anyone else's permission, nor anyone else to participate for the LLM's to work for me. You absolutely can not say that about blockchain, nft, or crypto in general.

rhetocj23 · 2025-09-30T13:30:18 1759239018

Nah that comparison doesnt make sense.

There is certainly real market penetration with LLMs. However, there is a huge gap between fantasy and reality - as in what is being promised vs what is being delivered and the effects on the economy are yet to play out.

metalliqaz · 2025-09-30T16:33:29 1759250009

This requires an assumption that LLM capability growth will continue on an exponential curve, when there are already signs that in reality the curve is logistic

otabdeveloper4 · 2025-09-30T13:08:24 1759237704

Two more weeks and "AI" will finally be intelligent. Trust the plan.

jf22 · 2025-09-30T13:22:31 1759238551

I know this is sarcasm but if you've been using LLMs for most than two weeks you've probably noticed significant improvements in both the models and the tooling.

Less than a year ago I was generated somewhat silly and broken unit tests with copilot. Now I'm generating entire feature sets while doing loads of laundry.

int_19h · 2025-10-01T01:24:16 1759281856

That's all true, yet the problem of hallucinations is as stark today as it was three years ago when GPT-3.5 was all the rage. Until that is solved, I don't think there's any amount of "smartness" of the models that can truly compensate for it.

randallsquared · 2025-09-30T12:57:33 1759237053

Exactly so. From the article:

> But those of us who’ve experimented a lot with using LLMs for code generation and modification know that there will be times when the tool just won’t be able to do it.

The pace of change here--the new normal pace--has the potential to make this look outdated in mere months, and finding that the curve topped out exactly in late 2025, such that this remains the state of development for many years, seems intuitively very unlikely.

fhennig · 2025-09-30T13:25:36 1759238736

Just like how in a year or two we will have fully self-driving cars, right?

The last percentage point for something to get just right are the hardest, why are you so sure that the flaws in LLMs will be gone in such a short time frame?

senordevnyc · 2025-09-30T13:28:48 1759238928

We do have fully self-driving cars. You can go to a number of American cities and take a nap in the backseat of one while it drives you around safely.

justsocrateasin · 2025-09-30T14:02:40 1759240960

But it's not fully self driving. SF Waymo can't bring you to the airport. You missed OPs point, which was that the last few percentage points are the hardest.

senordevnyc · 2025-10-02T02:50:42 1759373442

They recently got approval for the airport, and the issue was legal / regulatory, not technical. They could have been doing rides from the airport years ago.

exasperaited · 2025-09-30T12:44:27 1759236267

pore*