Hacker Newsnew | past | comments | ask | show | jobs | submit | ACCount37's commentslogin

Every single task that was easy and economical to offload to a single purpose robot arm bolted down to the floor was already offloaded to a single purpose robot arm bolted down to the floor.

What remains is: all those quirky little one-off processes that aren't very amenable to "robot arm" automation, aren't worth the process design effort to make them amenable to it, and are currently solved by human labor.

Thus, you design new solutions to target that open niche.

Humans aren't perfect at anything, but they are passable at everything. Universal worker robots attempt to replicate that.

"A drop-in replacement for simple human labor" is a very lucrative thing, assuming one could pull it off. And that favors humanoid hulls.

Not that it's the form that's the bottleneck for that, not really. The problem of universal robots is fundamentally an AI problem. Today, we could build a humanoid body that could mechanically perform over 90% of all industrial tasks performed by humans, but not the AI that would actually make it do it.


My impression is that a big part of the reason for the sudden boom in humanoid robots is that they lend themselves particularly well to RL based training using human-made training footage using VR. It’s much easier to have a robot broadly copy human actions if the robot looks like a human, instead of having to first translate the human action to your robot arm equivalent.

The big part is the rise of modern AI in general.

The success of large multipurpose AI models trained on web-scale data pushed a lot of people towards "cracking general purpose robot AI might be possible within a decade".

Whether transfer learning from human VR/teleop data is the best way to do it remains uncertain - there are many approaches towards training and data collection. Although transfer learning from web-scale data, teleoperation and "RL IRL" are common - usually on different ends of the training pipeline.

Tesla got the memo earlier than most, because Musk is a mad bleeding edge technology demon, but many others followed shortly before or during the public 2022 AI boom.


That is certainly a factor, but you also have to take into account that all these tasks in the factories are now centered around the human form because humans are doing them.

This framing clarifies something people get wrong about humanoid robots. The competition isn't "humanoid vs. better robot" — it's "humanoid vs. hiring another person."

And that reframes the economics entirely. You don't need the robot to be better than a human at any given task. You need the total cost of ownership to be lower than a salary, benefits, turnover, and training. That's a much easier bar to clear once the AI catches up to the body.

The interesting question is whether the AI problem gets solved generally (one model that can do everything) or whether we end up with task-specific AI in a general-purpose body — basically the robot arm paradigm wearing a humanoid suit.


Em-dashes aside, I favor "one model that can do everything" in principle because scaling laws and distillation exist, and in practice because "one model that you can point at any problem" is a massive operational advantage.

If you can get 5 specialist models that can use the same robot body, you can also get 1 generalist model with more capacity and fold the specialists into it. If you have the in-house training that made those specialists, apply them to the generalist instead, the way we give general purpose AIs coding-specific training. If you don't, take the specialists as is and distill from them.

If you do it right, transfer learning might even give you a model that generalizes better and beats the specialists at their own game. Because your "special" tasks have partial subtask overlap that you got stronger training for, and contributed to diversity of environments. Robotics AI is training data starved as a rule.

Same kind of lesson we learned with LLM specialists - invest into a specialist model and watch the next gen generalists with better data and training crush it.


Yes that's pretty much it. Some people from boston dynamics were talking on a podcast. And they were saying that they sat down with toyota and figured out they could automate all the tasks in a factory, but it would take 10000 man years or something and toyota makes new trims every six months so you need about 10000 man years every six months or so.

It's the flexibility and adaptability with minimum training that's required.


I think this is the podcast you mentioned:

https://youtu.be/SRZ9E48B6aM?si=K_wwvu97agBZpFTa


> Every single task that was easy and economical to offload to a single purpose robot arm bolted down to the floor was already offloaded to a single purpose robot arm bolted down to the floor.

What about doing dishes? That could be done with one arm. Maybe not easy and economical yet, but could be.

There is plenty that has not been seen through.

Laundry folding machines are not in wide distribution.

Robots to put away laundry?

Etc. lots of mundane tasks.


Yeah I think there’s plenty of room for more bolted arm robots, it’s just similar to the humanoids they need better AI. There’s also room for more optimisation on the entire system design around more specialized robots. I think some industries work really well for that kind of revamp, and have already begun doing so. Others are waiting for the cost curve to fall for it to be worth the investment

A lot of that seems to be the usual "you're training them wrong".

Sonnet 3.5 is old hat, and today's Sonnet 4.6 ships with an extra long 1M context window. And performs better on long context tasks while at it.

There are also attempts to address long context attention performance on the architectural side - streaming, learned KV dropout, differential attention. All of which can allow LLMs to sustain longer sessions and leverage longer contexts better.

If we're comparing to wet meat, then the closest thing humans have to context is working memory. Which humans also get a limited amount of - but can use to do complex work by loading things in and out of it. Which LLMs can also be trained to do. Today's tools like file search and context compression are crude versions of that.


I know Sonnet 4.6 has a 1M context window. I use it every day. But in my experience with Claude Code and Cursor, performance clearly drops between 20k and 200k context. External memory is where the real fix is, not bigger windows.

The only real way to unfuck your foreign language is to use it. Which does mean accepting you wouldn't be perfect doing it.

Given the mechanistic interpretability findings? I'm not sure how people still say shit like "no real world model" seriously.

People just overstate their understanding and knowledge, the usual human stuff. The same user has a comment in this thread that contains:

'If you actually know what models are doing under the hood to product output that...'

Any one that tells you they know 'what models are dong under the hood' simply has no idea what they're talking about, and it's amazing how common this is.


Fair, I should define what I mean by under the hood. By “under the hood” I mean that models are still just being fed a stream of text (or other tokens in the case of video and audio models), being asked to predict the next token, and then doing that again. There is no technique that anyone has discovered that is different than that, at least not that is in production. If you think there is, and people are just keeping it secret, well, you clearly don’t know how these places work. The elaborations that make this more interesting than the original GPT/Attention stuff is 1) there is more than one model in the mix now, even though you may only be told you’re interacting with “GPT 5.4”, 2) there’s a significant amount of fine tuning with RLHF in specific domains that each lab feels is important to be good at because of benchmarks, strategy, or just conviction (DeepMind, we see you). There’s also a lot work being put into speeding up inference, as well as making it cheaper to operate. I probably shouldn’t forget tool use for that matter, since that’s the only reason they can count the r’s in strawberry these days.

None of that changes the concept that a model is just fundamentally very good at predicting what the next element in the stream should be, modulo injected randomness in the form of a temperature. Why does that actually end up looking like intelligence? Well, because we see the model’s ability to be plausibly correct over a wide range of topics and we get excited.

Btw, don’t take this reductionist approach as being synonymous with thinking these models aren’t incredibly useful and transformative for multiple industries. They’re a very big deal. But OpenAI shouldn’t give up because Opus 4.whatever is doing better on a bunch of benchmarks that are either saturated or in the training data, or have been RLHF’d to hell and back. This is not AGI.


Everybody says "but they just predict tokens" as if that's not just "I hope you won't think too much about this" sleight of hand.

Why does predicting the next token mean that they aren't AGI? Please clarify the exact logical steps there, because I make a similar argument that human brains are merely electrical signals propagating, and not real intelligence, but I never really seem to convince people.


More take an episode like Loops from Radiolab where a person’s memory resets back to a specific set of inputs/state and pretty responds the same way over and over again - very much like predicting the next token. Almost all human interaction is reflexive not thoughtful. Even now as you read this and process it, there’s not a lot of thought - but a whole lot of prediction and pattern matching going on.

"Predict next token" describes an interface. That tells you very little of what actually goes on inside the thing.

You can "predict next token" using a human, an LLM, or a Markov chain.


Because there are some really fundamental things they cannot do with next token prediction. For instance, their memory is akin to someone who reads the phone book and memorizes the entire thing, but can't tell you what a phone number is for. Moreover, they can mimic semantic knowledge, because they have been trained on that knowledge, but take them out of their training distribution and they get into a "creative story-telling" mode very quickly. They can quote me all the rules of chess, but when it comes to actually making a chess move they break those rules with abandon simply because they didn't actually understand the rules. Chess is instructive in another way, too, in that you can get them to play a pretty solid opening game, maybe 10, 15 moves in, but then they start forgetting pieces, creating board positions that are impossible to reach, etc. They have memorized the forms of a board, know the names of the pieces, but they have no true understanding of what a chess game is. Coding is similar, they're fine when you give them Python or Bash shell scripts to write, they've been heavily trained on those, but ask them to deal with a system that has a non-standard stack and they will go haywire if you let their context get even medium sized. Something else they lack is any kind of learning efficiency as you or I would understand the concept. By this I mean the entire Internet is not sufficient to train today's models, the labs have to synthesize new data for models to train on to get sufficient coverage of a given area they want the model to be knowledgeable about. Continuous learning is a well-known issue as well, they simply don't do it. The labs have created memory, which is just more context engineering, but it's not the same as updating as you interact with them. I could go on.

At the end of the day next token prediction is a sleight of hand. It produces amazingly powerful affects, I agree. You can turn this one magic trick into the illusion of reasoning, but what it's doing is more of a "one thing after another" style story-telling that is fine for a lot of things, but doesn't get to the heart of what intelligence means. If you want to call them intelligent because they can do this stuff, fine, but it's an alien kind of intelligence that is incredibly limited. A dog or a cat actually demonstrate more ability to learn, to contextualize, and to make meaning.


You didn't actually give an example of what the issue with next token prediction is. You just mentioned current constraints (ie generalization and learning are difficult, needs mountains of data to train, can't play chess very well) that are not fundamental problems. You can trivially train a transformer to play chess above the level any human can play at, and they would still be doing "next token prediction". I wouldn't be surprised if every single thing you list as a challenge is solved in a few years, either through improvement at a basic level (ie better architectures) or harnessing.

We don't know how human brains produce intelligence. At a fundamental level, they might also be doing next token prediction or something similarly "dumb". Just because we know the basic mechanism of how LLMs work doesn't mean we can explain how they work and what they do, in a similar way that we might know everything we need to know about neurons and we still cannot fully grasp sentience.


I use the chess example because it’s especially instructive. It would NOT be trivial to train an LLM to play chess, next token prediction breaks down when you have so many positions to remember and you can’t adequately assign value to intermediate positions. Chess bots work by being trained on how to assign value to a position, something fundamentally different than what an LLM is doing.

A simpler example — without tool use, the standard BPE tokenization method made it impossible for state of the art LLMs to tell you how many ‘r’s are in strawberry. This is because they are thinking in tokens, not letters and not words. Can you think of anything in our intelligence where the way we encode experience makes it impossible for us to reason about it? The closest thing I can come to is how some cultures/languages have different ways of describing color and as a result cannot distinguish between colors that we think are quite distinct. And yet I can explain that, think about it, etc. We can reason abstractly and we don’t have to resort to a literal deus ex machina to do so.

Not being able to explain our brain to you doesn’t mean I can’t notice things that LLMs can’t do, and that we can, and draw some conclusions.


There are chess engines based on transformers, even DeepMind released one [1]. It achieved ~2900 Elo. It does have peculiarities for example in the endgame that are likely derived from its architecture, though I think it definitely qualifies as an example of the fact that simply because something is a next token predictor doesn't mean it cannot perform tasks that require intelligence and planning.

The r in strawberry is more of a fundamental limitation of our tokenization procedures, not the transformer architecture. We could easily train a LLM with byte-size tokens that would nail those problems. It can also be easily fixed with harnessing (ie for this class of problems, write a script rather than solve it yourself). I mean, we do this all the time ourselves, even mathematicians and physicists will run to a calculator for all kinds of problems they could in principle solve in their heads.

[1] https://arxiv.org/abs/2402.04494


But chess models aren't trained the same way LLMs are trained. If I am not mistaken, they are trained directly from chess moves using pure reinforcement learning, and it's definitely not trivial as for instance AlphaZero took 64 TPUs to train.

You can train them in a very similar way.

Modern LLMs often start at "imitation learning" pre-training on web-scale data and continue with RLVR for specific verifiable tasks like coding. You can pre-train a chess engine transformer on human or engine chess parties, "imitation learning" mode, and then add RL against other engines or as self-play - to anneal the deficiencies and improve performance.

This was used for a few different game engines in practice. Probably not worth it for chess unless you explicitly want humanlike moves, but games with wider state and things like incomplete information benefit from the early "imitation learning" regime getting them into the envelope fast.


I meant trivial in the sense it's a solved problem, I'm sure it still costs a non-negligible amount of money to train it. See for example the chess transformer built by DeepMind a couple of years ago which I referred to in a sibling comment [1].

[1] https://arxiv.org/abs/2402.04494


None of this is a logical certainty of "X, therefore Y", it's just opinions. You can trivially add memory to a model by continuing to train it, we just don't do it because it's expensive, not because it can't be done.

Also, the phone book example is off the mark, because if I take a human who's never seen a phone and ask them to memorise the phone book, they would (or not), while not knowing what a phone number was for. Did you expect that a human would just come up on knowledge about phones entirely on their own, from nothing?


Next token prediction is about predicting the future by minimizing the number of bits required to encode the past. It is fundamentally causal and has a discrete time domain. You can't predict token N+2 without having first predicted token N+1. The human brain has the same operational principles.

Next-token prediction is just the training objective. I could describe your reply to me as “next-word prediction” too, since the words necessarily come out one after another. But that framing is trivial. It tells you what the system is being optimized to do, not how it actually does it.

Model training can be summed up as 'This what you have to do (objective), figure it out. Well here's a little skeleton that might help you out (architecture)'.

We spend millions of dollars and months training these frontier models precisely because the training process figures out numerous things we don't know or understand. Every day, Large Language Models, in service of their reply, in service of 'predicting the next token', perform sophisticated internal procedures far more complex than anything any human has come up with or possesses knowledge of. So for someone to say that they 'know how the models work under the hood', well it's all very silly.


> Btw, don’t take this reductionist approach as being synonymous with thinking these models aren’t incredibly useful and transformative for multiple industries. They’re a very big deal. But OpenAI shouldn’t give up because Opus 4.whatever is doing better on a bunch of benchmarks that are either saturated or in the training data, or have been RLHF’d to hell and back. This is not AGI.

It's sad that you have to add this postscript lest you be accused of being ignorant or anti-AI because you acknowledge that LLMs are not AGI.


If you typed your comment by reading all the others' in the chain, then you responded by typing your response in one go, then you 'just' did next-token prediction based on textual input.

I would still argue that does not prevent you from having intelligence, so that's why this argument is silly.


They have a _text_ model. There is some correlation between the text model and the world, but it’s loose and only because there’s a lot of text about the world. And of course robotics researchers are having to build world models, but these are far from general. If they had a real world model, I could tell them I want to play a game of chess and they would be able to remember where the pieces are from move to move.

What makes you think that text is inherently a worse reflection of the world than light is?

All world models are lossy as fuck, by the way. I could give you a list of chess moves and force you to recover the complete board state from it, and you wouldn't fare that much better than an off the shelf LLM would. An LLM trained for it would kick ass though.


> I could give you a list of chess moves and force you to recover the complete board state from it, and you wouldn't fare that much better than an off the shelf LLM would

idk, I would expect anyone with an understanding of the rules of chess, and an understanding of whatever notation the moves are in, would be able to do it reasonably well? does that really sound so hard to you? people used to play correspondance chess. Heck I remember people doing it over email.

In comparison, current ai models start to completely lose the plot after 15 or so moves, pulling out third, fourth and fifth bishops, rooks etc from thin air, claiming checkmate erroneously etc, to the point its not possible to play a game with them in a coherent manner.


I would expect that off the shelf GPT-5.4 would be able to do it when prompted carefully, yes. Through reasoning - by playing every move step by step and updating the board one move at a time to arrive at a final board state.

On the other hand, recovering the full board state in a single forward pass? That takes some special training.

Same goes for meatbag chess. A correspondence chess aficionado might be able to take a glance at a list of moves and see the entire game unfold in his mind's eye. A casual player who only knows how to play chess at 600 ELO on a board that's in front of him would have to retrace every move carefully, and might make errors while at it.


Try to play a simple over the board style game with 5.4 with whatever notation you chose to use (or just descriptions, literally anything). Prediction: it will start out fine, but the mid game will be very hard to keep it on track, and the endgame will make you give up.

> What makes you think that text is inherently a worse reflection of the world than light is?

What does the color green look like?


A color without form can't look like anything.

It doesn't look like anything to me.

"What makes you think that text is inherently a worse reflection of the world than light is?"

Come on man, did you think before you asked that one :)?


People are finding it hard to grasp emergent properties can appear at very large scales and dimensions.

The "fundamental limitations" being what exactly?

I used to think it was the quadratic complexity of attention but I guess that's not a concern anymore as they've made more hardware aware kernels of attention? The other I remember is continual learning but that may be solved in near-term future. I am not completely confident about it.

Humans do have an upper limit on how much working memory they have. Which I see as the closest thing to the "O(N^2) attention curse" of LLMs.

That doesn't stop an LLM from manipulating its context window to take full advantage of however much context capacity it has. Today's tools like file search and context compression are crude versions of that.


Human brain's prediction loop is bayesian in nature.

Damn, the research moves fast. I was wrong again: https://arxiv.org/abs/2507.11768

The divide seems to come down to: do you enjoy the "micro" of getting bits of code to work and fit together neatly, or the "macro" of building systems that work?

If it's the former, you hate AI agents. If it's the latter, you love AI agents.


I'd say that the divide seems to come down to whether you want to be a manager or a hacker. Skimming the posts in this submission, many of the most enamored with LLMs seem to be project managers, people managers, principal+ engineers who don't code much anymore, and other not hands-on people who are less concerned with quality or technical elegance than getting some kind of result.

Bear in mind also that the inputs to train LLMs on future languages and frameworks necessarily have to come from the hacker types. Somebody has to get their hands dirty, the "micro" of the parent post, to write a high quality corpus of code in the new tech so that LLMs have a basis to work from to emit their results.


I want to "hack" at a different level.

What I want to do is create bespoke components that I can use to create a larger solution to solve a problem I have.

What I don't want to do is spend 45 minutes wrangling JSON to a struct so that I can get the damn component working =)

A quick example: I wanted a component that could see if I have new replies on HN using the Algol API. ~10 minutes of wall clock time with Claude, maybe a minute of my time. Just reading through the API spec is 15 minutes. Not my idea of fun.


I think it's pretty obvious what category you see yourself in.

I don't think you're a hacker. I think you enjoy writing code (good for you). Some of us just enjoy making the computer execute our ideas - like a digital magician. I've also gotten very good at the code writing and debugging part. I've even enjoyed it for long periods of time but there's times where I can't execute my ideas because they're bigger than what I can reasonably do by myself. Then my job becomes pitching, hiring, and managing humans. Now I write code to write code and no project seems too big.

But I'm looking forward to collapsing the many layers of abstraction we've created to move bits and control devices. It was always about what we could do with the computers for me.


“Technical excellence” has never been about whether you are using a for loop or while loop. It’s architecture, whether you are solving the right problem, scalability, etc

Performance critical applications (game engines etc) don't agree with that

Most people aren’t writing game engines. Hell most people at BigTech aren’t worried about scalability. They are building on top of scalable internal frameworks - not code frameworks things like Google Borg.

The reason your login is slow is not because someone didn’t use the right algorithm.

Most game developers are just using other company’s engines.

While yes you need to learn how the architecture, the code isn’t the gating factor.

One example is the Amazon Prime Video team using AWS Step functions when they shouldn’t have and it led to inefficiencies. This was a public article that I can’t find right now.

(And before someone from Amazon Retail chimes in and says much of Amazon Retail doesn’t run on AWS and uses the legacy CDO infrastructure - yes I know. I am a former AWS employee).


That is an amazing summary. It might not seem that amazing, but I feel like I've read pages about this, but nothing has expressed as elegantly and succinctly.

I do love the former, but it's been nice to take a break from that and work at a higher level of abstraction.

Same. After 40+ years of typing code on a keyboard, my hands aren't as nimble as they were, a little pain sometimes builds up (whether it's arthritis or carpal tunnel or something, I'm not sure). Being able to have large amounts of code written with much less input is a godsend - and it's been great to learn and see what models like Claude can really do, if you can remain organized and focussed on the API's/interfaces.

Do you have WisprFlow or similar STT setup? It's a real Star Trek moment vocally telling my computer what to build, and then to have it build it.

I tried WisprFlow after you mentioned it and after spending ages clicking through all the dialogs only to find it didn't work out of the box with my terminal (I use Claude cli almost exclusively). Could have been something wrong with my terminal I guess, since I wrote my own.

I enjoy both. There’s still plenty of micro to do even in web dev if you have high standards. Read Claude’s output and you’ll find issues. Code organization, style, edge cases, etc.

But the important thing is getting solutions to users. Claude makes that easier.


> do you enjoy the "micro" of getting bits of code to work and fit together neatly, or the "macro" of building systems that work?

These are not toys. I want to make money. The customers want feature after feature, in a steady stream. It's bad business if the third or fourth feature takes ages. The longer stream, the better financially.

That the code "works" on any level is elementary, Watson, what must "work" is that stream of new features/pivots/redesigns/fixes flowing.


How dare you go online without a clean IP at a first world country home ISP! You should be subjected to 99 captchas a minute for it!

It’s even a residential IP from a German tier 1 ISP, as reputable as it gets. Works fine on computers and for everyone around me.

But somehow, Turnstile seems to think that traffic from a Linux phone == robot traffic.


Not a given. We've already seen LLMs that got SFT'd by "national teams" adopt ESL speech patterns.

They won’t make punctuation mistakes though.

Wouldn't they do exactly that if they were trained on enough text with punctuation mistakes?

No because of post training

Good. GeoIP should be dead, and "IP reputation" should be meaningless garbage.

IP Reputation is only as meaningful as the duration of ownership. If it's the same owner for years, then reputation is meaningful, and that should count; if it changes hands every 6 hours being assigned to VPS clients or whatnot, then make the reputation stick to the /24 owner, and so on, with varying degrees of scope and duration, so that the responsible party - the shady companies renting their IPs to bad people - actually have their reputations stick. Then block the /24 or larger subnets, or aggressively block all ranges owned by the company, isolating them and their clients, good and bad.

That sort of pressure can work. But then you risk brigading and activist fueled social media mobs and that's definitely no way to run the internet.


What's the purpose of blocking them, anyway? Is it to make you feel good? To clean up logs? To reduce spam? With the residential proxy industry - which, I note, is directly boosted by such blocking practices and funnels money into organized crime - IPs don't mean a whole lot to those who can pay.

100% agree with your point regarding long term ownership allowing for meaningful reputation.

I don't necessarily think that's 'no way to run the internet' or even 'no way to run anything', in that people can choose to whom they listen in regards to blocking, protesting, boycotting.

As long as none of the different groups of opinions are forced on anyone else, then pick and choose those you apply and those you ignore.

With my lists of blocking, I classify them, personally, into different tiers such as Basic, Recommended, Aggressive, and Paranoid when I apply the rules to other people's (family) setups - I'm the only one that uses Paranoid.


How do you protect against DDoS?

Temporary blocks if and when you are actually being DDoSed, presumably?

Large DDoS botnets will have hundreds of thousands of return-path-capable IP addresses. Your temporary blocks will have to be very sensitive (i.e. trigger on a relatively small number of requests within the time window) for an application-level DDoS to be usefully mitigated.

So how does your other plan solve that?

Once an IP in a botnet attacks someone, it ends up on a blocklist and can’t attack anyone else who uses that blocklist. This is a big part of Cloudflare’s DDoS model: if you attack one CF property (with non-volumetric DDoS) you will not be able to attack any others with the same bot for an extended period. This makes attacks to CF properties limited in scope and way more costly, because you have to essentially “burn” IP addresses after sending relatively little traffic.

How long does it take for a whole major ISP, say Verizon, to get on your blocklist?

Considering nobody blocks the entirety of Verizon, apparently a long time. You can act like this is some insane plan, but it’s happening all the time and while it can lead to annoyance for end users the internet chugs on. Which it wouldn’t if there was no way to mitigate DDoS other than rate limits.

Not to a significant degree.

Preventing 1% of sunlight from hitting Earth is more than enough to offset climate change heating. It's not enough to make agriculture or photovoltaics uneconomical. In many regions, it might make agriculture more viable on the net, not less - by reducing climate risks.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: