Hacker News new | past | comments | ask | show | jobs | submit login
Bitter Lesson is about AI agents (ankitmaloo.com)
139 points by ankit219 81 days ago | hide | past | favorite | 105 comments



For a blog post of 1,200 words the bitter lesson has done more damage to AI research and funding than blowing up a nuclear bomb at neurips would.

Every time I try to write a reasonable blog post about why it's wrong it blows up to tens of thousands of words and no one can be bothered to read it, let alone the supporting citations.

In the spirit of low effort anec-data pulled from memory:

The raw compute needed to brute force any problem can only be known after the problem is solved. There is no sane upper limit to how much computation, memory and data any given task will take and humans are terrible at estimating how hard tasks actually are. We are after all only 60 years late for the undergraduate summer project that would solve computer vision.

Today VLMs are the best brute force approach to solving computer vision we have, and they look like they will take a PB of state to solve and the compute needed to train them will be available some time around 2040.

What do we do with the problems that are too hard to solve with the limited compute that we have? Lie down for 80 years and wait for compute to catch up? Or solve a smaller problem using specialized tricks that don't require a $10B super computer to build?

The bitter lesson is nothing of the sort, there is plenty of space for thinking hard, and there always will be.


In my experience, "the bitter lesson" is just wrong. Yes, a hypothetical AI model that has a suitable architecture so that it can memorize your handcrafted features can improve upon them, if it is perfectly trained. But that's a huge if. There's a reason that large AI models only work with some architectures and you need to randomly initialize tens to hundreds of times and you need exactly the right optimizer with the right hyperparameters and then the right training data in the right order... and the reason is that generic function approximators get stuck in a local minima incredibly easy.

People used to write handcrafted features to limit the optimization base area. Now people use specific pretraining data and initialization functions to do the same. It's just a different way to express the same constraints.

In short, the "bitter lesson" assumes a change that did not happen yet.


>and the reason is that generic function approximators get stuck in a local minima incredibly easy.

The key insight in massively over parameterized models (aka LLMs) has been that all the local minima are very similar. Picking one over the other doesn't actually benefit you that much.


Absolutely not! LLM training quite easily gets stuck. People call it "divergence" or "loss spike". See for example:

https://dl.acm.org/doi/10.5555/3524938.3525354

"Gradient optimization with attention layers can be notoriously difficult requiring tricks such as learning rate warmup to prevent divergence.“

"we then propose a new weight initialization scheme with theoretical justification, that enables training without warmup"

Or "Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes"

https://arxiv.org/html/2410.05052v1

If all local minima were similar, then these papers would represent a lot of wasted effort. Or said in a different way: If labs like DeepMind are willing to pay for large teams that are trying to optimize initialization, then surely this represents a real existing problem.


There are basically no people who have done pre-training of LLMs.

What you're describing is absolutely the case, but everyone not in a top 100 AI lab would have never had the displeasure of having to deal with that.

That said I strongly suggest everyone follow along with a GPT2 tutorial like: https://github.com/karpathy/minGPT to get a feel for what the big boys have to deal with. Any $100,000 training run can end in failure for no reason what so ever because you just got unlucky with your random seed.

If you want something that is both SOTA and cheap: https://huggingface.co/blog/modernbert you can get from pre-training to fine tuning for a couple hundred dollars with a model that is better than anything available to the public. Bert models are so under researched that you legitimately push the state of the art for them on a work station.

Also feel free to drop a line at my profile's address.


nit: a divergent training run is somewhat separate from the question of local minima. The latter observes the loss curve stabilize and asks 'is the _best_ (loss-minimizing) set of parameters for this loss/data/model/hypers/seed/etc?'. The former observes the loss curve explode, and asks for another 10k GPU hours to try again with a different choice of loss/data/model/... .

They're definitely related in some instances e.g. 'flat' vs 'steep' local minima with comparable losses.


> There's a reason that large AI models only work with some architectures and you need to randomly initialize tens to hundreds of times and you need exactly the right optimizer with the right hyperparameters and then the right training data in the right order..

GPT4 used the tensor programs (Greg Yang,now at X. Ai) research to effectively transfer hyper parameters from smaller models where experiments can go fast to the larger one in a predictable way and got a very smooth loss curve throughout the training.


But that's OPs point. You need to care about those hyper parameters. Good on them for finding them in smaller models, but they still had to find them, because without those hyper parameters pre-training is at best unstable and at worst divergent.

Also small here means hundreds of millions to billions of parameters - that's still larger than any model pre 2020.


Prior to that work on hyperparameter transfer though things would go wrong much more often training the big expensive model and require restarts back to checkpoints and retuning parameters much more often though. Yeah it's still a lot of fiddling around but it is orders of magnitude less expensive.

But that work also went far beyond transfer of fiddling around and mathematically solved a lot of stuff with avoiding overfitting, great video overview of it here:

https://www.youtube.com/watch?v=1aXOXHA7Jcw


In my opinion, you've grossly mischaracterized and misunderstood the bitter lesson. The bitter lesson is not saying that brute force will win out against better algorithms, it's saying that simple, efficient algorithms, in a theoretical sense, will win out against intricate and specialized algorithms.

That is, it's saying that worrying about the constant factor in the "big-Oh" is doomed to failure. My reading is that it's not saying that we should abandon polynomial time algorithms in favor of exponential ones, it's saying we should focus on polynomial time algorithms that are conceptually simple and let Moore's law take care of the constant factor differences between intricate, complex and "clever" algorithms. It's saying that Moore's law will obliterate the constant factor differences in algorithms, and potentially the small polynomial exponent differences, not that we shouldn't care about exponential time complexity.

The "Bitter Lesson" blog post came out during a time when people were debating whether simple algorithms on large datasets could win out against more complex algorithms with intricate domain knowledge. That is, did AGI, vision, language, translation, speech, etc. need intricate and deep domain knowledge or was it good enough to get coarser and "stupider" algorithms, that were still theoretically efficient/polynomial time efficient algorithms.

I can see why you would think this way and I certainly don't claim to have deep insight into Sutton to claim whether my reading is more correct than yours. From my perspective, the good faith reading of Sutton's article is talking about constant factor differences in algorithms, not in differences between polynomial time and exponential time worst case complexities.


I wonder if an analogy with graphics would be that "simple algorithms that leveraged compute" would not be "brute force everything with 1000 gpus + ray tracing", it might be a pixel shader that simulated water taking the state of the art a bit further.


The bitter lesson is about the balance of human ingenuity and compute being thrown at the problem. We've seen a few years of LLM compute being scaled up 10x every year, but this is hitting limits (fabs), and we will see more human effort, as it becomes compartively cheaper.

Also the current crop of models are inherently limited. Even for something as simple as following a JSON schema, models alone are not good enough [0]

Of course as the Moore law refuses to die, we'll continue seeing 1.5-2x or so every year, but that's far from 10x.

[0] https://openai.com/index/introducing-structured-outputs-in-t... - see plot


>Of course as the Moore law refuses to die, we'll continue seeing 1.5-2x or so every year, but that's far from 10x.

This is another one of those anec-data throw away sentences that take thousands of words to disprove - with a lot of graphs - that no one reads.

More hot takes: Moores law has been effectively dead since the Pentium 4 on CPUs. It's been dead on GPUs since 2020. Right now we're not seeing a 1.5-2x grow of compute per year. We've seen zero growth for 5. The only way that GPUs have gotten faster is by running ever hotter, and building out a trillion dollars worth of data centers.

No one cares because the current hotness in AI is transformers which are memory bound in both training and inference. If someone manages to get diffusion models to become the next hotness all of a sudden everyone will realize this is a problem since those are compute bound by a huge margin and current gen GPUs are fire hazards when ran at 100% utilization for weeks on end.


Yeah, my experience is that the rapid growth of computing capabilities of my childhood more or less ended in 2013: the desktop I built then is fine for most of my uses and the significant performance improvements since then haven’t been CPUs getting noticeably faster but increases in the speed of disks (SSDs) and growth of memory sizes and network speeds.


>More hot takes: Moores law has been effectively dead since the Pentium 4 on CPUs.

AMD is offering double digit percentage increases in IPC every generation. What you were saying might have been valid in early 2017, back when Intel was cranking up the clock speeds for a 5% increase, but when Zen came out suddenly AMD started delivering again.


the only escape from the Black Pill of AI ruining society is we hit some hard limits and technology stops progressing at such a rapid rate, tapers, and gives humanity a bit of time to adapt culturally.


wat... my 9950x cpu that i just bought is way faster than the similarly priced cpu i bought 6 years ago.

The difference is night and day. What are you talking about.


the equivalent chip 6 years ago was the 3950x which is the same number of cores clocked ~20% slower. if you add on ~30% ipc (for avx-512 and some general cleanups), you get a 60% speedup in 6 years which is way below the ~10x speedup between the similarly spaced Pentium 4 and Sandy bridge


Its closer to 2X faster on Chromium compile benchmark, at least the 9950X3D, and even bigger benefits for X3D in other use cases than code compiling, 9950X3D is only a few percent better than 9950X for that.


how about power consumption, cache size, etc?

what i know is that in my particle simulations and multithreaded compilation tasks it is easily 2-10x faster. where before i could choke on 1 million particles, now im choking on 30mill.

i specifically bought this processor for its rust compilation times.


>what i know is that in my particle simulations and multithreaded compilation tasks it is easily 2-10x faster.

If Moore's law still worked like it did between 1975 and 2005 you'd be getting a 16x performance boost in _single_ threaded applications.


The fastest single threaded, non overclocked, commercially available cpu was the Intel i3 7350k.

I haven't seen anything bench faster than a system set up for that machine on a business board.

I think now with advancements such as AVX (the newer versions that chips like atoms don't have) the 7350k can be viewed as a power hungry pig.

I'm not sure though. The systems I purpose built using those CPUs still work great, no real significant performance issues even with current software and web. And they can use a single battery UPS.

I still have a 16 core 5950x, that is 10% faster than the 40 core xeon box it replaced, at 1/4 the wattage at the wall as my main machine.


It's much higher than 1.5x-2x, both scale up and scale out. Moving onto fp4 alone will offer a huge speedup.


Agree (though that's perhaps not a charitable interpretation of TBL). Prematurely articulated principles can do a lot of damage.

Btw I found the blog post to be one of the lowest quality one in terms of information content posted on HN - almost like it was written by chatgpt or something.


I don't feel like the bitter lesson becomes applicable until you've demonstrated some initial degree of success with your technique on a single machine/GPU/CPU/thread. If you cannot make it work in practical environments, it's going to be an uphill battle the entire way.

This is why I've moved toward CPU-only techniques in my experimentation. Being able to execute arbitrary UTMs with high performance provides a significantly richer computational landscape to work with than matrix multiplication. I am perfectly happy with something taking longer as if it's on a linear basis. I.e., adding another CPU provides ~2x search speed. I am NOT happy with ideas like taking a 100x hit on token generation rate because half my parameters are paged out to disk at any given moment due to not having a step-size amount of vram.

The rigidity of the GPU solution stack makes exploration of clever techniques largely a slap on the wrist experience. Anything with rapid control flow changes is verboten for a fancy pants cluster. The latency domain of L1 cache in the CPU is impossible to compete with if you need to serialize all of your events. I strongly believe that control flow is where the magic happens. This is where you can cut through 100 million parameters of linear math bullshit and solve the problem with a lookup table and 2 interpreter cycles. You get about half a billion of these cycles to work with per second per thread, so there is a lot of room to play with ideas.


Meanwhile in robotics everyone is glad that computers are getting faster.

There are a lot of things that used to be impossible to do inside a 1000 Hz control loop.

>What do we do with the problems that are too hard to solve with the limited compute that we have? Lie down for 80 years and wait for compute to catch up? Or solve a smaller problem using specialized tricks that don't require a $10B super computer to build?

Solving a smaller problem using specialized tricks has gone nowhere in robotics. Almost all the advancements in robotics control happened in the last ten years as a result of computers getting faster.

We are very close to the cusp of non linear MPC becoming a solved problem for up to 64 degrees of freedom and a horizon of 20 time steps at 1000Hz, but we aren't there yet. It would definitely be possible with an ASIC built for MPC.

>The bitter lesson is nothing of the sort, there is plenty of space for thinking hard, and there always will be.

The bitter lesson doesn't say that human ingenuity is worthless. It guides it in a useful direction. A lot of human ingenuity was put into compute scaling solutions for transformers.


"The bitter lesson" didn't merely claim that a sufficiently large amount of compute would obsolete an engineered solution. Its claim was far stronger: the time it takes for the compute growth to catch up with the hand-engineered solution is so short that the investment in the latter won't pay off in the sense of a researcher's personal career investment, or in the sense of a big-tech R&D effort ROI.

You may well dispute such a claim, of course. Would be interesting to read your thoughts if you are willing to share them.


> The raw compute needed to brute force any problem can only be known after the problem is solved. There is no sane upper limit to how much computation, memory and data any given task will take and humans are terrible at estimating how hard tasks actually are. We are after all only 60 years late for the undergraduate summer project that would solve computer vision.

I feel like you’re conflating conceptual difficulty and computational difficulty


I appreciate this comment. I'm currently working hard on something seemly straightforward (address matching) and sometimes I feel demotivated because it feels like whatever progress I make, the bitter lesson will get me in the end. Reading your comment made me feel that maybe it's worth the effort after all. I have also taken some comfort in the fact that current LLMs cannot perform this task very well.


Can you provide more about the problems in address matching and what you are trying to solve?

Do you mean street address matching? Isn’t that already solved? (excuse the naive question)


The problem is that people write addresses down in different ways (I'm in the UK)

You will typically have a master/canonical list of addresses from an official source. In the UK that's Ordnance Survey's AddressBase.

You will then have 'messy addresses' that humans have written down. For simple addresses, they'll often be the same as the master version (e.g. 5 Rainbow Road, Hemel Hempstead, AB1 2BC).

But there are many harder addresses that exhibit lots of variations, especially flats and subunits.

These may all be the same: Flat A 1 High Street vs 1A High Street vs Basement Flat, 1 High Street

There no guarantee there will be a number in the address. In: 'THE OLD FARM COTTAGE PAD FARM BADGERCROFT ROAD PIKING', 'THE' and PAD FARM' may be missing, which doesn't seem like a problem until you find out there's also a PAD FARM COTTAGE on Badgercroft Road. There's no guarantee tokens will be in the same order.

FWIW, my work is open source, and it's here: https://github.com/RobinL/uk_address_matcher


Thanks for explaining. Sounds messy indeed. Wasn’t there this startup trying to offer an alternative? Three words to describe a location or something. But of course it’s not really easy to change an address system, so using AI to make the existing system less messy might be more pragmatic.


I don’t know of any open source solution. (I work in mapping)


Going back to the original "Bitter Lesson" article, I think the analogy to chess computers could be instructive here. A lot of institutional resources were spent trying to achieve "superhuman" chess performance, it was achieved, and today almost the entire TAM for computer chess is covered by good-enough Stockfish, while most of the money tied up in chess is in matching human players with each other across the world, and playing against computers is sort of what you do when you're learning, or don't have an internet connection, or you're embarrassed about your skill and don't want to get trash-talked by an Estonian teenager.

The "Second Bitter Lesson" of AI might be that "just because massive amounts of compute make something possible doesn't mean that there will be a commensurately massive market to justify that compute".

"Bitter Lesson" I think also underplays the amount of energy and structure and design that has to go into compute-intensive systems to make them succeed: Deep Blue and current engines like Stockfish take advantage of tablebases of opening and closing positions that are more like GOFAI than deep tree search. And the current crop of LLMs are not only taking advantage of expanded compute, but of the hard-won ability of companies in the 21st century to not only build and resource massive server farms, but mobilize armies of contractors in low-COL areas to hand-train models into usefulness.


The main useful outcome we get from chess is entertainment.

Entertainment that comes from a Human vs. Human match is higher than Human vs. AI, at least for spectators.

But many sectors of the economy don't gain much from it being done by humans. I don't care if my car was made by all humans or all robots, as long as it's the best car I can get for the money.

I think you're extrapolating a bit too much from the specific case of chess.


It’s not really about how the compute-intensive resources come to bear. You can draw a parallel to Moore’s law. Node advancement is one of the most expensive and cutting edge efforts by humanity today. But it’s also simultaneously true that software companies have succeeded or failed by betting for or against computers getting faster. There are famous examples of companies in the 80’s that designed software that was simply not usable on the computers on hand when the project began, but was incredible on the (much faster) computers of launch day.

The bitter lesson is very similar. In essence, when building on top of AI models, bet on the AI models getting much faster and more capable.


And there is software today that is simply not usable on computers today, but will be incredible on computers in 20 years time if clock speed continue doubling every 2 years.

Most of it is written in Electron.


Hah, point hilariously made. Although I might argue electron commits the sin of betting on endless increases in memory performance :)


The time span on which these developments take place matter a lot for whether the bitter lesson is relevant to a particular AI deployment. The best AI models of the future will not have 100K lines of hand-coded edge cases, and developing those to make the models of today better won't be a long-term way to move towards better AI.

On the other hand, most companies don't have unlimited time to wait for improvements on the core AI side of things, and even so building competitive advantages like a large existing customer base or really good private data sets to train next-gen AI tools have huge long-term benefits.

There's been an extraordinary amount of labor hours put into developing games that could run, through whatever tricks were necessary, on whatever hardware actually existed for consumers at the time the developers were working. Many of those tricks are no longer necessary, and clearly the way to high-definition real-time graphics was not in stacking 20 years of tricks onto 2000-era hardware. I don't think anyone working on that stuff actually thought that was going to happen, though. Many of the companies dominating the gaming industry now are the ones that built up brands and customers and experience in all of the other aspects of the industry, making sure that when better underlying scaling came there they had the experience, revenue, and know-how to make use of that tooling more effectively.


Why must the best model not have 100k edge cases hand coded?

Our firsthand experiences as humans can be viewed as such. People constantly over index on their own anecdata, and are the best "models" so far.


Previous experience isn't manual edge cases, it's training data. Humans have incredible scale (100 trillion synapses): we're incredibly good at generalizing, e.g., how to pick up objects we've never seen before or understanding new social situations.

If you want to learn how to play chess, understanding the basic principles of the game is far more effective than trying to memorize every time you make an opening mistake. You surely need some amount of rote knowledge, but learning how to appraise new chess positions scales much, much better than trying to learn an astronomically small fraction of chess positions by heart.


Actually companies can just wait. Multiple times my company has said: "a new model that solves this will probably come out in like 2-4 months anyways, just leave the old one as is for now".

It has been true like ten times in the past two years.


It's not that technical work is guaranteed to be in your codebase 10 years from now, it's that customers don't want to use a product that might be good six months from now. The actors in the best position to use new AI advances are the ones with good brands, customer bases, engineering know-how that does transfer, etc.


"those who have more capital have an advantage"


> Investment Strategy: Organizations should invest more in computing infrastructure than in complex algorithmic development.

> Competitive Advantage: The winners in AI won’t be those with the cleverest algorithms, but those who can effectively harness the most compute power.

> Career Focus: As AI engineers, our value lies not in crafting perfect algorithms but in building systems that can effectively leverage massive computational resources. That is a fundamental shift in mental models of how to build software.

I think the author has a fundamental misconception what making best use of computational resources requires. It's algorithms. His recommendation boils down to not do the one thing that would allow us to make the best use of computational resources.

His assumptions would only be correct if all the best algorithms were already known, which is clearly not the case at present.

Rich Sutton said something similar, but when he said it, he was thinking of old engineering intensive approaches, so it made sense in the context in which he said it and for the audience he directed it at. It was hardly groundbreaking either, the people whom he wrote the article for all thought the same thing already.

People like the author of this article don't understand the context and are taking his words as gospel. There is no reason not to think that there won't be different machine learning methods to supplant the current ones, and it's certain they won't be found by people who are convinced that algorithmic development is useless.


I'm by the same mind.

I dare say ChatGPT 3.0 and 4.0 are the only recent examples where pure computing produced a significant edge compared to algorithmic improvements. And that edge lasted a solid year before others caught up. Even among the recent improvements;

1. Gaussian splashing, a hand-crafted method threw the entire field of Nerf models out the water. 2. Deepseek o1 is used for training reasoning without a reasoning dataset. 3. Inception-labs 16x speedup is done using a diffusion model instead of the next token prediction. 4. Deepseek distillation, compressing a larger model into a smaller model.

That sets aside the introduction of the Transformer and diffusion model themselves, which triggered the current wave in the first place.

AI is still a vastly immature field. We have not formally explored it carefully but rather randomly tested things. Good ideas are being dismissed for whatever randomly worked elsewhere. I suspect we are still missing a lot of fundamental understanding, even at the activation function level.

We need clever ideas more than compute. But the stock market seems to have mixed them up.


>There is no reason not to think that there won't be different machine learning methods to supplant the current ones,

Sorry, is that a triple negative? I'm confused, but I think you're saying there WILL be improved algorithms in the future? That seems to jive better with the rest of your comment, but I just wanted to make sure I understood you correctly!

So.. Did I?


This misses that if the agent is occasionally going haywire, the user is leaving and never coming back. AI deployments are about managing expectations - you’re much better off with an agent that’s 80 +/- 10% successful than 90 +/- 40%. The more you lean into full automation, the more guardrails you give up and the more variance your system has. This is a real problem.


Sutton might have said you just need a loss function which penalises variance and the model will learn to reduce variance itself. He thinks this will be more effective than hand coded guardrails. He's probably right.

I don't know how you write that loss function mind you. Sounds tricky. But I doubt Sutton was saying it's easy, just that if you can do it then it's effective.


Penalises on training? Not runtime? The risk is that.


You don't have to tolerate agent/AI going haywire. In a simple example, say of multiple parallel generations. It's compute intensive and it reduces the probability of your agent going haywire. You need mechanisms and evals to detect the best output in this scenario of course, that is still important. With more compute, you are preventing your final output to be haywire despite the variance.


Do you have a real world example of this? Claude Code for example doesn’t fit the pattern of “higher success but more variance.” If anything the variance is lower as the model (and tightly coupled agent) gets better.


The only AI I've ever dealt with is unwillingly, when companies use AI chat bots to replace human support. They certainly make me want to leave and not come back.


Good stuff but the original "Bitter Lesson" article has the real meat, which is that by applying more compute power we get better results (just more accurate token predictions, really) than with human guiderails.


The counter argument is a bitter lesson that Tesla is learning from Waymo and the lesson might be bitter enough to tank the company. Waymo's approach to self driving isn't end to end - they have classical control combined with tons of deep learning, creating a final product that actually works in the real world, meanwhile the purely data driving approach from Tesla has failed to deliver a working product.


Tesla / Waymo is a perfect illustration of the point, but the Bitter Lesson doesn’t allow us to pick a winner here. The Bitter Lesson tells us that the Tesla approach (fully end to end, minimizing hand coded features / logic) will _ultimately_ win out. The Bitter Lesson does not tell us that this approach has to economically justify itself 1 year in, 5 years in, or that the approach when the technology is immature will allow a company to avoid bankrupting itself in the meantime while they wait for the data and compute to scale.

In other words, just because we know that ultimately (possibly in 20+ years) the Tesla compute-only approach will be simpler and more effective, Tesla might not survive to see this happen. Instead, manual feature engineering and hacking can always give temporary gains over data and compute driven approaches. The bitter lesson was clear about this. I suspect Waymo will win, and at some point in the future once they are out of their growth at all costs stage, they will transition into their maximum value extraction stage, in which vision will make significantly more economic sense than LiDAR. But once they win, they’ll have plenty of time to see the bitter lesson through its ultimate consequences. Elon is right, but he’s probably too early.


That's religion, not a predictive theory.

The Bitter Lesson has held up in a lot of domains where injecting human inductive bias was detrimental. Adding LIDAR for example is not inductive bias - it's a strictly superior form of sensing. You won't call a wolf's sense of smell "hand engineered features" or a cat's reflexes a failure of evolution to extract more signal from an inferior sensory input.

Waymo will win because they want to make a product that works and not be ideological about it - that's ultimately what matters.


I'd argue that bitter lesson might be the other way around. Waymo has been experimenting with more end-to-end approaches and is likely to end up with something that looks more like that than a "classical control" approach, though maybe not quite the same approach as Tesla's current setup.

IMO, this is the best public description of the current state of the art: https://www.youtube.com/watch?v=92e5zD_-xDw

I expect Waymo to continue to evolve in a similar direction.


The lesson from Tesla is that AI is not just a magic box where you can put in data and get out intelligence. There are more to working systems than compute, and when they operate in the real world, data isn't enough. The key problem with Tesla cars that keep them from succeeding is not that they don't have enough data, but they have no idea what to do with it. Even if they had infinite compute and all the driving videos in the world, it wouldn't be enough to overcome the limitations of their sensors.


> The key problem with Tesla cars that keep them from succeeding is not that they don't have enough data, but they have no idea what to do with it. Even if they had infinite compute and all the driving videos in the world, it wouldn't be enough to overcome the limitations of their sensors.

Isn't this effectively a refutation of the "bitter lesson"?


Tesla is a poor counterargument because it is no longer a market leader. It has poor management compared to 10 years ago and seems to be unable to attract top talent (poor labor relations).

Tesla is being leapfrogged by competitors across the auto industry. All it has is first mover status (charging network).

Tesla purposefully limits the capabilities of its self driving by refusing to implement it with sensors that go beyond smartphone cameras.

My belief is that Tesla doesn’t want to actually deliver a car that can drive itself because the end result of Waymo is that fewer people will need to own a car and fleets of short term rental self-driving cars won’t spend frivolous money on prestige and luxury like consumer car buyers. They won’t lease a car and replace it every 2-3 years like some car owners do just because they like having a new car. Fleet vehicle operators purchase cars with razor thin margins and make decisions based solely on economics, as well as having a lot more purchasing leverage over car manufacturers.

I don’t think Tesla ever wants self driving to work, they just want to sell the idea of the software.


Tesla removed the LIDAR and thought advances in AI would be able to do without one. They were wrong.


Tesla didn't remove LIDAR, they never had it. So far, that bet is looking pretty reasonable. It seems evident at the moment that the most formidable competitors in this space could build a solid FSD product with cameras alone, with the biggest variable being time.


Is that evident?

The only level 3 certified drive system available in the US (Mercedes) utilizes LIDAR.

The only operating robotaxi service also uses it.

I would say it’s the opposite of evident that cameras are enough.


While Mobileye has focused on using lidar+cameras with each effectively serving as a form of "backup", they have also demonstrated a full city drive using nothing but cameras. I've seen enough from Waymo to see that they could easily do the same thing.

I expect both to keep radar and lidar for quite a few years, but I think its usage will consistently decline to zero or near zero over the next 10 years.


The problem with this prediction is.... why? Camera doesn't actually replace LiDAR, even if we solve getting depth maps from camera 100%. The recent test in which the car ran over a child hidden in a smoke screen is evidence of this. If they purpose of driverless cars is to be better than humans, we should be aiming for them to have super human sensing with camera+lidar+radar etc, instead of human-level sensing with just cameras.


Sure, but that test was tailor made for lidar to "succeed" even if lidar actually couldn't see the hidden obstacle. The lidar was largely blocked by the "rain", which meant it could treat it like a wall. For a real world example of this, see Waymo's experience with a water main break: https://www.reddit.com/r/SelfDrivingCars/comments/1g75ftb/wa...

The lidar didn't necessarily see the child, but it didn't have to.

A real test of lidar would have been perfectly transparent obstacles. It wouldn't have been realistic, but neither were the other tests.

There are two things that will drive companies to eliminate it:

- Cost - Even though the cost can come down quite a bit, the current multi-lidar systems they add both up front cost and maintenance cost.

- Complexity - They don't do the same things that cameras do, and can in fact never replace cameras. This means that they are not 100% fidelity backups, and therefore the decision of which sensor to be used needs to be made. A Mobileye like solution is likely the best answer here, with a discriminator network to (hopefully) accurately judge which sensor to believe for each detected point, but this is not trivial. In fact, just getting this right may be nearly as hard as the driving policy tasks themselves.

I still think lidar may stay around for special purpose tasks (eg, curb detection, or low obstacles), but the era of them being relied on heavily to detect VRUs or other vehicles will surely end.


It's not really tailor made though -- this was a problem researchers faced in the DARPA Urban challenge in 2007, which was held in the desert, and during which dust storms frequently confounded sensor systems. LiDAR was the technology that enabled them to complete those task.

> A real test of lidar would have been perfectly transparent obstacles. It wouldn't have been realistic, but neither were the other tests.

LiDAR would fail that test, which is why you use thermal imaging to see transparent obstacles. Roboticists have known for a long time that orthogonal sensors are the key to building robust systems that perform in a wide variety of environments.

> - Cost - Even though the cost can come down quite a bit, the current multi-lidar systems they add both up front cost and maintenance cost.

I don't understand why you think the LiDAR system is more expensive. The only evidence you have for camera based systems potentially ever working is that if you squint really hard, eyeballs seem kind of like cameras, and people drive with eyeballs, therefore AI should be able to as well. But you're not factoring in the cost of the AI as part of that system. Your proof of concept requires a human-level intelligence being in control of the car. How is that cheaper than LiDAR? Both in R&D and in the amount of resources it would take to run said AI on the car?

> - Complexity - They don't do the same things that cameras do, and can in fact never replace cameras.

Again, how is it less complex? If you need an AGI in the mix for it to even get up to the level of systems with LiDAR, how can it be simple?

> A Mobileye like solution is likely the best answer here, with a discriminator network to (hopefully) accurately judge which sensor to believe for each detected point, but this is not trivial

Huh? This is not typically how robots are made, you don't choose which sensor to believe, you construct a belief and that's informed by various sensors, each of which produce data with some sort of error that accounted for. State estimation allows us to take very untrustworthy and error-filled sensor measurements and produce accurate beliefs. This is why orthogonal sensor modalities are so important to robust system design.

You talk about minimizing complexity, but who cares about complexity? What people care about is robustness. Robust/complex solutions are preferred to brittle/simple solutions when human lives are on the line.


> Your proof of concept requires a human-level intelligence being in control of the car. How is that cheaper than LiDAR?

This is the core of the problem. Lidar doesn't actually change any of that, it still needs AI that's just as strong, if not stronger. All it gets you is a more reliable point-cloud (in some situations, maybe less reliable in others), that you'll then have to interpret.

There's no getting around the need of having a powerful AI behind it all.


Tesla is actually an example of relying too much on human domain knowledge.

Wayno is brute forcing the problem with hardware. They use Lidar.

Elon Musk's argument against Lidar is that humans only need two eyes and therefore stereoscopic vision is enough.

"Human drivers use two eyes, therefore self driving cars need two eyes." is exactly the type of thing the bitter lesson is warning against if you stretch the analogy to hardware.


I bring this up often at work. There is more ROI in assuming models will continue to improve, and planning/engineering with that future in mind, rather than using a worse model and spending a lot of dev time shoring up it's weaknesses, prompt engineering, etc. The best models today will be cheaper tomorrow. The worst models today will literally cease to exist. You want to lean into this - have the AI handle as much as it possibly can.

Eg: We were using Flash 1.5 for awhile. Spent a lot of time prompt engineering to get it to do exactly what we wanted and be more reliable. Probably should have just done multi-shot and said "take best of 3", because as soon as Flash 2.0 came out, all the problems evaporated.


Thats the core of the argument. We are switching from a 100% deterministic and controlled worldview (in software terms) to a scenario where it's probabilistic, and we haven't updated ourselves accordingly. Best of n (with parallelization) is probably the simplest fix instead of such rigorous prompt engineering. Still many teams do want a deterministic output and spend a lot of time on prompts (as opposed to evals to choose the best output).


Honestly I feel the takeaways are the opposite.

There’s no point in building something non functional now simply because it will be replaceable with something functional later.

You should either do it without AI or not do it at all. You’re not actually adding value with a placeholder for “future AI”.


Its more that if you don't know for sure if it's possible; and usually you don't. Then adding your expertise onto an ai system is never going to pay off compared to building out the ai compute infrastructure and training data.

This has the best chance of being functional in the long term, in the face of uncertainty.

If you already know it can work, then you can improve with specific expertise, but it's a fixed solution at that point.


Eh, so in reality there are a lot of AI products people are trying to build and it's very unclear at the outset "if it's possible", where "possible" is a business question that includes factors like:

- How hard is the task? Can it be completed with cheaper/faster models or does it require heavyweight SOTA tier models?

- What's your cost envelope for AI compute?

- How are you going to test/refine the exact prompt and examples you give the AI?

- How much scaffolding (aka, dev time = $$$) do you need to set up to integrate the AI with other systems?

- Is the result reliable enough to productize and show to users?

What you realize when designing these systems is there is a sliding scale where the more scaffolding and domain expertise you put into the system as a whole, the less you need to rely on the AI, but the more expensive it is in terms of man-hours it is to develop and maintain. It looks more and more just like a traditional system. And vice versa, perhaps with the most powerful SOTA models you can just dump 20K tokens of context and get an answer that is highly reliable and accurate with almost no extra work on your end (but costs more to run).

It's very individualized and task-dependent. But we do know from recent history, you can generally assume models are going to get faster/smarter/cheaper pretty quickly. So you try to figure out how close to the latter scenario you can get away with for now, knowing that in 6 months the equation could have completely changed in favor of "let the AI do most of the work".

As an addendum, I think it's completely crazy right now to be in the business of training your own models unless you have HIGHLY specialized needs or like to light money on fire. You are never going to achieve the performance/$ of the big AI labs, and they/their investors are doing all your R&D for FREE. It's like if Ford was releasing a new car every 6 months made out of ever more efficient and stronger carbon nanotubes or whatever, because the carbon nanotube companies were all competing for market share and wanted to win the "carbon nanotube race". It's crazy, never seen anything like it.


It's not wrong, but I find the underlying corrolay pretty creepy that actually trying to understand those problems and fix errors at edge cases is also a fool's errand, because why try to understand a specific behavior if you can just (try to) finetune it away?

So we'll have to get used for good to a future where AI is unpredictable, usually does what you want, but has a 0.1% chance of randomly going haywire and no one will know how to fix it?

Also, the focus on hardware seems to imply that it's strictly a game of capital - who has access to the most compute resources wins, the others can stop trying. Wouldn't this lead to massive centralization?


>So we'll have to get used for good to a future where AI is unpredictable, usually does what you want, but has a 0.1% chance of randomly going haywire and no one will know how to fix it?

Just like humans. I don't think is a solvable problem either.


It’s not “just like”, because humans can be held accountable. Also, I suspect that the distribution of failure modes is actually substantially different between LLMs and humans.


Late to reply, but my point was that even with the ability to be held accountable on pain of unemployment, imprisonment or even death, sometimes even these disincentives are not enough to stop a certain small percentage of humans from going haywire. It's all just something we have to live with.


Add more humans and LLMs to correct for errors. If humans sometimes go crazy and try to randomly end the world at a rate of 0.1%, requiring two humans to turn two keys synchronously to end the world reduces the error rate to 0.01%.

So, to avoid depressed AIs ending the world randomly, have a stable of multiple AIs with different provenance (one from Anthropic, one from OpenAI, one from Google...) require a majority agreement to reduce the error rate. Adjust thresholds depending on criticality of the task at hand.


Let's hope they're not correlated.


> For instance, in customer service, an RL agent might discover that sometimes asking a clarifying question early in the conversation, even when seemingly obvious, leads to much better resolution rates. This isn’t something we would typically program into a wrapper, but the agent found this pattern through extensive trial and error. The key is having enough computational power to run these experiments and learn from them.

I am working on a gpt wrapper in customer support. I’ve focused on letting the LMs do what they do best, which is writing responses using context. The human is responsible for managing the context instead. That part is a much harder problem than RL folks expect it to be. How does your AI agent know all the nuance of a business? How does it know you switched your policy on returns? You’d have to have a human sign off on all replies to customer inquiries. But then, why not make an actual UI at that point instead of an “agent” chatbox.

Games are simple, we know all the rules. Like chess. Deepmind can train on 50 million games. But we don’t know all the rules in customer support. Are you going to let an AI agent train itself on 50 million customer interactions and be happy with it sucking for the first 20 million?


The bitter lesson would suggest eventually the LM agent will train itself, brute force, on something and extract the context itself. Perhaps it will scrape all your policy documents and figure out which ones are most recently dated.


> For instance, in customer service, an RL agent might discover that sometimes asking a clarifying question early in the conversation, even when seemingly obvious, leads to much better resolution rates.

Why does this read to me as the bot finding a path of “Annoy the customer until they hang up and mark the case as solved.” ?


YES to the nature analogy.

We are not guaranteed a world pliable to our human understanding. The fact that we feel entitled to such things is just a product of our current brief moment in the information-theoretic landscape, where humans have created and have domination over most of the information environment we navigate. This is a rare moment for any actor. Most of our long history has been spent in environments that are unmanaged ecologies that have blossomed around any one actor.

imho neither we nor any single AI agent will understand the world as fully as we do. We should retire the idea that we are destined to be privileged to that knowledge.

https://nodescription.net/notes/#2021-05-04


The bitter lesson is a good demonstration of how people have really short memories and distributed work loses information

Every AI "breakthrough" comes at a lag because the people who invent a new architecture or method aren't the ones who see its full potential. Because of the financial dynamics at play, the org or team that sees the crazy-looking result often didn't invent the pieces they used. Even if they did, it's been years and in a fast-moving field that stuff has already started to feel "standard" and "generic". The real change they saw was something like more compute or more data

Basically, even the smartest people in the world are pretty dumb, in the sense of generalizing observations poorly


Has anyone empirically assessed the claims of the Bitter Lesson? The article may sound convincing, but ultimately it's just a few anecdotes. It seems to have a lot of 'cultural' impact in AI research, so it would be good to have some structured data-based analysis before we dismiss entire research directions.


The Bitter Lesson is about general methods that scale to hard problems once you unlock a minimum threshold for compute, so neural nets definitely qualify.

Traditional ML methods exist for all of the things we use neural nets for, but none of them are as effective, for a plethora of reasons, but one of the biggest reasons is how much training data they can handle. If you have to invert an NxN matrix, for example, where N is the size of your training set, you aren't getting very far. But a neural net scales to datasets containing billions of samples, and can be adapted to multiple domains that previously had their own special techniques. The bottleneck was being able to train them, and letting go of restrictions like provable optimality. Once we could train them, we quickly discovered that scaling to larger datasets produced models that dominated everything else.


We got better and better models when we threw more and more compute? I gotta work on my snarkiness. Seriously, that's pretty good empirical evidence. The smaller models we get are all some kind of distillation or student model of a larger model, so they can never claim they are not the result of large compute.


An empirical study could start by quantifying model quality and available compute and plot them over time, see how they correlate, investigate whether there might be confounding factors,... in essence, putting numbers behind the qualitative statements. I see no reason for snarkiness here.


Better and better how ? Isn't this also based on an overcomplicated web of anecdotes ?


No.


I don't get how RL can be applied in a domain where there is no simulator.

So for customer service, to do RL on real customers... well this sounds like it's going to be staggeringly slow and very expensive in terms of peeved customers.


I dunno seems like a decent area to start with. A team that collects detailed data on customer service interactions as well as satisfaction of outcomes should be able to create a decent dataset. Then you grade outputs in a simulator to train the model. No need to train on the fly, at least not at first.


There's some challenge there.

- service interactions are pretty complicated - the satisfaction is dependent on a lot of factors - customers mostly give 7/10 mostly because giving nps is so unnatural


More generally beats better. That’s the continual lesson from data intensive workloads. More compute, more data, more bandwidth.

The part that I’ve been scratching my head at is whether we see a retreat from aspects of this due to the high costs associated with it. For cpu based workloads this was a workable solution, since the price has been reducing. gpus have generally scaled pricing as a constant of available flops, and the current hardware approach equates to pouring in power to achieve better results.


It's actually about LLMs. They're fundamentally limited by our preconceptions. Can we go back to games and AlphaZero?


It’d be nice if this post included a high-level cookbook for training the 3rd approach. The hand-waving around RL sounds great, but how do you accurately simulate a customer for learning at scale?


Amazon Connect (as just one example) allows for post-call transcription (“calls may be recorded for quality assurance purposes”). Feed several thousand of these into an LLM alongside the ticket summary of initial problem and final outcome. With a month’s worth of calls you probably have a reasonable dataset to distill out a cheap 1B or 3B Llama model that is practically free compared to the real model used in your support agent workflow.


I think this goes for almost all software. Hardware is still getting impressively faster every year despite moore’s law expiring.


I think an even more bitter lesson is coming very soon: AI will run out of human-generated content to train on.

Already AI companies are probably training AI with AI generated slop.

Sure there will be tweaks etc, but can we make it more intelligent than its teachers?


> probably

that's how its been the last 2 years

synthetic datasets is one of the terms used

this is what the "fine tuning" space relies upon for infinite permutations of base models a moment after every release, and what larger organizations are also doing to create their base models


please, please stop letting AI rewrite for you. i’m so tired of reading AI slop.

instead, ask it to be a disagreeable editor and have it ask questions about your draft. you’ll go so much further, and the writing won’t be nonsensical.


> My plants don’t need detailed instructions to grow. Given the basics (water, sunlight, and nutrients), they figure out the rest on their own.

They do need detailed instructions to grow. The instructions are encoded in their DNA, and they didn’t just figure them out in real time.


If only artificial intelligence was intelligent!

Oh, well...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: