Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Why is it taken for granted that LLM models will keep improving?
49 points by slalomskiing on Dec 5, 2023 | hide | past | favorite | 64 comments
Whenever I see discussion of stuff like ChatGPT it seems like there is this common assumption that it will get better every year.

And in 10-20 years it’ll be capable of some crazy stuff

I might be ignorant of the field but why do we assume this?

How do we know it won’t just plateau in performance at some point?

Or that say the compute requirements become impractically high




The scaling laws (the original Kaplan paper, Chinchilla, and OpenAI's very opaque scaling graphs for GPT-4) suggest indefinite improvement for the current style of transformers with additional pre-training data and parameters.

No one has hit a model/dataset size where the curves break down, and they're fairly smooth. Usually simple models that accurately predict performance work pretty well nearby existing performance, so I expect trillion or 10-trillion parameter models to be on the same curve.

What we haven't seen yet (that I'm aware of) is whether the specializations to existing models (LoRa, RLHF, different attention methods, etc.) follow similar scaling laws, since most of the efforts have been focused on achieving similar performance on smaller/sparser models and not investing the large amounts of money into huge experiments. It will be interesting to see what Deepmind Gemini reveals.


This is the most accurate answer so far re. The scaling laws. It has been demonstrated that LLMs follow quite clear power laws with respect to performance. In fact, the performance of any model can be determined from the number of parameters it has and the amount of data it is given. The Wikipedia article on Neural Scaling laws provides a brief, accessible, summary of this. Both data and parameters are expected to increase in coming years, so models are expected to improve.


> No one has hit a model/dataset size where the curves break down, and they're fairly smooth.

The same was true of transistors, until it wasn't and they started diverging from the predictions about how they would behave when very small. Sometime around the late Netburst era (the Pentium 4/Netburst architecture was sunk by this problem - they assumed, designing it, that it would scale to 8-10GHz on a sane power budget, and it simply didn't as the "improvement per transistor shrink" became less and less).


LLMs are comprised of just three elements

Data

Compute

Algorithms

All three are just scratching the surface of what is possible.

Data: What has been scraped off the internet is just <0.001% of human knowledge as most platforms cannot be scraped so easily, are in formats that are not in text like video, audio, or just plain old pieces of paper undigitized. Finally there are probably techniques to increase data through synthetic means, which is purportedly OpenAI's secret sauce to GPT-4's quality.

Compute: While 3nm processes are approaching an atomic limit (0.21nm for Si), there is still room to explore more densely packed transistors or other materials like Gallium Nitride or optical computing. Not only that but there is a lot of room in hardware architecture to allow more parallelism and 3-D stacked transistors.

Algorithms: The transformer and other attention mechanisms have several sub-optimal components to them like how arbitrary the Transformer is in terms of design decisions, and quadratic time complexity for attention. There also seems to be a large space of LLM augmentations like RLHF for instruction following and improvements in factuality and other mechanisms.

And these ideas are just from my own limited experience. So I think its fair to say that LLMs have plenty of room to improve.


> LLMs are comprised of just three elements

> Data

> Compute

> Algorithms

Not to be facetious but so is all other software. LLMs appear to scale in correlation to the first two but it's not clear what the correlation is and that's the basis of the question being asked.


For data though, as LLM's generate more output, over time wouldn't they be expected to mess themselves with their own generated data?

Wouldn't that be the wall we'll hit? Think of how shitted up Google Search is with generated garbage, I'm imagining we're already in the 'golden age' where we were able to train on good datasets before it gets 'polluted' with LLM generated data that may not be accurate, and it just continues to become less accurate over time.


I don't really buy that line of argument. There's still useful signals, like upvotes, known human writing, or just plain spending time/money to label it yourself. There's also the option of training better algorithms on pre-LLM datasets. It's something to consider, but not any sort of crisis.


Cleaning and preparing the dataset is a huge part of training. Like the OP mentioned, OpenAI likely have some high quality automation for doing this and that's what's given them a leg up above all other competitors. You can apply the same automation to clear out low quality AI content the same way you remove low quality human content. It's not about the source, just the quality matters.


There must be signals in the data about generated garbage, otherwise humans wouldn't be able to tell. Something like PageRank would be a game changer and potentially solve this issue.


We need models that need less language data to train. Babies learn to talk on way less data than the entire internet. We need something closer to human experience. Kids have a feel for what is bullshit before they have consumed the entire internet :-).

I think feeding the internet into a LLM will be seen as the mainframe days of AI.


My counter-point to this is that babies are born with a sort of basic pre-trained LLM. Humans are born with our analogical weights & biases in our brains partly optimized to learn language, math, etc. Before pre-training an LLM, the weights & biases of their analogical brain are initialized with random values. Training on the internet can IMO be seen as a kind of "pre-training"


> Babies learn to talk on way less data than the entire internet.

Is this actually true? My gut check says yes, but I'm also unaware of any meaningful way to actually quantify the volume of sensor data processed by a baby (or anyone else for that matter), and it wouldn't shock me to discover if we could we'd find it to be a huge volume.


Ah yes. I should be more precise. Less data that is textual. Of course other data sources are plentiful. Including internal and external sensory.


Babies in ancient societies certainly had less exposure to written language, much lower vocabulary, less exposure to music, etc.


Sure the breadth is (maybe) smaller, but the question is volume. Babies get years of people talking around them, as well as data from their own muscles and vocalizations fed back to them. Is the volume they have consumed to the point the begin talking actually less than the volume consumed by an LLM?


If you’re taking about babies in ancient societies (which I am), the answer is absolutely yes. They were exposed to much less language, and much less sound, than we are.


Really? How much less? I'm far from convinced that if you sum up the sheer volume of noises heard, as well the other neurological inputs that goes into learning to speak (ex proprioception) you'd come out with a lesser number than what LLMs are trained on, but I'm open to any real data on this.


I'm sorry but you can have process nodes smaller than an atom. The size of atoms is irrelevant here. The process node refers to what dimensions a theoretical planar transistor would have to be equivalent to the current 3D transistors. If you stack multiple transistors on top of one another, the process node gets smaller regardless of what you think.


Because a lot of smart people are spending a lot of time, money, and effort on this. It's as simple as that. We could go into all sorts of details, like how increase in GPU capabilities will improve training capabilities, both in size and speed, or how GPU(/TPU) capabilities will improve, or how better techniques will make training on the same data set result in better models, or where other improvements will make better use of existing models or make them better or where we're seeing additions to training data sets and how that will improve models using existing techniques. But it really all boils down to a lot of smart people, some with a lot of money, that are personally invested (with time and money) in making them better.

That doesn't mean there isn't possibly a plateau somewhere but it's somewhere way off in the distance.


I mean a lot of highly payed/intelligent people worked in crypto/fusion/quantum computing, still all of those topics are evolving rather gradually.


Crypto is actually picking up after FTX and Binance, though as it's a social problem of adoption rather than a technical issue, I'm not sure it's comparable.

The problem with fusion and quantum computing is that advances are being made, but because those advances aren't consumer facing, you don't see them. Eg December 2022, they managed to get more energy out of a fusion experiment than they put in. That's huge! I'm not going to see an effect on my power bill for another couple decades, if ever, but it's real actual solid progress. For quantum computing, they're moving past the singular q-bit tech demonstrations level and moving into actual practical applications like making chips that can talk to each other **. Again, doesn't remotely affect me or my laptop today, but we've moved past the 1998 Stanford/IBM 2 q-bit computer.

Meanwhile, I can adopt a new model getting dropped with an afternoon of work, and see the results in milliseconds, in the case of StableDiffusion-turbo.

* https://www.technologyreview.com/2023/11/16/1083491/whats-co... ** https://www.technologyreview.com/2023/01/06/1066317/whats-ne...


We know why those are.

Cryptocurrency’s need to be fully decentralised is the thorn in it’s side. Be your own bank is a bit too much for most people used to cash or a bank account they can call up if there is a problem. It has fundamental social problems that there may be solutions to but probably not.

Fusion and quantum are massive physics and engineering challenges. With ML we are already building the chips to scale, so we know it is scalable and doable.

It is a 50 to 100 problem not 0 to 1.


and we might know what problems those are for AI in a year or two, look back how crypto was viewed on by some...

I also was not arguing for ai to not improve by quit a bit more.. I actually think ai will make a few more big steps forward, but the garantee for this is not ankered in the inflowing capital/talent but instead in the relative clear path forward of the technologie and partly known ineffective architecture.


a more apt analogy is Moore's law


LLMs might hit a wall. Any technology could hit a wall. ChatGPT could be the next Segway. But, like the Segway, LLMs are useful now. I think the impact of "stuff like ChatGPT" on software engineering will equal the impact of the compiler in that eventually no one will consider writing software without a "stuff like ChatGPT" in the tool chain, in the same way that no one works without a compiler now. LLMs are useful now and they've only existed for a few years.

But that's just my opinion and no one knows the future. If you read papers on arxiv.org, progress is being made. Papers are being written, low-hanging fruit consumed. So we're going to try because PhDs are there for the taking on the academic side, and generational wealth is there for the taking on the business side.

E. F. Codd invented the relational database and won the Turing Award. Larry Ellison founded Oracle to sell relational databases and that worked out well for him, too.

There's plenty of motivation to go around.


I don't know about the specifics of mikewarot's point below, but I think he's close to verbalizing a fairly-important truth: there is no reason whatsoever to think that Von Neumann machines are the best way to implement neural networks. There are lots of reasons to think they aren't, starting with the VRAM bottleneck. The impressive results that have been achieved so far have almost certainly come from using the wrong tools. That's cause for optimism IMHO.

Digital computer architecture evolved the way it did because there was no other practical way to get the job done besides enforcing a strict separation of powers between the ALU, memory, mass storage, and I/O. We are no longer held to those constraints, technically, but they still constitute a big comfort zone. Maybe someone tinkering with a bunch of FPGAs duct-taped together in their basement will be the first to break out of it in a meaningful way.


It appears as if improving data corpus quality and size and improving processing capacity are still driving performance gains. I have no idea of the functional relationship, and its likely not a Moore's law kind of thing, although that would be an underlying driver of available capacity to saturation.


I don't think it's a universal assumption. Some people do think it will hit a wall (and maybe do so soon), others think it can keep improving easily by scaling up the compute or the training data.

Good LLMs like ChatGPT are a relatively new technology so I think it's hard to say either way. There might be big unrealized gains by just adding more compute, or adding/improving training data. There might be other gains in implementation, like some kind of self-improvement training, a better training algorithm, a different kind of neural net, etc. I think it's not unreasonable to believe there are unrealized improvements given the newness of the technology.

On the other hand, there might be limitations to the approach. We might never be able to solve for frequent hallucinations, and we might not find much more good training data as things get polluted by LLM output. Data could even end up being further restricted by new laws meaning this is about the best version we will have and future versions will have worse input data. LLMs might not have as many "emergent" behaviors as we thought and may be more reliant on past training data than previously understood, meaning they struggle to synthesize new ideas (but do well at existing problems they've trained on). I think it's also not unreasonable to believe LLMs can't just improve infinitely to AGI without more significant developments.

Speculation is always just speculation, not a guarantee. We can sometimes extrapolate from what we've seen, but sometimes we haven't seen enough to know the long term trend.


Not an expert, but have wondered the same thing. From what I've read, it comes down to optimism and extrapolation from current trends. Both of these have problems of course, but what else can you do? My working hypothesis is that we'll reach a practical limit on the quality of what we can get from the current class of models, and to extend beyond that would require a new approach, rather than just more data and more horsepower. The new breakthrough would have to be as significant as the last, but would be more likely to happen in a short time span because there is so much more activity in AI research now than even 5 years ago. Again, I'm a dummy about this stuff, not claiming more than that.


Your skepticism is, I think, very well founded -- especially with such unclear definitions of "improvement."

I think I have a corollary type idea: Why are LLM's not perhaps like "Linux," something than never really needs to be REWRITTEN from scratch, merely added to or improved on? In other words, isn't it fair to think that LoRA's are the really important thing to pay attention to?

(And perhaps, like Google Fuschia or whatever, new LLMs might just be mostly a waste of time from an innovators POV?)


When in comes to training LLMs, the definition of “improvement” is incredibly clear, as one must literally code a loss function that the model then minimizes.

It gets murkier trying to map that actual capabilities, but so far, lower loss has led to much stronger capabilities.


The recent history of bigger LLMs suddenly being capable of new things is kind of miraculous. This blog post is a decent overview: https://blog.research.google/2022/11/characterizing-emergent... " In many cases, the performance of a large language model can be predicted by extrapolating the performance trend of smaller models."


I dunno if LLMs will get better, but ML in general is a task of compression, and there is definitely a whole bunch of human knowledge and history that neural nets can compress.

Its not unfeasable in the future to have a box at home that you can ask a fairly complicated question, like "how do I build a flying car", and it will have the ability to

- tell you step by step instructions of what you need to order

- write and run code to simulate certain things

- analyze you work from video streams and provide feedback

- possibly even have a robotic arm with attachments that can do some work.


unbounded possibilities! imagine, and hear me out, asking the box: 'how do I build a box that can answer fairly complicated questions', and getting the output in an automatized way all the way from atoms, ready to be plugged into the power grid.


It's a wishing for more wishes from a genie scenario. But it isn't enough to just have a box that can do this, the box needs a body, preferably a humanoid one so that it can interface easily with our existing infrastructure.


Like others in this thread have said, we're just starting to explore the technology. I view it as akin to early CPUs like the 6502 which only did the absolute minimum to today's monsters with large memory caches, predictive logic, dedicated circuits, thousands of binary calculation shortcuts and more all built in. Each small improvement adds up.

From a software perspective, I've wondered for a while if as LLM usage matures, there will be an effort to optimize hotspots like what happened with VMs, or auto indexing like in relational DBs. I'm sure there are common data paths which get more usage, which could somehow be prioritized, either through pre-processing or dynamically, helping speed up inference.

Also, GPT4 seems to include multiple LLMs working in concert. There's bound to be way more fruit to picked along that route as well. In short, there's tons of areas where improvements large and small can be made.

As always in computer science, the maxim, "Make it work, make it work well, then make it work fast," applies here as well. We're collectively still at step one.


The 6502, more than any other chip, democratized computing. I would say, after that, that the Z80 was the next most impactful chip, going off of the technological complexity of 8080 systems and the cost of x86 systems prior to clones. In a way, work products like LLaMa 1&2 and Mystral 7B have gone a long way towards democratizing AI in the way that the early home computers did, especially with Georgi Gerganov's llama.cpp effort, which cannot be lauded enough.


What's not mentioned here is test-time compute. Idea being that, sure, you can spend a ton of compute power on pre-training and fine-tuning, but generation is difficult. So instead of spending all time and power more focused on that, how about spending some time and power on it for the model to generate a bunch of possibilities, and then spend the rest of time having a model verify what's been generated for correctness. That's the Let's Verify Step by step.

Great video to talk about this: https://www.youtube.com/watch?v=ARf0WyFau0A

In threads on LLMs, this point doesn't get brought up as much as I'd expect, so I'm curious if I'm missing talks on this or maybe it's wrong. But I see this as the way forward. Models generating tons of answers, and other models being able to pick out the correct ones, and the combinations being beyond human ability, where after, humans can do their own verification.

Edit:

Think of it this way. Trying to create something isn't easy. If I was to write a short story, it'd be very difficult, even if I spent years reading what others have written to learn their patterns. If I then tried to write and publish a single one myself, no chance it'd be any good.

But _judging_ short stories is much easier to do. So if I said screw it, I'll read a couple stories to get the initial framework, then write 100 stories in the same amount of time I'd have spent reading and learning more about short stories, I can then go through the 100 and pick out the one I think is the best and publish that.

That's where I see LLMs going and what the video and papers mentioned in the video say.


Isn't that just GAN, but with LLMs?


I'm curious if limits of like, thermodynamics, won't play a part here. Or maybe also ecological limits: how long will we allow corporations to use essential, scarce resources to train models without paying their fair share? [0]

I'm not an expert here either but I wonder if there will be the same "leap" we saw from ChatGPT3-4 or if there's a diminishing curve to performance, ie: adding another trillion parameters has less of a noticeable effect than the first few hundred billion.

[0] https://fortune.com/2023/09/09/ai-chatgpt-usage-fuels-spike-... -- I am fairly certain they paid for that water, it was not a commensurate price given the circumstances, and if they had to ask to use it first the answer would have been, no, by a reasonable environmental stewardship organization.


The main assumption of techno-optimism is that a large enough computer can do anything people can do and it can do it better. The goal of techno-optimism is to create a mechanical god that will rule the planet and scaling LLMs is a stepping stone to that goal.

I, of course, already know how to do all this for a mere $80B.


I think I can do it far, far cheaper than that... and I've been talking about it, in public, for over a decade.[1] I could easily be wrong, of course. It really all depends on how much power a 4x4 LUT and 4 bit latch Latch Leak, and how much energy it takes to clock data through them for a cycle, and how fast they can be cycled. If the number are good, this thing will be amazingly cool. I can't find good numbers anywhere.

If you've made chips with latches and LUTS, any performance data you can share, no matter how old, would be helpful

It's an idea that's been bouncing around in my head since reading George Gilder's call to waste transistors. Imagine the worst possible FPGA, no routing hardware, and slow it down even more with a latch on every single LUT. Optimize it slightly, by making cells with 4 bits in, 4 bits out (64 bits of programming per cell), with the cells clocked in 2 phases, like the colors of a chess board. This means that each white cell has static inputs from the black cells.... and is thus fully deterministic, and easy to reason about. The complement happens on the other phase. Together, it becomes turning complete.

The thing is, it does computing with NO delays between compute and memory. All the memory is effectively transferred to compute every clock cycle. The latency sucks because you'll take n/2 cycles to get data across an N*N grid. However, you'll get an answer every clock cycle after that.

Imagine a million GPT-4 tokens/second.... not related to each other, of course, but parallel streams, interleaved as the data streams across the chips.

Imagine a bad cell.... you can detect it, and route around it. Yields don't have to be 100%.

The extreme downside is that tools for programming this thing don't exist. VHDL, etc... aren't appropriate. I'm going to have to build them. I've been stuck in analysis paralysis, but I've decided to try to get Advent of Code done using my bitgrid simulator. I hope to be done before it starts again next December. ;-)

[1] https://bitgrid.blogspot.com/

[2] https://github.com/mikewarot/Bitgrid


I think $80B is very cheap because the outcome is going to be a technological utopia. Honestly, I think my price is a bargain deal for the inhabitants of Earth.


> I think $80B is very cheap because the outcome is going to be a technological utopia.

I wish I shared your optimism. However I've seen no evidence that society is prepared to deal with a large swath of jobs being obsoleted by AI. I have no doubt that the "haves" will call it a technological utopia, but I strongly suspect the "have nots" will be larger than ever.


In my architecture everyone is treated equally like an idiot so there are no haves and have not because the machine god treats all people the same and provides for all their needs within the "panoptic computronium cathedral"™.


You've prompted me to write a slightly more serious rant on the subject, thank you.

https://snafuhall.com/p/aioracle.html


No problem. Cool site, very minimalist.


I can do it for $20B and a pony


I did the numbers already using a small LLM and it said the real number is $80B and no ponies.


Would that be $80B in 2020 or 2024 dollars? What vintage is your small LLM?


It accounted for inflation, Trumponomics, Bidenomics, and regular economics. I trust the number, it feels right to me.


Only because you're cheating by using the neurons of the pony.


Shh.

We're the only AI company that can offer HorseSense (TM)


"create a mechanical god that will rule the planet" -- on what basis people call this 'optimism'?!


They don't. They say things like "infinite resources for everyone" and "compute/electricity too cheap to meter" and so on and so forth. I have distilled the techno-optimist manifesto down to what it actually looks like in reality, i.e. a global panopticon that controls everything with algorithms.

Per usual, I can build this technological panopticon/utopia for a bargain price of $80B. Some people think it can be done for cheaper but they haven't spent as much time as I have on this problem. I have the architecture ready to go, all I need is the GPUs, cameras, microphones, speakers, and wireless data network. The software is the easy part but the panoptic infrastructure is what requires the most capital. The software/brain can be done for maybe $2B but it needs eyes, ears, and a mouth to actually be useful.

The second stage is building up the actuators to bypass people but once the panopticon is ready it won't be hard to build up the robot factories to enact the will of AGI directly via robots acting on the environment.


Happens every time. LLMs, crypto value, stocks, CPU performance, GPU performance, etc.

Anything that has seen continual growth will be assumed to have further continual growth at a similar rate.

Or, how I mentally model it even if it's a bit incorrect: People see sigmoidal growth as exponential.


> it won’t just plateau in performance at some point?

I suspect that we've already seen the shape of the curve: a 1B parameter model can index a book; a 4B model can converse, but a 14B model can be a little more eloquent. Beyond that no real gains will be seen.

The "technology advancement" phase has already happened mostly, but the greater understanding of theory, that would discourage foolish investments hasn't propagated yet. So there's probably at least another full year of hype cycle before the next buzzword is brought out to start hoovering up excess investment funds.


I enjoyed Tom Scott's YT video monologue about this. To summarize, he postulates that most major innovations follow a sigmoid growth curve, wherein they ramp up, explode, and then level off. The question then becomes, where are we on this curve? He concludes that we will probably only know in hindsight.

https://www.youtube.com/watch?v=jPhJbKBuNnA


I think part of it is that some people say a person is 20 petaflops of compute

So if we have that much compute power already why can't we just configure it in the right way to match a human brain?

I'm not sure I totally buy that logic though, since I would think the architecture/efficiency of a brain is way different from a computer


what if we already did and all that's missing is the skeletomuscular system?


They are going to add more abilities onto the system, for example, toolformers or goal planning (like the recent Q* stuff at OpenAI people are talking about). This will make the overall product very powerful.

But even if you’re looking just at the LLM it seems like there’s a lot of ways it can be improved still.


because no class of software was as good as it'll ever be upon launch -- in other words : there is a normal expectation of improvement after introduction in the software world.


> How do we know it won’t just plateau in performance at some point?

We don't.

But that's also the sort of thing you can't say when seeking huge amounts of funding for your LLM company.


It will plateau at best. But crowd is never smart, yet can scream.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: