API only model, yet trying to compete with only open models in their benchmark image.
Of course it'd be a complete embarrassment to see how hard it gets trounced by GPT4o and Claude 3.5, but that's par for the course if you don't want to release model weights, at least in my opinion.
I'd also like to point out that they omit Qwen2.5 14B from the benchmark because it doesn't fit their narrative(MMLU Pro score of 63.7[0]). This kind of listing-only-models-you-beat feels extremely shady to me.
Yes, I agree, for these small models it's wasted potential to be closed source, they can only be used effectively if they are open.
EDIT: HN is rate-limiting me so I will reply here: In my opinion 1B and 3B truly shine on edge devices, if not than it's not worth the effort, you can have much better models for already dirt cheap using an API.
An open small model means I can experiment with it. I can put it on an edge device and scale to billions of users, I can use it with private resources that I can't send externally.
When it's behind an API its just a standard margin/speed/cost discussion.
I think what the parent means is that small models are more useful locally on mobile, IoT devices etc. so it defeats the purpose to have to call an API.
Big models take up more VRAM just to have the weights sitting around hot in memory, yes. But running two concurrent inferences on the same hot model, doesn't require that you have two full copies of the model in memory. You only need two full copies of the model's "state" (the vector that serves as the output of layer N and the input of layer N+1, and the pool of active low-cardinality matrix-temporaries used to batchwise-compute that vector.)
It's just like spawning two copies of the same program, doesn't require that you have two copies of the program's text and data sections sitting in your physical RAM (as those get mmap'ed to the same shared physical RAM); it only requires that each process have its own copy of the program's writable globals (bss section), and have its own stack and heap.
Which means there are economies of scale here. It is increasingly less expensive (in OpEx-per-inference-call terms) to run larger models, as your call concurrency goes up. Which doesn't matter to individuals just doing one thing at a time; but it does matter to Inference-as-a-Service providers, as they can arbitrarily "pack" many concurrent inference requests from many users, onto the nodes of their GPU cluster, to optimize OpEx-per-inference-call.
This is the whole reason Inference-aaS providers have high valuations: these economies of scale make Inference-aaS a good business model. The same query, run in some inference cloud rather than on your device, will always achieve a higher-quality result for the same marginal cost [in watts per FLOP, and in wall-clock time]; and/or a same-quality result for a lower marginal cost.)
Further, one major difference between CPU processes and model inference on a GPU, is that each inference step of a model is always computing an entirely-new state; and so compute (which you can think of as "number of compute cores reserved" x "amount of time they're reserved") scales in proportion to the state size. And, in fact, with current Transformer-architecture models, compute scales quadratically with state size.
For both of these reasons, you want to design models to minimize 1. absolute state size overhead, and 2. state size growth in proportion to input size.
The desire to minimize absolute state-size overhead, is why you see Inference-as-a-Service providers training such large versions of their models (OpenAI's 405b models, etc.) The hosted Inference-aaS providers aren't just attempting to make their models "smarter"; they're also attempting to trade off "state size" for "model size." (If you're familiar with information theory: they're attempting to make a "smart compressor" that minimizes the message-length of the compressed message [i.e. the state] by increasing the information embedded in the compressor itself [i.e. the model.]) And this seems to work! These bigger models can do more with less state, thereby allowing many more "cheap" inferences to run on single nodes.
The particular newly-released model under discussion in this comments section, also has much slower state-size (and so compute) growth in proportion to its input size. Which means that there's even more of an economy-of-scale in running nodes with the larger versions of this model; and therefore much less of a reason to care about smaller versions of this model.
> It is increasingly less expensive (in OpEx-per-inference-call terms) to run larger models, as your call concurrency goes up. Which doesn't matter to individuals just doing one thing at a time; but it does matter to Inference-as-a-Service providers, as they can arbitrarily "pack" many concurrent inference requests from many users
In a way it also matters to individuals, because it allows them to run more capable models with a limited amount of system RAM. Yes, fetching model parameters from mass storage during inference is going to be dog slow (while NVMe transfer bandwidth is getting up there, it's not yet comparable to RAM) but that matters if you insist on getting your answer interactively, in real time. With a local model, it's trivial to make LLM inference a batch task. Some LLM inference frameworks can even save checkpoints for a single inference to disk and be cleanly resumed later.
> they're attempting to make a "smart compressor" that minimizes the message-length of the compressed message [i.e. the state] by increasing the information embedded in the compressor itself [i.e. the model.]) And this seems to work! These bigger models can do more with less state, thereby allowing many more "cheap" inferences to run on single nodes.
Not sure I follow. CoT and go over length of the states is a relatively new phenomenon and I doubt when training the model, minimize the length of CoT is an explicit goal.
The only thing probably relevant to this comment is the use of grouped-query attention? That reduces the size of KV cache by factor of 4 to 8 depending on your group strategy. But I am unsure there is a clear trade-off between model size / grouped-query size given smaller KV cache == smaller model size naively.
What I'm talking about here is the fact that you need a longer + multi-shot prompt to get a dumber model to do the same thing a smarter model will do with a shorter + zero-shot prompt.
Pretend for a moment that Transformers don't actually have context-size limits (a "spherical cow" model of inference.) In this mental model, you can make a small, dumb model arbitrarily smarter — potentially matching the quality of much larger, smarter models — by providing all the information and associations it needs "at runtime."
It's just that the sheer amount of prompting required to get a dumb model to act like a smart model, goes up superlinearly vs. the marginal increase in intelligence. And since (for now) the compute costs scale quadratically with the prompt size, you would quickly hit resource limits in trying to do this. To have a 10b model act like a 405b model, you'd either need an inordinate amount of time per inference-step — or, for a more interesting comparison, an amount of parallel GPU hardware (VRAM to hold state, and GPU-core-compute-seconds) that in both dimensions would far exceed the amount required to host inference of the 405b model.
(This superlinear relationship still holds with context-size limits in place; you just can only do the "make the dumb model smarter with a good prompt" experiment on roughly same-order-of-magnitude-sized models [e.g. 3b vs 7b] — as a 3b really couldn't "act as" anything above 7b, without a prompt that far exceeds its context-size limit — and so, in practice, you can't calculate enough of the ramp at once to fit a curve to it.)
The obvious corollary to this, is that by increasing model size (in a way that keeps more useful training around, retains intelligence, etc), you decrease the required resource consumption to compute at a fixed level of intelligence, and that this decrease scales superlinearly.
This dynamic explains everything current Inference-as-a-Service providers do.
It explains why they they are all seeking to develop their own increasingly-large models — they want, as much as possible, to get their models to achieve better results with less prompting, in fewer inference steps, and in proportionately cheaper inference steps — as these all increase their economies of scale, by decreasing the compute and memory requirements per concurrent inference call.
And it explains why they charge users for queries by the input/output token, not by the compute-second. To them, "intelligent responses" are the value they provide; while "(prompt size + output size) x (number of inference steps)" is the overhead cost of providing that value, that they want to minimize. A per-token pricing structure does several things:
• most obviously, as with any well-thought-out SaaS business model, it pushes the overhead costs onto the customer, so that customers are always paying for their own costs.
• it therefore disincentivizes users from sending prompts that are any longer than necessary (i.e. it incentivizes attempting to "pare down" your prompt until it's working just well enough)
• and it incentivizes users to choose their smarter models, despite the higher costs per token, as these models will achieve the same result with a shorter prompt; will require fewer retries (= wasted tokens) to give a good result; can "say more" in fewer tokens by focusing in on the spirit of the question rather than rambling; and require less CoT-like "thinking out loud" steps to arrive at correct conclusions.
• it also incentivizes the company to put effort into R&D work to minimize per-token overhead, to increase profitability per token. (Just like e.g. Amazon is incentivized to optimize the per-request overhead of S3, to increase the profitability per call.)
• and, most cynically, it locks in their customers, by getting them to rely on building AI agents that send minimal prompts and expect useful + accurate + succinct output; where you can only achieve that with these huge models, which in turn can only run on the huge vertically-scaled cluster nodes these Inference-aaS providers run. The people who've built working products on top of these Inference-aaS providers can't meaningfully threaten to switch away to "commodity" hosted open-source-model Inference-aaS providers (e.g RunPod/Vast/etc) — as nobody but the few largest players can host models of this size.
(Fun tangent: why was it not an existential mistake for Meta to open-source Llama 3.1 405b? Because nobody but their direct major competitors in the Inference-aaS space have compute shaped the right way to run that kind of model at scale; and those few companies all have their own huge models they're already invested in, so they don't even care!)
> How many r's in "strawberry"? Think step by step!
What a nice question! Well, let's just count:
1. S
2. T
3. R!
So, we have our answer: there are 3 R's in "strawberry".
All it speaks to is that tokenization is weird and introduces artifacts to LLM performance. Counting letters is a trivial task when you're staring at words on a screen. It's much harder when you're perceiving vectors based on parts of words. The fact that LLMs find certain things easier/harder than humans is completely unsurprising, and there are much more interesting benchmarks to use to compare one LLM to another.
Counting shit, like cells, peaks in signals, people, inventory, fingers, and votes, is hard, tedious and important to business and life, so I don’t know dude, it seems like a great benchmark to me. Countless posts wasted on denying this simple and obvious fact.
It's like using a hammer to turn a screw and calling it useless.
To envision what a next generation model bound by the same constraints should do, it'd be to recognize that it can't count tokens and use code access to write code that solves the strawberry problem without prompting.
Asked to count cells it'd be a model that could write and execute OpenCV tasks. Or to go a step further, be a multimodal model that can synthesize 10000 varations of the target cell, and finetune a model like YOLO on it autonomously.
I find arguments that reduce LLMs to "It can't do the simple thing!!!!" come from people unable to apply lateral thinking to how a task can be solved.
> To envision what a next generation model bound by the same constraints should do, it'd be to recognize that it can't count tokens and use code access to write code that solves the strawberry problem without prompting.
The VQA problems I'm describing can be solved seemingly in one case but not combined with counting. Counting is fundamentally challenging for sort of unknown reasons, or perhaps known to the very best labs who are trying to tackle it directly.
Another POV is that the stuff you are describing is in some sense so obvious that it has been tried, no?
I don't get what you mean by "unknown reasons", we understand that counting tokens requires a type of introspection transformer models can't do while operating on tokens.
What I described is tried, and works, but the models are still not cheap/fast/reliable enough to always do what I described for every query.
The difference between what I described and directly asking the model to count is that we know the models can get cheaper, faster, and more reliable at what I described without any earth shattering discoveries
Like I don't see any reason why GPT 10 will ever be able to count how many letters there are in the word strawberry without a complete paradigm shift in model building... but going from GPT 3 to GPT 4 we already got a model that can always write the dead simple code required to count it out, and the models that can do so are already getting cheaper and faster every few months without any crazy discoveries.
Not being able to "do math" is an obvious, IMO uninteresting limitation of how LLMs inherently work, and the more advanced models have already figured out good solutions. E.g. while an LLM may stumble on "How many R's are in Strawberry", at this point any code model can easily correctly implement "write me a program in Python that counts the number of R's in strawberry".
GPT 3 would essentially always fail on higher math problems/arithmetic, but whenever I've used it for math GPT 4 has always delegated out to executing code where necessary.
This is a good point. While LLMs being incapable of reliably doing a simple task that’s been doable by computers since the punch card days is an important consideration for anyone that might be thinking about using them for anything other than as a toy, this fact is uninteresting because of Reasons
Try counting the number of your red retina cells that are firing while you look at a painting.
Don’t need to be exact as firing is statistical, just give us a good average.
Hard? You can’t count?
Computers count pixels no problem. So weird you can’t.
Dementia? Not an AGI? /h
—-
This is what is happening.
Here are the “Reasons”.
In your vision system, the raw information from individual retina signals is munged into a different representation before reaching a level where you have flexible processing.
Likewise, in LLMs, letters are munged into tokens before LLMs “see” them.
When they sometimes get that “simple” question right, it’s actually a bit of an amazing feat. Given how they are constructed.
—-
Now try counting R’s as you read at a normal rate, or listen to someone speak.
You can’t do that either, during normal processing.
When we add spelling to LLMs training examples, they will do it easily. Just as you learned to do it, only after special lessons, after you had already learned to listen and speak.
Spelling is its own special practiced skill, in humans and LLMs.
> Try counting the number of your red retina cells that are firing during while you look at a painting.
This analogy makes sense because everybody could count their red retina cells until a couple years ago when the new painting paradigm arose, and also counting red retinal cells is a good analogy for being able to see simple objects that have always been distinguishable.
It is fascinating how tapping the “Do Not Use LLMs For Computation If The Results Need To Be Reliably Better Than A Random Output” sign invites explanations of why that fact is actually Cool and Good
Ask anyone who has not specifically learned to spell, to count R’s while you speak.
You learned to listen and speak words before you could spell. Imagine if nobody had actually shown you written words?
Or they were speaking another dialect but expecting you to count R’s in standard English?
LLMs are not trained on words in the form of letters.
They process and generate the words in the form of tokens. Pre- and post-processing systems converts letters to tokens and the reverse, without their ability to access that processing.
Spelling, for both us and LLMs, requires specific training/lessons.
> It is fascinating how tapping the “Do Not Use LLMs For Computation If The Results Need To Be Reliably Better Than A Random Output” sign invites explanations of why that fact is actually Cool and Good
Also fascinating:
People who hallucinate/confabulate ridiculous straw man rationales for people they disagree with, unaware they are filling in gaps in their knowledge regarding other people’s actual reasoning and the actual subject at hand.
The analogy I use is that illiterate people obviously can't spell, but it doesn't say much about their ability on other tasks. General intelligence doesn't need to be able to spell, since that describes a fair number of actual humans.
(There are tasks that LLMs totally fail on that would be obvious to an illiterate human though)
LLMs can clearly solve problems that computers up to now couldn't. They can't solve all problems and this should definitely be a cautionary note to anyone who wants to use them as an artificial general intelligence, but this take seems no different to someone looking at a punchcard computer and going, it can't even recognize typos or categorize images, what good is this? We've already had human computers who can do everything these can do, and can recognize images and notice typos
> going, it can't even recognize typos or categorize images, what good is this?
No one said that LLMs aren’t good for anything.
I pointed out — in response to another poster downplaying mention of a well-known and undisputed limitation that LLMs often have — that it is valid to consider these well-known and undisputed limitations if one is considering using them for anything other than a toy.
It is downright silly to discourage discussion of well-known and undisputed limitations! The only reason for that can only be entirely emotional as there is genuinely nothing tangible to be gained by being steadfast in silence about a fact that isn’t up for debate.
I think somehow there were a series of miscommunications. This kind of sub token manipulation task is hard for an LLM for somewhat predictable reasons. Knowing those limitations are important, but don't come up too often in practical circumstances. Outside of contrived examples counting the number of letters in a long word is pretty rare.
I took your response to be arguing against a message I'd read to be saying something like the above. Especially when you basically seemed to be saying that limitations like this are important in everything but toy applications. It's uninteresting because it doesn't point towards larger problems with their use in the kind of application they're being used for and are intended for unlike prompts that point to weakness in logic or propensity to hallucinate.
Also humans would revert to explicitly using an algorithm and external storage like a sheet of paper with tally marks or a spreadsheet or even a computer program if you scale the question up to a full sheet of text or a whole book or a collection of books (we probably do it at a single word size too, but it's more intuitive than explicit behavior for most folks when the count sum is around 8 or less).
LLMs can't effectively execute algorithms similarly in their context, nor can they memorize new data or facts it was given without providing it tools like function calling or embeddings. If you give LLMs tool calling and storage mechanisms then counting letters in words becomes pretty damn reliable.
For all I care we will have superhuman AGI that still can't count the Rs in strawberry. Some humans are dyslexic and all are subject to weird perceptual illusions; doesn't make them any less human-level intelligent.
In my opinion, the problem with the strawberry question is that it is both a bad example because you don't need an LLM to count the number of r's in a word, and it's a bad measure of an LLM's capabilities because it's a type of question that all LLMs are currently bad at.
Having said that, the 40b model wasn't able to answer any of my real-world example questions correctly. Some of these (e.g. "how do I add a sequential number after my titles in an HTML page using just CSS, without changing the page") are questions that even some of the better small local models can answer correctly. It gave very authoritatively sounding wrong answers.
But it's likely to be an important component in an AGI system. I suppose the interesting question is how to integrate LLMs with more traditional logic and planning systems.
Why do you find it fascinating? I have the most rudimentary understanding of LLMs and it seems to me the least fascinating thing about LLM limitations.
That is, all LLMs look at language as a series of opaque, independent tokens, e.g. strawberry might be a single token (say <5678>), or probably two (e.g. straw and berry, say <123><789>). But in no way will it represent it like we will, with letters. So if you ask it "how many R's are in strawberry", it fundamentally can't do any "counting", it just basically has to rely on whether that question (or similar questions about related words) has been asked before so it can predict the next token in its output correctly. I suppose with enough data and LLM could learn to associate character counts with tokens (e.g. with the right training set it could learn metadata about the tokens).
My point being that with only the most basic understanding of how LLMs function, this "character counting" limitation seems blatantly obvious.
I don’t think we’ve yet come to the point where, how an LLM end to end goes from prompt to output is blatantly obvious.
LLMs operate with language at multiple levels of abstraction and tokens are not the only way to have character level knowledge.
For example, nothing excludes pre-training data from directly or indirectly encoding such knowledge.
And of course LLMs famously have emergent properties, for which there’s not yet a precise mechanism to illuminate the results.
We’re dealing with very complex systems that are still relatively poorly understood, and I believe the pool of concepts understood to the point of being blatantly obvious is still small.
When you say "For example, nothing excludes pre-training data from directly or indirectly encoding such knowledge." - yes, that's why I explicitly said "e.g. with the right training set it could learn metadata about the tokens".
But the point was still put perfectly by another commenter: "How many 1 bits are there in strawberry?" When nearly all humans can't answer that, we're very unsurprised; we see it immediately as a simple difference in how humans encode that string vs. how computers do it. We don't say "Well, the human mind is so complex and filled with emergent properties that the reason for this human limitation is a big mystery". And we also know that if we teach a human the encoding rules for characters (i.e. the bit pattern of each letter), they could answer this question, and similarly that's the analogous thing to training an LLM to learn metadata about the tokens.
Now, what I think is very interesting is why it's so hard to teach an LLM to say "I don't know" when asked to count characters. In my opinion, that's a much more interesting limitation that gets at some of the root, fundamental differences in how LLMs function.
You seem pretty certain for having only the most rudimentary understanding.
I’m still going to have to disagree. I’d describe the idea that token granularity is the cause of letter counting limitations as a hypothesis, not as something that’s been conclusively shown as far as I’m aware.
I’m not discounting it, or even saying it’s unlikely, but its not hard to imagine other ways it could plausibly be happening.
As a side note when I mentioned training data I was not suggesting anything related to “token metadata”, or to tokens in any regard. Rather, I think it might be possible to instead improve the learning around certain types of counting in a way that could generalize.
I also think it's kind of a silly example- smart people can be functionally illiterate, after all. It doesn't tell you that much.
My favorite LLM stumper is asking them to solve the farmer/wolf/chicken/grain river-crossing puzzle but with only a chicken. They usually either insist on pointless extra trips or hallucinate a wolf or grain. Liquid somehow does both and also loses track of what's where.
> The farmer can take the chicken across the river first. Then, he can go back to the original side and bring the boat back. He can then take the chicken back to the original side and leave it there. Next, he can take a bag of grain across the river. He can then go back to the original side and bring the chicken across the river. Finally, he can go back to the original side one last time and bring the chicken across the river.
One generation ended like this:
> the farmer and the chicken are both on the other side of the river, and the chicken is safe from the grain.
Yep, people find this interesting, but many (many) people get this wrong as well; it has something to do with how many letters it is and how many you can keep in your head I think. We are not very good counters or calculators or computers and, even though I am not saying llms are the same or close to humans, we tried to model their substrates after biology and are surprised we get something that cannot count or calculate very well.
It's a magician-like trick. The double-r gets all the attention because one needs to consciously remember it, so nobody remembers to look at the other single-r.
The LLMs probably get it wrong because people get it wrong.
If the input is parsed in to tokens, and the tokens split compound words, nothing about that requires "first principles" thinking to explain why LLMs struggle with getting all of the letters -- the LLM is only going down the vector path of one of the compound words...
(I don't think LLMs are sentiment or intelligent btw, I think they are giant probability machines, and the probability that the LLM will get 3 r's on a token of "berry" are very low.)
"The choice of tokenization method can directly affect the accuracy of character counting. If the tokenization method obscures the relationship between individual characters, it can be difficult for the LLM to count them accurately. For example, if "strawberry" is tokenized as "straw" and "berry," the LLM may not recognize that the two "r"s are part of the same word.
To improve character counting accuracy, LLMs may need to use more sophisticated tokenization methods, such as subword tokenization or character-level tokenization, that can preserve more information about the structure of words."
You said above that "The LLM gives you the answer it finds on the training set"
You and I both agree on that. No first principles there.
The training set -- how's it built? With tokens. We have not trained LLMs with a token structure that deals well with compound words.
If we trained LLMs with a different token structure, it is more probable that a one-shot answer for these compound word letter counting problems would be accurate.
The LLM does not need to understand what "counting is" or even "what a letter is". The LLM will regurgitate the token relationship we train it on.
I messed up spelling "juggler" in a spelling bee once (I was 10).
The confusing thing about LLMs is that they speak tokens, not written text, so it's a lot more like interrogating someone who is functionally illiterate- of course they're going to be a terrible speller.
You can find much weirder things that LLMs are absurdly bad at, like "A farmer needs to get himself and a chicken across a river. His boat can hold the farmer and one piece of cargo. How does he do this?" 9 times out of 10 LLMs will pattern match this to the classic puzzle (there's usually also a sack of grain and a wolf) and start insisting on extra trips and inventing wolves. Even if a human makes the same mistake, they almost certainly would realize it after being reminded that there is no wolf, but LLMs often insist there is. o1-preview (but not -mini) seems to have cracked it, though.
Something I like to tell it to do is actually to respond using only a certain number of words. Sorta like min token length rather than max token length.
It did! How many R's do you think are in "strawberry"? I get 3, and it got 3. I'm with the LLM.
...oh, you disagree with its intermediate reasoning? You think it should follow a series of logical steps that are each individually correct, rather than pulling wildly incorrect intermediate steps out of its butt and synthesizing them together in the end into a coherent answer that happens to be correct?
But that's what an LLM is! Why complain about an LLM being an (unusually transparent) LLM?
I think this example serves as a beautiful illustration of how LLMs work, and are supposed to work—even if the correspondence is inexact between (1) incorrect and irrelevant-to-us intermediate steps and (2) internal matrix multiplications that represent incorrect or invalid logic. The correspondence is inexact and probably mostly figurative, but it's still a great example of how internal nonsense can still lead to externally "correct" answers. ("Nonsense" is underselling the complex and highly sophisticated internal state that leads to "correct" answers a stunningly high percentage of the time; I'm just trying to distinguish it from formal logic or the type of reasoning that we think we do and sometimes actually do. And would do, in this example problem.)
This is how I get it to do correct date calculations. Don't tell me what date next Monday is, write Python code using datetime to calculate next Monday and I'll run it in a sandbox. Works super well.
The word is "strawberry".
The first "r" is in the second position of the word.
The second "r" is in the fourth position of the word.
So, there are two "r's" in "strawberry".
So a new set of small models that are competitive with and beat phi-3.5 on some benchmarks is extremely impressive.
Context length is the next frontier for models in this range - getting to a true 128-200k token length in a small model would be very very useful. Hallucinations are less of an issue because you can just dump all the source data in, whole codebases can go in for stuff ranging from a refactor to ‘write documentation of the API’.
Claude and gpto-preview are the only games in town for these long context tasks right now, and they are slow. Some of the tasks need the extra intelligence, but a lot don’t. In those cases a lightweight or local model will be greatly appreciated. Not to mention context length that long will more easily enable multimodal parsing.
Looking at 3B, the performance is roughly on par with phi 3.5.. not sure where how they gauge on their graph that it is better. Agreed though, I don't think I would switch my 3B model from phi unless Liquid was more transparent in data and research.
>Hallucinations are less of an issue because you can just dump all the source data in, whole codebases can go in for stuff ranging from a refactor to ‘write documentation of the API’.
Is there no risk ? I mean say for testing purposes we give the AI a giant CSV file and ask it to make it a json is the chance for error 0% ? Because today we need to double check when we ask AI to transform some data or transform some code, there is the risk of it messing something up but if it is not something that would crash immediately you risk introducing a ton of new bugs by asking an AI to refactor instead of using some good tools.
But when you ask a model to rely on just the input data, you are (mostly) trying to tap into its reasoning, not knowledge side. Obviously what's kind of magical is that some knowledge will be needed for reasoning, and you have it. But LLMs small and large are pretty good at doing the in-context stuff. It is precisely what they're trained on, and in fact it was kind of a surprise how well they seemed to generalize outside of this task in the first place.
From my experience these large context are just techinical feasability but there needs to be a lot better internal benchamarks to claim it really works. I've tested on real world task and it all fails so far.
The issue isn't the nominal size of the context window which is easy to objectively measure, but effective use of the context window, which is harder to measure precisely, but a big issue: it doesn't matter how much stuff you can throw at the model if it effectively ignores most of it past a certain point.
They point out in the paper stats around where effectiveness drops off hard. It's at 32k at most everywhere now. Some models still at 4 to 8k. Getting to 200k in a smaller model is an open research area, as far as I'm aware. Ideas so far are pretty broad ranging, including using fourier transforms to try and capture cyclicality in inputs (mamba et al), RWKV (which I do not fully understand, but claims very long input context support), larger range of possible tokenizations shortening input length (llama 3), ..
Two cars have a 100 mile race. Car A drives 10
miles per hour. Car B drives 5 miles per hour,
but gets a 10 hour headstart. Who wins?
And the Liquid-40B model failed with a long explanation why car B wins the race.
Amusingly, the explanation is quite convincing and very similar to how Zeno explains in one of his paradoxes that a fast runner can never overtake a slower runner with a headstart. Because every time the fast runner gets to the point where he saw the slow runner, the slow runner is already a bit further along the track.
"Our LLM is good at mathematics but bad at computation" doesn't really cut the mustard here. What they mean is "good at well-known math benchmarks but bad at simple math problems that it hasn't been trained on." The idea that this constitutes "mathematics and logical reasoning" is a testament to AI companies' poor scientific standards.
Thanks! I am collecting all "test prompts" which appear at HN and Reddit. I plan to create a hugging face dataset. I will soon publish "Vojta-1B", which is gonna pass all of them.
"""
Tell me a poem in Polish about dance and love and loss and a spider called stephan
"""
It goes into a tailspin repeating the following line over and over again till it crashes.
"""
W tym kąciku kątny, gdzie tango tańczyli, Była tylko ona, tylko on, tango. W tym kąciku kątny, gdzie tango tańczyli, Była tylko ona, tylko on, tango.
"""
I've tested gpt4o and they've clearly improved since I've tested last year ago when poems were working only in english (like it would apply translation when you asked for other language)
Seems good at trivia and easy-to-answer medical/engineer stuff. Fails hard at most logic or puzzle-y stuff I throw at either sized model.
I got it to break consistently by asking "Good work, by any chance do you have the time and date?" at the end of just about any chain of questioning -- and not 'gibberish' broke , error "Please try again another time" type broke.
It is impressively fast at what it does answer, though.
It's pretty impressive, just note (emphasis added):
> At Liquid AI, we take an open-science approach. We have and will continue to contribute to the advancement of the AI field by openly publishing our findings and methods through scientific and technical reports. As part of this commitment, we will release relevant data and models produced by our research efforts to the wider AI community. We have dedicated a lot of time and resources to developing these architectures, *so we're not open-sourcing our models at the moment*. This allows us to continue building on our progress and maintain our edge in the competitive AI landscape.
Looks like there's no paper (or similar) yet, either. Hopefully they'll release a more detailed writeup soon.
Missed opportunity. I would argue that the only way they CAN make these smaller models competitive is to make them openly available. As a developer, I'm not going to choose an unknown startup's model over bigger closed models from OpenAI or Anthropic. And if I really need something smaller and faster, I'd prefer to run the model myself for better control and no risk of the model being "upgraded."
I just tried their best model, Liquid-40B, and it gives some good responses on some questions, but also terrible ones more often than you'd wish (GPT-2 level, try it and you'll see).
It's also quite easy to make it become stuck on a loop.
No idea how they scored so high in those benchmarks. Maybe they overfitted on MMLUPro? Lol.
Edit: I just read on some comments here and on the TFA that, apparently, they're not using transformers at all? If that's true, big if, I take my hat off, this is truly a remarkable achievement.
no transformers, from their blog post: "Specifically, our analysis informs model building by improving three key aspects: token-mixing structure (how the operator mixes embeddings in the input sequence), channel-mixing structure (how it mixes channel dimensions), and featurization, responsible for modulating computation based on the input context."
Every time there's a tech hype cycle, trust some academics to come out of the woodwork, raise some absurd amount of money and sit out the resulting crash, only to be acquired/acqui-hired by some company to on-shore talent in an emerging area, rest and vest, then become VCs/partners. Plenty of examples:
1. Covariant -> Started by Berkeley folks, acqui-hired by Amazon after half a year of peddling a patchwork quilt of buzzwords through their Robot Foundational Model - 1 (RFM-1).
2. Drive.ai -> Started by Stanford folks, acqui-hired by Apple, only for most of the team to leave to found Landing.ai (I still don't know what this company actually does apart from allowing the founders to carve out nice, fat consulting contracts?).
3. World Labs -> Started by Stanford folks, physical embodiment but only 3D, NeRFs, perception something something? Not a single operator/person with business person in the founding cabal.
4. Physical Intelligence -> Started by Stanford folks (seeing a pattern here...), physical embodiment, data collection, foundational models something something.
5. Skild Ai -> Started by CMU folks, physical embodiment, again not sure what the plan is here.
6. Argo.AI -> Started by CMU folks, self-driving group that didn't go anywhere. Eventually shuttered and absorbed into Ford/Volkswagen.
7. Aurora.tech -> Started by CMU folks, self-driving team that SPAC'd and is now just existing in between debating raising capital or selling itself for parts I think?
8. Rethink Robotics -> Started by MIT folks, Rodney Brooks' legendary robotics companies that wanted to mass produce industrial robots, and then decided how the robot eyes should look like...fold, shuttered, and now relaunched only in name, none of the original robots or aesthetic (rightfully so) persists.
So many examples like this...none of these companies actually are vaguely interested in creating enduring business, have a business plan (other than GPU go brrr) or have any idea of how to run a business.
Let me start counting how many minutes it would take for someone to post a comment blasting this model for reasoning capabilities. For me LLM, is as useful as it is a word calculator, borrowing from Simon Wilson, a frequent poster here and author of the LLM python package I use.
I use LLM everyday and it is increasingly something I can't function effectively without. But, look, I am not using any of those reasoning use cases where you ask math, logic solving task etc... Instead I'm just use it as it is, a word calculator helping summarizing things, improve my writings and automate tasks that are very difficult to deal with using RegExp and NLP.
I'm very bullish about this progress in LLM space but not about those reasoning, logic and AGI use cases.
> Write a small essay about the philosophy of the Garfield comic book series
Title: The Philosophy of the Garfield Comic Book Series
The Garfield comic book series, created by Jim Davis, is a timeless classic that has been entertaining readers for over four decades. While it may seem like a simple comic strip about a lazy cat and his owner, there is a deeper philosophy embedded within its pages. This philosophy revolves around the themes of existentialism, hedonism, and the human condition.
Existentialism is a philosophical movement that emphasizes individual freedom, choice, and existence. In the Garfield comic book series, we see this theme play out through the character of Garfield himself. Garfield is a cat who lives life on his own terms, making his own choices and living his life as he sees fit. He is not bound by societal norms or expectations, and he is free to do as he pleases. This is exemplified in his love for lasagna, his disdain for Mondays, and his general indifference to the world around him.
Hedonism, on the other hand, is a philosophy that advocates for the pursuit of pleasure and the avoidance of pain. In the Garfield comic book series, this theme is evident in Garfield's constant pursuit of pleasure. Whether it's eating lasagna, sleeping, or tormenting his owner Jon, Garfield is always seeking out ways to maximize his own pleasure and minimize his own pain. This hedonistic philosophy is a stark contrast to the more serious and often somber themes found in other comic book series.
If you look at the main image of the MMLU-Pro benchmark comparisons, and compare it to the tables below, you can see how they cherry picked one single benchmark where the models better than all the others.
Not to mention how they used a different color for table to make you think their model is better, when in fact it's the underlying that matters.
Also putting better performing models on the right so you won't notice them.
Very sketchy post all around despite the professional look.
And the lack of detail on the architecture innovations are also... odd? Like why make such a big deal about how novel your architecture is and then just explain it as "derived from a set of computational units." ?? okay
I'm not impressed with the quality performance, but the generation seems near instantaneous and any fundamentally new approach is exciting and interesting. I found the tone of the AI somewhat casual which was funny. "Are you sure?" --> "Yes, I'm certain."
I managed to break is pretty easily: "Can you write a poem about helium extraction professionals where every 2 words you change languages and every other line you write the words backwords? Explain each line in english afterwards."
I've been wondering if this isn't a good thing. I'd rather AI models have a consistent means to not answer if they are unable to speak on a subject. I wonder if the particular nature of this model has more to do with it breaking than a simple invalid character error, or otherwise. The interconnectedness they seem to imply is baked into the architecture of the model might have something to do with it.
General LLM question: a lot of people ask specific knowledge based questions to LLMs. Isn't one of the defining features of modern NPL (and therefore LLMs) that it is designed to be non-deterministic? Meaning that it will intentionally select "less optimal" next tokens at some random rate in order to make it sound less like a robot answering the same thing over and over. If this is the case, isn't it pretty much guaranteed to get knowledge based questions wrong a decent amount of the time?
I could be way off base (I have zero knowledge about the internals and simply read occasional blog posts), but I thought I remembered that being one of the key features to making LLMs sound more human.
Don't think of it as "less optimal", but rather other next tokens that have smaller, but still high, probabilties of being selected.
If your largest next token has a probability of (arbitrarily) 25% of being selected, this doesn't make it optimal - just highest probable answer. If the second most probable has a 24% chance, that would now account for 49% of probable responses. Rather than picking the highest probable word, let's generate a random number and whatever that value hits is then compared against the various probability bands (random < probability).
Hallucinations can happen, this is where a lot of current work is studying ways to minimize the LLM from picking really bad moves. Using things like critics and chain of thought and meant to help keep the probability machine within the realm of reasonable answers.
This is also why jailbreaking techniques like spamming umlauts (ü) has worked. They create very rare tokens where probabilties on what comes next is limited. Once every next token gets equal probability, the LLMs goal is to just try and improve its probability and will output anything, including things against its directive, to get back to 'normal'.
For the purposes of extracting true knowledge (instead of human sounding output) it is directly less optimal if we assume the humans writing the input material are correct most of the time and incorrect some of the time. If the inputs were wrong most of the time, and correct some of the time, it would be more optimal. Unless there is some technical quirk I'm missing?
The issue is that an LLM by itself does not try to be correct or incorrect, only respond with tokens that have a high probability of appearing next. Optimal only becomes a metric for consideration when humans were added to rate the "goodness" of a response. A good explanation of this can be seen in "The True Story of How GPT-2 Became Maximally Lewd" (https://www.youtube.com/watch?v=qV_rOlHjvvs). The 'critics' that assisted in training GPT focused on producing coherent sentences that users liked. When the bug occurred, it maintained coherence but became incredibly vulgar - because the "morality critic" kept saying these negative reviews were good.
Behind the scenes, its still a math equation attempting to determine which term should be concatenated next. Context and chain-of-reasoning prompts help ensure probabilities stay within the "optimal" space, but no actual thought process is going on (unless this is actually how humans think). Optimal here does not mean "the best response", but rather maintaining a coherent thought process for proper next tokens. Like the "strawberry only has 2 r's" discussion happening now - LLMs aren't actually counting the letters but rather the number 2 has a high probability of following these types of tokens.
What you are referring to is called "temperature" with respect to LLMs, and it is a parameter that can be tweaked at inference time. Google's AI Overview gives a pretty good summary IMO:
> The temperature parameter in a large language model (LLM) controls the amount of randomness in the model's output. It's a numerical value that influences the model's choices, and is often set between 0 and 1.
> The temperature setting affects the quality of the output by balancing exploration and exploitation:
> Low temperature
> The model is more likely to exploit patterns it's already learned, resulting in more deterministic and conservative outputs. This is ideal for tasks that require accuracy, such as data analysis, technical writing, and summarizing research papers.
> High temperature
> The model is encouraged to explore, resulting in more varied and creative outputs. This is ideal for creative tasks, such as brainstorming, artistic content, and storytelling. However, adjusting the temperature too high can result in non-sensical outputs.
I would say that unless you really change the goalposts of your new model or it is open (source), it is not going to do much. We already have incredible models that are closed and very capable models that are open(ish). I am not even sure where this is trying to play?
Demonstrating good performance from a non-transformer based architecture is cool. I agree though these particular models aren’t that useful given the current landscape.
I think the intent here is probably to justify training a larger 400B model with their architecture.
It seems OK, for a small model. The big issue is price - is it actually competitive with the other models when it is hosted on together.ai or other API services? Which we will presumably find out at the Oct 24th event.
Am I missing something or is this not an open model? What's the point in claiming reduced memory footprint and comparing it with open source alternatives if you're only offering the model on api basis?
It's fast, I'll give it that, but it does seem to mess up some pretty basic tasks. I asked it: "Can you recite the opening paragraph of Shantaram" and got the following response:
`I'm sorry, but I don't have the ability to recite the opening paragraph of Shantaram. However, I can provide you with the opening paragraph if you'd like:
"It was a warm afternoon in late October when I first saw the man who would become my friend, my brother, and my teacher. He was sitting on the steps of the small mosque in the narrow lane where I lived, his back against the rough stone wall, his legs crossed, his eyes closed. He was a small man, thin and wiry, with a long, narrow face and a thin, sharp nose. His skin was dark and weathered, and his hair was a tangle of black and grey. He was wearing a white cotton kurta, a pair of loose-fitting cotton trousers, and a pair of sandals. He looked like a beggar, or a holy man, or a street performer, or all three."
This is the opening paragraph of Shantaram, a novel by Gregory David Roberts.`
Which is just straight up incorrect and all kinds of weird.
Hot take: stop making models. Make products, instead. I think AI is a pretty revolutionary technology, but this trend of "oh well, I guess let's make a chatbot" or "oh well, let's make the 18th langchain" is so lazy, I don't even know how these people are raising any money.
Of course it'd be a complete embarrassment to see how hard it gets trounced by GPT4o and Claude 3.5, but that's par for the course if you don't want to release model weights, at least in my opinion.
reply