A friendly reminder: if you have tech-illiterate people in your life (parents, grandparents, friends, etc), please reach out to them and inform them about advances in AI text, image, audio, and (as of very recently) video generation. Many folks are not aware of what modern algorithms are capable of, and this puts them at risk. GenAI makes it easier and cheaper than ever for bad actors to create targeted, believable scams. Let your loved ones know that it is possible to create believable images, audio, and videos which may depict anything from “Politician Says OUTRAGEOUS Thing!” to “a member of your own family is begging you for money.” The best defense you can give them is to make them aware of what they’re up against. These tools are currently the worst they will ever be, and their capabilities will only grow in the coming months and years. They are already widely used by scammers.
Anecdotally, I feel like I've heard "powerhouse" used somewhat regularly when describing impressive people, i.e.: "Such-and-such is an absolute powerhouse on the field", or "That person is so productive, they're a powerhouse".
So maybe the original usage has been subsumed by "power plant", but I think the word has alternative meanings which persist.
The counterargument here is that you can just scale the size of the hidden state sufficiently such that it can hold compressed representations of whatever-length sequence you like. Ultimately, what I care about is whether RNNs could compete with transformers if FLOPs are held constant—something TFA doesn't really investigate.
Well, that's what Transformer already does... One problem with the scaling you're describing is that there would be a massive amount of redundant information stored in hidden activations during training the RNN. The hidden state at each time step t in the sequence would need to contain all info that (i) could be useful for predicting the token at time t and (ii) that could be useful for predicting tokens at times >t. (i) is obvious and (ii) is since all information about the past is transferred to future predictions through the current hidden state. In principle, Transformers can avoid storing redundant info in multiple hidden states at the cost of having to maintain and access (via attention) a larger hidden state at test/eval time.
> there would be a massive amount of redundant information stored in hidden activations
Is there a way to prove this? One potential caveat that comes to mind for me is that perhaps the action of lerping between the old state and the new could be used by the model to perform semantically meaningful transformations on the old state. I guess in my mind it just doesn't seem obvious that the hidden state is necessarily a collection of "redundant information" — perhaps the information is culled/distilled the further along in the sequence you go? There will always be some redundancy, sure, but I don't think that such redundancy necessarily means we have to use superlinear methods like attention.
All information about the past which will be available for predicting future tokens must be stored in the present state. So, if some bits of info about some past tokens at times less than t_p will be used for predicting some future token at time t_f, those bits must be passed through all states at times from t_p to t_f. The bits are passed through the recurrence. Once information about past tokens is lost from the hidden state it is gone forever, so it must be stored and carried across many steps up until it finally becomes useful.
The information cost of making the RNN state way bigger is high when done naively, but maybe someone can figure out a clever way to avoid storing full hidden states in memory during training or big improvements in hardware could make memory use less of a bottleneck.
> The information cost of making the RNN state way bigger is high when done naively, but maybe someone can figure out a clever way to avoid storing full hidden states in memory during training or big improvements in hardware could make memory use less of a bottleneck.
Isn't this essentially what Mamba [1] does via its 'Hardware-aware Algorithm'?
And since the proposed hidden states and mix factors for each layer are both only dependent on the current token, you can compute all of them in parallel if you know the whole sequence ahead of time (like during training), and then combine them in linear time using parallel scan.
The fact that this is competitive with transformers and state-space models in their small-scale experiments is gratifying to the "best PRs are the ones that delete code" side of me. That said, we won't know for sure if this is a capital-B Breakthrough until someone tries scaling it up to parameter and data counts comparable to SOTA models.
One detail I found really interesting is that they seem to do all their calculations in log-space, according to the Appendix. They say it's for numerical stability, which is curious to me—I'm not sure I have a good intuition for why running everything in log-space makes the model more stable. Is it because they removed the tanh from the output, making it possible for values to explode if calculations are done in linear space?
EDIT: Another thought—it's kind of fascinating that this sort of sequence modeling works at all. It's like if I gave you all the pages of a book individually torn out and in a random order, and asked you to try to make a vector representation for each page as well as instructions for how to mix that vector with the vector representing all previous pages — except you have zero knowledge of those previous pages. Then, I take all your page vectors, sequentially mix them together in-order, and grade you based on how good of a whole-book summary the final vector represents. Wild stuff.
FURTHER EDIT: Yet another thought—right now, they're just using two dense linear layers to transform the token into the proposed hidden state and the lerp mix factors. I'm curious what would happen if you made those transforms MLPs instead of singular linear layers.
This architecture, on the surface, seems to preclude the basic function of recognizing sequences of tokens. At the very least, it seems like it should suffer from something like the pumping lemma: if [the ][cat ][is ][black ] results in the output getting close to a certain vector, [the ][cat ][is ][black ][the ][cat ][is ][black ][the ][cat ][is ][black ] should get even closer to that vector and nowhere close to a "why did you just repeat the same sentence three times" vector? Without non-linear mixing between input token and hidden state, there will be a lot of linear similarities between similar token sequences...
Counterpoint: the hidden state at the beginning of ([the][cat][is][black]) x 3 is (probably) initialized to all zeros, but after seeing those first 4 tokens, it will not be all zeros. Thus, going into the second repetition of the sentence, the model has a different initial hidden state, and should exhibit different behavior. I think this makes it possible for the model to learn to recognize repeated sequences and avoid your proposed pitfall.
The new hidden state after the first repetition will just be a linear combination between zero and what the non-recurring network outputs. After more repetitions, it will be closer to what the network outputs.
I don't think it's a capital-B Breakthrough, but recurrent networks are everywhere, and a simplification that improved training and performance clears the stage to build back complexity up again to even higher hights.
Log space is important if the token probabilities span a large range of values (powers). There is a reason that maximum likelihood fitting is always performed with log likelihoods.
This is fascinating—I've recently been experimenting with something spookily similar after coming across the same paper as OP, only I'm trying to make random DAGs of logic gates reconstruct images rather than audio signals. Great minds like a think, I guess :P
Haven't written anything up about it, but the code is here: https://github.com/mkaic/dagnabbit. I haven't gotten it to work particularly well yet, so I don't have any demos posted.
If I ever get it to a point where I feel proud of the result, I'll write a blog post about it and submit it to HN.
I have used Comic Mono as my coding font for the past 2 years and unironically love it. I installed it as a joke so I could take some screenshots and get funny reactions out of my friends, but found myself genuinely enjoying the readability. These days I frequently forget it's even installed except when someone new joins the team and sees my IDE setup for the first time:
"What font is that??"
"Oh, haha, yeah... It's Comic Sans, but monospaced!"
Regarding "What makes people fat?", I feel like there are few charts as damning as the one [0] halfway through TFA showing a spike in obesity rates that coincides nearly perfectly with the introduction of the modern USDA Food Pyramid.
The quote below it sums things up nicely, imo:
> Dr. Michael Eades: “A few years ago, I went back and pulled some labels off some feed sacks in a farmer’s coop and ran them through my little nutrition computer and there’s virtually no difference in the macronutrient composition that farmers use to feed animals to fatten them up and that the USDA uses to tell us to supposedly slim down.”
It is truly insane how we continue to act as though anything other than drastic carbohydrate intake reduction will fix the American obesity epidemic. We've been eating way too much flour and sugar for decades now.
As much as I love their work, I can't be the only one who really struggles to see a path to profitability for Mistral, right? How do you make money selling API access to a model which anyone else can spin up an API for (license is Apache 2.0) on AWS or GCP or similar? Do they have some sort of magic inference optimization that allows them to be cheaper per-token than other hosting providers? Why would I use their API instead of anybody else's?
Asking these questions as a genuine fan of this company—I really want to believe they can succeed and not go the way of StabilityAI.
If VC funding for AI dries up and the french continue investing in mistral that would prevent much of the damage of an AI winter that could make openai and anthropic fail
even if Google had, it would have been of little value, since running Google, even back then, required a lot of computers and a lot of humans.
which is rather like Mistral - running large models is expensive, and hosting lets you amortise that across lots of users who individually use the model very little.
google ostensibly started on beige boxes, though. They used whatever computers they could get cheaply and quickly, even older hardware sufficed. There was a niche global group of people who could make stuff like that work as a much larger compute system (beowulf, etc). I don't know that it took "a lot of humans" to bootstrap.
That hasn't been true for their largest model since the 2407 release of Mistral Large 2 (https://mistral.ai/news/mistral-large-2407/), it is however under a non-commercial license.
> How do you make money selling API access to a model which anyone else can spin up an API for (license is Apache 2.0) on AWS or GCP
Uhhh.. easily. Don’t host it in AWS or GCP where everyone is hosting their infrastructure on proprietary infrastructure with a 10x markup? Don’t hire thousands of unnecessary employees? Don’t bank on outrageous valuations? Lots of ways to compete with big tech.
I guess I was just under the impression that cloud inference is such a competitive market that it'd be nigh-on-impossible to compete with the major players.
Except Dropbox always kept their users tied to their platform, which allowed them to gradually enshittify their offering, starting with removal of directly addressable content in "Public" folder, and continuing through various changes and side products that all had very little to do with "a folder that syncs". Mistral can't successfully enshittify if they can't keep users captive.
It's somewhat common to open source the core yet still monetize a version (browser vendors, SaaS, games). People will still pay for convenience, reliability, or for the best product.
> Why would I use their API?
To pay 15 cents per million tokens instead of $5.
> license is Apache 2.0
There are browser vendors that use Chromium despite "competing" with Chrome - even though it's the same kind of web browser product, there are some benefits if they allow the other options to exist. The same can be said of open source games and frienemy situations like Uber vs Lyft in the early days - it doesn't necessarily hurt to have others playing your game, especially if you have a common enemy (Firefox, other games, cabs, respectively).
I run mostly Mistral offline in Terminal (via ollama cli) but in the case where I need a text-to-text LLM for an app and users pay for access to the LLM-powered stuff, why not use Mistral's API? Then I could have a super cheap app setup on Vercel or whatever and do everything through an API key. The app would "be AI" and yet it runs on a calculator for cents.
The main thing that comes to mind regarding "just spin it up on AWS" is the considerable backend needs (GPU) and cost to train and run LLMs. In the same way you ask why use the LLM's cloud option when I could use AWS' cloud option? could also ask the inverse (or just host it yourself for free after initial setup if it's cost you're after).
If you need geo located instances and some other specific requirements use IaaS, but otherwise I think IaaS like AWS and GCP are a nightmare to manage - the awful IAM experience, all the vendor-specific jargon, navigating the hell that is Amazon.com. For something like an LLM "just spin it up on AWS" is just funny when you really consider what you're getting yourself into.
My intention was less to imply "I'll just spin it up" and more to imply "Some competitor of Mistral's will spin it up". I agree that from my perspective as a casual user, Mistral's API is quite convenient. What I don't understand is why they aren't driven to zero-margin instantaneously by an onslaught of clones of their business model.
But it is verifiable. We could quibble over "who judges novelty," but I bet if there were regular examples of it doing so, and there were some community agreement the ideas were indeed suitably novel, we'd pretty quickly shout "existence proof!" and be done.