Question: I only ever hear how RWKV is threatening the dominance of transformers... from the RWKV folks, while everyone independent is happily using transformers. Is there any substance to the above claim that's been independently verified?
Not an NLP person, so apologies in advance if I've missed any obvious papers or references.
I've been catching up on the papers, and the mechanics of RWKV are interesting, but I'm more of a hands on guy and I played a bunch with RWKV earlier this year when its quality was comparable with the early LLaMA models of similar size and it was significantly faster (its CPU inference was comparable to the latter's GPU inference speeds 6 months ago), but I feel like those advantages have largely dissipated.
With 4-bit quants, and the optimizations for batch=1 with llama.cpp, Exllama, MLC on the edge side, and on continuous batching w/ vLLM and the like on the high throughput side, I think performance (in terms of speed) is not as pressing, at least while people are pushing hardest on the capabilities front. Meanwhile, the quality of the best Llama 2 (and even the last-gen of Llama 1) models have blown by the existing RWKV models (although I'm definitely interested in kicking the tires on RWKV-5 when it's out). Also, I recently stepped back into RWKV world to check up on what was going on, and I was actually a bit shocked at how night and day the tooling/ecosystem is now. While it's not all roses the open LLM side, there's enough critical mass where the evolution has been ... extremely rapid, to say the least.
I think over the next few years, obviously there will be architectural improvement that succeeds the current crop of dense transformer LLMs (one might argue that's already happening piece by piece), and I'm happy there are experiments like RWKV going on, so I guess we'll just have to see in terms of what scales and what doesn't.
Quantisation thankfully is applicable to RWKV as much as transformers. Most notably in our RWKV.cpp community project: https://github.com/saharNooby/rwkv.cpp
Tooling/Ecosystem is something that I am actively working on as there is still a gap to transformers level of tooling. But i'm glad that there is a noticeable difference!
And yes! experiments are important, to ensure improvements in the architecture. Even if "Linear Transformers" replaces "Transformers". Alternatives should always be explored, to learn from such trade-offs to the benefit of the ecosystem
(This was lightly covered in the podcast, where I share IMO that we should have more research into text based diffusion networks)
Glad to hear that you're focused on improving tooling, as you point out, as there is definitely a gap (that maybe I was too circumspect about it originally, but to be clear, difference wasn't in RKWV's favor).
Some of it may just be the docs. I think one easy improvement is that when you search for RWKV, you end up on the Github page, but the README basically goes from a description of the model, to a short snippet of sample inference code, to a list of implementations, to a bunch of random research notes? It's a bit scattered and pretty hard to read, but should be easy to clean up.
From a practical perspective for example, there's no list/description in the repo or the model cards on all the different versions at https://huggingface.co/BlinkDL (world, raven, pile, novel, pileplus, etc) and while it says RWKV-4-World is the "best" model, it's only a 7B and Raven 14B is listed above and says it's fine-tuned (while World is not)? Which is actually better? Unsure, no benchmarks (even the spreadsheet screenshots eventually listed don't give any info, nor do the model cards) - even as someone returning to RWKV this was very confusing. Compare it to what the Llama community is doing these days with model cards (eg https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-... or for quants https://huggingface.co/TheBloke/WizardCoder-Guanaco-15B-V1.0...)
Also, for those not familiar with the project at all, I think a slightly expanded/better organized "Usage" section I think would be helpful in terms of getting up and running (it doesn't help that multiple implementations each have their own quant formats). I feel like have a section that would get a user easily up and running (I think ExLlama and MLC both do a good job here) would be helpful. Again, some of this might just be a pass of cleaning up/better organizing the docs.
Thanks for clarifying, and pointing out where docs can be improved.
For most parts I am trying to ensure everything do get properly reorganised under the official wiki page here. I have just added a guideline of sort to help better navigate which model should be used accordingly, based on your feedback.
Eventually, from an SEO stand-point, the first landing for RWKV should eventually be optimised to be the wiki or rwkv.com page itself. And not blinkDL original trainer.
This would be more aligned with the fact that the vast majority of search for the model would naturally be individuals who would want to try the model (and not train it / finetune from scratch).
The wiki looks like a much better organized starting point! I wonder if that could simply be linked after the intro in the BlinkDL repo until search rankings catch up: "For those looking to get started using or testing out RWKV models, visit: https://wiki.rwkv.com/"
I do also think that having a recent benchmark table w/ the latest models and reference to a few similar sized transformer base models (but most importantly llama, llama2) would be super useful - maybe a table w/ memory size at various context lengths, or other ways to highlight RWKV's unique advantages. While local LLMs are still niche, I do think a lot of why RWKV gets overlooked is because if it has lower capabilities, doesn't inference faster, and is harder to set up, it's not exactly clear why to check it out (beyond as a curiosity).
(interviewer here, but not on RWKV team and no ties) so if you're looking for social proof - this is something we cover in the pod but the original idea came from Apple (Attention Free Transformers) and Microsoft recently put out the RetNet which is a similar idea. If you believe that the reasoning benchmarks shown were legit (and these are pretty standard benchmarks imo, and audited by other academics not strictly on the RWKV team like Quentin Anthony (also our past guest!), tho he is with Eleuther), then that's pretty much all the proof you need. plus the fact that you can spin it up on huggingface spaces right now and try it out yourself (link in the show notes).
bottomline is this thing is being built by the community that uses it - there is no higher form of dogfooding. if you wanna wait 3-4 years for big names to use it before you take it seriously, then we'll just have to wait. i call this out explictly on the pod intro that i'm trying to be early here, i think we might be the first big ai pod to seriously cover RWKV
Do note that both cited are competing alternatives for linear transformers, and in some sense competes with RWKV, in this upcoming "linear transfromer architecture" trend.
---
While all the above validates architecture. It's important to separate out architecture design from dataset.
The quality of an AI model answers, is equal parts as much about the dataset, than the architecture used. This is why it makes less sense comparing LLaMA2 directly with retnet/RWKV for example, due to the large differences in dataset size.
It has good potential and nobody ever thought that RNN's would be able to scale as they have led by BlinkDL.
Well we did at Stability AI which is why we provided and scaled compute for it.
They have the potential for far better edge inference and also benefits longer context windows.
There are other architectures out there too we are supporting, I would be surprised if the current transformer LLM paradigm is what we use in a few years.
With all due respect, I consider you as one of the "RWKV folks" I mentioned in the initial comment :) (assuming you are Emad Mostaque?)
> They have the potential for far better edge inference and also benefits longer context windows. ... I would be surprised if the current transformer LLM paradigm is what we use in a few years.
What are the studies/evidence that convinced you? Given the current landscape, the last claim is quite bold.
Btw, I also realise on rereading, both statements can be true at the same time.
In the podcast as well, I was being very transparent that I am happily using transformers (Salesforce codegen) in production for my primary work use cases, and I am still happily doing so.
My work into RWKV is to support a new usecase which transformer could not support even today. And in no means should it threaten your happiness factor in using existing AI models.
I suspect in the long run, AI architecture wars may slowly mirror that of the database wars, in software development (SQL vs NoSQL vs ??)
Each of them will have substantial benefits in one use case over another. And may have many overlaps, with lots of happy users for all of them =)
Personally, the RNN "mode" runs substantially faster with similar quality to other models of the same size. RWKV is limited by compute and thus model size at present which makes it hard to suggest how it will stack up against models like falcon and llama2 in their larger sizes.
There are also trade offs around their specific attention mechanism. It's approximate and consists of an exponential decay across time steps and some constant that represents the magnitude of the attention at that time step. The decay means for long running contexts it in theory will loses some of the ability to recall nuance you see with perfect attention on longer sequences. This isn't a specific problem of RWKV, but really any decay based attention.
A perk of the formulation is the attention computation is linear in time complexity and only requires one additional serial scan at the end, so you get nicer scaling properties than quadratic perfect attention for the trade off in lossy attention.
The examples in the original paper don't show this to be a problem, but they also show much shorter sequences. In the real world its a toss up, I'd expect it to be faster but more lossy for longer context windows.
TLDR: It has some teeth, but I think saying it's threatening is wishful thinking to some degree. Good model, looking forward to seeing larger sizes.
Agreed that as a whole, we do need more public benchmarks (both transformers and RWKV), with much longer sequence length.
I view this as a case of the "benchmarks" and "datasets" not keeping up with the current progress, as the vast majority of benchmarks presumes the limit of up to 2k tokens. Which was cutting edge last year, but isn't today.
Internally we do have some benchmarking for how the decay over time affects lossy memory, and we are making major improvements in this direction for v5, and aim for no loss within the 8k to 100k and beyond context sizes. And hopefully with time, there will be more formalised standard benchmarks for larger context sizes.
Edit/PS: I believe until we get a sponsor for the compute required to build on similar scale and dataset compared to LLaMA2, so that we can have that "showdown" - we would be stuck - in its non threatening status. As there will always be that shadow of doubt, if it can scale to the next milestone that transformers have already crossed.
Occam's razor tells us that if it's a great architecture/technical breakthrough, it would have taken the world by storm by now. Similar to the original transformer paper and model. Since 2017, the only successful models are variations of transformers. RNNs are no where in the picture.
Simply believing a architecture is superior doesn't make it so. Nothing converges and performs as good as a model with attention in both training and inference. The difference is night and day.
> Occam's razor tells us that if it's a great architecture/technical breakthrough, it would have taken the world by storm by now. Similar to the original transformer paper and model.
I think you're making the opposite case here without even realizing it yourself. The Transformer was proposed in 2017, with the softmax-based attention being proposed in 2014 but it wasn't until 2023 when GPT-3 et al took the world by storm that people started really using the Transformer as it is today.
The timeline going from being proposed to being used in production models also took multiple steps, with things improving with each step. It's a iterative process, not "dump -> done".
> Simply believing a architecture is superior doesn't make it so
I agree. But also I agree the opposite, that just because you believe it's not superior, doesn't make it so. If you claim "RWKV performs poorly in practice", I expect you to have something to back that up, definitely more than "Just because you believe it's superior doesn't make it so"
I've now presented on this model and worked with it. It's not phenomenally better than other models but has some attractive scaling and speed properties that may or may not be worth the trade off relative to its draw backs which I detail in my other comment in the GP's thread.
This is really a strange take. Encoder-only transformers like BERT and RoBERTa have been wildly popular in NLP for years now, replacing pretty much every model that came prior to it and beating pretty much all traditional NLP benchmarks (tagging, parsing, etc.).
> The Transformer was proposed in 2017, with the softmax-based attention being proposed in 2014 but it wasn't until 2023 when GPT-3 et al took the world by storm that people started really using the Transformer as it is today.
Ideas and trends move much slower, than we think, especially outside of silicon valley, or the current bubbles we as individual are in.
We as humans have a tendency of not wanting things to change, nor accept new ideas that challenge our existing ideas. And shape our memories accordingly.
Transformer example: The world didn't switch over immediately to transformer's in 2017, in fact the original model had issues converging past a 100M params that needed to be sorted out. And arguably picked up steam mostly after BERT a year later.
Day-to-day example: The average day to day person, outside the tech bubble, still have not tried ChatGPT - and unfortunately it has not taken the world by storm yet.
So while it is true that traditional RNNs with LSTM do not converge as well. The changes presented here are substantial (we removed LSTM for example)
And it's not a question of belief, RWKV code is fully opensource, in public and available. All claims can be tested. With results that can be replicated. By anyone who is willing to put in the time to do so.
I realize you say "not authoritative", but just to clarify - LAION isn't the creation of Emad Mostaque or Stability AI. Emad is/was a funder who showed up well after its inception. The same applies for his work with Eleuther. Although some members of both LAION and Eleuther did go on to become Stability employees. LAION was also partially born from _frustrations with_ Eleuther's TPU culture and inability to ship anything (at the time, and excepting Katherine Crowson's great work) for text-to-image work.
will mention, thanks for correction. would love to interview someone from the LAION side to get that part of the story. (its really hard to trace as an outsider)
It all started originally on lucidrains/dalle-pytorch in the months following the release of DALL-E (1). The group started as `dalle-pytorch-replicate` but was never officially "blessed" by Phil Wang who seems to enjoy being a free agent (can't blame him).
https://github.com/lucidrains/DALLE-pytorch/issues/116 is where the discord got kicked off originally. There's a lot of other interactions between us in the github there. You should be able to find when Phil was approached by Jenia Jitsev, Jan Ebert, and Mehdi Cherti (all starting LAION members) who graciously offered the chance to replicate the DALL-E paper using their available compute at the JUWELS and JUWELS Booster HPC system. This all predates Emad's arrival. I believe he showed up around the time guided diffusion and GLIDE, but it may have been a bit earlier.
Data work originally focused on amassing several of the bigger datasets of the time. Getting CC12M downloaded and trained on was something of an early milestone (robvanvolt's work). A lot of early work was like that though, shuffling through CC12M, COCO, etc. with the dalle-pytorch codebase until we got an avocado armchair.
Christophe Schumann was an early contributor as well and great at organizing and rallying. He focused a lot on the early data scraping work for what would become the "LAION5B" dataset. I don't want to credit him with the coding and I'm ashamed to admit I can't recall who did much of the work there - but a distributed scraping program was developed (the name was something@home... not scraping@home?).
I should also mention Romain Beaumont, an original member. His work on clip-retrieval was great and allowed us to eventually ship CLIP embeddings data for LAION5B and do semantic search over the dataset. He worked on a lot there however, including code for training and model architecture. Smart guy! I have heard he's now with Stability? Haven't confirmed though.
The discord link on Phil Wang's readme at dalle-pytorch got a lot of traffic and a lot of people who wanted to pitch in with the scraping effort.
Eventually a lot of people from Eleuther and many other teams mingled with us, some sort of non-profit org was created in Germany I believe for legal purposes. The dataset continued to grow and the group moved from training DALLE's to finetuning diffusion models. Eventually it was decided by core members that we should get a proper name and LAION was chosen.
The `CompVis` team were great inspiration at the time and much of their work on VQGAN and then latent diffusion models basically kept us motivated. As I mentioned a personal motivation was Katherine Crowson's work on a variety of things like CLIP-guided vqgan, diffusion, etc.
I believe Emad Mostaque showed up around the time GLIDE was coming out? I want to say he donated money for scrapers to be run on AWS to speed up data collection. I was largely hands off for much of the data scraping process and mostly enjoyed training new models on data we had.
As with any online community things got pretty ill-defined, roles changed over, volunteers came/went, etc. I would hardly call this definitive and that's at least partially the reason it's hard to trace as an outsider. That much of the early history is scattered about GitHub issues and PR's can't have helped though.
I haven't been on the Eleuther Discord in a year or so but the last time I was there it was bustling and thriving on some really hard topics
So while I'm not sure that the RWKV movement will see more widespread adoption than the corporate LLM implementation movement - I think there's probably still more goodness to come out of Eleuther community
Discord is a cancer. While I don’t understand the appeal, I’ve been around long enough to no longer be surprised by people sacrificing control of their information for trivial short term gain.
As a user from the early internet forum era, I do find it a step back as well.
But unfortunately, the people as large has kinda spoken, and voted by their usage. And ultimately for community engagement, their decision is something I have to accept as well.
Better to have a messy growing discord with active users, then a forum with little to no engagement.
For better search indexing, and making things more accessible, I help constantly move links / items from the discord into the more public github wiki : https://wiki.rwkv.com/
(PS: our discord is fully public for anyone to signup/access)
This doesn't seem to be true, unless you don't trust what Eleuther themselves are saying.
> The work we do would not be possible without the generous support of our donors and other sponsors: Core Weave, HuggingFace, Google TRC, Nat Friedman, Lambda Labs
least capitalist hacker news comment. do the arrows mean flow of capital? i don't think so. i think it means flow of people or cultural influence or leadership
it was meant to be people :) i obviously have even less insight on flow of funds than i do of people movement. i'm sure Emad could draw a much better chart of the Eleuther mafia than I can (and happy to update my chart to reflect the truth Emad)
I think there is something there, it's just maybe not in the Yudsphere. I spent a few months in grad school reading FATML papers, but that is obviously something very different.
Another tangent — I don’t get why people writing technical blogs use Substack, where support for math and syntax highlighting is nonexistent (math) or minimal (code). Something with Hugo plus PaperPad theme would be far better. I suppose Substack is popular because of the newsletter features.
Same reason anyone would use a hosted service vs setting it up themselves, they don't want to deal with the hassle. For many of us, it's trivial, sometimes because we've done it before, but for others, it's a huge hill to climb, which seems unnecessary if all they want to do is publish some content.
This makes sense for people who don’t code, but I see plenty of developer oriented bloggers who use Substack, which is puzzling to me. I also don’t know why Substack doesn’t get with the program and just support MathJax and colored syntax highlighting already.
> This makes sense for people who don’t code, but I see plenty of developer oriented bloggers who use Substack, which is puzzling to me
No, it makes sense for any person with any skill set. Just because you are a developer doesn't mean you want to setup everything yourself. We use plenty of tools to make our own experience better and more focused, based on what we want to do.
If I'm a programmer, and I want to publish blog posts, sometimes I don't want to care about the setup that makes me able to publish blog posts, I just want to post something.
> why Substack doesn’t get with the program and just support MathJax and colored syntax highlighting already
This, I don't know. Should have been done ages ago.
(author here) i have spent years setting up my own blogstack (google swyxkit) so i know how to do it. but comments like this miss that Substack is an *email social network*, and as a newsletter author that is the one currency that loosely/poorly tracks my success and distribution (right down to ability to book quality guests). Substack is the best email network period right now (it is not a very good podcast host but I deal with its many annoyances because it does one thing all the other hosts cannot - auto email my subscribers when a new episode is out). The moment Hashnode improves I'll switch to it.
Yes Substack excels at being an email social network. So I’ve settled on the following workflow:
1. Make my blog using mkdocs material plus GitHub pages so it has all the nice code and math rendering features (plus things like admonitions and all the other nice things in mkdocs material).
2. On each post have a Substack signup button.
3. When I make a new post I want to send to subscribers, I manually create a short text-only teaser post on Substack (paragraph or two), with a link to my main blog that says “read more here”.
I know this doesn’t auto-push new posts to subscribers but I am ok with that given that I won’t be writing more than once a week at most.
I can do it myself and it's not a huge hill to climb, but I don't necessarily have the time to do it. There is also the time that needs to be dedicated towards maintenance (upgrades etc.).
Not an NLP person, so apologies in advance if I've missed any obvious papers or references.