Hacker News new | past | comments | ask | show | jobs | submit login

> "Our experiments reveal that RWKV performs on par with similarly sized Transformers"



This sounds pretty bad, right? Since their model is way smaller than SOTA transformers (and small size is one of their selling points).


The paper is meant to compare architecture vs architecture with similar model size, and dataset - to inform decisions in future architecture designs

Its main benefits being presented with the above staying closely the same, is that it has significantly lower running and training cost without performance penalty

If you want to compare any <20B model with GPT 3.5 / 4 / 100B models evals, thats another paper altogether




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: