Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

1 token/s is way too slow for dialog, especially since a token isn't even a word, but often part of a word. 1 t/s might be sufficient for asynchronous processing, but if you want a chatGPT like therapy dialog, then that's not good enough.


Do we know what the speed difference actually is? I'm not sure what benchmarks would measure that. My best plan so far is to just run a smaller model on one of the GPUs and see how long it takes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: