Hacker Newsnew | past | comments | ask | show | jobs | submit | ubermenchh's commentslogin

yes it does continous batching along with paged attention and prefix caching. i am also goint to be adding some more inference techniques


Haha, i just wanted my repo to be out here. If someone finds it interesting they can always just check the repo. And you're close, its about getting faster responses from the model by manipulating the request queues and memory.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: