Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
ActorNightly
5 months ago
|
parent
|
context
|
favorite
| on:
Ask HN: How does LLM serving infra work?
Generally, having to unload data from GPU ram, and load a new set of weights in is quite expensive, so my guess is that the backend is built out where an input gets reservation to some cluster based on some ordering, and the batch is ran through.
Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: