Hacker News new | past | comments | ask | show | jobs | submit login

Don’t use GPUs at inference (serving) time unless you prove that you need to.

The only consistent case when I’ve found it’s needed (across a variety of NLP & computer vision services that have latency requirements under 50 milliseconds) is for certain very deep RNNs, especially for long input sequence lengths and large vocabulary embeddings.

I’ve never found any need for it with deep, huge CNNs for image processing.

Also consider a queue system if utilization is a problem switching from GPU. Create batch endpoints that accept small batches, like 8-64 instances, and put a queue system in front to mediate collating and uncollating batch calls from the stream of all incoming requests (this is good for GPU services too).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: