Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One NVIDIA L4 GPU (24GB vRAM) per Cloud Run instance (many instances per Cloud Run service).

Scale to zero: When there are no incoming requests, Cloud Run stops all remaining instances and you’re not charged.

Fast cold start: When scaling from zero, processes in the container can use the GPU in approximately 5 seconds.

Open large language models up to 13B parameters run great, including: Gemma 2 (9B), Llama 3.1 (8B), Mistral (7B), Qwen2 (7B).

You can get Gemma 2 (2B, Q4_0) to return tokens after 11 seconds from a cold start (best case).



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: