One NVIDIA L4 GPU (24GB vRAM) per Cloud Run instance (many instances per Cloud R...

One NVIDIA L4 GPU (24GB vRAM) per Cloud Run instance (many instances per Cloud Run service).

Scale to zero: When there are no incoming requests, Cloud Run stops all remaining instances and you’re not charged.

Fast cold start: When scaling from zero, processes in the container can use the GPU in approximately 5 seconds.

Open large language models up to 13B parameters run great, including: Gemma 2 (9B), Llama 3.1 (8B), Mistral (7B), Qwen2 (7B).

You can get Gemma 2 (2B, Q4_0) to return tokens after 11 seconds from a cold start (best case).