Cached input at $0.003625/M, output at $0.435/M. Aggressive pricing.
For anyone doing the "should I self-host on rented GPUs?" math: at this rate you'd need to push roughly 1B output tokens/day to break even against an 8xH100 fleet on Vast/Lambda (assuming 3-5k tokens/sec aggregate throughput). The vast majority of "I should run my own LLM" use cases don't come close to that volume.
Every API price drop kills another tranche of "self-host the open model" use cases. The implied bet: even if regular pricing ($1.74/M output) is also subsidized, exponential demand growth eventually makes the unit economics work. We'll see.
For anyone doing the "should I self-host on rented GPUs?" math: at this rate you'd need to push roughly 1B output tokens/day to break even against an 8xH100 fleet on Vast/Lambda (assuming 3-5k tokens/sec aggregate throughput). The vast majority of "I should run my own LLM" use cases don't come close to that volume.
Every API price drop kills another tranche of "self-host the open model" use cases. The implied bet: even if regular pricing ($1.74/M output) is also subsidized, exponential demand growth eventually makes the unit economics work. We'll see.