How do they “nerf the models”? Are they quietly compacting context to reduce kv ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		dd8601fn 1 day ago \| parent \| context \| favorite \| on: Will It Mythos? How do they “nerf the models”? Are they quietly compacting context to reduce kv cache usage, before the actual compaction? Like there’s a slider for how much to compress it, and that’s never revealed to us?
		help

airstrike 1 day ago [–]

I suspect they quantize them, reduce thinking budgets, batch more requests, or all of the above.

lwarfield 1 day ago | [–]

There's also lowering the number of experts you run in MoE models.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact