The biggest takeaway is that they claim SOTA for multi-modal stuff even ahead of proprietary models and still released it as open-weights. My first tests suggest this might actually be true, will continue testing. Wow
Most multi-modal input implementations suck, and a lot of them suck big time.
Doesn't seem to be far ahead of existing proprietary implementations. But it's still good that someone's willing to push that far and release the results. Getting multimodal input to work even this well is not at all easy.
Super interesting that they moved away from their specialized, Lean-based system from last year to a more general-purpose LLM + RL approach. I would suspect this likely leads to improved performance even outside of math competitions. It’ll be fascinating to see how much further this frontier can go.
The article also suggests that the system used isn’t too far ahead of their upcoming general "DeepThink" model / feature, which is they announced for this summer.
The rate limits apply only to the Gemini API. There is also Vertex from GCP, which offers the same models (and even more, such as Claude) at the same pricing, but with much higher rate limits (basically none, as long as they don't need to cut anyone off with provisioned throughput iiuc) and with a process to get guaranteed throughput.
Have you checked out https://github.com/prefix-dev/pixi? It's built by the folks who developed Mamba (a faster Conda implementation). It supports PyPI dependencies using UV, offers first-class support for multi-envs and lockfiles, and can be used to manage other system dependencies like CUDA. Their CLI also embraces much of the UX of UV and other modern dependency management tools in general.
Mastermind intrigued me in the same way as the author some time ago, and I've used it as a standard problem when trying out new computational frameworks/methods ever since.