It's impressive how the MCP example in https://docs.vlm.run/mcp/examples/template-search search retains visual context across multiple images and tool calls. Unlike most chat interfaces, it enables seamless multi-step reasoning—like finding a logo in one image and tracking it in another—without losing state. This makes it ideal for building stateful, iterative visual workflows.
Basically there is no model schema combination. IF you go ahead and prompt a open source model with the schema it doesn't produce the results in the expected format. The main contribution is how to make these model conform to your specific needs and in a structured format.
Wait, but we're doing that already, and it works well (Qwen 2.5 VL)? If need be, you can always resort to structured generation to enforce schema conformity?