We've had a decent amount of luck with InternVL 2.0 w/ Llama, and are pretty excited about Llama 3.2
It's still super early in the open source x vision model space. The limiter actually seems to be the vision encoder -- advancements here will pay off huge dividends