I’m actually amazed at the output since GLM doesn’t have eyes. If GLM 5.2 costs ...

horsawlarway · 2026-06-22T14:44:35 1782139475

I'm also very impressed at the output given the lack of image support.

They picked a task that heavily favors a model that can do multi-modal with images, and GLM still came within striking distance.

What I'm hearing from this article is that the next generation of open models that includes better multi-modal support are basically no-brainers for adoption.

Seems like a HUGE win for Z.ai and open models in general here.

killingtime74 · 2026-06-23T04:09:46 1782187786

Yes, it could just make one call to a multimodal llm to describe the scene