Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I’m actually amazed at the output since GLM doesn’t have eyes. If GLM 5.2 costs 1/5 as much, seems like it could be set up to reach out to a multimodal model for vision tasks when required. Closer to parity but probably still significantly cheaper.
 help



I'm also very impressed at the output given the lack of image support.

They picked a task that heavily favors a model that can do multi-modal with images, and GLM still came within striking distance.

What I'm hearing from this article is that the next generation of open models that includes better multi-modal support are basically no-brainers for adoption.

Seems like a HUGE win for Z.ai and open models in general here.


Yes, it could just make one call to a multimodal llm to describe the scene



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: