I don't think we are getting anywhere. Look, can you point to any instance of machine vision being used to improve a language model of English? Especially any case where the language model took more computing power to train than the model aided with vision?

I don't think anything like that exists today, or ever will exist. And in fact you are making an even stronger claim than that. Not just that vision will be helpful, but absolutely necessary.

