1. For me transformer multimodal processing is still language, as the work is do...

1. For me transformer multimodal processing is still language, as the work is done via tokens/patches. In fact, that makes its image processing capabilities limited in some scenarios when compared to other techniques.

2. When on a bike you need to be full alert and hearing all the time or you might provoke an accident. Sure you can use headphones, but it is not recommended. Also, that robot you showed is narrow AI, it should be out of discussion when arguing about complex end-to-end models that are supposedly comparable to humans. If not, we could just code any missing capability as a tool, but that would automatically invalidate the "llms reason/think/process the same way as humans" argument.

3. Agree, I expect the same in the future. I'm talking about the present though.

4. ;) a robot that helps with daily chores can't come soon enough. More important than coding AI for me.

PS: Not sure if we're talking about the same argument anymore, I'm following the line of "LLMs work/reason the same as humans" not "AI can't perform as good as humans (now and in the future)"