I would argue that some would add time to that as well, a lot of our data are missing spatial and temporal information. But if we're able to take text2text models and add in audio/vision then I suspect we can apply the same technique to add in spatial and temporal intelligence. However the data for those are non existent unlike audio and visual data.