No need even for synthetic data. Advance away from doing statistics on troves of human-generated data, to doing statistics on raw environmental data, like humans do. This will also likely mean moving away from the GPT architecture completely.
This is the way. Robotics, self driving cars, mobile devices themselves, either phone or glasses....
They can generate the data needed for the next steps. (pun intentional)
Text was easy, the process has been defined. Multi-modality improves the models. The next set of data is the world. Everything we currently observe using our senses and data outside our senses like radio waves. Build a better world model.
Seems silly to say: "Instead of training this LLM on internet data, we will train the LLM on the output of an OCR model pointed at a screen scrolling internet data. This is clearly more ethical."
I made no allusion to ethics here; merely logic. Humans don't learn by scraping the web, but by observing the environment. The web is an incomplete distillation of humanity's knowledge of facts and various interpretations of them (which also includes "creative" works as one cannot create anything without basing said creations on environmental observations).