People condense information from more than several thousand tokens ago into high...

People condense information from more than several thousand tokens ago into higher level chunks. So the whole first chapter of Infinite Jest becomes “child tennis prodigy is weird and annoying”.

It’s not clear how to learn this. Transformers work so well because there are only a few layers between input tokens and output. Adding a summarization step is a whole new research project.