Personally I’ve been thinking about this problem for some time and have had a neat idea that im almost tempted to patent, I have yet to test it via an actual implementation but the core idea is quite simple…
IMHO, your article is missing an important point: 90% of implementations today flatten documents to plain text before chunking them. Why not consider the visual appearance that the human gave to the document?
Using layout information combined with semantics, you can increase rag performances by +160% (tested via benchmarks), so why do most of us only use text?