I like the approach! but does it work on long documents? If you’re doing a single llm pass, how does the llm keep track of the chunks it’s already made?
For long documents we have a rolling window strategy. So, we cut the document into 5,000 token groupings for use in inference. There's also a 400 token overlap, and we prefer the earlier chunk for overlap tokens.
For example, if Group #0 overlaps with Group #1 at index 5,200, then we use the logprob from Group #0, because it had more context. Group #1 gets the benefit of context for indices 5,000-5,400, even though we toss out the logprobs for that range.
No need to keep track of chunks that it's already made, we just want heatmap values and then we use those heatmaps to split at the hottest character that's around our target chunk length (Or use a threshold value and binary search the threshold for our target # Chunks or average chunk size).