Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What techniques do they actually use to achieve 100k? I assumed they load the document to a vector databases and then do some kind of views into that



I would assume by training LLM on dataset of 100k tokens would be the right way.


Makes me wonder whether we could get really huge contexts much more efficiently by feeding back a higher layer back into the tail end of the model. That way it has a very clear picture of the recent text but only a compressed picture of the earlier parts of the document.

(I think I’ve got to read up on how transformers actually work.)


Afaik you're describing something akin to a recurrent neural network, and the problem with that is that it doesn't parallelize well to modern hardware. And vanishing gradients.


I had the same thought as the comment you're responding to.

Recurrent neural networks are bad when the recurrence is 100x long or more. You need long chains because with a token-at-a-time, that's what you need to process even one paragraph.

But if you use an RNN around a Transformed-based LLM, then you're adding +4K or +8K tokens per recurrence, not +1.

E.g.: GPT 4 32K would need just 4x RNN steps to reach 128K tokens!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: