What techniques do they actually use to achieve 100k? I assumed they load the do...

raymond_goo · on May 15, 2023

Alibi: https://arxiv.org/abs/2108.12409 https://www.reddit.com/r/MachineLearning/comments/ww146r/d_a...

asadm · on May 12, 2023

I would assume by training LLM on dataset of 100k tokens would be the right way.

clbrmbr · on May 12, 2023

Makes me wonder whether we could get really huge contexts much more efficiently by feeding back a higher layer back into the tail end of the model. That way it has a very clear picture of the recent text but only a compressed picture of the earlier parts of the document.

(I think I’ve got to read up on how transformers actually work.)

desperate · on May 12, 2023

Afaik you're describing something akin to a recurrent neural network, and the problem with that is that it doesn't parallelize well to modern hardware. And vanishing gradients.

jiggawatts · on May 12, 2023

I had the same thought as the comment you're responding to.

Recurrent neural networks are bad when the recurrence is 100x long or more. You need long chains because with a token-at-a-time, that's what you need to process even one paragraph.

But if you use an RNN around a Transformed-based LLM, then you're adding +4K or +8K tokens per recurrence, not +1.

E.g.: GPT 4 32K would need just 4x RNN steps to reach 128K tokens!