Honestly, I swear to god, been working 12 hours a day with these for a year now, llama.cpp, Claude, OpenAI, Mistral, Gemini:
The long context window isn't worth much and is currently creating more problems than it's worth for the bigs, with their "unlimited" use pricing models.
Let's take Claude 3's web UI as an example. We build it, and go the obvious route: we simply use as much of the context as possible, given chat history.
Well, now once you're 50-100K tokens in, the initial prefill takes forever, O(10 seconds). Now we have to display a warning whenever that is the case.
Now we're generating an extreme amount of load on GPUs for prefill, and it's extremely unlikely it's helpful. Writing code? Previous messages are likely to be ones that needed revisions. The input cost is ~$0.02 / 1000 tokens and it's not arbitrary/free, prefill is expensive and on the GPU.
Less expensive than inference, but not that much. So now we're burning ~$2 worth of GPU time for the 100K conversation. And all of the bigs use a pricing model of a flat fee per month.
Now, even our _paid_ customers have to take message limits on all our models. (this is true, Anthropic quietly introduced them end of last week)
Functionally:
Output limit is 4096 tokens, so tasks that are a map function (ex. reword Moby Dick in Zoomer), need the input split into 4096 tokens anyway.
The only use cases I've seen thus far that _legitimately_ benefit are needle in a haystack stuff, video with Gemini, or cases with huuuuuge inputs and small outputs, like, put 6.5 Harry Potter books into Gemini and get a Mermaid diagram out connecting characters.
As a user, I've been putting in some long mathematical research papers and asking detailed questions about them in order to understand certain parts better. I feel some benefit from it because it can access the full context of the paper so it is less likely to misunderstand notation that was defined earlier etc.
This is similar to my thoughts, “code” is for humans. AI does need a game engine or massive software, some future video game just needs to output the next frame and respond to input. Little to no code required.
That's really interesting. Even if you specifically tell it to "write non-rhyming, free verse, iambic pentameter prose", it absolutely cannot generate appropriate output.
Am I missing something? 8k seems quite low in current landscape.