Looking at MMLU and other benchmarks, this essentially means sub-second first-token latency with Llama 3 70B quality (but not GPT-4 / Opus), native multimodality, and 1M context.
Not bad compared to rolling your own, but among frontier models the main competitive differentiator was native multimodality. With the release of GPT-4o I'm not clear on why an organization not bound to GCP would pick Gemini. 128k context (4o) is fine unless you're processing whole books/movies at once. Is anyone doing this at scale in a way that can't be filtered down from 1M to 100k?
With 1M tokens you can dump 2000 pages of documents into the context windows before starting a chat.
Gemini's strength isn't in being able to answer logic puzzles, it's strength is in its context length. Studying for an exam? Just put the entire textbook in the chat. Need to use a dead language for an old test system with no information on the internet? Drop the 1300 page reference manual in and ask away.
According to https://ai.google.dev/pricing it's $0.70/million input tokens (for a long context). That will be per-exchange, so every little back and forth will cost around that much (if you're using a substantial portion of the context window).
And while I haven't tested Gemini, most LLMs get increasingly wonky as the context goes up, more likely to fixate, more likely to forget instructions.
That big context window could definitely be great for certain tasks (especially information extraction), but it doesn't feel like a generally useful feature.
That per exchange context cost is what really puts me off using cloud LLM for anything serious. I know batching and everything is needed in the data center, and important for keeping around KVQ cache, you basically need to fully take over machine to get an interactive session to get the context costs to scale with sequence length. So it's useful, but more in the case of a local LLaMA type situation if you want a conversation.
I wonder if we could implement the equivalent of a JIT compilation, whereby context sequences which get repeatedly reused would be used for an online fine-tuning.
Is there a way to amortize that cost over several queries, i.e. "pre-bake" a document into a context persisted in some form to allow cheaper follow-up queries about it?
They announced that today, calling it "context caching" - but it looks like it's only going to be available for Gemini Pro 1.5, not for Gemini Flash.
It reduces prompt costs by half for those shared prefix tokens, but you have to pay $4.50/million tokens/hour to keep that cache warm - so probably not a useful optimization for most lower traffic applications.
> It reduces prompt costs by half for those shared prefix tokens, but you have to pay $4.50/million tokens/hour to keep that cache warm - so probably not a useful optimization for most lower traffic applications
That's on a model with $3.5/1M input token cost, so half price on cached prefix tokens for $4.5/1M/hour breaks even at a little over 2.5 requests/hour using the cached prefix.
Depending on the output window limit, the first query could be something like: "Summarize this down to its essential details" -- then use that to feed future queries.
Tediously, it would be possible to do this chapter by chapter in order to exceed the output limit building something for future inputs.
Of course, the summary might not fulfill the same functionality as the original source document. YMMV
Can anyone speculate on how G arrived at this price, and perhaps how it contrasts with how OAI arrived at its updated pricing? (realizing it can't be held up directly to GPT x at the moment)
Isn't there retrieval degradation with such a large context size? I would still think that a RAG system on 128K is still better than No Rag + 1M context window, no? (assuming text only)
You don't really use it, right? There's no way to debug if you're doing it like this. Also, the accuracy isn't high, and it can't answer complicated questions, making it quite useless for the cost.
I've been trying to work Gemini 1.5 Pro into our workstream for all kinds of stuff and it is so bad. Unbelievable amount of hallucinations, especially when you introduce video or audio.
I'm not sure I can think of a single use case where a high hallucination tiny multimodal model is practical in most businesses. Without reliability it's just a toy.
> With the release of GPT-4o I'm not clear on why an organization not bound to GCP would pick Gemini.
Price for anything, particularly multimodal tasks that with OpenAI GPT-4o is the cheapest model, that doesn't need GPT-4 quality. GPT-3.5-Turbo — which itself is 1/10 the cost of GPT-4o, is $0.5/1M tokens on input, $1.50/1M on output, with a 16K context window. Gemini 1.5 Flash, for prompts up to 128K, is $0.35/1M tokens on input, and $0.53/1M tokens on output.
For tasks that require multimodality but not GPT-4 smarts (which I think includes a lot of document-processing tasks, for which GPT-4 with Vision and now GPT-4 are magical but pricy), Gemini Flash looks like close to a 95% price cut.
Long context is a little bit different than extra email storage. Having 1 gb of storage instead of 50 mb has essentially no downside to the user experience.
But submitting 1M input tokens instead of 100k input tokens:
- Causes your costs to go up ~10x
- Causes your latency to go up ~10x (or between 1x and 10x)
- Can result in worse answers (especially if the model gets distracted by irrelevant info)
So longer context is great, yes, but it's not a no-brainer like more email storage. It brings costs. And whether those costs are worth it depends on what you're doing.
E.g. I want to send an entire code base in a context. It might not fit into 128k.
Filtering down is a complex task by itself. It's much easier to call a single API.
Regarding quality of responses, I've seen both disappointing and brilliant responses from Gemini. Do maybe worth trying. But it will probably take several iterations until it can be relied upon.
Not bad compared to rolling your own, but among frontier models the main competitive differentiator was native multimodality. With the release of GPT-4o I'm not clear on why an organization not bound to GCP would pick Gemini. 128k context (4o) is fine unless you're processing whole books/movies at once. Is anyone doing this at scale in a way that can't be filtered down from 1M to 100k?