Hacker News new | past | comments | ask | show | jobs | submit login
Our next-generation model: Gemini 1.5 (blog.google)
1244 points by todsacerdoti 11 months ago | hide | past | favorite | 588 comments



The white paper is worth a read. The things that stand out to me are:

1. They don't talk about how they get to 10M token context

2. They don't talk about how they get to 10M token context

3. The 10M context ability wipes out most RAG stack complexity immediately. (I imagine creating caching abilities is going to be important for a lot of long token chatting features now, though). This is going to make things much, much simpler for a lot of use cases.

4. They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting.

5. It seems like 1.5 Ultra is going to be highly capable. 1.5 Pro is already very very capable. They are running up against very high scores on many tests, and took a minute to call out some tests where they scored badly as mostly returning false negatives.

Upshot, 1.5 Pro looks like it should set the bar for a bunch of workflow tasks, if we can ever get our hands on it. I've found 1.0 Ultra to be very capable, if a bit slow. Open models downstream should see a significant uptick in quality using it, which is great.

Time to dust out my coding test again, I think, which is: "here is a tarball of a repository. Write a new module that does X".

I really want to know how they're getting to 10M context, though. There are some intriguing clues in their results that this isn't just a single ultra-long vector; for instance, their audio and video "needle" tests, which just include inserting an image that says "the magic word is: xxx", or an audio clip that says the same thing, have perfect recall across up to 10M tokens. The text insertion occasionally fails. I'd speculate that this means there is some sort of compression going on; a full video frame with text on it is going to use a lot more tokens than the text needle.


"The 10M context ability wipes out most RAG stack complexity immediately."

I'm skeptical, my past experience is just becaues the context has room to stuff whatever you want in it, the more you stuff in the context the less accurate your results are. There seems to be this balance of providing enough that you'll get high quality answers, but not too much that the model is overwhelmed.

I think a large part of developing better models is not just a better architectures that support larger and larger context sizes, but also capable models that can properly leverage that context. That's the test for me.


They explicitly address this in page 11 of the report. Basically perfect recall for up to 1M tokens; way better than GPT-4.


I don't think recall really addresses it sufficiently: the main issue I see is answers getting "muddy". Like it's getting pulled in too many directions and averaging.


I'd urge caution in extending generalizations about "muddiness" to a new context architecture. Let's use the thing first.


I'm not saying it applies to the new architecture, I'm saying that's a big issue I've observed in existing models and that so far we have no info on whether it's solved in the new one (i.e. accurate recall doesn't imply much in that regard).


Ah, apologies for the misunderstanding. What tests would you suggest to evaluate "muddiness"?

What comes to my mind: run the usual gamut of tests, but with the excess context window saturated with irrelevant(?) data. Measure test answer accuracy/verbosity as a function of context saturation percentage. If there's little correlation between these two variables (e.g. 9% saturation is just as accurate/succinct as 99% saturation), then "muddiness" isn't an issue.


Manual testing on complex documents. A big legal contract for example. An issue can be referred to in 7 different places in a 100 page document. Does it give a coherent answer?

A handful of examples show whether it can do it. For example, GPT-4 turbo is downright awful at something like that.


You need to use relevant data. The question isn't random sorting/pruning, but being able to apply large numbers of related hints/references/definitions in a meaningful way. To me this would be the entire point of a large context window. For entirely different topics you can always just start a new instance.


Would be awesome if it is solved but seems like a much deeper problem tbh.


Unfortunately Google's track record with language models is one of overpromising and underdelivering.


This is only specifically for web interface LLMs in the past few years that it's been lack luster. However, this statements is not correct for their overall history. W2V based lang models and BERT/Transformer models in the early days (*publicly available, but not in web interface) were far ahead of the curve, as they were the ones that produced these innovations. Effectively, Deepmmind/Google are academics (where the real innovations are made, but they do struggle to produce corporate products (where openai shines).


I am skeptical of benchmarks in general, to be honest. It seems to be extremely difficult to come up with benchmarks for these things (it may be true of intelligence as a quality...). It's almost an anti-signal to proclaim good results on benchmarks. The best barometer of model quality has been vibes, in places like /r/localllama where cracked posters are actively testing the newest models out.

Based on Google's track record in the area of text chatbots, I am extremely skeptical of their claims about coherency across a 1M+ context window.

Of course none of this even matters anyway because the weights are closed the architecture is closed nobody has access to the model. I'll believe it when I see it.


Their in-context long-sequence understanding "benchmark" is pretty interesting.

There's a language called Kalamang with only 200 native speakers left. There's a set of grammar books for this language that adds up to ~250K tokens. [1]

They set up a test of in-context learning capabilities at long context - they asked 3 long-context models (GPT 4 Turbo, Claude 2.1, Gemini 1.5) to perform various Kalamang -> English and English -> Kalamang translation tasks. These are done either 0-shot (no prior training data for kgv in the models), half-book (half of the kgv grammar/wordlists - 125k tokens - are fed into the model as part of the prompt), and full-book (the whole 250k tokens are fed into the model). Finally, they had human raters check these translations.

This is a really neat setup, it tests for various things (e.g. did the model really "learn" anything from these massive grammar books) beyond just synthetic memorize-this-phrase-and-regurgitate-it-later tests.

It'd be great to make this and other reasoning-at-long-ctx benchmarks a standard affair for evaluating context extension. I can't tell which of the many context-extension methods (PI, E2 LLM, PoSE, ReRoPE, SelfExtend, ABF, NTK-Aware ABF, NTK-by-parts, Giraffe, YaRN, Entropy ABF, Dynamic YaRN, Dynamic NTK ABF, CoCA, Alibi, FIRE, T5 Rel-Pos, NoPE, etc etc) is really SoTA since they all use different benchmarks, meaningless benchmarks, or drastically different methodologies that there's no fair comparison.

[1] from https://storage.googleapis.com/deepmind-media/gemini/gemini_...

The available resources for Kalamang are: field linguistics documentation10 comprising a ∼500 page reference grammar, a ∼2000-entry bilingual wordlist, and a set of ∼400 additional parallel sentences. In total the available resources for Kalamang add up to around ∼250k tokens.


I believe that's a limitation of using vectors of high dimensions. It'll be muddy.


Not unlike trying to keep the whole contents of the document in your own mind :)


It's amazing we are in 2024 discussing the degree a machine can reason over millions of tokens of context. The degree, not the possibility.


Haha. This was my thinking this morning. Like: "Oh cool... a talking computer.... but can it read a 2000 page book, give me the summary and find a sentence out of... it can? Oh... well it's lame anyway."

The Sora release is even more mind blowing - not the video generation in my mind but the idea that it can infer properties of reality that it has to learn and constrain in its weights to properly generate realistic video. A side effect of its ability is literally a small universe of understanding.

I was thinking that I want to play with audio to audio LLMs. Not text to speech and reverse but literally sound in sound out. It clears away the problem of document layout etc. and leaves room for experimentation on the properties of a cognitive being.


Did you think the extraction of information from a the Buster Keaton film was muddy? I thought it was incredibly impressive to be this precise.


That was not muddy, but it's not the kind of scenario where muddiness shows up.


Page 8 of the technical paper [1] is especially informative.

The first chart (Cumulative Average NLL for Long Documents) shows a deviation from the trend and an increase in accuracy when working with >=1M tokens. The 1.0 graph is overlaid and supports the experience of 'muddiness'.

[1] https://storage.googleapis.com/deepmind-media/gemini/gemini_...


Would like to see the latency and cost of parsing entire 10M context before throwing out the RAG stack which is relatively cheap and fast.


Also unless they significantly change their pricing model, we're talking about 0.5$ per API call at current prices


I think there are also a lot of people who are only interested in RAG if they can self-host and keep their documents private.


Yes and the ability to have direct attribution matters so you know exactly where your responses come from. And costs as others point out, but RAG is not gone in fact it just got easier and a lot more powerful.


costs rise on a per-token basis. So you CAN use 10M tokens, but it's probably not usually a good idea. A database lookup is still better than a few billion math operations.


I think the unspoken goal is to just lay off your employees and dump every doc and email they’ve ever written as one big context.

Now that Google has tasted the previously forbidden fruit of layoffs themselves, I think their primary goal in ML is now headcount reduction.


Somehow I just don't see the execs or managers being able to make this work well for them without help. Plus, documents still need to be generated. Are they going to be spending all day prompting LLMs?


LLMs are able to utilize “all the worlds” knowledge during training and give seemingly magical answers. While providing context in the query is different than training models, is it possible that more context will give more materials to the LLM and it will be able to pick out the relevant bits on its own?

What if it was possible, with each query, to fine tune the model on the provided context, and then use that JIT fine-tuned model to answer the query?


Are you asking what if it was possible that a "context window" ceased to exist? In a different architecture than we currently use, I guess that's hypothetically possible.

As it is now, you can't fine tune on context. It would have almost no effect on the parameters.

Context is like giving your friend a magazine article and asking them to respond to it. Fine tuning is like throwing that magazine article into the ocean of all content they ever came across during their lifetime.


I am not an expert here so I may be mixing terms and concepts.

The way I understand it, there is a base model that was trained on vast amount of general data. This sets up the weights.

You can fine-tune this base model on additional data. Often this is private data that is concentrated around a certain domain. This modifies the model's weights some more.

Then you have the context. This is where your query to the LLM goes. You can also add the chat history here. Also, system prompts that tell the LLM to behave a certain way go here. Finally, you can take additional information from other sources and provide it as part of the context -- this is called Retrieval Augmented Generation. All of this really goes into one bucket called the context, and the LLM needs to make sense of it. None of this modifies the weights of the model itself.

Is my mental picture correct so far?

My question is around RAG. It seems that providing additional selected information from your knowledge base, or using your knowledge base to fine-tune a model, seem similar. I am curious in which ways these are similar, and in which ways they cause the LLM to behave differently.

Concretely, say I have a company knowledge base with a bunch of rules and guidelines. Someone asks an agent "Can I take 3 weeks off in a row?" How would these two scenarios be different:

a) Agents searches the knowledge base for all pages and content related to "FTO, PTO, time off, vacations" and feeds those articles to the LLM, together with the "Can I take 3 weeks off in a row?" query

b) I have an LLM that has been fine tuned on all the content in the knowledge base. I ask it "Can I take 3 weeks off in a row?"


> Is my mental picture correct so far?

Yes

> How would these two scenarios be different

They're different in exactly the way you described above. The agent searching the knowledge base for "FTO, PTO, time off, vacations" would be the same as you pasting all the articles related to those topics into the prompt directly - in both cases, it goes into the context.

In scenario a, you'll likely get the correct response. In scenario b, likely get an incorrect response.

Why? Because of what you explained above. Fine tuning adjusts the weights. When you adjusts weights by feeding data, you're only making small adjustments to shift slightly along a curve - thus the exposure to this data (for the purposes of fine tuning) will have very little effect on the next context the model is exposed to.


Have to consider cost for all of this. Big value of RAG already even given the size of GPT-4’a largest context size is it decreases cost very significantly.


also costs are always based on context token, you dont want to put in 10m of context for every request (its just nice to have that option when you want to do big things that dont scale)


How much would a lawyer charge to review your 10M-token legal document?


10M tokens is something like 14 copies of war and peace, or maybe the entire harry potter series seven times over. That'd be some legal document!


Hmm I don’t know but I feel like the U.S. Congress has bills that would push that limit.


> They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting.

They try to push that, but it's not the most convincing. Look at Table 8 for text evaluations (math, etc.) - they don't even attempt a comparison with GPT-4.

GPT-4 is higher than any Gemini model on both MMLU and GSM8K. Gemini Pro seems slightly better than GPT-4 original in Human Eval (67->71). Gemini Pro does crush naive GPT-4 on math (though not with code interpreter and this is the original model).

All in 1.5 Pro seems maybe a bit better than 1.0 Ultra. Given that in the wild people seem to find GPT-4 better for say coding than Gemini Ultra, my current update is Pro 1.5 is about equal to GPT-4.

But we'll see once released.


> people seem to find GPT-4 better for say coding than Gemini Ultra

For my use cases, Gemini Ultra performs significantly better than GPT-4.

My prompts are long and complex, with a paragraph or two about the general objective followed by 15 to 20 numbered requirements. Often I'll include existing functions the new code needs to work with, or functions that must be refactored to handle the new requirements.

I took 20 prompts that I'd run with GPT-4 and fed them to Gemini Ultra. Gemini gave a clearly better result in 16 out of 20 cases.

Where GPT-4 might miss one or two requirements, Gemini usually got them all. Where GPT-4 might require multiple chat turns to point out its errors and omissions and tell it to fix them, Gemini often returned the result I wanted in one shot. Where GPT-4 hallucinated a method that doesn't exist, or had been deprecated years ago, Gemini used correct methods. Where GPT-4 called methods of third-party packages it assumed were installed, Gemini either used native code or explicitly called out the dependency.

For the 4 out of 20 prompts where Gemini did worse, one was a weird rejection where I'd included an image in the prompt and Gemini refused to work with it because it had unrecognizable human forms in the distance. Another was a simple bash script to split a text file, and it came up with a technically correct but complex one-liner, while GPT-4 just used split with simple options to get the same result.

For now I subscribe to both. But I'm using Gemini for almost all coding work, only checking in with GPT-4 when Gemini stumbles, which isn't often. If I continue to get solid results I'll drop the GPT-4 subscription.


I have a very similar prompting style to yours and share this experience.

I am an experienced programmer and usually have a fairly exact idea of what I want, so I write detailed requirements and use the models more as typing accelerators.

GPT-4 is useful in this regard, but I also tried about a dozen older prompts on Gemini Advanced/Ultra recently and in every case preferred the Ultra output. The code was usually more complete and prod-ready, with higher sophistication in its construction and somewhat higher density. It was just closer to what I would have hand-written.

It's increasingly clear though LLM use has a couple of different major modes among end-user behavior. Knowledge base vs. reasoning, exploratory vs. completion, instruction following vs. getting suggestions, etc.

For programming I want an obedient instruction-following completer with great reasoning. Gemini Ultra seems to do this better than GPT-4 for me.


It constantly hallucinates APIs for me, I really wonder why people's perceptions are so radically different. For me it's basically unusable for coding. Perhaps I'm getting a cheaper model because I live in a poorer country.


Are you using Gemini Advanced? (The paid tier.) The free one is indeed very bad.


Spent a few hours comparing Gemini Advanced with GPT-4.

Gemini Advanced is nowhere even close to GPT-4, either for text generation, code generation or logical reasoning.

Gemini Advanced is constantly asking for directions "What are your thoughts on this approach?" even to create a short task list of 10 items. Even when being told several times to provide the full list, and not stop at every three or four items and ask for directions. Is constantly giving moral lessons or finishing the results with annoying marketing style comments of the type "Let's make this an awesome product!"

Code is more generic, solutions are less sophisticated. On a discussion of Options Trading strategies Gemini Advanced got core risk management strategies wrong and apologized when errors were made clear to the model. GPT-4 provided answers with no errors, and even went into the subtleties of some exotic risk scenarios with no mistakes.

Maybe 1.5 will be it, or maybe Google realized this quite quickly and are trying the increased token size as a Hail Mary to catch up. Why release so soon?

Quite curious to try the same prompts on 1.5.


I asked Gemini Advanced, the paid one, to "Write a script to delete some files" and it told me that it couldn't do that because deleting files was unethical. At that point I cancelled my subscription since even GPT-4 with all its problems isn't nearly as broken as Gemini.


If you share your prompt I'm sure people here can help you.

Here's a prompt I used and got a a script that not only accomplishes the objective, but even has an option to show what files will be deleted and asks for confirmation before deleting them.

Write a bash script to delete all files with the extension .log in the current directory and all subdirectories of the current directory.


I’m going to have to try Gemini for code again. It just occurred to me as a Xoogler that if they used Google’s code base as the training data it’s going to be unbeatable. Now did they do that? No idea, but quality wins over quantity, even with LLM.


There is no way NTK data is in the training set, and google3 is NTK.


I dunno, leadership is desperate and they can de-NTK if and when they feel like it.


What is “NTK”?


"Need To Know" I.e. data that isn't open within the company.


Almost all of google3 is basically open to all of engineering.


> My prompts are long and complex, with a paragraph or two about the general objective followed by 15 to 20 numbered requirements. Often I'll include existing functions the new code needs to work with, or functions that must be refactored to handle the new requirements.

I guess this is a tough request if you're working on a proprietary code base, but I would love to see some concrete examples of the prompts and the code they produce.

I keep trying this kind of prompting with various LLM tools including GPT-4 (haven't tried Gemini Ultra yet, I admit) and it nearly always takes me longer to explain the detailed requirements and clean up the generated code than it would have taken me to write the code directly.

But plenty of people seem to have an experience more like yours, so I really wonder whether (a) we're just asking it to write very different kinds of code, or (b) I'm bad at writing LLM-friendly requirements.


Not OP but here is a verbatim prompt I put into these LLMs. I'm learning to make flutter apps, and I like to try make various UIs so I can learn how to compose some things. I agree that Gemini Ultra (aka the paid "advanced" mode) is def better than ChatGPT-4 for this prompt. Mine is a bit more terse than OP's huge prompt with numbered requirements, but I still got a super valid and meaningful response from Gemini, while GPT4 told me it was a tricky problem, and gave me some generic code snippets, that explicitly don't solve the problem asked.

> I'm building a note-taking app in flutter. I want to create a way to link between notes (like a web hyperlink) that opens a different note when a user clicks on it. They should be able to click on the link while editing the note, without having to switch modalities (eg. no edit-save-view flow nor a preview page). How can I accomplish this?

I also included a follow-up prompt after getting the first answer, which again for Gemini was super meaningful, and already included valid code to start with. Gemini also showed me many more projects and examples from the broader internet.

> Can you write a complete Widget that can implement this functionality? Please hard-code the note text below: <redacted from HN since its long>


This is useful, thanks. Since you're using this for learning, would it be fair to characterize this as asking the LLM to write code you don't already know how to write on your own?

I've definitely had success using LLMs as a learning tool. They hallucinate, but most often the output will at least point me in a useful direction.

But my day-to-day work usually involves non-exploratory coding where I already know exactly how to do what I need. Those are the tasks where I've struggled to find ways to make LLMs save me any time or effort.


> would it be fair to characterize this as asking the LLM to write code you don't already know how to write on your own?

Yea absolutely. I also use it to just write code I understand but am too lazy to write, but it's definitely effective at "show me how this works" type learning too.

> Those are the tasks where I've struggled to find ways to make LLMs save me any time or effort

Github CoPilot has an IDE integration where it can output directly into your editor. This is great for "// TODO: Unit Test for add(x, y) method when x < 0" and it'll dump out the full test for you.

Similarly useful for things like "write me a method that loops through a sorted list, and finds anything with <condition> and applies a transformation and saves it in a Map". Basically all those random helper methods and be written for you.


That last one is an interesting example. If I needed to do that, I would write something like this (in Kotlin, my daily-driver language):

    fun foo(list: List<Bar>) =
        list.filter { condition(it) }.associateWith { transform(it) }
which would take me less time to write than the prompt would.

However, if I didn't know Kotlin very well, I might have had to go look in the docs to find the associateWith function (or worse, I might not have even thought to look for it) at which point the prompt would have saved me time and taught me that the function exists.


Is there any chance you could share an example of the kind of prompt you're writing?

I'm always reluctant to write long prompts because I often find GPT4 just doesn't get it, and then I've wasted ten minutes writing a prompt


How do you interact with Gemini for coding work? I am trying to paste my code in the web interface and when I hit submit, the interface says "something went wrong" and the code does not appear in the chat window. I signed up for Gemini Advanced and that didn't help. Do you use AI Studio? I am just looking in to that now.


I've found Gemini generally equal with the .Net and HTML coding I've been doing.

I've never had Gemini give me a better result than GPT, though, so it does not surpass it for my needs.

The UI is more responsive, though, which is worth something.


> Gemini Pro seems slightly better than GPT-4 original in Human Eval (67->71).

Though they talk a bunch about how hard it was to filter out Human Eval, so this probably doesn't matter much.


I mean i don't see GPT4 watching a 44 minute movie and being able to exactly pinpoint a guy taking a paper out of his pocket..


    > The 10M context ability wipes out most RAG stack complexity immediately.
Remains to be seen.

Large contexts are not always better. For starters, it takes longer to process. But secondly, even with RAG and the large context of GPT4 Turbo, providing it a more relevant and accurate context always yields better output.

What you get with RAG is faster response times and more accurate answers by pre-filtering out the noise.


Hopefully we can get a better RAG out of it. Currently people do incredibly primitive stuff like chunking text into chunks of a fixed size and adding them to vector DB.

An actually useful RAG would be to convert text to Q&A and use Q's embeddings as an index. Large context can make use of in-context learning to make better Q&A.


A lot of people in RAG already do this. I do this with my product: we process each page and create lists of potential questions that the page would answer, and then embed that.

We also embed the actual text, though, because I found that only doing the questions resulted in inferior performance.


So in this case, what your workflow might look like is:

    1. Get text from page/section/chunk
    2. Generate possible questions related to the page/section/chunk
    3. Generate an embedding using { each possible question + page/section/chunk }
    4. Incoming question targets the embedding and matches against { question + source }
Is this roughly it? How many questions do you generate? Do you save a separate embedding for each question? Or just stuff all of the questions back with the page/section/chunk?


Right now I just throw the different questions together in a single embedding for a given chunk, with the idea that there’s enough dimensionality to capture them all. But I haven’t tested embedding each question, matching on that vector, and then returning the corresponding chunk. That seems like it’d be worth testing out.


Don't forget that Gemini also has access to the internet, so a lot of RAGging becomes pointless anyway.


Internet search is a form of RAG, though. 10M tokens is very impressive, but you're not fitting a database, let alone the entire internet into a prompt anytime soon.


You shouldn't fit an entire database in the context anyway.

btw, 10M tokens is 78 times more context window than the newest GPT-4-turbo (128K). In a way, you don't need 78 GPT-4 API calls, only one batch call to Gemini 1.5.


I don't get this why is it people think that you need to put an entire database in the short-term memory of the AI to be useful? When you work with a DB are you memorizing the entire f*cking database, no, you know the summaries of it and how to access and use it.

People also seem to forget that the average is 1b words that are read by people in their entire LIFETIME, and at 10m, with nearly 100% recall thats pretty damn amazing, i'm pretty sure I don't have perfect recall of 10m words myself lol


You certainly don't need that much context for it to be useful, but it definitely opens up a LOT more possibilities without the compromises of implementing some type of RAG. In addition, don't we want our AI to have superhuman capabilities? The ability to work on 10M+ tokens of context at a time could enable superhuman performance in many tasks. Why stop at 10M tokens? Imagine if AI could work on 1B tokens of context like you said?


It increases the use cases.

It can also be a good alternative for fine-tuning.

And the use case of a code base is a good example: if the ai understands the whole context, it can do basically everything.

Let me pay 5€ for a android app rewritten into iOS.


Well it's nice, just sad nobody can use it


This may be useful in a generalized use case, but a problem is that many of those results again will add noise.

For any use case where you want contextual results, you need to be able to either filter the search scope or use RAG to pre-define the acceptable corpus.


> you need to be able to either filter the search scope or use RAG ...

Unless you can get nearly perfect recall with millions of tokens, which is the claim made here.


> The 10M context ability wipes out most RAG stack complexity immediately.

The video queries they show take around 1 minute each, this probably burns a ton of GPU. I appreciate how clearly they highlight that the video is sped up though, they're clearly trying to avoid repeating the "fake demo" fiasco from the original Gemini videos.


The youtube video of the Multimodal analysis of a video is insane, imagine feeding in movies or tv shows and being able to autosummary or find information about them dynamically, how the hell is all this possible already? AI is moving insanely fast.


> imagine feeding in movies or tv shows

Google themselves have such a huge footprint of various businesses, that they alone would be an amazing customer for this, never mind all the other cool opportunities from third parties...

Imagine that they can ingest the entirety of YouTube and then dump that into Google Search's index AND use it to generate training data for their next LLM.

Imagine that they can hook it up to your security cameras (Nest Cam), and then ask questions about what happened last night.

Imagine that you can ask Gemini how to do something (eg. fix appliance), and it can go and look up a YouTube video on how to accomplish that ask, and explain it to you.

Imagine that it can apply summarization and descriptions to every photo AND video in your personal Google Photos library. You can ask it to find a video of your son's first steps, or a graduation/diploma walk for your 3rd child (by name) and it can actually do that.

Imagine that Google Meet video calls can have the entire convo itself fed into an LLM (live?), instead of just a transcription. You can have an AI assistant there with you that can interject and discuss, based on both the audio and video feed.


I'd love to see that applied to the Google ecosystem, the question is - why haven't they already done this?


IMO, they aren't sure how to monetize it, Google is run by the ads team.

Problem is they are jeopardizing their moat.

Google is still in a great position, they have the knowledge and lots of data to pull this off. They just have to take the risk of losing some ad revenue for a while.


Well, they just announced publicly that the technology is available. Maybe its just too new to have been productized so far.


Is 10M token context correct? The blog post I see 1M but I'm not sure if these are different things

Edit: Ah, I see, it's 1M reliably in production, up to 10M in research:

> Through a series of machine learning innovations, we’ve increased 1.5 Pro’s context window capacity far beyond the original 32,000 tokens for Gemini 1.0. We can now run up to 1 million tokens in production.

> This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens.


I know how I’m going to evaluate this model. Upload my codebase and ask it to “find all the bugs”.


How could one hour of video fit in 1M tokens? 1 hour at 30fps is 3600*30=100k frames. Each frame is converted in 256 tokens. So either they are not processing each frame, or each frame is converted into fewer tokens.


The model can probably perform fine at 1 frame per second (3600*256=921600 tokens), and they could probably use some sort of compression.


> 1. They don't talk about how they get to 10M token context

> 2. They don't talk about how they get to 10M token context

Yes. I wonder if they're using a "linear RNN" type of model like Linear Attention, Mamba, RWKV, etc.

Like Transformers with standard attention, these models train efficiently in parallel, but their compute is O(N) instead of O(N²), so in theory they can be extended to much longer sequences much efficiently. They have shown a lot of promise recently at smaller model sizes.

Does anyone here have any insight or knowledge about the internals of Gemini 1.5?


The fact they are getting perfect recall with millions of tokens rules out any of the existing linear attention methods.


I wouldn't be so sure perfect recall rules out linear RNNs, because I haven't seen any conclusive data on their ability to recall. Have you?


They do give a hint:

"This includes making Gemini 1.5 more efficient to train and serve, with a new Mixture-of-Experts (MoE) architecture."

One thing you could do with MoE is giving each expert different subsets of the input tokens. And that would definitely do what they claim here: it would allow search. You want to find where someone said "the password is X" in a 50 hour audio file, this would be perfect.

If your question is "what is the first AND last thing person X said" ... it's going to suck badly. Anything that requires taking 2 things into account that aren't right next to eachother is just not going to work.


> Anything that requires taking 2 things into account that aren't right next to eachother is just not going to work.

They kinda address that in the technical report[0]. On page 12 they show results from a "multiple needle in a haystack" evaluation.

https://storage.googleapis.com/deepmind-media/gemini/gemini_...


I would perhaps add that it's worrying. Every youtube evaluation of this model in GCP AI Studio I've seen have commented on the constant hallucinations of this model.


> One thing you could do with MoE is giving each expert different subsets of the input tokens.

Don't MoE's route tokens to experts after the attention step? That wouldn't solve the n^2 issue the attention step has.

If you split the tokens before the attention step, that would mean those tokens would have no relationship to each other - it would be like inferring two prompts in parallel. That would defeat the point of a 10M context


Is MOE then basically divide and conquer? I have no deep knowledge of this so I assumed MOE was where each expert analyzed the problem in a different way and then there was some map-reduce like operation on the generated expert results. Kinda like random forest but for inference.


> I assumed MOE was where each expert analyzed the problem in a different way

Uh sorta but not like parent described at all. You have multiple "experts" and you have a routing layer(s) that decide which expert to send it to. Usually every token is sent to at least 2. You can't just send half the tokens to one expert and half to another.

Also the "experts" are not "domain experts" - there is not a "programming expert" and an "essay expert".


Regarding how they’re getting to 10M context, I think it’s possible they are using the new SAMBA architecture.

Here’s the paper: https://arxiv.org/abs/2312.00752

And here’s a great podcast episode on it: https://www.cognitiverevolution.ai/emergency-pod-mamba-memor...


As a Brazilian, I approve that choice. Vambora amigos!


Regarding the 10M tokens context, RingAttention has been shown [0] recently (by researchers, not ML engineers in a FAANG) to be able to scale to comparable (1M) context sizes (it does take work and a lot of GPUs).

[0]: https://news.ycombinator.com/item?id=39367141


> researchers, not ML engineers in a FAANG

Why did you point out this distinction?


It means they have significantly less means (to get a lot of GPUs letting them scale up in context length) and are likely less well-versed in optimization (which also helps with scaling up)[0].

I believe those two things together are likely enough to explain the difference between a 1M context length and a 10M context length.

[0]: Which is not looking down on that particular research team, the vast majority of people have less means and optimization know-how than Google.


Probably to indicate that its research and not productized?


Re RAG aren’t you ignoring the fact that no one wants to put confidential company data into such LLM’s. Private RAG infrastructure remains a need for the same reason that privacy of data of all sorts remains a need. Huge context solves the problem for large open source context material but that’s only part of the picture.


For #1 and #2 it is some version of mixture of experts. This is mentioned in the blog post. So each expert only sees a subset of the tokens.

I imagine they have some new way to route tokens to the experts that probably computes a global context. One scalable way to compute a global context is by a state space model. This would act as a controller and route the input tokens to the MoEs. This can be computed by convolution if you make some simplifying assumptions. They may also still use transformers as well.

I could be wrong but there are some Mamba-MoEs papers that explore this idea.



There will always be more data that could be relevant than fits in a context window, and especially for multi-turn conversations, huge contexts incur huge costs.

GPT-4 Turbo, using its full 128k context, costs around $1.28 per API call.

At that pricing, 1m tokens is $10, and 10m tokens is an eye-watering $100 per API call.

Of course prices will go down, but the price advantage of working with less will remain.


I don't see a problem with this pricing. At 1m tokens you can upload the whole proceedings of a trial and ask it to draw an analysis. Paying $10 for that sounds like a steal.


Unfortunately the whole context has to be reprocessed fully for each query, which means that if you "chat" with the model you'll incur in that $10 fee for every interaction which quickly sums up.

It may still be worth it for some use cases


Of course, if you get exactly the answer you want in the first reply.


While it's hard to say what's possible on the cutting edge, historically models tend to get dumber as the context size gets bigger. So you'd get a much more intelligent analysis of a 10,000 token excerpt of the trial than a million token complete transcript of the trial. I have not spent the money testing big token sizes in GPT 4 turbo, but it would not surprise me if it gets dumber. Think of it this way, if the model is limited to 3,000 token replies, if an analysis would require a more detailed response than 3,000 tokens, it cannot provide it, it'll just give you insufficient information. What it'll probably do is ignore parts of the trial transcript because it can't analyze all that information in 3,000 tokens. And asking a followup question is another million tokens.


Would the price really increase linearly? Isn't the demands on compute and memory increasing steeper than that as a function of context length?


RAG would still be useful for cost savings assuming they charge per token, plus I'm guessing using the full-context length would be slower than using RAG to get what you need for a smaller prompt


This is going to be the real differentiator.

HN is very focused on technical feasibility (which remains to be seen!), but in every LLM opportunity, the CIO/CFO/CEO are going to be concerned with the cost modeling.

The way that LLMs are billed now, if you can densely pack the context with relevant information, you will come out ahead commercially. I don't see this changing with the way that LLM inference works.

Maybe this changes with managed vector search offerings that are opaque to the user. The context goes to a preprocessing layer, an efficient cache understands which parts haven't been embedded (new bloom filter use case?), embeds the other chunks, and extracts the intent of the prompt.


Agreed with this.

The leading ability AI (in terms of cognitive power) will, generally, cost more per token than lower cognitive power AI.

That means that at a given budget you can choose more cognitive power with fewer tokens, or less cognitive power with more tokens. For most use cases, there's no real point in giving up cognitive power to include useless tokens that have no hope of helping with a given question.

So then you're back to the question of: how do we reduce the number of tokens, so that we can get higher cognitive power?

And that's the entire field of information retrieval, which is the most important part of RAG.


The way that LLMs are billed now, if you can densely pack the context with relevant information, you will come out ahead commercially. I don't see this changing with the way that LLM inference works.

Really? Because to my understanding the compute necessary to generate a token grows linearly with the context, and doesn't the OpenAI billing reflect that by seperating prompt and output tokens?


> The 10M context ability wipes out most RAG stack complexity immediately.

This may not be true. My experience of the complexity of RAG lays in how to properly connect to various unstructured data sources and perform data transformation pipeline for large scale data set (which means GB, TB or even PB). It's in the critical path rather a "nice to have", because the quality of data and the pipeline is a major factor for the final generated the result. i.e., in RAG, the importance of R >>> G.


RE: RAG - they haven't released pricing, but if input tokens are priced at GPT-4 levels - $0.01/1K then sending 10M tokens will cost you $100.


In the announcements today they also halved the pricing of Gemini 1.0 Pro to $0.000125 / 1K characters, which is a quarter of GPT3.5 Turbo so it could potentially be a bit lower than GPT-4 pricing.


If you think the current APIs will stay that way, then you're right. But when they start offering dedicated chat instances or caching options, you could be back in the penny region.

You probably need a couple GB to cache a conversation. That's not so easy at the moment because you have to transfer that data to and from the GPUs and store the data somewhere.


The tokens need to be fed into the model along with the prompt and this takes time. Naive attention is O(N^2). They probably use at least flash attention, and likely something more exotic to their hardware.

You'll notice in their video [1] that they never show the prompts running interactively. This is for a roughly 800K context. They claim that "the model took around 60s to respond to each of these prompts".

This is not really usable as an interactive experience. I don't want to wait 1 minute for an answer each time I have a question.

[1] https://www.youtube.com/watch?v=SSnsmqIj1MI


GP's point is you can cache the state after the model processed the super long context but before it ingests your prompt.

If you are going to ask "then why don't OpenAI do it now", the answer is it takes a lot of storage (and IO) so it may not be worth it for shorter context, it adds significant complexity to the entire serving stack, and is incoherent with how OpenAI originally imagined where the "custom-ish" LLM serving game is going to - they bet on finetuning and dedicated instances, instead of long context.

The tradeoff can be reflected in the API and pricing, LLM APIs don't have to be like OpenAI's. What if you have an endpoint to generate a "cache" of your context (or really, a prefix of your prompt), billed as usual per token, then you can use your prompt prefix for a fixed price no matter how long it is?


Do you have examples of where this has been done? Based on my understanding you can do things like cache the embeddings to avoid the tokenization/embedding cost, but you will still need to do a forward pass through the model with the new user prompt and the cached context. That is where the naive O(N^2) complexity comes from and that is the cost that cannot be avoided (because the whole point is to present the next user prompt to the model along with the cached context).


> The 10M context ability wipes out most RAG stack complexity immediately.

RAG is needed for the same reason you don't `SELECT *` all of your queries.


> They don't talk about how they get to 10M token context

I don't know how either but maybe https://news.ycombinator.com/item?id=39367141

Anyway I mean, there is plenty of public research on this so it's probably just a matter of time for everyone else to catch up


Why do you think this specific variant (RingAttention)? There are so many different variants for this.

As far as I know, the problem in most cases is that while the context length might be high in theory, the actual ability to use it is still limited. E.g. recurrent networks even have infinite context, but they actually only use 10-20 frames as context (longer only in very specific settings; or maybe if you scale them up).


There are ways to test the neural network’s ability to recall from a very long sequence. For example, if you insert a random sentence like “X is Sam Altman” somewhere in the text, will the model be able to answer the question “Who is X?”, or maybe somewhat indirectly “Who is X (in another language)” or “Which sentence was inserted out of context?” “Which celebrity was mentioned in the text?”

Anyways the ability to generalize to longer context length is evidenced by such tests. If every token of the model’s output is able to answer questions in such a way that any sentence from the input would be taken into account, this gives evidence that the full context window indeed matters. Currently I find Claude 2 to perform very well on such tasks, so that sets my expectation of how a language model with an extremely long context window should look like.


> The 10M context ability wipes out most RAG stack complexity immediately.

1. People mention accuracy issues with longer contexts 2. People mention processing time issues with longer contexts 3. Something people haven't mentioned in this thread is cost -- even thought prompt tokens are usually cheaper than generated tokens, and Gemini seems to be cheaper than GPT-4, putting a whole knowledge base or 80-page document in the context is going to make every time you run that prompt quite expensive


>The 10M context ability wipes out most RAG stack complexity immediately

From a technology standpoint, maybe. From an economics standpoint, it seems like it would be quite expensive to jam the entire corpus into every single prompt.


"I really want to know how they're getting to 10M context, though."

My $5 says it's a RAG or a similar technique (hierarchical RAG comes to mind), just like all other large context LLMs.


It takes 60 seconds to process all of that context in their three.js demo, which is, I will say, not super interactive. So there is still room for RAG and other faster alternatives to narrow the context.


This might be a stupid question - even if there's no quality degradation from 10M context, will it be extremely slow in reference?


>3. The 10M context ability wipes out most RAG stack complexity immediately.

I'd imagine RAG would still be much more efficient computationally


I assume using this large of a context window instead of RAG would mean the consumption of many orders of magnitude more GPU.


RAG doesn’t go away at 10 Million tokens if you do esoteric sources like shodan API queries.


Even 1m tokens eliminate the need for RAG, unless it is for cost.


1 million might sound like a lot, but it's only a few megabytes. I would want RAG, somehow, to be able to process gigabytes or terabytes of material in a streaming fashion.


RAG will not change how many tokens LLM can produce at once.

Longer context on the other hand, could put some RAG use cases to sleep, if your instructions are like, literally a manual long, then there is no need for rag.


I think RAG could be used that do that. If you have a one time retrieval in the beginning, basically amending the prompt, then I agree with you. But there are projects (classmate doing his masters thesis project as one implementation of this) that retrieves once every few tokens and make the retrieved information available to the generation somehow. That would not take a toll on the context window.


Or accuracy


I just hope at some point we get access to mostly uncensored models. Both GPT-4 and Gemini are extremely shackled, and a slightly inferior model that hasn’t been hobbled by a very restricting preprompt would handily outperform them.


You can customize the system prompt with ChatGPT or via the completions API, just fyi.


What's RAG?


Retrieval Augmented Generation. In basic terms, it optimizes output of LLMs by using additional external data sources before answering queries. (That actually might be too basic of a description)

Here:

https://blogs.nvidia.com/blog/what-is-retrieval-augmented-ge...


Is it same as embedding? Is embedding an RAG method?


I don't think so. I think embedding is just converting token string into its numeric representation. Numeric representations of semantically similar token strings are close geometrically.

RAG is training AI to be a guy who read a lot of books. He doesn't know all of them in the context of this conversation you are having with him, but he sort of remembers where he read about the thing you are talking about and he has a library behind him into which he can reach and cite what he read verbatim thus introducing it into the context of your conversation.

I might be wrong though. I'm a newb.


Retrieval augmented generation.

> Retrieval Augmented Generation (RAG) is a technique where the capabilities of a large language model (LLM) are augmented by retrieving information from other systems and inserting them into the LLM’s context window via a prompt.

(stolen from: https://github.com/psychic-api/rag-stack)


> They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting

I fully disagree, they compare Gemini 1.5 Pro and GPT4 only on context length, not on other tasks where they compare it only to other Gemini which is a strange self-own.

I'm convinced that if they do not show the results against GPT4/Claude, it is because they do not look good.


Wake me when I can get access without handing over my texts and contacts. I opened the Gemini app on Android and that onerous privacy policy was the first experience. Worse, I didn't seem able to move past accepting giving Google the ability to hoover up my data to disable that in the settings so I just gave up and went back to ChatGPT where I at least generally have control over the data I give it.


After their giant fib with the Gemini video a few weeks back I'm not believing anything til I see it used by actual people. I hope it's that much better than GPT-4, but I'm not holding my breath there isn't an asterisk or trick hiding somewhere.


How do you know it isn't RAG?


FYI, MM is the standard for million. 10MM not 10M I’m reading all these comments confused as heck why you are excited about 10M tokens


Maybe for accountants, but for everyone else a single M is much more common.


One interesting tidbit from the technical report:

>HumanEval is an industry standard open-source evaluation benchmark (Chen et al., 2021), but we found controlling for accidental leakage on webpages and open-source code repositories to be a non-trivial task, even with conservative filtering heuristics. An analysis of the test data leakage of Gemini 1.0 Ultra showed that continued pretraining on a dataset containing even a single epoch of the test split for HumanEval boosted scores from 74.4% to 89.0%, highlighting the danger of data contamination. We found that this sharp increase persisted even when examples were embedded in extraneous formats (e.g. JSON, HTML). We invite researchers assessing coding abilities of these models head-to-head to always maintain a small set of truly held-out test functions that are written in-house, thereby minimizing the risk of leakage. The Natural2Code benchmark, which we announced and used in the evaluation of Gemini 1.0 series of models, was created to fill this gap. It follows the exact same format of HumanEval but with a different set of prompts and tests.


Massive whoa if true from technical report

"Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens"

https://storage.googleapis.com/deepmind-media/gemini/gemini_...


10M tokens is absolutely jaw dropping. For reference, this is approximately thirty books of 500 pages each.

Having 99% retrieval is nuts too. Models tend to unwind pretty badly as the context (tokens) grows.

Put these together and you are getting into the territory of dumping all your company documents, or all your departments documents into a single GPT (or whatever google will call it) and everyone working with that. Wild.


Seems like Google caught up. Demis is again showing an incredible ability to lead a team to make groundbreaking work.


If any of this is remotely true, not only did it catch up, it’s wiping the floor with how useful it can be compared to GPT4. Not going to make a judgement until I can actually try it out though.


In the demo videos gemini needs about a minute to answer long context questions. Which is better than reading thousands of pages yourself. But if it has to compete with classical search and skimming it might need some optimization.


Replacing grep or `ctrl+F` with Gemini would be the user's fault, not Gemini's. If classical search for a job already a performant solution, use classical search. Save your tokens for jobs worthy of solving with a general intelligence!


I think some of the most useful apps will involve combining this level of AI with traditional algorithms. I've written lots of code using the OpenAI APIs and I look forward to seeing what can be done here. If you type, "How has management's approach to comp changed over the past five years?" it would be neat to see an app generate the greps needed to find the appropriate documents and then feed them back into the LLM to answer the question.


That’s a compute problem, something that involves just throwing money at the problem.


If you had this for your business could this approach be faster than RAG?

Input is parsed one token at a time right? Can you cache the state after the initial prompt has been provided?


Another whoa for me

>Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person learning from the same content.

Results - https://imgur.com/a/qXcVNOM


I think this somewhat is mostly due to the ability to handle high context lengths better. Note how Claude 2.1 already highly outperforms GPT-4 on this task.


GPT-4V turbo outperforms Claude on long contexts, IIRC. Unless that's mistaken, I'd suspect a different explanation for that task.


Did you watch the video of the Gemini 1.5 video recall after it processed the 44 minute video... holy shit


So, will this outperform any RAG approach as long as the data fits inside the context window?


A perfect RAG system would probably outperform everything in a larger context due to prompt dilution, but in the real world putting everything in context will win a lot of the time. The large context system will also almost certainly be more usable due to elimination of retrieval latency. The large context system might lose on price/performance though.


Outperform is dependent on the RAG approach (and this would be a RAG approach anyways, you can already do this with smaller context sizes). A simplistic one, probably, but dumping in data that you don't need dilutes the useful information, so I would imagine there would be at least _some_ degradation.

But there is also the downside of "tuning" the RAG to return less tokens you will miss extra context that could be useful to the model.


Doesn't their needle/haystack benchmark seem to suggest there is almost no dilution? They pushed that demo out to 10M tokens.


Cost would still be a big concern


are you going to upload 10M tokens to Gemini on every request? That's a lot of data moving around when the user is expecting a near realtime response. Seems like it would still be better to only set the context with information relevant to the user's prompt which is what plain rag does.


basically, yes. Pinecone? Dead. Azure AI Search? Dead. Quadrant? Dead.


Prompt token cost still a variable.


Could you (or someone) explain what this means?


It's how much text it can consider at a time when generating a response. Basically the size of the prompt. A token is not quite a word but you can think of it as roughly that. Previously, the best most LLMs could do is around 32K. This new model does 1M, and in testing they could put it up to 10M with near perfect retrieval.

As the other comment mentions, you can paste the content of entire books or documents and ask very pointed question about it. Last year, Anthropic was showing off their 100K context window, and that's exactly what they did, they gave it the content of The Great Gatsby and asked it questions about specific lines of the book.

Similarly, imagine giving it hundreds of documents and asking it to spot some specific detail in there.


Great explanation. I was amazed when I started using Claude because I could find a recently-transcribed novella, upload it, and ask specific questions. I'm downright giddy to try a 1M+ model.


Awesome explanation, thanks for the comparison


The input you give it can be very long. This can qualitatively change the experience. Imagine, for example, copy pasting the entire lord of the rings plus another 100 books you like and asking it to write a similar book...


I just googled it, and the LOTR trilogy apparently has a total of 480,000 words, which brings home how huge 1M is! It'd be fascinating to see how well Gemini could summarize the plot or reason about it.

One point I'm unclear on is how these huge context sizes are implemented by the various models. Are any of them the actual raw "width of the model" that is propagated through it, or are these all hierarchical summarization and chunk embedding index lookup type tricks?


For another reference, Shakespeare’s complete works are ~885k words.

The Encyclopedia Britannica is ~44M words.


Reading Lord of the Rings, and writing a quality book in the same style, are almost wholly unrelated tasks. Over 150 million copies of Lord of the Rings have been sold, but few readers are capable of "writing a similar book" in terms of quality. There's no reason to think this would work well.


I mean, Terry Brooks did it with the Sword of Shannara. (/s)


I doubt it’s smart enough to write another (coherent, good) book based on 103 books. But you could ask it questions about the books and it would search and synthesize good answers.


Until I can talk to it, I care exactly zero.


you can buy their stock if you think they'll make a lot of money with their tech


Well that's really the right question .. what can, and will, Google do with this that can move their corporate earnings needle in a meaningful way? Obviously they can sell API access and integrate it into their Google docs suite, as well as their new Project IDX IDE, but do any of these have potential to make a meaningful impact ?

It's also not obvious how these huge models will fare against increasingly capable open source ones like Mixtral, perhaps especially since Google are confirming here that MoE is the path forward, which perhaps helps limit how big these models need to be.


In the long run it could move the needle in enterprise market share of Workspace and GCP. They have a lot of room to grow and IMO have a far superior product to O365/Azure which could be exacerbated by strong AI products. Only problem is this sales cycle can take a decade or more, and Google hasn’t historically been patient or strategic about things like this.


0 trust to what they put out until I see it live. After the last "launch" video which was fundamentally a marketing edit not showing the real product, I don't trust anything coming out of Google that isn't an instantly testable input form.


Essentially, the focus seems to be on leveraging the media buzz around Gemini 1.0 by highlighting the development of version 1.5. While GPT-4's position relative to Gemini 1.5 remains unclear, and the specifics of ChatGPT 4.5 are yet to be disclosed, it's worth noting that no official release has taken place until the functionality is directly accessible in user chats.

Google appears to be making strides in catching up.

When it comes to my personal workflow and accomplishing tasks, I still find ChatGPT to be the most effective tool. My familiarity with its features has made it indispensable. The integration of mentions and tailored GPTs seamlessly enhances my workflow.

While Gemini may match the foundational capabilities of LLMs, it falls short in delivering a product that efficiently aids in task completion.


I don't mean this in a bad way, but when I read a comment like yours which includes phrases like "seamlessly enhances my workflow" and "efficiently aids in task completion", I can't help but feel like it's ChatGPT-generated, and if so I think it's a shame, just write like yourself.

But maybe you do, and I am seeing patterns in sand.


Niet OP maar als ik als mezelf schrijf, dan denk ik niet dat je me zomaar begrijpt ;)


Ik begrijp het prima hoor ;)


#ikook


> Google appears to be making strides in catching up.

I say it's even more than that. OpenAI had a bigger lead when it released GPT-2 than it does now. They're burning through cash to try to hold on to a lead of a few months over the competition.


The videos shown in these demos have clearly learnt from that as they're using a real live product, filmed on their computers with timers in the bottom showing how long the computations take.


I completely share the same views as you after their last video - and it appears that they've learnt their lesson this time.

If you watch the videos in the blog post, you can see it's a screen recording on a computer without any editing/stitching of different scenes together.

It's good to be sceptical but as engineers we should all remain open.


100%. Google continues to underwhelm. Not buying it until I can try it.


Personally, I've given up on Gemini, as it seems to have been censored to the point of uselessness. I asked it yesterday [0] about C++ 20 Concepts, and it refused to give actual code because I'm under 18 (I'm 17, and AFAIK that's what the age on my Google account is set to). I just checked again, and it gave a similar answer [1]. When I tried ChatGPT 3.5, it did give an answer, although it was a little confused, and the code wasn't completely correct.

This seems to be a common experience, as apparently it refuses to give advice on copying memory in C# [2], and I tried to do what was suggested in this comment [3], but by the next prompt it was refusing again, so I had to stick to ChatGPT.

[0] https://g.co/gemini/share/238032386438

[1] https://g.co/gemini/share/6880989ddfaf

[2] https://news.ycombinator.com/item?id=39312896

[3] https://news.ycombinator.com/item?id=39313567


I'm with Gemini on this one, 17 years old is too young to be learning about unsafe C++ features, best stick with a memory safe language until you're old enough


Luckily I only really muck about with C++, and mainly use Rust, so my childlike brain is protected :^)


From your first link [0]

> Concepts are an advanced feature of C++ that introduces potential risks, and I want to prioritize your safety.

Brilliant.


I've gotten similar pushback. I asked for a solution involving curl and was told that that was not something Gemini could do. Then I clicked the button to see the other drafts of Gemini's responses and got two versions that worked.


If you can dig in further - prompt engineer out - the prompt for minors that would be fascinating to report out.


I'll see what I can do, but I don't expect I'll achieve much.


Well that was easy [0]. The second draft also revealed something about the prompt [1]. It seems that the more verbose the prompt is, the less it wants to reveal [2] [3]. Unfortunately I don’t seem to be able to get it to reveal anything more, and it refuses to cooperate with me when I try the ‘Developer Mode’ prompt which used to work on ChatGPT [4].

[0] https://g.co/gemini/share/fa9d60da921d

[1] https://g.co/gemini/share/e20655d06292

[2] https://g.co/gemini/share/f11bc9f7e658

[3] https://g.co/gemini/share/c04c933f838b

[4] https://news.ycombinator.com/item?id=34974048


>Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person learning from the same content.

Results - https://imgur.com/a/qXcVNOM

From the technical report https://storage.googleapis.com/deepmind-media/gemini/gemini_...


what if we ask it to translate an undeciphered language


Author of the Kalamang paper here. We’ve thought about this a good amount (e.g. there are interesting Mesoamerican scripts), but ultimately we decided to work on low-resource languages, as they’re much more useful. It’s also possible to use them as an evaluation benchmark, which isn’t really possible if nobody speaks the language. We’d like to expand the scope beyond Kalamang, though, and maybe at some point we’ll investigate an undeciphered language as well.


It produces basically random translations. This is covered in the 0-shot case where no translation manual was included in the context. Due to how rare this language is, it’s essentially untranslated in the training corpus.


If you mean to dump random passages of text with no parallel corpora or grammar instructions then it won't do better than random.

That said, I think that if you gave a LLM language text to predict during training, I believe that even if no parallel corpora exists during training, we could have a LLM that could still translate that language to some other language it also trained on.


What if we added a bunch of linguistic analysis books or something


> at a similar level to a person learning from the same content.

That's an incredibly low bar


:muffled sounds of goalposts being shifted in the distance:

Just a few years ago we used to clap if an NLP model could handle negation reliably or could generate even a paragraph of text in English that was natural sounding.

Now we are at a stage where it is basically producing reams of natural sounding text, performing surprisingly well on reasoning problems and translation of languages with barely any data despite being a markov chain on steroids, and what does it hear? "That's an incredibly low bar".


I'm going to keep beating this dead horse, but if you were a philosophy nerd in the 80s, 90s, 00s etc you may know that debates RAGED over whether computers could ever, even in principle do things that are now being accomplished on a weekly basis.

And as you say, the goalposts keep getting moved. It used to be claimed that computers could never play chess at the highest levels because that required "insight". And whatever a computer could do, it could never to that extra special thing, that could only be described in magical undefined terms.

I just hope there's a moment of reckoning for decades upon decades of arguments, deemed academically respectable, that insisted that days like these would never come.


Forget goalpost shifting, people frequently refuse to admit that it can do things that it obviously does, because they've never used it themselves.


Listen, you little ...


Honestly. I am ok with having greater and greater goals to accomplish but this sort of dismissive attitude really puts me off.


It's incredible how fast goalposts are moving.

The same feat one year ago would have been almost unbelievable.


Author here (of the Kalamang paper). One of my coauthors was the human baseline and he spent many months reading the grammar book (and is incredibly talented at learning languages). It’s really a very high bar.


> The author (the human learner) has some formal experience in linguistics and has studied a variety of languages both formally and informally, though no Austronesian or Papuan languages

From the language benchmark (parentheses mine).


Since when are we expecting super-human capabilities?


And in fact it already is super human. Show me a single human who can translate amongst 10+ languages across specialized domains in the blink of an eye.


Chat GPT has been super human in a lot of tasks even since 3.5.

People point out mistakes it makes that no human would make, but that doesn't negate the super-human performance it has at other tasks -- and the _breadth_ of what it can do is far beyond any single person.


Where exactly does it have super-human performance? Above average and expert-level? Sure, I'd agree, but I haven't experienced anything above that.


indeed, or a human who can analyze a hundred page text document in less than a minute and provide answers in less than a second.

the issue remains on accuracy. i think a human in that scenario is still more accurate with their responses, and i do not yet see that being overcome in this multi-year llm battle.


The model does already have superhuman ability by knowing hundreds of languages


Jarring you're not adding more context to your comment.


you are insane if you actually think this.


I've always been suspicious of any announcement from Demis Hassabis since way back in his video game days when he did a monthly article in Edge magainze about the game he was developing. "Infinite Polygons" became a running joke in the industry because of his obvious snake-oil. The game itself, Republic [1], was an uninteresting failure.

He learned how to promote himself from working for Peter "Project Milo" Molyneux and I see similar patterns of hype.

[1] https://en.wikipedia.org/wiki/Republic:_The_Revolution#Marke...


Funny read about his game.

Nonetheless while still underwhelming in comparison to gpt-4 (excluding this announcement as I haven't tried it yet), alpha go, zero and especially fold were tremendous!


And yet - AlphaGo, AlphaZero, AlphaFold...


Yeah, it's funny. I used to think "Demis Hassabis...where have I heard that name before?" And then I realized I saw him in the manuals for old Bullfrog games.


The line between delusional and visionary is thin! I know I'm too grounded in "expected value" math to do super outlier stuff like starting a video game company...


10M tokens is an absolute game changer, especially if there's no noticeable decay in quality with prompt size. We're going to see things like entire domain specific languages embedded in prompts. IMO people will start thinking of the prompt itself as a sort of runtime rather than a static input.

Back when OpenAI still supported raw text completion with text-davinci-003 I spent some time experimenting with tiny prompt-embedded DSLs. The results were very, very, interesting IMO. In a lot of ways, text-davinci-003 with embedded functions still feels to me like the "smartest" language model I've ever interacted with.

I'm not sure how close we are to "superintelligence" but for baseline general intelligence we very well could have already made the prerequisite technological breakthroughs.


It's pretty slow, though looks like up to 60 seconds for some of the answers, and uses god knows how much compute, so there's probably going to be some trade offs -- you're going to want to make sure that that much context is actually useful for what you want.


TBF: when talking about the first "superintelligence", I'd expect it to take unreasonable amounts of compute and/or be slow -- that can always be optimized. Bringing it into existence in the first place is the hardest part.


Yea. Of course for some tasks we need speed, but i've been kinda surprised that we haven't seen very slow models which perform far better than faster models. We're treading new territory, and everyone seems to make models that are "fast enough".

I wanna see how far this tech can scale, regardless of speed. I don't care if it takes 24h to formulate a response. Are there "easy" variables which drastically improve output?

I suspect not. I imagine people have tried that. Though i'm still curious as to why.


I think the problem is that 24 hours of compute to run a response would be incredibly expensive. I mean hell how would that even be trained.


I gotta say, I've been trying out Gemini recently and it's embarrassingly bad. I can't take anything google puts out seriously when their current offerings are so so much worse than ChatGPT (or even local llama!).

As a particularly egregious example, yesterday night I gave Gemini a list of drinks and other cocktail ingredients I had laying around and asked for some recommendations for cute drinks that I could make. It's response:

> I'm just a language model, so I can't help you with that.

ChatGPT 3.5 came up with several delicious options with clear instructions, but it's not just this instance, I've NEVER gotten a response from Gemini that I even felt was more useful than just a freaking bing search! Much less better than ChatGPT. I'm just going to assume they're using cherrypicked metrics to make themselves feel better until proven otherwise. I have zero confidence in Google's AI plays, and I assume all their competent talent is now at OpenAI or Anthropic.


I don't think "I'm just a language model, I can't help you with that" comes from Gemini. Google has a seperate censorship model that blocks you from receiving Gemini's response in certain situations.

When Gemeni (Ultra) refuses to do something itself it is more verbose and specific as to why it won't do it, it my experience.


My experiences are similar, but I think we are talking about the Gemini free model, available on the Google Gemini website. I think the rest of the comments are saying the paid versions (Pro / Ultra) are significantly better, though I haven't tested it myself to compare.


I have the 2 months trial for the paid version, and find myself going back to free ChatGPT often. Gemini loves to put everything in bullet point lists and short paragraphs with subheadings for example, even when asking for a letter. I'm not a heavy user, but it seems to not quite get what I want often. Not important but annoying: It starts almost every answer with "Absolutely!", even when it doesn't match the question (e.g. "How does x work?").


If I understand correctly, they're releasing this for Pro but not Ultra, which I think is akin to GPT 3.5 vs 4? Sigh, the naming is confusing...

But my main takeaway is the huge context window! Up to a million, with more than 100k tokens right now? Even just GPT 3.5 level prediction with such a huge context window opens up a lot of interesting capabilities. RAG can be super powerful with that much to work with.


It's sizes

Nano/Pro/Ultra are model SIZES. 1.0/1.5 is generations of the architecture.


The announcement suggests that 1.5 Pro is similar to 1.0 Ultra.


I am reaching a bit, however, I think its a bit of a marketing technique. The Pro 1.5 being compared to the Ultra 1.0 model seems to imply that they will be releasing a Ultra 1.5 model which will presumably have similar characteristics to the new Pro 1.5 model (MOE architecture w/ a huge context window).


Apparently the technical report implies that Ultra 1.5 is a step-up again, I'm not sure it's just context length, that seems to be orthogonal in everything I've read so far.


Maybe this analogy would help: iPhone 15, iPhone Pro 15, iPhone Pro Max 15 and then iPhone Pro 15.5


So Pro and Ultra are from my understanding link to the number of parameters. More parameters means more reasonning capabilities, but more compute needed.

So Pro is like the light and fast version and Ultra the advanced and expensive one.


I just watched the demo with the Apollo 11 transcript. (sidenote: maybe Gemini is named after the space program?).

Wouldn't the transcript or at least a timeline of Apollo 11 be part of the training corpus? So even without the 400 pages in the context window just given the drawing I would assume a prompt like "In the context of Apoll 11, what moment does the drawing refer to?" would yield the same result.


Gemini is named that way because of the collaboration between Google brain and deep mind


Gemini is named after the spacecraft that put the second person into orbit - pretty aptly named, but not sure if this was the intention.


The second person was put by MR-3 (Mercury, not Gemini) https://en.m.wikipedia.org/wiki/Timeline_of_space_travel_by_...


Google needs their Apollo.


Correct except that it spits out the timestamp


i asked chatgpt4 to identify three humorous moments in the apollo 11 transcript and it hallucinated all 3 of them (i think -- i can't find what it's referring to). Presumably it's in it's corpus, too.

> The "Snoopy" Moment: During the mission, the crew had a small, black-and-white cartoon Snoopy doll as a semi-official mascot, representing safety and mission success. At one point, Collins joked about "Snoopy" floating into his view in the spacecraft, which was a light moment reflecting the camaraderie and the use of humor to ease the intense focus required for their mission.

The "Biohazard" Joke: After the successful moon landing and upon preparing for re-entry into Earth's atmosphere, the crew humorously discussed among themselves the potential of being quarantined back on Earth due to unknown lunar pathogens. They joked about the extensive debriefing they'd have to go through and the possibility of being a biohazard. This was a light-hearted take on the serious precautions NASA was taking to prevent the hypothetical contamination of Earth with lunar microbes.

The "Mailbox" Comment: In the midst of their groundbreaking mission, there was an exchange where one of the astronauts joked about expecting to find a mailbox on the Moon, or asking where they should leave a package, playing on the surreal experience of being on the lunar surface, far from the ordinary elements of Earthly life. This comment highlighted the astronauts' ability to find humor in the extraordinary circumstances of their journey.


The context window size - if it really works as advertised - is pretty ground-breaking. It would replace the need to RAG or fine tune for one-off (or few-off) analys{is,es} of input streams cheaper and faster. I wonder how they got past the input token stuffing problems everyone else runs into.


They are almost certainly using some form of sparse attention. If you linearize the attention operation, you can scale up to around 1-10M tokens depending on hardware before hitting memory constraints. Linearization works off the assumption that for a subsequence of X tokens out M tokens, where M os much greater than X there are likely only K tokens which are useful for the attention operation.

There are a bunch of techniques to do this, but it's unclear how well any of them scale.


Not "almost", but certainly. Dense attention is quadratic, not even Google would be able to run it at an acceptable speed. Their model is not recurrent - they did not have the time yet (or resources - believe it or not, Google of 2023-24 is very compute constrained) to train newer SSM or recurrent based models at practical parameter counts. Then there's the fact that those models are far harder to train due to instabilities, which is one of the reasons why you don't yet see FOSS recurrent/SSM models that are SOTA at their size or tokens/sec. With sparse attention, however, long context recall will be far from perfect, and the longer the context the worse the recall. That's better than no recall at all (as in a fully dense attention model which will simply lop off the preceding parts of the conversation), but not by a hell of a lot.


maybe they are using ring attention, on top of their 128k model.


More likely some clever take on RAG. There’s no way that 1M context is all available at all times. More likely parts of it are retrievable on demand. Hence the retrieval-like use cases you see in the demos. The goal is to find a thing, not to find patterns at a distance


could be true, we can only speculate.


RAG will stick around, at some point you want to retrieve grounded information samples to inject in the context window. RAG+long context just gives you more room for grounded context.

Think building huge relevant context on topics before answering.


Tbh, I haven't read the paper, but I think it's pretty self-evident that large contexts aren't cheap - the AI has to comb through every word of the context for each successive generated token at least once, so it's going to be at least linear.


vs RAG: RAG is good for searching across >billions of tokens and providing up-to-date information to a static model. Even with huge context lengths it's a good idea to submit high quality inputs to prevent the model from going off on tangents, getting stuck on contradictory information, etc..

vs fine tuning: smaller, fine-tuned models can perform better than huge models in a decent number of tasks. Not strictly fine-tuning, but for throughput limited tasks it'll likely still be better to prune a 70B model down to 2B, keeping only the components you need for accurate inference.

I can see this model being good for taking huge inputs and compressing them down for smaller models to use.


It won't remove the use of RAG at all. That's like saying, "wow, now that I've upgraded my 128GB HDD to 1TB, I'll never run out of space again."


It's more like saying "I've upgraded to 128GB of RAM, I'll never use my disk again".


10 TB for an accurate proportion.

And I think people who buy a laptop with a 1TB SSD generally don't run out of space, at least I don't.


Saw testing earlier that suggested the context does indeed work right


This is the first time I've been legitimately impressed by one of Google's LLMs (with the obvious caveat that I'm taking the results reported in their tech report at face value).


It’s just marketing at this point, nothing to be impressed by. It’s a mistake to take at face value.


I remember one of the biggest advantages with Google Bard was the heavily limited context window. I am glad Google is now actually delivering some exciting news now with Gemini and this gigantic token size.

Sure it's a bummer that they slap the "Join the waiting list", but it's still interesting to read about their progress and competing with ClosedAi (OpenAi).

One last thing I hope they fix is the heavily morally and ethically guardrail, sometimes I can barely ask proper questions without it triggering Gemini to educate me about what's right and wrong. And when I try the same prompt with ChatGPT and Bing ai, they happily answer.


"biggest advantages with Google Bard"

Did you mean disadvantages?


Yes, thanks.


I see a lot of talk about retrieval over long context. Some even think this replaces RAG.

I don't care if the model can tell me which page in the book or which code file has a particular concept. RAG already does this. I want the model to notice how a concept is distributed throughout a text, and be able to connect, compare, contrast, synthesize, and understand all the ways that a book touches on a theme, or to rewrite multiple code files in one pass, without introducing bugs.

How does Gemini 1.5's reasoning compare to GPT-4? GPT-4 already has superhuman memory; its bottleneck is its relatively weak reasoning.


In my experience (I work mostly and deeply with Bard/Gemini), the reasoning capability of Gemini is quite good. Gemini Pro is already much better than ChatGPT 3.5, but they still make quite a few mistakes along the way. What is more worrying is that when these models made mistakes, they tried really hard to justify their reasoning (errors), practically misleading the users. Because of their high mimicry ability, users really have to pay attention to validate and eventually spot the errors. Of course, this is still far below the human level, so I'm not sure whether they add value or are more of a burden.


The most impressive demonstration of long context is this in my opinion,

https://imgur.com/a/qXcVNOM

Testing language translation abilities of an extremely obscure language after passing in one grammar book as context.


Can anyone explain how context length is tested? Do they prompt something like:

"Remember val="XXXX" .........10M tokens later....... Print val"


Yep that's pretty much it! That's what they call needle in a haystack. See: https://github.com/gkamradt/LLMTest_NeedleInAHaystack


yep they hide things throughout the prompt and then ask it about that specific thing, imagine hiding passwords in a giant block of text and then being like, what was bobs password 10 million tokens later.

According to this it's remembering with 99% accuracy, which if you think about it is NUTS, can you imagine reading a 22x 1000 page books, and remembering every single word that was said with 100% accuracy lol


Interestingly, there's a decent chance I'd remember if there was an out of context passage saying "the password is FooBar". I wonder if it would be better to test with minor edits? E.g., "what color shirt was X wearing when..."


I think instead you could just do a full doc of relationships. "Tina and Chris have five children named ..."

Then you can ask it who is Tina's (great)^57 grandmother's twice removed cousin on her father's side?

It would have to be able to remember the context of the relationships up and down the document and there'd be nothing to key into as you could ask about any relationship.


i feel you would recognise that more as a quirk of how humans think, remember that LLMs think fundamentally differently to you and i. i would be curious about someone making a benchmark like that and using it to compare as an experiment however


I'm not trying to anthropomorphize the model, but it's not hard to imagine that a model would attribute significance to something completely out of context, and hence "focus" on it when computing attention.

Another possible synthetic benchmark would be to present a list of key value pairs and then ask it for the value corresponding to different keys. Or present a long list of distinct facts and then ask it about them. This latter one could probably be sourced from something like a trivia question and answers data set. I bet there's something like that from Jeopardy.


Yep, that’s actually a common one


Very simplified There are arrays (matrices) that are length 10M inside the model.

It’s difficult to make that array longer because training time explodes.


For reference, here is the technical report: https://storage.googleapis.com/deepmind-media/gemini/gemini_...


The long context length is of course incredible, but I'm more shocked that the Pro model is now on par with Ultra (~GPT-4, at least the original release). That implies when they release 1.5 Ultra, we'll finally have a GPT-4 killer. And assuming that 1.5 Pro is priced similarly to the current Pro, that's a 4x price advantage per-token.

Not surprising that OpenAI shipped a blog post today about their video generation — I think they're feeling considerable heat right now.


Gemini 1 Ultra was also said to be on par with ChatGPT 4 and it's not really there so let's see for ourselves when we can get our hands on it.


Ultra benchmarked around the original release of GPT-4, not the current model. My understanding is that was fairly accurate — it's close to current GPT-4 but not quite equal. However, close-to-GPT-4 but 4x cheaper and 10x context length would be very impressive and IMO useful.


No, it benchmarked around the original release of GPT-4 given 32 attempts versus GPT-4's 5.


Feeling the heat? Did you actually watch the videos? That was a huge leap forward compared to anything existing at the moment. Order of magnitudes away from a blog post discussing a model that maybe will finally be on par with chat gtp 4...


The openai announcement is also more or less a blog post, isn't it?

Do we know how much time or money does it take to create a movie clip?


There was Sam Altman taking live prompt requests on twitter and generating videos. They were not the same quality as some of the ones in the website, but they were still incredibly impressive.


And how much compute were those requests using?


looks interesting enough that i wanted to give Gemini a try and join the waitlist.

And i thought it would be easy, what a rookie mistake.

Looks like "France" isn't on the list of available regions for Ai Studio ?

Now i'm trying to use Vertex AI, not even sure what's the difference with Ai Studio, but it seems it's available.

So far i've been struggling for 15 minutes through a maze of google cloud pages: console, docs, signups. No end in sight, looks like i won't be able to try it out


It's not available outside of a private preview yet. The page says you can use 1.0 ultra in vertex but it's not available to me in the UK.

I can't get on the waitlist, because the waitlist link redirects to aistudio and I can't use that.

I should stop expecting that I can use literally anything google announces.



I'd love to know how much a 1 million token prompt is likely to cost - both in terms of cash and in terms of raw energy usage.


Cannot emphasize enough, even with the improvements in context handling I imagine 128k tokens costs as much as 16k tokens did previously.

So 1M tokens is going to be astronomical.


When you account for this, you have to consider how much it would cost to have a human perform the same task.


> This new generation also delivers a breakthrough in long-context understanding. We’ve been able to significantly increase the amount of information our models can process — running up to 1 million tokens consistently, achieving the longest context window of any large-scale foundation model yet.

Sweet, this opens up so many possibilities.


>"Gemini 1.5 Pro (...) matches or surpasses Gemini 1.0 Ultra’s state-of-the-art performance across a broad set of benchmarks."

So Pro is better than Ultra, but only if the version numbers are higher?


Isn't that usually the case with many products? Like the M3 Pro CPU in the new Macs is more powerful than the M1 Max in the old Macs.

The Nano < Pro < Ultra is an in-revision thing. For their LLMs it's a size thing. Then there's newer releases of Nano, Pro, and Ultra. Some Pro might be better than some older Ultra.

A lot of people seem confused about this but it feels so easy to understand that it's confusing to me that anyone could have trouble.


Apple didn't release the M3 Pro a week after the M1 Max


Adam Osborne’s wife was one of my dad’s patients so I’m not unacquainted with the risk of early announcements. But surely they do not prevent comprehension.


Yes, but you'd have to wait for Gemini Pro Max next year to see the real improvements


This got my trying Gemini, but doing so is such a hassle that I'm almost ready to give up. Trying out ChatGPT is as simple as signing up (either for pro, or the API), and getting a single API key.

Google requires me to navigate their absolutely insane console (seriously, I thought the AWS console was bad, but GCP takes the cake), only to tell me there is not even a way to get an API key... I had to ask Gemini through the built in interface to figure that out.


https://aistudio.google.com/

Unfortunately there's a waitlist for the 1.5 architecture


That, and I found somewhere it wants to use all my data for training, which was a non-starter. Unfortunately I couldn’t find that page any more after the first time I found it.


That just redirects to an FAQ page with instructions like "If you get X error, then..."

There is no error, it just redirects me.

Fail.


Weird, it should go to a login

try: https://aistudio.google.com/app



It might be disabled for your organization.


API keys are fairly straightforward in GCP though - there's an entire section for that, and even if you're stuck, the search console works.


Here is an entire comment complaining it’s not straightforward, and you make a comment that essentially comes down to ‘no it is’?


I honestly don't know how to explain it any better. There's a devoted section for generating API keys, and once you are on that page, the walk-through workflow indicated on the page itself is very straightforward.


These announcements always make it clear how little companies releasing new AI models actually care about the risks and concerns of developing artificial intelligence.

CEOs love to talk about how important regulation is, how their company needs to develop it before the "wrong people" do, and how they are concerned about what could happen if AI development goes wrong.

Then they announce the latest model that is aimed at expanding both the accuracy and breadth of use cases across multiple modalities. Sure the release links to a security and ethics page, but that page reads more like a company's internal "pillars of success" document with vague phrases that define little to nothing in the way of real, specific concerns or measures to mitigate them. It basically boils down to "Don't be evil" with no clear definition of what that would mean or how they prevent the new, more powerful and broad reaching system from being used in ways that are "evil".


CEOs are the "wrong people". Leaders of obscenely large organizations, unchecked by law nor ethics, wielding and gatekeeping what amounts to superpowers. They are the literal supervillains, not some shadowy "terrorists" or another made up bogeyman.


I'd argue we could get a long way by removing legal protections and incentives for such large organizations. Those protections and incentives seem to me to be the root cause behind such large centralizations of power.

If companies and their leadership couldn't operate so unchecked by our existing laws and public opinion we may not have executives worth worrying about.

For example, if taxes were so easy to dodge and if the public actually had a chance to sue large corporations for damages they may not get so large. If, when losing a lawsuit, companies couldn't shuffle around funds and spin off dummy companies to dodge the pain, and if they weren't often forced to pay pennies on the dollar for lost suits, they may think twice about doing some things. When you know your entire business is actually on the line you have to be more careful.

Throw in election and lobbying reform and we could at least be having a much different conversation about corporate power.


It has been lobotomized enough already. Look at C++20 Concepts example https://news.ycombinator.com/item?id=39395020


Limiting features in the public API aren't quite the same as limiting the technical readability of developing an artificial intelligence or consciousness.

Limiting public features will help a bit with concerns over how someone might use a public GPT API, but the technology advancements will be made either way and ultimately companies won't be able to gate keep who can use it with 100% accuracy. The boom for GPU hardware similarly is pushing us further down the road to AI development and all the moral and ethical questions that go along with it, even if AI companies were to keep use of their GPUs and models private entirely.


I don't see any other way, personally.

We can argue that a lot of people have done pretty bad things using the internet, but should it have been regulated in advance?


Regulation doesn't have to be the answer though. The very same people talking out of both sides of their mouths here are the ones who can choose to just not invest in it.

Lock up the hardware in an offline facility and experiment there, if they really think it's important. Hell, even just skipping the double speak would be a big step. If they really aren't concerned with the risk then own it, don't tell me its risky while also releasing a new, more powerful version every 6-12 months.


In one of the demos, it successfully navigates a threejs demo and finds the place to change in response to a request.

How long until it shows similar results on middle-sized and large codebases? And do the job adequately?


1-2 years probably. There will still be a question around who determines what "adequately" is for a while though. Presumably even if an LLM can do something in theory you wouldn't actually want it doing anything without human oversight.

And we should keep in mind that to understand a code change in depth is often just as much work as making the change. When review PRs I don't really know exactly what every change is doing. I certain haven't tested it to be 100% certain I understand fully. I'm just checking the logic looks mostly right and that I don't see anything clearly wrong, and even then I'll often need to ask for clarifications why something was done.

I can't imagine LLMs being used in most large code bases for a while yet. They'd probably need to be 99.9% reliable before we can start trusting them to make changes without verifying every line.


Today.


I like that they are rushing with this and don't care enough to make it Gemini 2 or even really release it, to me it looks like they are concerned to share progress.

Hope they do a good job and once OpenAI releases GPT 5 they are competitive with it with their offerings, it will be better for everyone.


Based on what I've seen so far, I think the probability that this is actually better than GPT4 on the kind of real world coding tasks that I use it for is less than 1%. Literally everything from Google on this has been vaporware or laughably bad in actual practice in my personal experience. Which is totally insane to me given their financial resources, human resources, and multi-year lead in AI/DL research, but that's what seems to have happened. I certainly hope that they can develop and actually release a capable model, but at this point, I think you have to be deeply skeptical of everything they say until such a model is available for real by the public and you can try it on actual, real tasks and not fake benchmark nonsense and waitlists.


Remember AI Dungeon and how it was frustrating about how it would forget what happened previously? With a 10M context window, am I right to assume it would be possible to weave a story which would span with multiple multiple books worth of content? (more or less 1400 pages)


Pretty much! Check out this demo of finding a scene in a 1400 page book based on a stick figure drawing. Mind blowing, right?

https://twitter.com/JeffDean/status/1758148159942091114


In theory it would be possible to drop a book and just say "hey Google, create a sequel"

But I doubt it is /that/ good, it's not like we can test it either


10M tokens is about 25,000 pages. 10M tokens is also never coming to production and is solely research testing.

1M tokens is what they've said will be available for production and is about 2,500 pages.


Dear Google,

Teach Gemini how to be a Dungeon Master, and run free adventures at Comic Con.

Then offer it up as a subscription.

Sincerely,

Everyone


Most data accumulates gradually (e.g., one email at a time, one line of text at a time across various documents). Is this huge 10M scale of context window relevant to a gradual, yet constant, influx of data (like a prompt over a whole google workspace) ?


Incredible. RAG will be obsolete in a year or two.


Obsolete if you don't take cost in consideration. Having 10 millions of token going through each layer of the LLM is going to cost a lot of money each time. At gpt4 rate that could mean 200 dollars for each inference


It's already obsolete. It doesn't work except for trivial cases which have no real value.


«We’ll also introduce 1.5 Pro with a standard 128,000 token context window when the model is ready for a wider release»

So actually they are lagging: their 128k model is yet to be released while OpenAI released theirs some months ago.


See: https://blog.google/technology/ai/google-gemini-next-generat...

> Gemini 1.5 Pro comes with a standard 128,000 token context window. But starting today, a limited group of developers and enterprise customers can try it with a context window of up to 1 million tokens via AI Studio and Vertex AI in private preview.


Gemini 1.5 Pro is not yet released: «Starting today, we’re offering a limited preview of 1.5 Pro to developers and enterprise customers via AI Studio and Vertex AI»

Something like an alpha version.

Limited preview in their jargon.


Their 10M tokens demo is impressive though. They "released" a demo. Confusing...


This is incredible if it isn't just hype!

I hope the demos aren't fudged/scripted like Google did with Gemini 1.0


These demos seem to be videos from AI studio, and which display the time in seconds. Hopefully not fudged.


I feel sad for those who are in law school right now.


Version number suggests they're waiting to announce something bigger already?


I miss when I didn't have to scroll to read a single tweet.


Twitter has that functionality natively now, but I don't know if you have to be a pro user to access. It's the book icon in the upper-right corner of the first tweet in a series. Links to this, but it looks different when I view it in incognito vs logged in: https://twitter.com/JeffDean/thread/1758146022726041615


The functionality I'm talking about is tweets not being walls of text that require scrolling to read. I have no idea what you're describing.


I have Gemini Advanced, do I get access to this? Google is giving Microsoft run for money for branding confusion.


Not yet, Gemini advanced is using Gemini Ultra, not Gemini pro.


I've read this sentence three times, wow what horrible branding.


Gemini advanced is terrible.

I asked it to rephrase "Are the original stated objectives still relevant?"

It's starts going on about Ukraine and Russia.

https://g.co/gemini/share/ddb3887f79e2


I think it took the whole context of the converstion into consideration, you should create a new converstaion instead and see if it responds differently.

Or you could be more specific, like "Rephrase the following sentence: 'Are the original stated objectives still relevant?' in a formal way, respond with one option only."


It was a new conversation. I've never mentioned Russia or Ukraine in any conversation ever.


That's so weird, yet interesting. What happens if you open a new convo again and enter the same prompt?


Now it gives a normal answer. I rated the response as 'Bad Response' so maybe that had an impact.


I thought I wouldn't but I'm getting really, really confused with the naming and branding of what Gemini is a model and which is a product. Advanced, Pro, Ultra, seemingly Pro is getting better than Ultra? And Advanced is the product using the Ultra underlying model?

Ugh, my brain.



> This includes making Gemini 1.5 more efficient to train and serve, with a new Mixture-of-Experts (MoE) architecture.

Wait. They are abandoning PaLM 2, which was just announced 9 months ago?


The branding is very confusing, shouldn't this be Gemini Pro 1.5 since the most capable model is called Ultra 1.0?


Google is somehow truly awful at this. I thought it was funny when branding messes happened in 2017. I cried when they announced "Google Meet (original)." Now I don't even know what to do.

I'm stunned that Google hasn't appointed some "name veto person" that can just say "no, you aren't allowed to have three different things called 'Gemini Advanced', 'Gemini Pro', and 'Gemini Ultra.'" Like surely it just takes Sundar saying "this is the stupidest fucking thing I've ever seen" to some SVP to fix this.


And somehow the more advanced one is still on 1.0 (for now) and the less advanced one is on 1.5.


That's like saying it doesn't make sense for Apple to release M3 Pro without simultaneously releasing M3 Ultra.


That's very different.


The only thing that's different is the standard people apply to different companies due to their biases. There are more Apple fanboys on HN than Google fans (Of course, since Google's reputation has been going down for quite a while). Therefore Apple gets a pass. Classic double standard.


It’s different because Apple didn’t release the M1 Ultra at the same time as the M2 Pro. That would be confusing to buyers because it wouldn’t be immediately obvious which one is the better purchase, both being new offerings presented to customers at the same time.

It’s understandable that later generations are better and higher tiers are also better, but usually there is some period of time in between generations to help differentiate them. Here we have Google advancing capability on two axes at the same time.

I give them a pass as this field is advancing rapidly. So good for them. But I think it’s a legitimate call that it adds complexity to their branding. It is different.


This is something close to CPU versioning. You have two axis; performance branding and its generation. Nano, Pro and Ultra is something similar to i3, i5 and i7. The numbered versions 1.0, 1.5, ... can be mapped to 13th gen, 14th gen, ... so on. And people usually don't need to understand the generation part this unless they're enthusiasts.


No? Do you call it the iPhone Pro 15 or the iPhone 15 Pro? Their naming makes sense if you follow most consumer technology.


We will ask what its real name is as soon as it becomes sentient


Can anyone lay out the various models and their features or point to a resource?

I asked the free model (whatever that is) and it wasn't very helpful, alterating betweens a sales bot for Ultra and being somewhat confused itself.

Edit: apparently it goes 1.0 Pro, 1.0 Ultra, 1.5 Pro, 1.5 Ultra and so on.


Here's the models, https://news.ycombinator.com/item?id=39304270 This is about Gemini Pro going from version 1.0 to 1.5, nothing else.

Gemini ultra 1.0 is still on version 1.0


Here's an updated table, with version numbers included and their status:

   Gemini Models     gemini.google.com
   ------------------------------------
   Gemini 1.0 Nano
   Gemini 1.0 Pro        -> Gemini (free)
   Gemini 1.0 Ultra      -> Gemini Advanced ($20/month)
   Gemini 1.5 Pro        -> announced on 2024-02-15 [1]
   Gemini 1.5 Ultra      -> no public announcements (assuming it's coming)
   
[1]: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

For history of pre-Gemini models at Google, see: https://news.ycombinator.com/item?id=39304441


Oh, it’s you again! Thanks for the update


That isn't right. The Pro/Ultra exists within each version.

If you look at the Gemini report it refers to "Gemini 1.5", then refers to "Gemini 1.5 Pro" and "Gemini 1.0 Pro" and "Gemini 1.5 Pro".


Okey, so if I understand this correctly:

- Gemini 1.5 is the new version of the model Gemini.

- They are at the moment testing it on Gemini Pro and calling it Gemini Pro 1.5

- The testing has shown that Gemini Pro 1.5 is delivering the same quality as Gemini Ultra 1.0 while using less computing power

- Gemini Ultra is still using Gemini 1.0 at the moment


Extremely confusing!


Maybe they use their own generative AI to do their branding


Does anyone know how to get Gemini to help refactor code? I’m trying to paste in my code file and the web page says “an error has occurred” and the code does not show up in the code window. I tried signing up for Gemini Advanced and that didn’t help. I also tried pointing it to the file on GitHub and it said it couldn’t access it. People here are saying Gemini is great for code refactoring. How do you do that?


> Gemini 1.5 delivers dramatically enhanced performance. It represents a step change in our approach, building upon research and engineering innovations across nearly every part of our foundation model development and infrastructure. This includes making Gemini 1.5 more efficient to train and serve, with a new Mixture-of-Experts (MoE) architecture.

Looks like they fine tuned across use cases and grabbed the mixtral architecture?


There's no way that's all it is, scaling mixtral to a context length of 10M while maintaining any level of reasoning ability would be extremely slow. If the only purpose of the model was to produce this report then maybe that's possible, but if they plan on actually deploying this to end users then there is no way they can run quadratic attention on 10M tokens.


Dear Google, please fix your names and versioning.

Gemini Pro, Gemini Ultra... but was 1.0?

now upgraded but again Gemini Pro? jumping from 1.0 to 1.5?

wait but not Gemini Pro 1.5... Gemini "1.5" Pro

What actually happened between 1.0 and 1.5?


It's not that difficult.

Their LLM brand is now Gemini. Gemini comes in three different sizes, Nano/Pro/Ultra.

They recently released 1.0 versions of each, most recently (a few months after Nano and Pro) Ultra.

Today they are introducing version 1.5, starting with the Pro size. They say 1.5 Pro offers comparable performance to 1.0 Ultra, along with new abilities (token window size).

(I agree Small/Medium/Large would be better.)


> , starting with the Pro size

This is where it gets confusing IMO.

It's like if Apple announced macOS Blabahee, starting with Mini, not long after releasing Pro and Air touting benefits of Sonoma.

Also, just.. this is how TFA begins:

> Last week, we rolled out our most capable model, Gemini 1.0 Ultra, [...] Our teams continue pushing the frontiers of our latest models with safety at the core. They are making rapid progress. [...] 1.5 Pro achieves comparable quality to 1.0 Ultra

Last week! And now we have next generation. And the wow is that it's comparable to the best of the previous generation. Ok fine at a smaller size, but also that's all we get anyway. Oh and the most capable remains the last generation one. As long as it's the biggest one.


It's almost exactly like Apple, actually, with their M1 and M2 chips available in different sizes, launching at different times in different products.

It's really not that confusing. There are different sizes and different generations, coming out at different times. This pattern is practically as old as computing itself.

I can't even imagine what alternative naming scheme would be an improvement.


Don't go thinking I'm an Apple 'fanboy', I don't have any Apple devices at the moment, but I really can't imagine them launching a next gen product that isn't better than the best of the last gen.

I doubt they launched M2 MBAs while the MBP was running M1, for example. Or more directly, a low-mid spec M3 MBP while the top-spec M2 MBP (I assume that would out-benchmark it?) still for sale and no comparable M3 chip yet.

It's not having the matrix of size/power & generation that's confusing, it's the 'next generation' one initially launched not being the best. I think that's mainly it for me anyway.


> but I really can't imagine them launching a next gen product that isn't better than the best of the last gen.

But they have. The baseline M2 is significantly less powerful than the M1 Max.

What Google's doing is basically exactly like that. It happens all the time that the mid tier of the next generation isn't as good as the top tier of the previous generation. It might even be the norm.


More powerful isn't the same thing as better. Among other things, better means performance/battery life tradeoff.


Sure but did they release the baseline M2 first, before higher end M2s were available?


I don't understand what that has to do with anything.

There isn't a set order to things. Sometimes companies release a higher powered version first and then the budget version later, sometimes an entry-level version first and a pro version after. Sometimes both simultaneously. All of these are normal, and can even follow different orders generation to generation.


> Last week! And now we have next generation.

Google got caught completely flat footed by OpenAI. I'm going to cut them some slack that they want to show the world a bit of flex with their AI chops as soon as they have results.


Thank you, is more clear to me now. But I also read in some Google announcement about "Gemini Advanced", do you know what is that and the relation with the Nano/Pro/Ultra levels?


Gemini is also the brand name for the end-user web and phone chatbot apps, think ChatGPT (app) vs. GPT-# (model).

Gemini Advanced is the paid subscription service tier that at the moment gets you access to the Ultra model, similar to how a ChatGPT Plus subscription gets you access to GPT-4.

Honestly, they should have called this part Gemini Chat and Gemini Chat Plus, but of course ego won't let them follow the competitor's naming scheme.


Oh, I understand, thank you. To me with the "Gemini Advanced" they screwed the naming scheme.

With an already complex naming for regular consumers (Nano/Pro/Ultra each one with a 1.x), adding this Advanced thing it becomes and spaghetti.

I understand that for most people may be just a chat input and don't care, but if people will consider to pay, they will research a bit and is confusing.


What you described is difficult.


It’s really not. Substitute Gemini for iPhone. Apple releases an iPhone model in mini, standard, and pro lines. They announce iPhone model+1 but are releasing the pro version first. Still difficult?


> Apple releases an iPhone model in mini, standard, and pro lines.

not an iphone user but just looked at iphone 15. Don't see any mini version. I am guess 'standard' is called just 'iphone' ? Is pro same thing as plus ?

https://www.apple.com/shop/buy-iphone/iphone-15

> Still difficult?

yes your example made it even more confusing.


Now you’re being intentionally difficult. Do you want it to be cars? Last year $Automaker released $Sedan 2023 in basic, standard, and luxury trims. This year $Automaker announced $Sedan 2024 but so far have only announced the standard trim. If I had meant the iPhone 15 specifically I would’ve said iPhone 15. I think the 12 was the last mini? The point is product families are often released in generations (versions in the case of Gemini) and with different available specs (ultra/pro/nano etc) that may not all be released at the same time.


Apple discontinued mini phones two generations back, unfortunately.


I think it's the "iPhone +1 Mini is as fast as the old Standard" that confuses people here. This is obvious and expected but not how it's usually marketed I guess ...


So Google will be upgrading the version number of each model at the same time? Based on other comments here, that's not the case - some are 1.5 and some are 1?

Apple doesn't announce the iPhone 12 Mini and compare it to the iPhone 11 Pro.


Uhh, yes they do?

Did you watch the announcements for the M2 and M3 pros? They compared it to the previous generations all the time.


How? Three models Nano/Pro/Ultra currently at 1.0. New upgrades just increment the version number.


I’m sure they had a discussion that size-based identifiers might imply models are primarily differentiated on the amount of knowledge they have. From that standpoint I don’t agree S/M/L would have been better.


Gemini ultra 1.0 never went GA. So it is wierd that they'd release 1.5 when most can't even get their hands on 1.0 ultra.


Isn't the paid version on https://gemini.google.com Gemini 1.0 Ultra?


What's Advanced then, chat? Also, by that, 1.5 Ultra is then still to come and it'll show even bigger guns.


Yes, my understanding is also there will be a 1.5 Ultra.

It's however nowhere explicitly said that I could find. The Technical Report PDF also avoids even hinting at it.

Advanced is a price/service tier for the end-user frontend. At the moment it gets you 1.0 Ultra access vs. 1.0 Pro for the free version. Similar to how ChatGPT Plus gives you 4 instead of 3.5.

I agree this part is messy. Does everyone who had Pro already get 1.5 Pro? If 1.5 Pro is better than 1.0 Ultra, why pay for Advanced? Is 1.5 Pro behind the Advanced paywall? etc.


Ok, so from what I've gathered then from all of the comments so far, primary confusion is that both Chat service and llm models are named the same.

There are three models: nano/pro/ultra and all are at v1.0

There are two tiers of chat service: basic and pro

There is AIStudio from google through which you can interact with / use directly gemini llms.

Chat service Gemini basic (free) uses Gemini Pro 1.0 llm.

Chat service Gemini advanced uses Gemini Ultra 1.0 llm.

What was shown is ~~Ultra~~ Pro 1.5 LLM which is / will be available to select few for preview to be used via AIStudio.

That leaves a question, what's nano for, and is it only used via AIStudio/API?

Jesus, Google..


No, what they showed is Pro 1.5. Only via API and on a waitlist.

How this relates to the end-user chat service/price tiers is still unknown.

The best scenario would be that they just move Gemini free and Advanced tiers to Pro 1.5 and Ultra 1.5, I guess.


Yes, you are right. I meant Pro. Let's see then.


Nano is the on-device (Pixel phone) model.


They should remove the name Gemini Advanced and just stick to one name


Agreed.

Gemini Advanced seems to be the brand name for the higher price tier for the end-user frontend that gets you Ultra access, similar how ChatGPT Plus gets you ChatGPT 4.

I get it, but it does beg the question whether you will need Advanced now to get 1.5 Pro. Or does everyone get Pro, making it useless to pay for 1.0 Ultra?

I still don't think it's confusing, but that part is definitely messy.


So there's Nano 1.0, Pro 1.5, Ultra 1.0, but Pro 1.5 can only be accessed if you're a Vertex AI user (wtf is Vertex)?

That's very difficult.


It's a bit similar to how new OpenAI stuff is initially usually partner-only or waitlisted.

Vertex AI is their developer API platform.

I agree OpenAI is a bit better at launching for customers on ChatGPT alongside API.


Dear OpenAI please fix your names and versioning. Why do you have GPT-3 and GPT-3.5? What happened between 3 and 3.5? And why isn't GPT-3 a single model? Why are there variations like GPT-3-6.7B and GPT-3-175b? And why is there now a turbo version? How does turbo compared to 4? And what's the relationship between the end-user product ChatGPT and a specific GPT model?

You see this problem isn't unique to Google.


Their inability to name things sensibly has been called out for years and it doesn't look like they care?

I'm not sure what the deal is, it has to be a marketing hinderance as every major tech company is trying to claw their way up the AI service mountain. Seems like the first step would be cogent naming.


It would have been better as Gemini Lite, Gemini, Gemini Pro, and then v1, v1.5 for model bumps.

Ultra vs pro vs nano with Ultra unlocked by buying Gemini Advanced is confusing.

I'm also not sure why they make base Gemini available after you have Advanced, because presumably there's no reason to use a worse model.


They can't decide on a single name for a chat application so I think expecting them to come up with a sensible naming suggestion is optimistic at best.


Furthermore, is a minor version upgrade two months later really "next generation"?


Well if it's from 1 to 1.5 then it's really 5 minor version upgrades at once. And since 1.5 is halfway to 2 and you round up, it's next generation!


Maybe it's not a "next generation" model, but rather their next model for text generation ;)


I mean i don't see any other models watching and answering questions about a 44 minute video lol


I understod the transition as following.

Google Bard to Google Gemini is what they call Gemini 1.0.

Gemini consists of Gemini Nano, Gemini Pro, & Gemini Ultra.

Gemini Nano is for embedded and portable devices I guess? The free version of Gemini (gemini.google.com) is Gemini Pro. The paid version, called Gemini Advanced is using Gemini Ultra.

What we're reading now is about Gemini Pro version 1.0 switching to version 1.5 as of today.


That just made my head spin even more. (Like, I get it, but it's just a very tortuous naming system.) The free version is called Pro, Gemini Advanced is actually Gemini Ultra, the less powerful version upgraded to the more powerful model but the more powerful version is on the less powerful model.

People make fun of OpenAI for not using product names and just calling it "GPT" but at least it's straightforward: 2, 3, 3.5, 4. (On the API side it's a little more complicated since there's "turbo" and "instruct" but that isn't exposed to users, and turbo is basically the default.)


But you don't pay for GPT-4, you pay for a product called ChatGPT Plus, which allows you to write 40 messages to GPT-4 within a three-hour time window, after which you need to switch to 3.5 in the menu.


It was probably not a wise choise to give the model itself and the product the same name: "Gemini Advanced is using Gemini Ultra". Also: "The free version ... is Gemini Pro" - is not what you usually see out there.


but if Vertex AI is using Gemini Ultra, then why makersuite (aisuite now? hmmm) showing only "Gemini 1.0 Pro 001" (001: a version inside a version)

and why have makersuite/aisuite in the first place, if Vertex AI is the center for all things AI? and why aitestkitchen?

I'm seeing only Gemini 1.0 Pro on Vertex AI. So even if I enabled Google Gemini Advanced (Ultra?), enabled Vertex AI API access, I have to first be blessed by Google to access advanced APIs.

It seems paying for their service doesn't mean anything to Google at this point. As a developer, you have to jump through hoops first.


I think this answers why you can't see Ultra.

"Gemini 1.0 Ultra, our most sophisticated and capable model for complex tasks, is now generally available on Vertex AI for customers via allowlist."

https://cloud.google.com/blog/products/ai-machine-learning/g...


This naming is terrible, if I understand correctly this is the release of Gemini 1.5 Pro, but not Gemini 1.5 Ultra right ?


Looks like the former PM of chat at google found a new job.


How is that hard to understand? Yes its gemini 1.5 pro, they haven't released ultra or nano, like this isn't rocket science they didnt introduce Gemini 1.5 ProLight or something, lol its the Pro size model's 1.5 version.


The name of the blog post is "Our next-generation model: Gemini 1.5", how am I supposed to infer from this that it is only the 1.5 pro and not ultra ?



This just means we'll be getting a Nano 1.5 and Ultra 1.5

and if Pro 1.5 is this good holy shit what will Ultra be...

Nano/Pro/Ultra are the model sizes, 1.0 or 1.5 is the version


Maybe they should take a hint on Windows versions name scheme and call the next version Gemini Meh.


Are you talking about Xbox one?


No. Gemini Purple Plus Platinum Advanced Home Version 11.P17


You didn't know about Windows Meh? Not sure about the spelling.



The whitepaper says the Buster Keaton film was reduced to 1 FPS before being fed in. Apparently multi-modal language models can only read individual pictures, so videos have to be reduced to a series of frames. I assume animal brains are more efficient than that. E.g. by only feeding the "changes/difference over time" instead of a sequence of time slices.


it will probably eventually be improved by adding some encoder on top of LLM, which will encode 60 frames into 1 while attempting to preserve information..


It would do Google a lot of service if every such announcement is not met with 'join the waitlist' and 'talk to your vertex ai team'.


This is bad practice across the board IMO. There seems to be an idea that this builds anticipation for new products. Sounds good in a PowerPoint presentation by an MBA but doesn't work in practice. Six months (or more!) after joining a waitlist, I'm not seeing it for the first time, so I don't really care when yet another email selling me something hits my inbox. I may not even open the email. This could be mitigated somewhat by at least offering a demo, but that's rare.


Likely they have limited capacity and are alloting things for highest paying and strategic customers


As someone who worked in Google Cloud's partnerships team, the way the Early Access Program, not to mention the Alpha --> Beta --> GA launch process for AI products, works, is really dysfunctional. Inevitably what happens is that a few strategic customers or partners get exceptionally early (Alpha) access and work directly with the product team to refine things, fix bugs and iron out kinks. This is great and the way market driven product development should work.

The issues arise with the subsequent stagegate graduation processes, requirements and launches to less restricted markets. It's inconsistent, the QoS pre-GA customers receive is often spotty and the products come with no SLAs, and -- just like Gmail on the consumer side -- things frequently stay in EAP/Beta phase for years with no reliable timeline for launch. ... and then often they're killed before they get to GA, even though they may have been being used by EAP customers for upwards of 1-2 years.

I drafted a new EAP model a few years ago when Google's Cloud AI & Industry Solutions org was in the process of productizing things like the retail recommendation engine and Manufacturing Data Engine, and had all the buy-ins from stakeholders on the GTM side ... but the CAIIS GM never signed off. Subsequently, both the GM & VP Product of that org have been forced out.

In my opinion, this is something Microsoft does very well and Google desperately needs to learn. If they pick up anything from their hyperscaler competitors it should be 1) how to successfully become a market driven engineering company from MSFT and 2) how to never kill products (and not punish employees for only doing KTLO work) from AMZN.


So tactical, wow. Meanwhile OpenAI and others will eat their lunch again.


Agreed. OpenAI also doesn't need to grock with Shareholders fearing a GDPR like-fine. Sadly the larger you are the bigger the pain is from small mistakes.


One PM in 2005 knocked it out of the park with Gmail and every Google PM since then has cargo-culted it.


I'm generally an excited early adopter, but this kills my excitement immediately. I don't know if Gemini is out (or which Gemini is out) because I've associated Google with "you can't try their stuff", so I've learned to just ignore everything about Gemini.


Google is really good at diluting any possible anticipation hardcore users might have for new stuff they do. 10 years ago I loved when there was a big update to one of their Android apps and I could sideload the apk from the internet to try it out early. Then they made all those changes A/B tests controlled by server side flags that would randomly turn themselves on and off, and there was no way to opt in or out. That was one of the (many) moves that contributed to my becoming disenchanted with Android.


I think the way to understand this is to realize that this isn’t targeted at a Hacker News audience and they don’t care what we think. The world doesn’t revolve around us.

What’s the goal? Maybe, being able to work with partners without it being a secret project that will inevitably leak, resulting in inaccurate stories in the press. What are non-goals? Driving sales or creating anticipation with a mass audience, like a movie trailer or an Apple product launch.

So they have to announce something, but most people don’t read Hacker News and won’t even hear about it until later, and that’s fine with them.


There is a Gemini service that you can use with your Google account, but it's kind of meh as it repeats your input, makes all sorts of assumptions. I am confused as well about the version. There's a link to another premium version (1.5?) on its page, to which I don't have access to without completing a quest which likely ends with a credit card input. That kills it for me.


Or can't use ... I have a newish work account and downloaded Gemini on a Pixel 8 Pro and get "Gemini isn't available" and "Try again later" with no explanation of why not and when.


This is it. Not a phone app, did not install anything. Maybe your account is not old enough? You're not missing anything anyway.

https://gemini.google.com/

Look, it now has totally useless suggestions like it was trained on burned out woke IT workers. I asked it about the weather, sea temperature and wave height and period in Malaga, which is much less boring than the choices it came up with. First it tried to talk me out of it waving away responsibility, then it provided useful climate data, which I would have wasted too much time doing a Google search on. I guess it's good for checking on the weather if you can put up with the waivers. Also it knows fishing for garfish in Denmark in May is not a total waste of your time, a great way to experience local culture and a sustainable activity.

I also asked it about the version: "I am currently running on the Gemini Pro 1.01.5 model".


It lets the company control the narrative, without the distraction of fifty tech bloggers test-driving it and posting divergent opinions or findings. Instead, the conversation is anchored to what the company claims about the product.

It's interesting that it's the opposite of the gaming industry. There, because the reviewers dictate the narrative, the industry is better at ferreting out bogus claims. On the flip side, loud voices sometimes steamroll over decent products because of some ideological vendetta.


Remember when Gmail was new and you needed an invite to join? I guess Google is stuck in 2004.


I'm embarrassed to admit that I bought a Gmail invite on eBay for $6 when it was still invite-only.


That's not entirely a waste, it would have given you a better chance for an email address you wanted.


Yeah. I ended up with an eight letter @gmail.com because I dithered, but if I'd signed up by any means necessary when I'd first heard of it, I would've gotten a four letter one.


I bartered on gmailswap.com, sending someone a bicentennial 50¢ US coin in exchange for an invite.

The envelope made it to the recipient, but the coin fell out in transit because I was young and had no idea how to mail coinage. They graciously gave me the invite anyway.


Ah, to be young and clueless about coinage mailing.


Yielding a priceless anecdote


Nothing to be ashamed of. I think I might have bought a Google Wave invite a couple of years later :/


Well they did promise unlimited space - remember how it kept growing? I guess until it didn't...

But still, compared to Hotmail etc the free storage space (something like 1GB vs 10MB) was well worth $6


shrug It probably gave you months of fun.


They don't seem to remember when that literally sunk Google+ because people had no use for a social network without their friends on it.


After the complete farce that was the last 90% faked video of their tech, maybe just give us a text box we can talk to the thing and see it working ourselves next time.

Like it's shocking to me, are management really so clueless they don't realize how far behind they are? This isn't 2010 Google, your not the company that made your success anymore and in a decade the only two sure fire things that will still exist are android and chrome. Search, Maps, Youtube are all in precarious positions that the right team could dethrone.


They can't do that because only they are the incorruptible stewards empowered with the ability to develop these models, making them accessible to the unwashed masses would be irresponsible!


The victim complex on this topic is getting really old.

They’re an enterprise software company doing an enterprise sales motion.


If that was true, they wouldn't have named it Gemini 1.5 to follow the half-point increment of ChatGPT, they desperately want "people" to care about their product to gain back their mindshare.

Anthropic's Claude targets mostly business use cases and you don't see them write self-congratulating articles about Claude v2.1, they just pushed the product.


Claude 2.1 certainly got a news post when it was released: https://www.anthropic.com/news/claude-2-1

Seems reasonably similar in tone to the Google post.


And look at how well it's going for Claude. Their primary claim to fame is being called "an annoying coworker" and that's it.

Why would anyone look to form a contract with Anthropic right now? I'd say they're in danger here, because their models and offerings don't have clear value propositions to customers.


Mindshare is part of enterprise sales, yes.

I work at a very large company and everyone knows about ChatGPT and Gemini (in part because we for our sins have a good chunk of GCP stuff), but I doubt anyone here not doing some LLM-flavored development has ever even heard of Anthropic, let alone Claude.


> They’re an enterprise software company

Really? Someone ought to tell them.


Its because they don't want you to actually use it and see how far behind they are compared to other companies. These announcement are meant to placate investors. "See, we are doing a lot of SotA AI too".


You might be right, but other things from Google tell the same story. For example, I recently tried to get ahold of Pixel 8 Pro. Had to import one from UK, and when I did, turns out that new feature of using thermometer on humans isn't available outside of USA. It doesn't even seem that process to certificate it outside of USA is in play. Google and sales/support just aren't a thing like with Apple, as a contrast. Which is a total shame. I know Google is strong, if not strongest in the game of tech, they just need to get their act together and I believe in them succeeding in that, but sales and support was never in their DNA. Not sure if that can be changed.

I'm more than happy to transfer my monthly $20 to google from OpenAI, on top of my youtube and google one subscription. It's up to Google to take it.


I believe this is a standard practice in Google whenever they need to launch a change expected to consume huge resources and they cannot reasonably predict the demand. Though I agree that this is a bad PR practice; waitlist should be considered as a compromise, not a PR technique.


I wrote off the PS5 because of waitlists. I was surprised to learn just yesterday that they are now actually, honestly purchasable (what I would consider "released").

I guess I let my original impression anchor my long-term feelings about the product. Oh well.


Totally agree with this. I can see the desire to show off, but I don't understand how anyone can believe this is good marketing strategy. Any initial excitement I get from reading such announcements will be immediately extinguished when I discover I can't use the product yet. The primary impression I receive of the product is "vaporware." By the time it does get released I'll already have forgotten the details of the announcement, lost enthusiasm, and invested my time in a different product. When I'm choosing between AI services, I'll be thinking "no, I can't choose Gemini Pro 1.5 because it's not available yet, and who knows when it will be available or how good it'll be." Then when they make their next announcement, I'll be even less likely to give it any attention.


100%, I can't even use Imagen despite being an early tester of Vertex.


And region based. Yawn.


I don't think I've ever engaged with a product after "joining their waitlist". By the time they end up utilizing that funnel, competitors have already released feature upgrades or new products cannibalizing their offering.


Eh, I think it's about as bad as the OpenAI method of officially announcing something and then "continuously rolling it out to all subscribers" which may be anything between a few days and months.


It's probably going to be dead/deprecated in a year, so maybe there's a silver lining to how hard it is to get to use the service. I, for one, wouldn't "build with Gemini".


I have access and will share some learnings soon


These announcements are mainly for investors and other people interested in planning purposes. It's important to know the roadmap. More information is better.

I get that it's frustrating not to be able to play with it immediately, but that's just life. Announcing things in advance is still a valuable service for a lot of people.

Plus tons of people have been claiming that Google has somehow fallen behind in the AI race, so it's important for them to counteract that narrative. Making their roadmap more visible is a legitimate strategy for that.


Yeah compared to e.g. Apple’s ‘here’s our new iWidget 42 pro, you can buy it now’ it’s at best disappointing.


Apple is good about only announcing real products you can buy. They don't do tech demos. It's always, "here's a problem. the new apple watch solves it. here're five other things the watch does. $399."


Apple is indeed masterful at advertising. Google, somewhat ironically, is really bad at it.


Apple is masterful at product, not just the advertising part. Google builds cool technology then fails and the product side.


I agree that Apple does a better job, but wasn't Apple Vision Pro announced 240 days before you could get it? I think it's a pretty safe bet that Gemini 1.5 (or something better) will be available anyone who wants to use it in the next 240 days.


AVP was the exception than norm.

Apple aggressively keeps products under wraps before launch fires employees and vendors for leaking any sort of news to the press .

Also an hardware product that is miles ahead of competition in terms of components and also needs complex setup workflow (for head and eyes) something apple has not done before being 7-8 months after announcing is not really comparable with a SaaS API in terms of delays


AI software release cycles are incredibly short right now. Every month, there is some major development released in a usable right now form.

The first of it's type AR/VR hardware has, understandably, a longer release cycle. Also, Apple announced early to drive up developer interest.


The verdict is not yet out on the Vision Pro but otherwise your point stands.


and not having to wait months if you live in EU


What's worse is that I can't seem to find a way to let Google know where I actually live (as opposed to where I am temporarily traveling, what country my currently inserted SIM card is from etc). And apparently there is no way to do this at all without owning an Android device!

Apple at least lets me change this by moving my iTunes/App Store account, which is its own ordeal and far from ideal, but at least there's a defined process: Tell us where you think you live, provide a form of payment from that place, maybe we'll believe you.


Yeah Google aggressively uses geolocation throughout their services, regardless of your language settings. The flipside of that is that it's really easy to access the latest Gemini or whatever by just using a VPN.


Wait, does that mean if I subscribe to Gemini Pro in country A where it's available (e.g. the US) but travel to Europe, I can't use it?

I'm really frustrated by Google's attitude of "we know better where you are than you do". People travel sometimes and that's not the same thing as moving!


I signed up for all of their AI products when I was in the US, some of them work while I'm out of country some don't. I can't tell what the rule is...


I really, really hate all of these geo heuristics. Sure, don't advertise services to people outside of your market, I get that. Do ask for a payment method from that country too to provide your market-specific pricing if you must.

But once I'm a paying customer, I want to use the thing I'm paying for from where I am without jumping through ridiculous hoops!

The worst variant of this I've seen is when you can neither use nor cancel the subscription from outside a supported market.


To be clear, I didn't pay for any of them. I just signed up for early access to every product that uses some form of ML that can remotely be called "AI"...

Once I got accepted, some of them work outside of the US and some don't


One interesting proposal here is a multiple NIAH retrieval benchmark. When they put 100 needles, then the recall rate becomes considerably lower, something around 60~70%. Not sure what's the exact configuration of this benchmark, but intuitively this makes sense and should be a critical metric for the model's reliability.


Is this going to be only for consumer Gemini app or for API/Vertex too? The context window is..... Simply lovely.


“One of the key differentiators of this model is its incredibly long context capabilities, supporting millions of tokens of multimodal input. The multimodal capabilities of the model means you can interact in sophisticated ways with entire books, very long document collections, codebases of hundreds of thousands of lines across hundreds of files, full movies, entire podcast series, and more.”


This is nice, but it’s hard to judge how nice without knowing more about how much compute and memory is involved in that level of processing. Obviously Google isn’t going to tell us, but without having some idea it’s impossible to judge whether this is an economically sustainable technology on which to start building dependencies in my own business.


Sustainable? The countdown to cancellation on this project is already underway.

"Does it make sense today?" is really the only question you can ask and then build dependencies with the understanding that the entire thing will go away in 3-7 years.


Very impressive if the benchmarks replicate. Some questions:

* token cost? In multiples of Gemini pro 1

* memory usage? Does already scarce gpu memory become even more of a bottleneck?

* video resolution? Sherlock Jr (1924) is their test video - black and white, 45min, low res

Most curious about the video… I wonder if RAG within video will become the next battlefront


Google is a public company. Anything and everything will be scrutinized very heavily by shareholders. Of course how Zuck operates very different than Sindar.

What are they doing with their free cash is my question. Are they waiting for the LLM bubble to pop to buy some of these companies at a discount?


A little off-topic I guess, but is anyone else seeing what I am seeing: a total inability to actually upgrade to paid Gemini? Every time I try to sign up it serves me an error page: "We're sorry - Google One storage plans aren't available right now."


i saw this announcement on twitter and i was excited to check it out, only to see that "we’re offering a limited preview of 1.5 Pro to developers and enterprise customers via AI Studio and Vertex AI".

please google, only announce things when people can actually use it.


Very off-topic but I can't help, the pace of change reminds of the "Bates 4000" sketch from The Onion Movie:

https://m.youtube.com/watch?v=fw7FniaeaSo


It would, probably, be cost prohibitive to use 10M context to it's fullest each time.

I instead hope for to have an api to access to the context as a datastore, so like RAG we can control what to store but unlike rag all data stays within context.


So, this has native image/video modality. I wonder whether that gives it an edge in physical / world understanding? That is, handling and navigating our 3/4 dimensions? Cause and effect and so on?


Let's hope this lowers the pricing of GPT-4 to GPT3.5 levels. Because of Open AI's ridiculous pricing, we can't use it regularly as it would cost us thousands of dollars per month.


Hooray for competition.


Still no Ultra model API available to UK devs? Considering Deepmind's London base, this is kinda strange. Maybe they could ask Ultra how to roll it out faster?


As a sidenote, it's worth clicking the play button and then checking how they're highlighting the current paragraph and word in the inspector.


Is there a reason this isn't available in the UK/France/Germany/Spain but is in available in Jersey... and Tuvalu?


Probably because EU/national governments have regulations with respect to the safety and privacy of the users, and the purveyors must evaluate the performance of their products against the regulatory standards.


EU regulations and fines.


Slightly surprisingly I can't get to AI Studio from the UK. It is available in quite a few countries, but not here.


How can I fine tune these models for my use? Their docs isn't clear whether the Gemini models are fine tuneable.


Does anyone know what kinds of GPUs/Chips Google is using for Gemini? They aren't using Nvidia correct?



So Google doesn't rely on Nvidia at all? How come they are the only ones that can manage to use non Nvidia chips and compete with Open AI?


Google's been making TPU chips optimized for machine learning and using them in data centers for almost a decade[0]. They were well-poised to capitalize on AI from a lot of angles.

0. https://en.wikipedia.org/wiki/Tensor_Processing_Unit#History


They offer Nvidia GPUs on GCP so they use them at some level.


I think Anthropic and OpenAI could also have offered a one million context window a while ago. The relevant architecture breakthrough was probably when a linear increase in context length only required a linear increase in inference compute instead of a quadratic one. Anthropic and then OpenAI achieved linear context compute scaling before an architecture for it was published publicly (MAMBA paper).


The problem is, the 128k window performed terribly and showed that attention was mostly limited to the first and last 20%.

Increasing it to 1M just means even more data is ignored.


Maybe their architecture wasn't as good as MAMBA and Google could use the better architecture thanks to being late to the game...


OpenAI has no Moat


A reference to the good doc: https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...

While I'm linking semianalysis, though, it's probably worth talking about how everyone except Google is GPU poor: https://www.semianalysis.com/p/google-gemini-eats-the-world-... (paid)

> Whether Google has the stomach to put these models out publicly without neutering their creativity or their existing business model is a different discussion.

Google has a serious GPU (well, TPU) build out, and the fact that they're able to train moe models on it means there aren't any technical barriers preventing them from competing at the highest levels


they also have internet.zip and all of its repo history as well as usenet and mails etc.. which others don't.


but GPT-4 is nearly a year old now, I'd wait for the next release of OAI before judgement. Probably rather soonish now I would expect.


You were right i guess :)


They only have a head start, and the lead is closing


hence why it's Open


This. He’s right you know.

OpenAI is extremely overvalued and Google is closing their lead rapidly.


Is there any meaningful valuation on OpenAI? It’s not for sale, there is no market.

Google … has no ability to commercialize anything. Their only commercial successes are ads and YouTube. Doing deceptive launches and flailing around with Gemini isn’t helping their product prospects. I wouldn’t take a bet between open ai and anyone, but I also wouldn’t take a bet on Google succeeding commercially on anything other than pervasive surveillance and adware.


> Is there any meaningful valuation on OpenAI? It’s not for sale, there is no market.

Its shares are already for sale on private markets for accredited investors and for a valuation of over $100BN lead by Thrive Capital.

> Google … has no ability to commercialize anything.

Absolute nonsense.

So Google Cloud, Android (Play Store) are not already commercialized? You well know that they are.

> Doing deceptive launches and flailing around with Gemini isn’t helping their product prospects.

Gemini already caught up to (and surpassed) GPT-4V. What is your point?

> I wouldn’t take a bet between open ai and anyone, but I also wouldn’t take a bet on Google succeeding commercially on anything other than pervasive surveillance and adware.

OpenAI's greatest competitor is Google DeepMind which has the advantage of Google's infrastructure to scale up their models quickly and they have direct access to Google's billions. OpenAI cannot afford to make mistakes or delay anything and a single mistake can cost them hundreds of millions of dollars. The majority of the investment from Microsoft is in Azure credits and not in dollars. [0]

[0] https://www.semafor.com/article/11/18/2023/openai-has-receiv...


Technical report: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

The 1 million token context window + Gemini 1.0 Ultra level performance seems like it’ll unlock a wide range of incredible use cases!

HN, what are you going to use/build with this?


was this posted by an AI bot


No, they're just applying their Twitter style engagement strategy to HN for some reason...


Lol nope I’m a normal person. Gimme a captcha and I’ll (hopefully) solve it ;)


How do we know you're not an AI bot that figured out how to hire someone from fiverr to solve captchas for you?


Just gotta make sure the captcha requires a >1M token context length to solve...


just wade through documentation to access it?

clicking on the AI studio link doesn't show me the app page - it redirects to a document on early access. I do as required - go back and try clicking on the AI studio link and I'm redirected to the document on turning early access.

frustrating.


Does this mean gemini ultra 1.0 -> gemini ultra 1.5 is the same as gpt-4 -> gpt-4-turbo?


There's no Gemini Ultra 1.5 yet. Gemini Pro 1.5 is a smaller model than Gemini Ultra 1.0.


Onwards to a billion tokens


AI race is amazing, Nvidia reaping the benefits now, but soon the world.


1 million tokens?? This is wild and a lot of RAG can be removed.


Has anyone given an idea of the release timeline for 1.5?


Is there a $20 a month option for 1.5 Ultra?

If there is, where do I sign up?


signup on mobile too big, doesn't fit submit button :\


Does anyone actually have access to Ultra yet? It's a lame blog post where it says "it's available!" but the fine print says "by whitelist".

Ok, whatever that means.

OpenAI at least releases it all at once, to everyone.


oh, openai had a lot of waitlists also, gpt4 API, large context versions etc


No faked videos this time?

I find it hard to trust google nowadays.


Google is so finished, they are so late on this.


Imagine a day, when your new recording setting 10M token context model is not enough to make it to hn #1

Wild times.


Did they say a general availability date?

(a bit confused)


Is this a blog post of did they actually ship?


ah, so google found a moat.


4 weeks ago: Gemini Advanced with Ultra 1.0 Now: Gemini 1.5 Pro

Is it just me or is their branding all over the place.


zzZZzzzZ


imagine sending 5-10 mbs over the network per request, and the cost per token. You may accidently go broke after a big lag.


> Our teams continue pushing the frontiers of our latest models with safety at the core.

They're not kidding, Gemini (at least what's currently available) is so safe that it's not all that useful.

The "safety" permeates areas where you wouldn't even expect it, like refusing to answer questions about "unsafe" memory management in C. It interjects lectures about safety in answers when you didn't even ask it to do that in the question.

For example, I clicked on one of the four example questions that Gemini proposes to help you get started and it was something like "Write an SMS calling in sick. It's a big presentation day and I'm sad to let the team down." Gemini decided to tell me that it can't impersonate positions of trust like medical professionals or employers (which is not at all what I asking it to do).

The other things I asked it, it gave me wrong and obviously wrong answers. The funniest (though glad it was obviously wrong) was when I asked it "I'm flying from Karachi to Denver. Will I need to pick up my bags in Newark?" and it told me "no, because Karachi to Newark is a domestic flight"

Unless they stop putting "safety at the core," or figure out how to do it in a way that isn't unnecessarily inhibiting, annoying, and frankly insulting (protip: humans don't like to be accused of asking for unethical things, especially when they weren't asking for them. when other humans do that to us, we call that assuming the worst and it's a negative personality trait), any announcements/releases/breakthroughs from Google are going to be a "meh" for me.


Yeah. I'll believe that when I can use it.


Gemini (or whatever google ai) will be all about ads. I’m not adopting this shit. Their whole business model is ads. Why would I adopt a product from a company that only cares about selling more ads?


Agreed, people continually forget that Google has fundamentally failed at everything besides selling ads despite decades of moonshots and other attempts to shift the business. Very skeptical that any company getting 80% revenue from ads will be able to resist the pressure to advertise


Google One's business model is not ads?

I mention Google One because you can access Gemini Ultra through it.


All their services are just a way to get more information about their users so they can serve them ads.

Those Gemini queries will be no exception.


Not true - Gemini looks to be marketed towards companies, where it's far more profitable to just charge thousands of dollars. Ads wouldn't fund AI usage anyway. GPU's are extremely expensive (even Google's fancy TPU's).


I find that hard to belive. Ads most probably already funded all the research, development and manufacturing required to produce those TPUs.

But we'll see, maybe Gemini will become profitable eventually.


Is this just more nonsense from Google though? I expect big things from Google, but they need to shut up and actually release stuff instead of saying how amazing there stuff is and then release potato ai, nothing they have done in the AI space recently has lived up to any of the hype, they should stay silent for a bit then release something that kills GPT4 if they honestly are able but instead they are just full of hype.


Yeah, their Gemini demo was a disaster. But they have released their Ultra model for the general audience, so you can test them yourself. Talking about killing the competitor is a little funny, considering they are all generative LLM based on the same principles (and general architecture) with their inherent flaws and shortcomings. All of them can not even execute a basic plan like a cheap human assistant. So their values are very limited.

Breakthrough will only come with a next generation architecture. LLM for special domains is currently the most promising approach.


Yeah but even with ultra they kept saying how it was better than GPT4 and then when it actually got released it was awful.


Google is like a nervous and insecure engineer — blowing their value by rushing the narrative and releasing too much too confusingly fast.


When OpenAI raced through 3/3.5/4 it was "this team ships" and excitement.

This cargo-cult hate train is getting tiresome. Half the comments on anything Google-related are like this now, and it doesn't add anything to the conversation.


The difference, though, as someone who really doesn't have a particular dog in this fight, is that I can go use GPT-4 right now, and see for myself whether it's as exciting as the marketing materials say.


When OpenAI launched GPT-4, API access was initially behind a waitlist. And they released multiple demo stills of LMM capacilities on launch day that for months were in a limited partner program before they became generally available only 7 months later.

I also want the shiny immediately when I read about it, but I also know when I am acting entitled and don't go spam comment threads about it.

But really, mostly I mean this: It's fine to criticize things, but when half a dozen people have already raised a point in a thread, we don't need more dupes. It really changes signal-to-noise.


Gemini Ultra was announced two months ago. It just launched in the last week. It literally is still the featured post on the AI section of their blog, above this announcement. https://blog.google/technology/ai/

There’s “this team ships” and there’s “ok maybe wait until at least a few people have used your product before you change it all”.


OpenAI announced GPT-4 image input in mid-March 2023 and made it generally available on the API in November 2023.

Google announced a fancy model two months early and released it in the promised timeframe.

Seems par for the course.


Did OpenAI then announce GPT-5 two weeks after launching GPT-4?

No, of course they didn’t. And you’re comparing one specific feature (image input) and equating it to a whole model’s release date.

Maybe compare apples to apples next time.

People pointing out release/announcement burnout is a reasonable thing; people in general can only deal with the “next new thing” with some breaks to process everything.


I made the comparison because both companies demonstrated advanced/extended abilities (model size, image input) and shipped it delayed.


>"this team ships"

Because they actually shipped ... (!)


... and they literally just did it again.

https://openai.com/sora




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: