Hacker News new | past | comments | ask | show | jobs | submit login

Greetings from the Gemma team! We just got Gemma 3 out of the oven and are super excited to show it to you! Please drop any questions here and we'll answer ASAP.

(Opinions our own and not of Google DeepMind.)

PS we are hiring: https://boards.greenhouse.io/deepmind/jobs/6590957






I'm comparing Gemma3 12 B (https://ollama.com/library/gemma3; running fully on my 3060 12GB) and Mistral Small 3 24B (https://ollama.com/library/mistral-small; 10% offloaded to the CPU).

- Gemma3 12B: ~100 t/s on prompt eval; 15 t/s on eval

- MistralSmall3 24B: ~500 t/s on prompt eval; 10 t/s on eval

Do you know what different in architecture could make the prompt eval (prefill) so much slower on the 2x smaller Gemma3 model?


Thank you for the report! We are working with the Ollama team directly and will look into it.

Thanks, been using Gemma 2 a lot at home as it still holds up very well and the 9B version runs great on my 2080Ti. Strong prompt adherence coupled with overall capability makes it very useful. Looking forward to trying Gemma 3.

I have some dumb questions though, might as well ask. How do you decide on the model sizes? And how do you train them? Independently or are they related somehow?


Picking model sizes is not an exact science. We look for sizes that will fit quantized on different categories on devices (e.g., low-end and high-end smartphone, laptops and 16GB GPUs, and bigger GPUs/TPUs). We also want the ratio of model width to depth (number of layers) to be consistently around 90, which we found works best.

The models are trained with distillation from a bigger teacher. We train them independently, but for v3 we have unified the recipes for 4B-27B, to give you more predictably when scaling up and down to different model sizes.


Thanks again, very interesting.

One unexpected (to me) use-case appeared not long ago when I found myself without internet but wanting to fix some non-standard Linux configuration issue. As a Windows guy I tend to web search such things, but local LLM to the rescue!

Even smaller models like Gemma 2 9B has enough compressed knowledge that it managed to help me quickly solve my issue.

This got me thinking how such smaller, but very capable models might be a game-changer in communities where internet might not be available or too expensive for continuous use. It's almost like having a portion of the internet in a box, just add electricity.


Thank you for the feedback! This is why we are so excited to push more and more on small models for both low end and high end smartphones!

Can you provide more information about this “bigger teacher” model?

How good is Gemma at structured output generation, JSON schema compliance and tool use? Particularly the smaller versions, particularly in foreign languages?

We will run our internal evals on it for sure, but just wanted to ask whether that's even a use case that the team considered and trained for.


Hey, I'm from the Gemma team. There's a couple of angles to your question

We do care about prompted instructions, like json schema, and it is something we eval for and encourage you to try. Here's an example from Gemma2 to guide folks looking to do what it sounds like you're interested in.

https://www.youtube.com/watch?v=YxhzozLH1Dk

Multilinguality was a big focus in Gemma3. Give it a try

And for structured output Gemma works well with many structured output libraries, for example the one built into Ollama

https://github.com/ollama/ollama/blob/main/docs/api.md#struc...

In short you should have all the functionality you need!


The Ollama stuff is the old llama.cpp stuff that constrains output tokens.

It's great, I've used it to get outputs from as small a model as 1B.

But it's a stark difference in quality from, say, Phi-4's native tool-calling.

If Gemma 3 is natively trained on tool-calling, i.e. y'all are benching on say, Berekley Function Calling leaderboard, that'd be great to know out here.

Tangentially, github.com/ochafik is a Googler who landed an excellent overhaul of llama.cpp's tool-calling, might be worth reaching out to (if you're not working with him already!)


Just tried gemma3:4b for structured output and it fails with a strange error ( ollama is the latest):

Ollama error: POST predict: Post "http://127.0.0.1:49675/completion": read tcp 127.0.0.1:49677->127.0.0.1:49675: wsarecv: An existing connection was forcibly closed by the remote host.

Not sure this is Ollama or gemma3:4b problem. At the same time, gemma3:12b works fine for the same API request (100% identical, only difference is model id).



As per the technical report, every 5 layers you have a global attention layer. The global attention layer during training can have as many as a 128k context length during training (though I understand it is usually 32k).

Q. When you are training with a context length of 128k, is the attention in the global layers dense or sparse ?

If dense, would the attention memory requirement here would be O(n^2) where n is 128k for each global layer ?


We never train at 128k, only 32k, changing the scaling factor at the end.

We wanted the long context recipe to be friendly for finetuning, and training at 128k is a bit of a pain we don't do it. For inference, we see inference at 128k with the 5/1 is close to RAM usage for a fully-global-layer model at 32k.

Individual attention layers are always dense.


Thanks for your answer ! So in the 32k global layer, every token attends to each of the other 32k tokens ?

[Edit: You answered the question when you said that individual attention layers are always dense.]


Thank you!

Question: your model supports 140 languages. Given that you are focusing on compactness and efficiency, would you not have gains in also developing models on a selected limited number of languages (e.g. the topmost (in cultural production) four "western" ones with shared alphabet - or similar set)?

Edit: of course the multilingual capability can be can be welcome. On the other hand, there are evident cases in which efficiency can be paramount. We can wonder about the tradeoff: how much in efficiency is sacrificed by features.


That's an idea we've thought about. However, we think the open source community has already created a very impressive set of language or region-specific finetunes [1] [2]. Also there is a lot of cultural and nuance context in every language that we don't have the capacity to cover sufficiently. So for v3 we focused on creating the best foundational multilingual model.

[1] https://huggingface.co/aiplanet/buddhi-indic

[2] https://ai.google.dev/gemma/gemmaverse/sealion


Just wanted to say that Gemini 1.5-Pro is still the SOTA foundational model for certain languages (including non-Google models), so it's disappointing to have received the email that it will be removed in September - it will cause our product quality to go backwards when we're forced to replace it by a worse model. Unless a better one appears in that time, but we've extensively tested all big models and for the languages in question, none of them perform on the same level.

Happy to elaborate if there's a way to get in touch, in case the team isn't aware of this.


And have you measured the trade-off that could come with embracing such a large number of languages and alphabets? It would be interesting to note whether you are sacrificing some response quality, or if such supposed sacrifice is interestingly negligible, or if - even more interestingly - the quality increases with the added proficiency.

Yes we have measured the tradeoff. We don't see a drop of perplexity in English when introducing multilingual, and there is a slight drop in some English language-specific evals (~1%).

There are enough small model teams competing that I fell confident one of them will try this, and if it just sticking to english gives a large boost, the others will be forced to follow suite.

It would also kind of suck for non-english speakers, because it will just be another feather in the hat of "English eats the world".


Some numbers to try and make an idea: if I understand correctly, Gemma3 uses a fixed (in its versions by size) vocabulary 256k entries big; the smallest 1B version has ~300M embedding parameters and ~700M non-embedding parameters; the largest 27B version has ~5x embedding parameters and ~35x non-embedding parameters.

Multilingualism covering 140 languages is quite a big feat. Gemma3 apparently aims to be compact and efficient. The two goals and features put together raise questions. You wonder for example how much does such extensive multilingualism impact the above numbers, on a benchmark of similar results. It may e.g. be a general question to wonder how much multilingualism complicates an embedding space (owing e.g. to omographic collisions), and the question becomes more prominent when you crammed 140 languages in one model.

> non-english speakers

You would produce more specialized models (where it makes sense): Eng; Eng-Fra-Esp-Deu; Man-Can... For a billion weights per model it could probably be financially acceptable.


will there ever be a Gemma 3 Thinking? how copyable is the Flash Thinking approach to the Gemma series?

That's a very interesting area, but nothing we can announce today.

Excellent work. What optimizer did you use? I assume AdamW? I didn't see it listed.

Is this what powers Gemini?

Google is using Greenhouse for ATS now?

What's the official take on the system prompt? The technical report doesn't mention it, but the official QAT GGUFs include some form of prepending it to the first user message. Has it been trained with any <start_of_turn>system turns with tool calls and such?

We recommend using <start_of_turn>user for the system prompt as well.

I was under the impression that the purpose of "system" prompt is to encode the instruction boundary explicitly to reduce the risk of injection. Do you enforce some kind of security invariant that we could rely on? For example, does the alignment regiment include adversarial demonstrations so that out-of-order instruction-following (such as contradicting preceding) is penalised?



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: