Greetings from the Gemma team! We just got Gemma 3 out of the oven and are super excited to show it to you! Please drop any questions here and we'll answer ASAP.
Thanks, been using Gemma 2 a lot at home as it still holds up very well and the 9B version runs great on my 2080Ti. Strong prompt adherence coupled with overall capability makes it very useful. Looking forward to trying Gemma 3.
I have some dumb questions though, might as well ask. How do you decide on the model sizes? And how do you train them? Independently or are they related somehow?
Picking model sizes is not an exact science. We look for sizes that will fit quantized on different categories on devices (e.g., low-end and high-end smartphone, laptops and 16GB GPUs, and bigger GPUs/TPUs). We also want the ratio of model width to depth (number of layers) to be consistently around 90, which we found works best.
The models are trained with distillation from a bigger teacher. We train them independently, but for v3 we have unified the recipes for 4B-27B, to give you more predictably when scaling up and down to different model sizes.
One unexpected (to me) use-case appeared not long ago when I found myself without internet but wanting to fix some non-standard Linux configuration issue. As a Windows guy I tend to web search such things, but local LLM to the rescue!
Even smaller models like Gemma 2 9B has enough compressed knowledge that it managed to help me quickly solve my issue.
This got me thinking how such smaller, but very capable models might be a game-changer in communities where internet might not be available or too expensive for continuous use. It's almost like having a portion of the internet in a box, just add electricity.
How good is Gemma at structured output generation, JSON schema compliance and tool use? Particularly the smaller versions, particularly in foreign languages?
We will run our internal evals on it for sure, but just wanted to ask whether that's even a use case that the team considered and trained for.
Hey, I'm from the Gemma team. There's a couple of angles to your question
We do care about prompted instructions, like json schema, and it is something we eval for and encourage you to try. Here's an example from Gemma2 to guide folks looking to do what it sounds like you're interested in.
The Ollama stuff is the old llama.cpp stuff that constrains output tokens.
It's great, I've used it to get outputs from as small a model as 1B.
But it's a stark difference in quality from, say, Phi-4's native tool-calling.
If Gemma 3 is natively trained on tool-calling, i.e. y'all are benching on say, Berekley Function Calling leaderboard, that'd be great to know out here.
Tangentially, github.com/ochafik is a Googler who landed an excellent overhaul of llama.cpp's tool-calling, might be worth reaching out to (if you're not working with him already!)
Just tried gemma3:4b for structured output and it fails with a strange error ( ollama is the latest):
Ollama error: POST predict: Post "http://127.0.0.1:49675/completion": read tcp 127.0.0.1:49677->127.0.0.1:49675: wsarecv: An existing connection was forcibly closed by the remote host.
Not sure this is Ollama or gemma3:4b problem. At the same time, gemma3:12b works fine for the same API request (100% identical, only difference is model id).
As per the technical report, every 5 layers you have a global attention layer. The global attention layer during training can have as many as a 128k context length during training (though I understand it is usually 32k).
Q. When you are training with a context length of 128k, is the attention in the global layers dense or sparse ?
If dense, would the attention memory requirement here would be O(n^2) where n is 128k for each global layer ?
We never train at 128k, only 32k, changing the scaling factor at the end.
We wanted the long context recipe to be friendly for finetuning, and training at 128k is a bit of a pain we don't do it. For inference, we see inference at 128k with the 5/1 is close to RAM usage for a fully-global-layer model at 32k.
Question: your model supports 140 languages. Given that you are focusing on compactness and efficiency, would you not have gains in also developing models on a selected limited number of languages (e.g. the topmost (in cultural production) four "western" ones with shared alphabet - or similar set)?
Edit: of course the multilingual capability can be can be welcome. On the other hand, there are evident cases in which efficiency can be paramount. We can wonder about the tradeoff: how much in efficiency is sacrificed by features.
That's an idea we've thought about. However, we think the open source community has already created a very impressive set of language or region-specific finetunes [1] [2]. Also there is a lot of cultural and nuance context in every language that we don't have the capacity to cover sufficiently. So for v3 we focused on creating the best foundational multilingual model.
Just wanted to say that Gemini 1.5-Pro is still the SOTA foundational model for certain languages (including non-Google models), so it's disappointing to have received the email that it will be removed in September - it will cause our product quality to go backwards when we're forced to replace it by a worse model. Unless a better one appears in that time, but we've extensively tested all big models and for the languages in question, none of them perform on the same level.
Happy to elaborate if there's a way to get in touch, in case the team isn't aware of this.
And have you measured the trade-off that could come with embracing such a large number of languages and alphabets? It would be interesting to note whether you are sacrificing some response quality, or if such supposed sacrifice is interestingly negligible, or if - even more interestingly - the quality increases with the added proficiency.
Yes we have measured the tradeoff. We don't see a drop of perplexity in English when introducing multilingual, and there is a slight drop in some English language-specific evals (~1%).
There are enough small model teams competing that I fell confident one of them will try this, and if it just sticking to english gives a large boost, the others will be forced to follow suite.
It would also kind of suck for non-english speakers, because it will just be another feather in the hat of "English eats the world".
Some numbers to try and make an idea: if I understand correctly, Gemma3 uses a fixed (in its versions by size) vocabulary 256k entries big; the smallest 1B version has ~300M embedding parameters and ~700M non-embedding parameters; the largest 27B version has ~5x embedding parameters and ~35x non-embedding parameters.
Multilingualism covering 140 languages is quite a big feat. Gemma3 apparently aims to be compact and efficient. The two goals and features put together raise questions. You wonder for example how much does such extensive multilingualism impact the above numbers, on a benchmark of similar results. It may e.g. be a general question to wonder how much multilingualism complicates an embedding space (owing e.g. to omographic collisions), and the question becomes more prominent when you crammed 140 languages in one model.
> non-english speakers
You would produce more specialized models (where it makes sense): Eng; Eng-Fra-Esp-Deu; Man-Can... For a billion weights per model it could probably be financially acceptable.
What's the official take on the system prompt? The technical report doesn't mention it, but the official QAT GGUFs include some form of prepending it to the first user message. Has it been trained with any <start_of_turn>system turns with tool calls and such?
I was under the impression that the purpose of "system" prompt is to encode the instruction boundary explicitly to reduce the risk of injection. Do you enforce some kind of security invariant that we could rely on? For example, does the alignment regiment include adversarial demonstrations so that out-of-order instruction-following (such as contradicting preceding) is penalised?
(Opinions our own and not of Google DeepMind.)
PS we are hiring: https://boards.greenhouse.io/deepmind/jobs/6590957