Hacker News new | past | comments | ask | show | jobs | submit login
Google calls Gemma 3 the most powerful AI model you can run on one GPU (theverge.com)
127 points by gmays 33 days ago | hide | past | favorite | 100 comments



Apparently it can also pray. Seriously, I asked it for biblical advice about a tough situation today and it said it was praying for me. XD


This reminds me a recent chat I had with Claude, trying to identify what looked like an unusual fossil. The responses included things along the lines of "What a neat find!" or "That's fascinating! I'd love it if you could share more details". The statements were normal nice things to hear from a friend, but I found them pretty off-putting coming from a computer that of course couldn't care less and isn't a living thing I have a relationship with.

This sort of thing worries me quite a bit. The modern internet has already sparked an awful lot of pseudo or para-social relationships through social media, OnlyFans and the like, with serious mental health and social cohesion costs. I think we're heading into a world where a lot of the remaining normal healthy social behavior gets subsumed by LLMs pretending to be your friend or romantic interest.


Trying to find the silver lining in this makes me God's advocate I guess?

I was able to reflect a lot on my upbringing by reading reddit threads. Advice columns, relationships, parenting advice, just dealing with people. It was great to finally have a normalized, standardized world view to bounce my own concepts off. It was like an advice column in an old magazine, but infinitely big. In my early 20s I must have spent entire days on there.

I guess LLMs are the modern, ultra personalized version of that. Internet average, westernized culture, infinite supply, instantly. Just add water and enjoy a normal view of the world, no matter your surroundings or how you grew up. This is going to help out so many kids.

And they're not evil yet. Host your own LLMs before you tell them your secrets, people.


It was great to finally have a normalized, standardized world view to bounce my own concepts off. It was like an advice column in an old magazine, but infinitely big. In my early 20s I must have spent entire days on there. I guess LLMs are the modern, ultra personalized version of that. Internet average, westernized culture, infinite supply, instantly.

That's a really interesting way to put it and actually made me look back at my own heavily internet-influenced upbringing. Setting healthy personal boundaries? Mindfulness for emotional management? Elevated respect for all types of people and ways of life beyond what my parents were exposed to? Yes. These were not automatically taught to me by my inherited culture or family. I would not have heard about them in a transformative way without the internet. Maybe passively, as something "those weird rich people" do, but not enough to become embedded in my mental operating system. Not to disparage the old culture. I still borrow a lot from it, but yeah I like westernized internet average culture.


I’m in the same boat. And judging by the people I’ve met at Google and FB (before it was Meta) a lot of us are refugees from conservative minded illiberal cultures within North America, Asia, and Europe. Memes are our currency. A lot of the internal cultures of these two companies are steeped in formative memes of those born in the mid-80s who only had the internet to find their people in the early 2000s.


Although I agree with you and GP, there are cynics who will say: "Ha! You think that the totality of Reddit posts is some kind of normalized, standardized, Internet average world view? HA!" There are people deep in ideological bubbles that think Reddit is too liberal! or Reddit is too young! or Reddit is too atheist! or other complaints that amount to "The average Internet-Person doesn't match what I think the average should be!" and they would not be interested in using that ISO Standard World View for anything.

I have a feeling if there is a market for this kind of LLM sounding board, the software writers will need to come up with many different models that differ ideologically, have different priors, and even know different facts and truths, in order to actually be acceptable to a broad swath of users. At the limit, you'd have a different model, tailored to each individual.


I was also quick to dive into early internet forums and feel like I got a lot out of them, but LLMs just seem different. Forums were a novel medium but it was still real people interacting and connecting with each other, often over a shared interest. With LLMs none of the social interactions are genuine and will always be shallow.

I'm sure some nerds will continue to host their own models but I would bet that 99.9% of social-type LLM interactions will be with corporate hosted models that can and will be tweaked and weighted in whatever ways the host company thinks will make it the most money.

It all reminds me a lot of algorithmic social media feeds. The issues were forseen very early on even if we couldn't predict the exact details, and it's an unsurprising disappointment that all of the major sites have greatly deemphasized organic interactions with friends and family in favor of ads and outrage bait. LLMs are still in their honeymoon phase but with the amount of money being plowed into them I don't expect that to last much longer.


New term, "God's advocate"!


>I think we're heading into a world where a lot of the remaining normal healthy social behavior gets subsumed by LLMs pretending to be your friend or romantic interest.

This is already happening.

https://x.com/zymillyyy/status/1902181493553733941


I had to upload that screenshot to ChatGPT and ask it to translate. I must be getting old! ... "Clock my tea" haha. never heard that one before.


I installed a new “operating system” that sounds a lot like Scarlett Johansson. (Her 2013)


Wow, Twitch parasocial relationships have _nothing_ on this.


I loved when Gemini called out what I thought was a very niche problem as classic. I think there are very few people attempting this stack, to the point where the vendor's documentation is incorrect and hasn't been updated in two years.

"Ah, the "SSL connection already established" error! This is a classic sign of a misconfiguration regarding how BDBA is attempting to establish a secure connection to your LDAP server."


I spent a good half hour "talking" to 4 mini about why Picard never had a family and the nature of the crew as his family despite the professional distance required. It really praised me when I said the holodeck scene where Data is playing King Henry walking amongst his men and I felt pretty smart and then realized I'd not actually garnered the admiration of anyone or anything.


I think there's a similar trap when you're using it for feedback on an idea or to brainstorm features and it gives you effusive praise. That's not a paying customer or even a real person. Like those people you quickly learn aren't too useful for seeking out over feedback because they rave about everything just to be nice.


> “…I found them pretty off-putting coming from a computer that of course couldn't care less and isn't a living thing I have a relationship with.”

I’ve prolly complained about it here, but Spectrum cable’s pay by phone line in NYC has an automated assistant with a few emotive quirks.

I’m shocked how angry that robot vote makes me feel. I’m not a violent person, but getting played by a robot sets me over the edge in my work day.

Reminds me of a BoingBoing story from years ago about greeter robots being attacked in Japan. Japan has a tradition of verbally greeting customers as they enter the building and large department stores will have dedicated human greeters stationed at the entrance. IIFC this was a large store who replaced human greeters with these robots. Rando customers were attacking these robots. I now know how they feel.


https://www.youtube.com/watch?v=LghsLs3DYUs

Snip Snip was pleasant to talk to until the very end.


> The statements were normal nice things to hear from a friend

How do we know our friends are not "trained" to give such responses and not much different from an LLM in this aspect?

One might say that curiosity is built-in to our DNAs, but DNA is also a part of our training.


People can want things, so there's the possibility that your friends are being honest.

LLMS cannot desire, and in fact are so devoid of interiority that they are not even lying when they say that they do.


After giving me continuous wrong answers ChatGPT decided it would try allow me to indulge it in a "learning opportunity" instead.

I completely understand your frustration, and I genuinely appreciate you pushing me to be more accurate. You clearly know your way around Rust, and I should have been more precise from the start.

If you’ve already figured out the best approach, I’d love to hear it! Otherwise, I’m happy to keep digging until we find the exact method that works for your case. Either way, I appreciate the learning opportunity.


Coding with ChatGPT always makes me angry in the end. Claude at least stays professional. ChatGPT always ends up sounding kinda mean.


Which reveals strong reasons to suspect strong "parroting" qualities. "Parroting" qualities that should have been fought in implementation since day 0.


I always laughed when ChatGPT would reply with the same emoji I typed, regardless of context. Not sure if parroting exactly, but I assumed it would understand meaning (if not context) of emoji?


Its fun to pick it apart sometimes and get it to correct itself but you often would never know unless you had direct or deep cut knowledge to interrogate it from


I'm not sure how much people parrot on the daily.


And we have tackled the issue long ago with education.

What are you trying to imply? Imitating fools or foolery is not a goal, replicating the unintelligent is not intelligence - it is strictly undesirable.


I asked Claude for advice and it said something about it being heartbreaking.


LLMs always do. "I'm very sorry to hear that, seems like you are going through a lot" yadda yadda.

ChatGPT has been giving me too many emojis lately, however. I might tell it to avoid the use of emojis.


And henceforth, I'll equate that statement as a pithy one GPU level effort from whoever offers it.

"Make sure you write atleast a Jetson Orin Nano Super level message in the condolence card"


Reply sounds like it was written by someone who has never had to ask AWS to up their vCPU limit.


Douglas Adams' Electric Monk finally available to the public!


I'm wondering how small of a model can be "generally intelligent" (as in LLM intelligent, not AGI). Like there must be a size too small to hold "all the information" in.

And I also wonder at what point we'll see specialized small models. Like if I want help coding, it's probably ok if the model doesn't know who directed "Jaws". I suspect that is the future: many small, specialized models.

But maybe training compute will just get to the point where we can run a full-featured model on our desktop (or phone)?


> Like there must be a size too small to hold "all the information" in.

We're already there. If you running a Mistral-Large-2411 and Mistral-Small-2409 locally, you'll find the larger model is able to recall more specific details about works of fiction. And Deepseek-R1 is aware of a lot more.

Then you ask one of the Qwen2.5 coding models, and they won't even be aware of it, because they're:

> small, specialized models.

> But maybe training compute will just get to the point where we can run a full-featured model on our desktop (or phone)?

Training time compute won't allow the model to do anything out of distribution. You can test this yourself if you run one of the "R1 Distill" models. Eg. If you run the Qwen R1 distill and ask it about niche fiction, no matter how long you let it <think> for, it can't tell you something the original Qwen didn't know.


I suppose we could eventually get to a super-MoE architecture. Models are limited to 4-16GB in size, but you could have hundreds of models on various topics. Load from storage to RAM and unload as needed. Should be able to load up any 4-16GB model in a few seconds. Maybe as well as a 4GB "Resident LLM" that is always ready to figure out which expert to load.


> We're already there. If you running a Mistral-Large-2411 and Mistral-Small-2409 locally, you'll find the larger model is able to recall more specific details about works of fiction.

Oh, for sure. I guess what I'm wondering is if we know the Small model (in this case) is too small -- or if we just haven't figured out how to train well enough?

Like, have we hit the limit already -- or, in (say) a year, will the Small model be able to recall everything the Big model does (say, as of today)?


It's a sliding scale based on what you consider "generally intelligent" but they're getting smaller and smaller. This 27b model is comparable to 400b models not much over a year ago. But we'll start to see limits on how far that can go, maybe soon.

You can try different sizes of gemma3 models, though. The biggest one can answer a lot of things factually, while the smallest one is a hilarious hallucination factory, and the others are different levels in between.


That's interesting. Is there some quantitative way to know that a modern 27b model is equal to an older 400b model?


https://medium.com/@elmo92/gemma-3-a-27b-multimodal-llm-bett...

> It comes in sizes from 1 billion to 27 billion parameters (to be precise: 1B, 4B, 12B, 27B), with the 27B version notably competing with much larger models, such as those with 400B or 600B parameters by LLama and DeepSeek.


Maybe Llama 3.3 70B doesn't count as running on "one GPU", but it certainly runs just fine on one Mac, and in my tests it's far better at holding onto concepts over a longer conversation than Gemma 3 is, which starts getting confused after about 4000 tokens.


Gemma 3 is a lot better at writing for sure, compared to 2, but the big improvement is I can actually use a 32k+ context window and not have it start flipping out with random garbage.


It lasted until Mistral released 3.1 Small a week later. Such is the pace of AI...


yeah, I can't even keep up these days. So now mainly focus on what I can run locally via Ollama.


Gemma 3 is on Ollama now https://ollama.com/library/gemma3 but surprisingly they don't have Mistral 3.1 yet.

I've managed to run Mistral 3.1 on my laptop using MLX, notes here https://simonwillison.net/2025/Mar/17/mistral-small-31/


Are you running vision tasks? Works just fine in ollama as an LLM but the vision component is unimplemented.


I'm using gemma3:27b with the latest ollama (0.6.2) and vision is working.


Yes but we're talking about mistral 3.1 here, which doesn't have the vision component implemented, looks like vllm supports it though.



How's their lmarena ELO?


Technically, the 1.58-bit Unsloth quant of DeepSeek R1 runs on a single GPU+128GB of system RAM. It performs amazingly well, but you'd better not be in a hurry.


What's the recommended way to run LLMs these days?

Ollama seems to work with DeepSeek R1 with enough memory using an older CPU but it's around 1 token/second on my desktop.


I've looked into it and the only sane answer right now is still "If it flies, floats, or infers, rent it." You need crazy high memory bandwidth for good inference speed, and that means GPUs which are subject to steep monopoly pricing. That doesn't look to be changing anytime soon.

Second place is said to be the latest Macs with lots of unified memory, but it's a distant second place.

The recently announced hardware from nvidia is either underpowered, overpriced, or both, so there's not much point waiting for it.


Makes sense, thanks for sharing! I'll take it your recommendation to not buy also includes the upcoming project digits box from Nvidia?


From what I've seen and read (273 GB/s, wtf?), DGX Spark nee DIGITS is a non-starter.

If this pricing turns out to be true: https://www.reddit.com/r/LocalLLaMA/comments/1jgnye9/rtx_pro...

... then this generation of RTX Pro hardware sounds better to me. At the end of the day I don't know anything I didn't see on YouTube or /r/LocalLLaMA, though.


You probably meant 128GB


Edited. TBF, I did say it was slow...


Have you tried the 0.000000001 bit quant? IIUC it's a single param so youd get much faster speeds.


I found Mistral Small 3.1, which released slightly after Gemma 3, much better.

Much fewer refusals, more accurate, less babbling, generally better, but especially at coding.



My instinct is that it would be cheaper overall to buy API credits when needed, compared with buying a top-of-the-line GPU which sits idle for most of the day. That also opens up access to larger models.


It's a choice. Running local means personal safety and privacy. It could also mean easier compliance with any enterprise that doesn't want to share data.


Agrees with my own experience. I have a 4070 super which of course is nothing to brag about, but tps using quantized 27b model is miserable. I could go down to 12b or even smaller, but it would sacrifice in quality. Then I could upgrade my gears, but I realize that however much I spend, the experience is not going to be as smooth as off-the-shelf LLM products, and definitely not worth the cost.

Of course it is nice to have an LLM running locally where nobody gets to know your data. But I don't see a point in me spending thousands of $ to do that.


Yeah. Classic capacity / utilization problem.


Does it run on the severed floor?


I love how this show (2022) is just heavily emanating into pop culture.

Stoked for the season 2 finale today. It'll be like watching the Super Bowl.


actually the finale is on friday not today!


they come out thursday nights (in the US) - 9pm EST


what show?


Severance


Severance


Severance


Severance


Does anyone use GoogleAI? For an AI Company with an AI Ceo using AI language translation, I think their actual GPT products are all terrible and have a terrible rep. And who wants their private conversation shipped back to google for spying?


I use Gemini (Advanced) over ChatGPT. Google's privacy issues are concerning, but no moreso than giving my conversations to OpenAI.

In my experience, Gemini simply works better.


Gemini 2.0 Flash et al are excellent. Think you need to try them again.


I tried a lot of models on openrouter recently, and I have to say that I found Gemini 2.0 flash to be surprisingly useful.

I’d never used one of Google’s proprietary models before that, but it really hits a sweet spot in the quality vs latency space right now.


The thing they ship on android is just horrible. It appears closer to markov chain than actual AI


So says Gemma 3.


...until this coming tuesday? ...let's talk value.

EDIT: I do feel like a fool, thank you.


Good news: it's free. Infinite value.


Plenty of free things with no value


There's nothin more expensive than free


I call it the biggest bs since I had my supper.


It’s a 27B model, I highly doubt that.


Which part do you doubt? That it is the most powerful? That it runs on a single GPU?


That it is the most powerful


The claim is not that it's the most powerful, it is that it's the most powerful model that fit on one card.


What's a better model that can run on a single GPU?


It depends what you're trying to do.

Coding - Mistral-Small-2503 or Qwen2.5-32b-Coder Reasoning - QwQ-32b Writing - Gemma-3-27b is good at this. etc


Right, this thread is about which models are better than Gemma-3-27B.

I'm a fan of Mistral Small 3 personally but I've not spent enough time with it, Gemma and the new Mistral Small 3.1 to have an opinion of which of those is the "best" model.

The best indicator of model quality I can find right now is still https://lmarena.ai/?leaderboard=

Gemma 3 27B holds an impressive 8th place right now, second highest non-proprietary model after DeekSeek R1 (at 6th).

QwQ-32B is 12th. Weirdly I couldn't find either of the Mistral Small 3 models on there.


What's a better model to run on one GPU?


QWQ 32B q4 OR deepseek-r1-qwen-32b for reasoning, Qwen2.5 Coder 32B q4 (pair with QWQ 32B) for coding


QwQ-32


If I ask qwq32 anything that is even slightly complicated it will ramble until it exceeds the context window, then forget my question. Q4k which is all that fits (with context) on a 3090.

Gemma3 27B gives me a rapid 1shot response, and actually works really well for the type of rubber duck brainstorming partner I often need.


Try giving it a river crossing puzzle with substitutions. QwQ can take a lot of time but it will solve it. Gemma will just confidently give you a wrong answer, and will keep giving you wrong answers if you point out the mistakes.

Now, yes, QwQ will take a lot of tokens to get there (in one case it took it over 5 minutes running on Mac Studio M1 Ultra). Nevertheless, at least it can solve it.


Yeah, but how many river crossing puzzles and murder mystery games was it trained on, and how many times do I actually need to solve a river crossing puzzle?


I’ve had this same exact experience.


It's unequivocally not. What usecase do you have that QwQ-32 is outperforming Gemma3? In real world uses I didn't even prefer it to Gemma 2.


Anything that requires reasoning rather than regurgitating. For a simple example, try the classic river crossing puzzle with non-trivial substitutions. Gemma can't solve it even if you keep pointing out where it fucks up. To be fair, this also goes for all non-CoT 70B models, so it's not surprising. But QwQ can solve it through sheer persistence because it can actually fairly reliably detect when it's wrong, and it just keeps hammering until it gets it done.


Yes ok but when does this come up in the real world?


It's an example of a problem that requires actual reasoning to solve, and also an example of a "looks similar therefore must use similar solution" trap that LLMs are so prone to.

Translating this to code, for example, it means that Gemma is that much more likely to pretend to solve a more complicated problem that you give it by "simplifying" it to something it already knows how to solve.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: