It can't make outbound network calls though, so this fails:
llm -m gemini-2.0-flash-exp -o code_execution 1 \
'write python code to retrieve https://simonwillison.net/ and use a regex to extract the title, run that code'
Code execution is okay, but soon runs into the problem of missing packages that it can't install.
Practically, sandboxing hasn't been super important for me. Running claude with mcp based shell access has been working fine for me, as long as you instruct it to use venv, temporary directory, etc.
Brown Pelican (Pelecanus occidentalis) heads are white in the breeding season. Birds start breeding aged three to five. So technically the statement is correct but I wonder if Gemini didn't get its pelicans and cormorants in a muddle. The mainland European Great Cormorant (Phalacrocorax carbo sinensis) has a head that gets progressively whiter as birds age.
Big companies can be slow to pivot, and Google has been famously bad at getting people aligned and driving in one direction.
But, once they do get moving in the right direction the can achieve things that smaller companies can't. Google has an insane amount of talent in this space, and seems to be getting the right results from that now.
Remains to be seen how well they will be able to productize and market, but hard to deny that their LLM models aren't really, really good though.
> Remains to be seen how well they will be able to productize and market
The challenge is trust.
Google is one of the leaders in AI and are home to incredibly talented developers. But they also have an incredibly bad track record of supporting their products.
It's hard to justify committing developers and money to a product when there's a good chance you'll just have to pivot again once they get bored. Say what you will about Microsoft, but at least I can rely on their obsession with supporting outdated products.
> they also have an incredibly bad track record of supporting their products
Incredibly bad track record of supporting products that don't grow. I'm not saying this to defend Google, I'm still (perhaps unreasonably) angry because of Reader, it's just that there is a pattern and AI isn't likely to fit that for a long while.
I’m sad for reader but it was a somewhat niche product. Inbox I can’t forgive. It was insanely good and was killed because it was a threat to Gmail.
My main issue with Google is that internal politic affects users all the time. See the debacle of anything built on top of Android and being treated as a second citizen.
You can’t trust a company which can’t shield users from its internal politics. It means nothing is aligned correctly for users to be taken seriously.
The question is seldom "why" they kill it (I'd argue ultimately it doesn't matter), it's about how fast and what they offer as a migration path for those who boarded the train.
That also means the minute Gemini stops looking like a growing product it's gone from this world, where Microsoft backed alternatives have a fighting chance to get some leeway to recover or pivot.
Yeah, either AI is significant, in which case Google isn't going to kill it. Or AI is a bubble, in any of the alternatives one might pick can easily crash and die long before Google ends of life anything.
This isn't some minor consumer play, like a random tablet or Stadia. Anyone who has paying attention would have noticed that AI has been an important, consistent, long term strategic interest of Google's for a very long time. They've been killing off the fail/minor products to invest in this.
Yes. Imagine Google banning your entire Google account / Gmail because you violated their gray area AI terms ([1] or [2]). Or, one of your users did via an app you made using an API key and their models.
With that being said, I am extremely bullish on Google AI for a long time. I imagine they land at being the best and cheapest for the foreseeable future.
For me that is a reason for not touching anything from Google for building stuff. I can afford lossing my Amazon account, but Google's one would be too much. At least they should be clear in their terms that getting banned at cloud doesn't mean getting banned from Gmail/Docs/Photos...
That won't help. Their TOS and policies are vague enough that they can terminate all accounts you own (under "Use of multiple accounts for abuse" for instance).
Even if it is warranted on their part, the 1% false positive will be detrimental to those affected. And we all know there is no way to reach out to them in case the account is automatically flagged.
I asked Gemini about banning risks, and it answered:
Gemini: Yes, there is a potential risk of your Google account being suspended if your SaaS is used to process inappropriate content, even if you use Gemini to reject the request. While Gemini can help you filter and identify harmful content, it's not a foolproof solution.
Here are some additional measures you can take to protect your account:
* Content moderation: Implement a robust content moderation system to filter out inappropriate content before it reaches Gemini. This can include keyword-based filtering, machine learning models, and human review.
...
* Regularly review usage: Monitor your usage of Gemini to identify any suspicious activity.
* Follow Google's terms of service: Make sure that your use of Gemini complies with Google's terms of service.
By taking these steps, you can minimize the risk of your account being suspended and ensure that your SaaS is used responsibly.
---
In a follow up question I asked about how to implement robust content moderation and it suggested humans reviewing each message...
> Google is one of the leaders in AI and are home to incredibly talented developers. But they also have an incredibly bad track record of supporting their products.
This is why we've stayed with Anthropic. Every single person I work with on my current project is sore at Google for discontinuing one product or another - and not a single one of them mentioned Reader.
We do run some non-customer facing assets in Google Cloud. But the website and API are on AWS.
> But they also have an incredibly bad track record of supporting their products.
I don't know about that: my wife built her first SME on Google Workspace / GSuite / Google Apps for domain (this thing changed names so many times I lost track). She's now running her second company on Google tools, again.
All she needs is a browser. At one point I switched her from Windows to OS X. Then from OS X to Ubuntu.
Now I just installed Debian GNU/Linux on her desktop: she fires up a browser and opens up Google's GMail / GSuite / spreadsheets and does everything from there.
She's a happy paying customer of Google products since a great many years and there's actually phone support for paying customers.
I honestly don't have many bad things to say. It works fine. 2FA is top notch.
It's a much better experience than being stuck in the Windows "Updating... 35%" "here's an ad on your taskbar" "you're computer is now slow for no reason" world.
I don't think they'l pull the plug on GSuite: it's powering millions and millions of paying SMEs around the world.
>Say what you will about Microsoft, but at least I can rely on their obsession with supporting outdated products.
Eh... I don't know about that. Their tech graveyard isn't as populous as Google's, but it's hardly empty. A few that come to mind: ATL, MFC, Silverlight, UWP.
Besides Silverlight (which was supported all the way until the end of 2021!), you can still not only run but _write new applications_ using all of the listed technologies.
That doesn't constitute support when it comes to development platforms. They've not received any updates in years or decades. What they've done is simply not remove the capability build capability from the toolchains. That is, not even the work that would be required to no longer support them in any way. Compare that to C#, which has evolved rapidly over the same time period.
Only because they operate different businesses. Google is primarily a service provider. They have few software products that are not designed to integrate with their servers. Many of Microsoft's businesses work fundamentally differently. There's nothing Microsoft could do to Windows to disable all MFC applications and only MFC applications, and if there was it would involve more work than simply not doing anything else with MFC.
I can write something with Microsoft tech and expect it with reasonable likelihood to work in 10 years (even their service-based stuff), but can't say the same about anything from Google.
That alone stops me/my org buying stuff from Google.
I'm not contending that Microsoft and Google are equivalent in this regard, I'm saying that Microsoft does have a history of releasing technologies and then letting them stagnate.
They have to not get blind sided by Sora, while at the same time fighting the cloud war against MS/Amazon.
Weirdly Google is THE AI play. If AI is not set to change everything and truly is a hype cycle, then Google stock withstands and grows. If AI is the real deal, then Google still withstands due to how much bigger the pie will get.
> and Google has been famously bad at getting people aligned and driving in one direction.
To be fair, it's not that they're bad at it -- it's that they generally have an explicit philosophy against it. It's a choice.
Google management doesn't want to "pick winners". It prefers to let multiple products (like messaging apps, famously) compete and let the market decide. According to this way of thinking, you come out ahead in the long run because you increase your chances of having the winning product.
Gemini is a great example of when they do choose to focus on a single strategy, however. Cloud was another great example.
I definitely agree that multiple competing products is a deliberate choice, but it was foolish to pursue it for so long in a space like messaging apps that has network effects.
>> hard to deny that their LLM models aren't really, really good though.
The context window of Gemini 1.5 pro is incredibly large and it retains the memory of things in the middle of the window well. It is quite a game changer for RAG applications.
Bear in mind that a "1 million token" context window isn't actually that. You're being sold a sparse attention model, which is guaranteed to drop critical context. Google TPUs aren't running inference on a TERABYTE of fp8 query-key inputs, let alone TWO of fp16.
> but hard to deny that their LLM models aren't really, really good though
Although I do still pay for ChatGPT, I find it dog slow. ChatGPT is simply way too slow to generate answers. It feels like --even though of course it's not doing the same thing-- I'm back to the 80s with my 8-bit computer printing thing line by line.
Gemini OTOH doesn't feel like that: answers are super fast.
To me low latency is going to be the killer feature. People won't keep paying for models that are dog slow to answer.
I'll probably be cancelling my ChatGPT subscription soon.
BERT and Gemma 2B were both some of the highest-performing edge models of their time. Google does really well - in terms of pushing efficiency in the community they're second to none. They also don't need to rely on inordinate amounts of compute because Google's differentiating factor is the products they own and how they integrate it. OpenAI is API-minded, Google is laser-focused on the big-picture experience.
For example; those little AI-generated YouTube summaries that have been rolling out are wonderful. They don't require heavyweight LLMs to generate, and can create pretty effective summaries using nothing but a transcript. It's not only more useful than the other AI "features" I interact with regularly, it doesn't demand AGI or chain-of-thought.
I disagree - another way you could phrase this is that Google is presbyopic. They're very capable of thinking long-term (eg. Google Deepmind and AI as a whole, cloud, video, Drive/GSuite, etc.), but as a result they struggle to respond to quick market changes. AdSense is the perfect example of Google "going long" on a product and reaping the rewards to monopolistic ends. They can corner a market when the set their sights on it.
I don't think Google (or really any of FAANG) makes "good" products anymore. But I do think there are things to appreciate in each org, and compared to the way Apple and Microsoft are flailing helplessly I think Google has proven themselves in software here.
Yet, google continues to show it'll deprecate it's APIs, Services, and Functionality at the detriment of your own business. I'm not sure enterprises will trust Google's LLM over the alternatives. Too many have been burned throughout the years, including GCP customers.
I read parent comment "grew 35% last quarter" as (income on 2024-09-30) is 1.35 * (income on 2024-07-01)
The balance sheet shows (income on days from 2024-07-01 through 09-30) is 1.35 * (income on days from 2023-07-01 through 09-30)
These are different because with heavily handwavey math the first is growing 35% in a single quarter and the second is growing 35% annually (by comparing like-for-like quarters)
> hard to deny that their LLM models aren't really, really good though
I'm so scarred by how much their first Gemini releases sucked that the thought of trying it again doesn't even cross my mind.
Are you telling us you're buying this press release wholesale, or you've tried the tech they're talking about and love it, or you have some additional knowledge not immediately evident here? Because it's not clear from your comment where you are getting that their LLM models are really good.
Beats Gemini 1.5 Pro at all but two of the listed benchmarks. Google DeepMind is starting to get their bearings in the LLM era. These are the minds behind AlphaGo/Zero/Fold. They control their own hardware destiny with TPUs. Bullish.
No, and they haven't been for at least half a year. Utterly optimized for by the providers. Nowadays if a model would be SotA for general use but not #1 on any of these benchmarks, I doubt they'd even release it.
I've started keeping an eye out for original brainteasers, just for that reason. GCHQ's Christmas puzzle just came out [1], and o1-pro got 6 out of 7 of them right. It took about 20 minutes in total.
I wasn't going to bother trying those because I was pretty sure it wouldn't get any of them, but decided to give it an easy one (#4) and was impressed at the CoT.
Meanwhile, Google's newest 2.0 Flash model went 0 for 7.
I didn't see a straightforward way to submit the final problem, because I used different contexts for each of the 7 subproblems.
Given the right prompt, though, I'm sure it could handle the 'find the corresponding letter from the landmarks to form an anagram' part. That's easier than most of the other problems.
You're saying the ultimate answer isn't 'PROTECTING THE UNITED KINGDOM'?
Regarding TPU’s, sure for the stuff that’s running on the cloud.
However their on device TPUs lag behind the competition and Google still seem to struggle to move significant parts of Gemini to run on device as a result.
Of course, Gemini is provided as a subscription service as well so perhaps they’re not incentivized to move things locally.
I am curious if they’ll introduce something like Apple’s private cloud compute.
i don’t think they need to win the on device market.
we need to separate inference and training - the real winners are those who have the training compute. you can always have other companies help with inference
> i don’t think they need to win the on device market.
The second Apple comes out with strong on-device AI - and it very much looks like they will - Google will have to respond on Android. They can't just sit and pray that e.g. Samsung makes a competitive chip for this purpose.
I think Apple is uniquely disadvantaged in the AI race to a point people dont realize.
They have less training data to use, having famously been focused on privacy for its users and thus having no particular advantage in this space due to not having customer data to train on.
They have little to no cloud business, and while they operate a couple of services for their users, they do not have the infrastructure scale to compete with hyperscaler cloud vendors such as Google and Microsoft.
Most of what they would need to spend on training new models would require that they hand over lots of money to the very companies that already have their own models, supercharging their competition.
While there is a chance that Apple might come out with a very sophisticate on-device model. The problem here is that they would only be able to compete with other on-device models. The magnitude of compute needed to keep pace with SOA models is not achievable on a single device. It will take many generations of Apple silicon in order to compete with the compute of existing datacenters.
Google also already has competitive silicon in this space with the Tensor series processors, which are being fabbed at Samsung plants today. There is no sitting and praying necessary on their part as they already compete.
Apple is a very distant competitor in the space of AI, and I see no reason to assume this will change, they are uniquely disadvantaged by several of the choices they made on their way to mobile supremacy.
The only thing they currently have going for them is the development of their own ARM silicon which may give them the ability to compete with Google's TPU chips, but there is far more needed to be competitive here than the ability to avoid the Nvidia tax.
It is likely Apple can get additional data by creating synthetic data for user interactions.
About 7 years ago I trained GAN models to generate synthetic data, and it worked so well. The state of the art has increased a lot in 7 years, so Apple will be fine.
"having famously been focused on privacy for its users and thus having no particular advantage in this space due to not having customer data to train on"
That may not be as big a disadvantage as you think.
Anthropic claim that they did not use any data from their users when they trained Claude 3.5 Sonnet.
sure but they certainly acquired data from mass scraping (including of data produced by their users) and/or data brokering aka paying someone to do the same.
yeah i’ve never understood the outsized optimism for apple’s ai strategy, especially on hn.
they’re a little bit less of a nobody than they used to be, but they’re basically a nobody when it comes to frontier research/scaling. and the best model matters way more than on-device which can always just be distilled later and find some random startup/chipco to do inference
Theory: Apple's lifestyle branding is quite important to the identity of many in the community here. I mean, look at the buy-in at launch for Apple Vision Pro by so many people on HN--it made actual Apple communities and publications look like jaded skeptics.
At what point does the on device stuff eat into their market share though? As on device gets better, who will pay for cloud compute? Other than enterprise use.
I’m not saying on device will ever truly compete at quality, but I believe it’ll be good enough that most people don’t care to pay for cloud services.
That makes no sense. Inference cost dwarf training cost if you have a succesfull product pretty quickly. Afaik there is no commodity hardware that can run state of the art models like chatgpt-o1.
> Afaik there is no commodity hardware that can run state of the art models like chatgpt-o1.
Stack enough GPUs and any of them can run o1. Building a chip to infer LLMs is much easier than building a training chip.
Just because one cost dwarfs another does not mean that this is where the most marginal value from developing a better chip will be, especially if other people are just doing it for you. Google gets a good model, inference providers will be begging to be able to run it on their platform, or to just sell google their chips - and as I said, inference chips are much easier.
Each GPU costs ~50k. You need at least 8 of them to run mid-sized models. Then you need a server to plug those GPUs into. That's not commodity hardware.
more like ~$16k for 16 3090s. AMD chips can also run these models. The parts are expensive but there is a competitive market in processors that can do LLM inference. Less so in training.
I don’t think the AI market will ever really be a healthy one until inference vastly outnumbers training. What does it say about AI if training is done more than inference?
I agree that the in-device inference market is not important yet.
And yet these new models still haven’t reached feature parity with Google Assistant, which can turn my flashlight on, but with all the power of burning down a rainforest, Gemini still cannot interact with my actual phone.
Poor reception is rapidly becoming a non-issue for most of the developed world. I can’t think of the last time I had poor reception (in America) and wasn’t on an airplane.
As the global human population increasingly urbanizes, it’ll become increasingly easy to blanket it with cell towers. Poor(er) regions of the world will increase reception more slowly, but they’re also more likely to have devices that don’t support on-device models.
Also, Gemini Flash is basically positioned as a free model, (nearly) free API, free in GUI, free in Search Results, Free in a variety of Google products, etc. No one will be paying for it.
Many major cities have significant dead spots for coverage. It’s not just for developing areas.
Flash is free for api use at a low rate limit. Gemini as a whole is not free to Android users (free right now with subscription costs beyond a time period for advanced features) and isn’t free to Google without some monetary incentive. Hence why I also originally ask about private cloud compute alternatives with Google.
Your first sentence has the fallacy that you’re attributing the cost of the device to a single feature against the cost of that single feature.
Most people are unlikely to buy the device for the AI features alone. It’s a value add to the device they’d buy anyway.
So you need the paid for option to be significantly better than the free one that comes with the device.
Your second sentence assumes the local one is dumb. What happens when local ones get better? Again how much better is the cloud one to compete on cost?
To your last sentence, it assumes data fetching from the cloud. Which is valid but a lot of data is local too. Are people really going to pay for what Google search is giving them for free?
I think it's a more likely assumption that on device performance will trail off device models by a significant margin for at least the next few years - of course if magically you can make it work locally with the same level of performance it would be better.
Plus a lot of the "agentic" stuff is interaction with the outside world, connectivity is a must regardless.
My point is that you do NOT need the same level of performance. You need an adequate level of performance that the cost to get more performance isn’t worth it to most people.
And my point is that it's way too early to try to optimize for running locally, if performance really stabilizes and comes to a halt (which may likely happen) then it makes more sense to optimize.
Plus once you start with on device features you start limiting your development speed and flexibility.
Yes, but I can't imagine situations where I "have" to run a model when I don't have internet at that time. My life would be more affected with the rest of the internet than having to run a small stupid model locally. At the very least until the hallucination is completely solved, as I need internet to verify the models.
You’re assuming the model is purely for generation though. Several of the Gemini features are lookup of things across data available to it. A lot of that data can be local to device.
That is currently Apple’s path with Apple Intelligence for example.
Yeah they've been slow to release end-user facing stuff but it's obvious that they're just grinding away internally.
They've ceded the fast mover advantage, but with a massive installed base of Android devices, a team of experts who basically created the entire field, a huge hardware presence (that THEY own), massive legal expertise, existing content deals, and a suite of vertically integrated services, I feel like the game is theirs to lose at this point.
The only caution is regulation / anti-trust action, but with a Trump administration that seems far less likely.
Buried in the announcement is the real gem — they’re releasing a new SDK that actually looks like it follows modern best practices. Could be a game-changer for usability.
They’ve had OpenAI-compatible endpoints for a while, but it’s never been clear how serious they were about supporting them long-term. Nice to see another option showing up. For reference, their main repo (not kidding) recommends setting up a Kubernetes cluster and a GCP bucket to submit batch requests.
its interesting that just as the LLM hype appears to be simmering down, DeepMind is making big strides. I'm more excited by this than any of OpenAI's announcements.
OT: I’m not entirely sure why, but "agentic" sets my teeth on edge. I don't mind the concept, but the word itself has that hollow, buzzwordy flavor I associate with overblown LinkedIn jargon, particularly as it is not actually in the dictionary...unlike perfectly serviceable entries such as "versatile", "multifaceted" or "autonomous"
To play devil's advocate, the correct use of the word would be when multiple AIs are coordinating and handing off tasks to each other with limited context, such that the handoffs are dynamically decided at runtime by the AI, not by any routine code. I have yet to see a single example where this is required. Most problems can be solved with static workflows and simple rule based code. As such, I do believe that >95% of the usage of the word is marketing nonsense.
You nailed an interesting nuance there about agents needing to make their own decisions!
I'm getting fairly excited about "agentic" solutions to the point that I even went out of my way to build "AgentOfCode" (https://github.com/JasonSteving99/agent-of-code) to automate solving Advent of Code puzzles by iteratively debugging executions of generated unit tests (intentionally not competing on the global leaderboard).
And even for this, there's actually only a SINGLE place in the whole "agent" where the models themselves actually make a "decision" on what step to take next, and that's simply deciding whether to refactor the generated unit tests or the generated solution based on the given error message from a prior failure.
I actually have built such a tool (two AIs, each with different capabilities), but still cringe at calling at agentic. Might just be an instinctive reflex.
I think this sort of usage is already happening, but perhaps in the internal details or uninteresting parts, such as content moderation. Most good LLM products are in fact using many LLM calls under the hood, and I would expect that results from one are influencing which others get used.
I'm personally very glad that the word has adhered itself to a bunch of AI stuff, because people had started talking about "living more agentically" which I found much more aggravating. Now if anyone states that out loud you immediately picture them walking into doors and misunderstanding simple questions, so it will hopefully die out.
That's not necessarily a good thing because they are overloaded while novel jargon is specific.
We use new words so often that we take it for granted. You've passively picked up dozens of new words over the last 5 or 10 years without questioning them.
No, we need a scientific understanding of autonomous intelligent decision-making. The problem with “agentic AI” is the same old “Artificial Intelligence, Natural Stupidity” problem: we have no clue what “reasoning” or “intelligence” or “autonomous” actually means in animals, and trying to apply these terms to AI without understanding them (or inventing a new term without nailing down the underlying concept) is doomed to fail.
We need something to describe a behavioral element in business processes. Something goes into it, something comes out of it - though in this case nondeterminism is involved and it may not be concrete outputs so much as further actioning.
This is what other replies are missing - I've been following AI closely since GPT 2 and it's not immediately clear what agentic means, so to other people, the term must be even less clear. Using the word autonomous can't be worse than agentic imo.
The Gemini 2 models support native audio and image generation but the latter won't be generally available till January. Really excited for that as well as 4o's image generation (whenever that comes out). Steerability has lagged behind aesthetics in image generation for a while now and it's be great to see a big advance in that.
Also a whole lot of computer vision tasks (via LLMs) could be unlocked with this. Think Inpainting, Style Transfer, Text Editing in the wild, Segmentation, Edge detection etc
Maybe some of these tasks are arguably not aligned with the traditional applications of CV, but Segmentation and Edge detection are definitely computer vision in every definition I've come across - before and after NNs took over.
Anyway, I'm glad that this Google release is actually available right away! I pay for Gemini Advanced and I see "Gemini Flash 2.0" as an option in the model selector.
I've been going through Advent of Code this year, and testing each problem with each model (GPT-4o, o1, o1 Pro, Claude Sonnet, Opus, Gemini Pro 1.5). Gemini has done decent, but is probably the weakest of the bunch. It failed (unexpectedly to me) on Day 10, but when I tried Flash 2.0 it got it! So at least in that one benchmark, the new Flash 2.0 edged out Pro 1.5.
I look forward to seeing how it handles upcoming problems!
I should say: Gemini Flash didn't quite get it out of the box. It actually had a syntax error in the for loop, which caused it to fail to compile, which is an unusual failure mode for these models. Maybe it was a different version of Java or something (I'm also trying to learn Java with AoC this year...). But when I gave Flash 2.0 the compilation error, it did fix it.
For the more Java proficient, can someone explain why it may have provided this code:
for (int[] current = queue.remove(0)) {
which was a compilation error for me? The corrected code it gave me afterwards was just
for (int[] current : queue) {
and with that one change the class ran and gave the right solution.
I use a Claude and Gemini a lot for coding and I realized there is no good or best model. Every model has it's upside and downside. I was trying to get authentication working according to the newer guidelines of Manifest V3 for browser extensions and every model is terrible. It is one use case where there is not much information or right documentation so every model makesup stuff. But this is my experience and I don't speak for everyone.
Relatedly, I start to think more and more the AI is great for mediocre stuff. If you just need to do the 1000th website, it can do that. Do you want to build a new framework? Then there will probably be less many useful suggestions. (Still not useless though. I do like it a lot for refactoring while building xrcf.)
EDIT: One reason that lead me to think it's better for mediocre stuff was seeing the Sora model generate videos. Yes it can create semi-novel stuff through combinations of existing stuff, but it can't stick to a coherent "vision" throughout the video. It's not like a movie by a great director like Tarantino where every detail is right and all details point to the same vision. Instead, Sora is just flailing around. I see the same in software. Sometimes the suggestions go towards one style and the next moment into another. I guess AI currently is just way lower in their context length. Tarantino has been refining his style for 30 years now. And always he has been tuning his model towards his vision. AI in comparison seems to always just take everything and turn it into one mediocre blob. It's not useless but currently good to keep in mind I think. That you can only use it to generate mediocre stuff.
That's when having a huge context is valuable. Dump all of the new documentation into the model along with your query and the chances of success hugely increase.
This is true for all newish code bases. You need to provide the context it needs to get the problem right. It has been my experience that one or two examples with new functions or new requirements will suffice for a correction.
I can't comment on why the model gave you that code, but I can tell you why it was not correct.
`queue.remove(0)` gives you an `int[]`, which is also what you were assigning to `current`. So logically it's a single element, not an iterable. If you had wanted to iterate over each item in the array, it would need to be:
```
for (int[] current : queue) {
for (int c : current) {
// ...do stuff...
}
}
```
Alternatively, if you wanted to iterate over each element in the queue and treat the int array as a single element, the revised solution is the correct one.
I like that https://artificialanalysis.ai/leaderboards/models describes both quality and speed (tokens/s and first chunk s). Not sure how accurate it is; anyone know? Speed and variance of it in particular seems difficult to pin down because providers obviously vary it with load to control their costs.
Leaderboards are not that useful for measuring real-life effectiveness of the models atleast in my day-today usage.
I am currently struggling to diagnose an ipv6 mis-configuration in my enormous aws cloudformation yaml code. I gave the same input to Claude Opus, Gemini and ChatGPT ( o1 and 4o).
4o was the worst. verbose and waste of my time.
Claude completely went off-tangent and began recommending fixes for ipv4 while I specifically asked for ipv6 issues
o1 made a suggestion which I tried out and it fixed it. It literally found a needle in the haystack. The solution is working well now.
Gemini made a suggestion which almost got it right but it was not a full solution.
I must clarify diagnosing network issues on AWS VPC is not my expertise and I use the LLMs to supplement my knowledge.
With the accelerating progress, the "be ready to look again" is becoming a full time job that we need to be able to delegate in some way, and I haven't found anything better than benchmarks, leaderboards and reviews.
FWIW I've found the 'coding' 'category' of the leaderboard to be reasonably accurate. Claude was the best, o1-mini then was typically stronger, now the Gemini Exp 1206 is at the top.
I find myself just paying a la carte via the API rather than paying the $20/mo so I can switch between the models.
poe.com has a decent model where you buy credits and spend them talking to any LLM which makes it nice to swap between them even during the same conversation instead of paying for multiple subscriptions.
Though gpt-4o could say "David Mayer" on poe.com but not on chat.openai.com which makes me wonder if they sometimes cheat and sneak in different models.
I tried accessing Gemini 2.0 Flash through Google AI Studio in the Safari browser on my iPhone, and to my surprise it worked. After I gave it access to my microphone and camera, I was able to have a pretty smooth conversation with it about what it saw through the camera. I pointed the camera at things in my room and asked what they were, and it identified them accurately. It was also able to read text in both English and Japanese. It correctly named a note I played on a piano when I showed it the keyboard with my finger playing the note, but it couldn’t identify notes by sound alone.
The latency was low, though the conversation got cut off a few times.
It's easier for consultants and sales people to sell to enterprise if the terminology is familiar but mysterious.
Bad
1. installed Antivirus software
2. added screen-size CSS rules
3. copied 'Assets' harddrive to DropBox
4. edited homepage to include Bitcoin wallet address link
5. upgraded to ChatGPT Pro
The beauty of LLMs isn’t just these coding objects speak human vernacular but they can be concatenated with human vernacular prompts and that itself can be used as an input, command or output sensibly without necessarily causing error even if a series of inputs combinations weren't preprogrammed.
I have an A.I. textbook that has agent terminology that was written preLLm days. agents are just autonomous ish code that loops on itself with some extra functionality. LLMs in their elegance can more easily out the box selfloop just on the basis concatenating language prompts, sensibly. They are almost agent ready out the box by this very elegant quality(the textbook agentic diagram is just a conceptual self perpetuation loop), except…
Except they fail at a lot or get stuck at hiccups. But, here is a novel thought. What if an LLM becomes more agentic (ie more able to sustain autonomous chain prompts that do actions without a terminal failure) and less copilotee not by more complex controlling wrapper self perpetuation code, but by means of training the core llm itself to more fluidly function in agentic scenarios.
a better agentically performing llm that isnt mislabeled with a bad buzzword might not reveal itself in its wrapper control code but through it just performing better in an typical agentic loop or environment conditions with whatever initiating prompt, control wrapper code, or pipeline that initiates its self perpetuation cycle.
Windows Phone was actually great though, and would've eventually been a major player in the space if Microsoft were stubborn enough to stick with it long enough, like they did with the Xbox.
By his own admission, Gates was extremely distracted at the time by the antitrust cases in Europe, and he let the initiative die.
The Web is dead. It’s pretty clear future web pages, if we call them that, will be assembled on-the-fly by AI based your user profile and declared goals, interests and requests.
I've been using gemini-exp-1206 and I notice a lot of similarities to the new gemini-2.0-flash-exp: they're not that much actually smarter but they go out of their way to convince you they are with overly verbose "reasoning" and explanations. The reasoning and explanations aren't necessarily wrong per se, but put them aside and focus on the actual logical reasoning steps and conclusions to your prompts and it's still very much a dumb model.
The models do just fine on "work" but are terrible for "thinking". The verbosity of the explanations (and the sheer amount of praise the models like to give the prompter - I've never had my rear end kissed so much!) should lead one to beware any subjective reviews of their performance rather than objective reviews focusing solely on correct/incorrect.
> We're also launching a new feature called Deep Research, which uses advanced reasoning and long context capabilities to act as a research assistant, exploring complex topics and compiling reports on your behalf. It's available in Gemini Advanced today.
Anyone seeing this? I don't have an option in my dropdown.
Anecdotally, using the Gemini App with "Gemini Advanced 2.0 Flash Experimental", the response quality is ignorantly improved and faster at some basic Python and C# generation.
Based on initial interactions, it's extremely verbose. It seems to be focused on explaining its reasoning, but even after just a few interactions I have seen some surprising hallucinations. For example, to assess current understanding of AI, I mentioned "Why hasn't Anthropic released Claude 3.5 Opus yet?" Gemini responded with text that included "Why haven't they released Claude 3.5 Sonnet First? That's an interesting point." There's clearly some reflection/attempted reasoning happening, but it doesn't feel competitive with o1 or the new Claude 3.5 Sonnet that was trained on 3.5 Opus output.
Fascinating, thanks for calling that out: I found 1.0 promising in practice, but with hallucination problems. Then I saw it had gotten 57% of questions wrong on open book true/false and I wrote it off completely - no reason to switch to it for speed and cost if it's just a random generator. That's a great outcome.
Was this written by an LLM? It's pretty bad copy. Maybe they laid off their copywriting team...?
> "Now millions of developers are building with Gemini. And it’s helping us reimagine all of our products — including all 7 of them with 2 billion users — and to create new ones"
and
> "We’re getting 2.0 into the hands of developers and trusted testers today. And we’re working quickly to get it into our products, leading with Gemini and Search. Starting today our Gemini 2.0 Flash experimental model will be available to all Gemini users."
That's not really any better, since "all of our products" already includes the subset that has at least 2B users. "I brought all my shoes, including all my red shoes."
Again, that's covered by "all our products". Why do we need to be reminded that Google has a lot of users? Someone oblivious to that isn't going to care about this press release.
Scale and cost are defining considerations of LLMs. By saying they're rolling out to billions of users, they're pointing out they're doing something pretty unique and have confidence in a major competitive advantage. Point billions of devices at other high-performing competitors' offerings, and all of them would fall over.
That's not what the sentence says. "Now millions of developers are building with Gemini. And it’s helping us reimagine all of our products — including all 7 of them with 2 billion users — and to create new ones." That does not imply that Gemini will start receiving requests from billions of users. At best it says that they'll start using it in some unspecified way.
That phrasing still sucks, I am neither a native speaker nor a wordsmith but I've worked with professional English writers who could make that look and sound infinitely better.
We’re definitely going to need better benchmarks for agentic tasks, and not just code reasoning. Things that are needlessly painful that humans go through all the time
I've been using Gemini Flash for free through the API using Cline for VS Code. I switch between Claude and Gemini Flash, using Claude for more complicated tasks. Hope that the 2.0 model comes closer to Claude for coding.
Agreed - tried some sample prompts on our data and the rough vibe check is that flash is now as good as the old pro. If they keep pricing the same, this would be really promising.
Does anyone have any insights into how Google selects source material for AI overviews? I run an educational site with lots of excellent information, but it seems to have been passed over entirely for AI overviews. With these becoming an increasingly large part of search--and from the sound of it, now more so with Gemini 2.0--this has me a little worried.
Anyone else run into similar issues or have any tips?
Their Mariner tool for controlling the browser sounds scary and exciting. At the moment, it's an extension, which means JavaScript. Some web sites block automation that happens this way, and developers resort to tools such as Selenium. These use the Chrome DevTools API to automate the browser. It's better, but can still be distinguished from normal use with very technical details. I wonder if Google, who still own Chrome, will give extensions better APIs for automation that can not be distinguished from normal use.
You would think so, and probably starting the productivity suite (Docs, Sheets, etc.). If Mariner could seamlessly integrate with those, then times are about to get real interesting (if they aren't already!)
Did anyone get to play with the native image generation part? In my experience, Imagen 3 was much better than the competition so I'm curious to hear people's take on this one.
At least when it comes to Go code, I'm pretty impressed by the results so far. It's also pretty good at following directions, which is a problem I have with open source models, and seems to use or handle Claude's XML output very well.
Overall, especially seeing as I haven't paid a dime to use the API yet, I'm pretty impressed.
Does anyone know how to sign up to the speech output wait list or tester program? I have a decent spent with GCP over the years if that helps at all. Really want DemoTime videos to use those voices. (I like how truly incredible best in the world tts is like a footnote in this larger announcement.)
Anyone else annoyed how the ML/AI community just adopted the word "reasoning" when it seems like it is being used very out of context when looking at what the model actually does?
Static computation is not reasoning (these models are not building up an argument from premises, they are merely finding statistically likely completions). Computational thinking/reasoning would be breaking down a problem into an algorithmic steps. The model is doing neither. I wouldn't confuse the fact that it can break it into steps if you ask it, because again that is just regurgitation. It's not going through that process without your prompt. That is not part of its process to arrive at an answer.
Models trained to do chain-of-thought do it without any specific prompting, and a specific system prompt that enables this behavior is often part of the setup, because that is what the model was trained on.
I don't see how that is "regurgitation", either, if it performs the reasoning steps first, and only then arrives at the answer.
I kinda agree with you but I can also see why it isn't that far from "reasoning" in the sense humans do it.
To wit, if I am doing a high school geometry proof, I come up with a sequence of steps. If the proof is correct, each step follows logically from the one before it.
However, when I go from step 2 to step 3, there are multiple options for step-3 I could have chose. Is it so different from a "most-likely-prediction" an LLM makes? I suppose the difference is humans can filter out logically-incorrect steps, or prune chains-of-steps that won't lead to the actual theorem quicker. But an LLM predictor coupled with a verifier doesn't feel that different from it.
The point is emergent capabilities in LLMs go beyond statistical extrapolation, as they demonstrate reasoning by combining learned patterns.
When asked, “If Alice has 3 apples and gives 2 to Bob, how many does she have left?”, the model doesn’t just retrieve a memorized answer—it infers the logical steps (subtracting 2 from 3) to generate the correct result, showcasing reasoning built on the interplay of its scale and architecture rather than explicit data recall.
Does it help to explore the annoyance using gap analysis? I think of it as heuristics. As with humans, it's the pragmatic "whatever seems to work" where "seems" is determined via training. It's neither reasoning from first principles (system 1) nor just selecting the most likely/prevalent answer (system 2). And chaining heuristics doesn't make it reasoning, either. But where there's evidence that it's working from a model, then it becomes interesting, and begins to comply with classical epistemology wrt "reasoning". Unfortunately, information theory seems to treat any compression as a model leading to some pretty subtle delusions.
Also no pricing is live yet. OpenAI's audio inputs/outputs are too expensive to really put in production, so hopefully Gemini will be cheaper. (Not to mention, OAI's doesn't follow instructions very well.)
The Multimodal Live API is free while the model/API is in preview. My guess is that they will be pretty aggressive with pricing when it's in GA, given the 1.5 Flash multimodal pricing.
If you're interested in this stuff, here's a full chat app for the new Gemini 2 API's with text, audio, image, camera video and screen video. This shows how to use both the WebSocket API and to route through WebRTC infrastructure.
Unfortunately the 10rpm quota for this experimental model isn't enough to run an actual Agentic experience on.
That's my main issue with google there's several models we want to try with our agent but quota is limited and we have to jump through hoops to see if we can get it raised.
I think they are really overloading that word "Agent". I know there isn't a standard definition - but I think Google are stretching the meaning of that way thinner than most C Suite level execs talk about agents at.
I think DeepMind could make progress if they focused on the agent definition of multi-step reasoning + action through a web browser, and deliver a ton of value, outside of lumping in the seldom used "Look at the world through a camera" or "Multi modal Robots" thing.
If Google cracked robots, past plays show that the market for those aren't big enough to interest Google. Like VR, you just can't get a billion people to be interested in robots - so even if they make progress, it won't survive under Google.
The "Look at the world through a camera" thing is a footnote in an Android release.
Agentic computer use _is_ a product a billion people would use, and it's adjacent to the business interests of Google Search.
I'm quite impressed with the flash demo's reasoning capabilities. Did the 20 questions game with it, it found the computer mouse i had in my head. At first it was confused about our roles and said something weird, it thought that it had to guess its own word. Afterwards I had a meta conversation about that weirdness and it gave impressive insights:
"
Why My "I Guess What I Have in Mind" Statement Doesn't Make Sense
The Problem of Agency: You're spot-on when you say it doesn't make sense for me to state "I guess what I (the computer) have in mind." AI does not have a "mind" in the same way that humans do. My internal state is a complex pattern of numerical data, not conscious thoughts or intentions in the way you have things in mind. My phrasing gave the illusion of having a "thing in mind," but in reality, I don't have this capacity.
Misinterpretation of the Task: My statement "I guess what I have in mind" misinterprets the task. You were clearly signaling that I needed to ask questions to uncover what you were thinking of. My failure to grasp this created a weird scenario where it seemed like I was trying to determine my own data set!"
Small models don't "know" as much so they hallucinate more. They are better suited for generations that are based in a ground truth, like in a RAG setup.
A better comparison might be Flash 2.0 vs 4o-mini. Even then, the models aren't meant to have vast world knowledge, so benchmarking them on that isn't a great indicator of how they would be used in real-world cases.
Yes, it's not an apples to apples comparison. My point is the position it's at on the lmarena leaderboard is misplaced due to the hallucination issues.
"Over the last year, we have been investing in developing more agentic models, meaning they can understand more about the world around you, think multiple steps ahead, and take action on your behalf, with your supervision."
"With your supervision". Thus avoiding Google being held responsible. That's like Teslas Fake Self Driving, where the user must have their hands on the wheel at all times.
The best thing about gemini models is the huge context windows, you can just throw big documents and find stuff real fast, rather than struggling with cut off in perplexity or Claude.
I work with LLMs and MLLMs all day (as part of my work on JoyCaption, an open source VLM). Specifically, I spend a lot of time interacting with multiple models at the same time, so I get the chance to very frequently compare models head-to-head on real tasks.
I'll give Flash 2 a try soon, but I gotta say that Google has been doing a great job catching up with Gemini. Both Gemini 1.5 Pro 002 and Flash 1.5 can trade blows with 4o, and are easily ahead of the vast majority of other major models (Mistral Large, Qwen, Llama, etc). Claude is usually better, but has a major flaw (to be discussed later).
So, here's my current rankings. I base my rankings on my work, not on benchmarks. I think benchmarks are important and they'll get better in time, but most benchmarks for LLMs and MLLMs are quite bad.
1) 4o and its ilk are far and away the best in terms of accuracy, both for textual tasks as well as vision related tasks. Absolutely nothing comes even close to 4o for vision related tasks. The biggest failing of 4o is that it has the worst instruction following of commercial LLMs, and that instruction following gets _even_ worse when an image is involved. A prime example is when I ask 4o to help edit some text, to change certain words, verbage, etc. No matter how I prompt it, it will often completely re-write the input text to its own style of speaking. It's a really weird failing. It's like their RLHF tuning is hyper focused on keeping it aligned with the "character" of 4o to the point that it injects that character into all its outputs no matter what the user or system instructions state. o1 is a MASSIVE improvement in this regard, and is also really good at inferring things so I don't have to explicitly instruct it on every little detail. I haven't found o1-pro overly useful yet. o1 is basically my daily driver outside of work, even for mundane questions, because it's just better across the board and the speed penalty is negligible. One particularly example of o1 being better I encountered yesterday. I had it re-wording an image description, and thought it had introduced a detail that wasn't in the original description. Well, I was wrong and had accidentally skimmed over that detail in the original. It _told_ me I was wrong, and didn't update the description! Freaky, but really incredible. 4o never corrects me when I give it an explicit instruction.
4o is fairly easy to jailbreak. They've been turning the screws for awhile so it isn't as easy as day 1, but even o1-pro can be jailbroken.
2) Gemini 1.5 Pro 002 (specifically 002) is second best in my books. I'd guesstimate it at being about 80% as good as 4o on most tasks, including vision. But it's _significantly_ better at instruction following. Its RLHF is a lot lighter than ChatGPT models, so it's easier to get these models to fall back to pretraining, which is really helpful for my work specifically. But in general the Gemini models have come a long way. The ability to turn off model censorship is quite nice, though it does still refuse at times. The Flash variation is interesting; often times on-par with Pro with Pro edging out maybe 30% of the time. I don't frequently use Flash, but it's an impressive model for its size. (Side note: The Gemma models are ... not good. Google's other public models, like so400m and OWLv2 are great, so it's a shame their open LLMs forays are falling behind). Google also has the best AI playground.
Jailbreaking Gemini is a piece of cake.
3) Claude is third on my list. It has the _best_ instruction following of all the models, even slightly better than o1. Though it often requires multi-turn to get it to fully follow instructions, which is annoying. Its overall prowess as an LLM is somewhere between 4o and Gemini. Vision is about the same as Gemini, except for knowledge based queries which Gemini tends to be quite bad at (who is this person? Where is this? What brand of guitar? etc). But Claude's biggest flaw is the insane "safety" training it underwent, which makes it practically useless. I get false triggers _all_ the time from Claude. And that's to say nothing of how unethical their "ethics" system is to begin with. And what's funny is that Claude is an order of magnitude _smarter_ when its reasoning about its safety training. It's the only real semblance of reason I've seen from LLMs ... all just to deny my requests.
I've put Claude three out of respect for the _technical_ achievements of the product, but I think the developers need to take a long look in the mirror and ask why they think it's okay to for _them_ to decide what people with disabilities are and are not aloud to have access to.
4) Llama 3. What a solid model. It's the best open LLM, hands down. Nowhere near the commercial models above, but for a model that's completely free to use locally? That's invaluable. Their vision variation is ... not worth using. But I think it'll get better with time. The 8B variation far outperforms its weight class. 70B is a respectable model, with better instruction following than 4o. The ability to finetune these models to a task with so little data is a huge plus. I've made task specific models with 200-400 examples.
5) Mistral Large (I forget the specific version for their latest release). I love Mistral as the "under-dog". Their models aren't bad, and behave _very_ differently from all other models out there, which I appreciate. But Mistral never puts any effort into polishing their models; they always come out of the oven half-baked. Which means they frequently glitch out, have very inconsistent behavior, etc. Accuracy and quality is hard to assess because of this inconsistency. On its best days it's up near Gemini, which is quite incredible considering the models are also released publicly. So theoretically you could finetune them to your task and get a commercial grade model to run locally. But rarely see anyone do that with Mistral, I think partly because of their weird license. Overall, I like seeing them in the race and hope they get better, but I wouldn't use it for anything serious.
Mistral is lightly censored, but fairly easy to jailbreak.
6) Qwen 2 (or 2.5 or whatever the current version is these days). It's an okay model. I've heard a lot of praises for it, but in all my uses thus far its always been really inconsistent, glitchy, and weak. I've used it both locally and through APIs. I guess in _theory_ it's a good model, based on benchmarks. And it's open, which I appreciate. But I've not found any practical use for it. I even tried finetuning with Qwen 2VL 72B, and my tiny 8B JoyCaption model beat it handily.
That's about the sum of it. AFAIK that's all the major commercial and open models (my focus is mainly on MLLMs). OpenAI are still leading the pack in my experience. I'm glad to see good competition coming from Google finally. I hope Mistral can polish their models and be a real contender.
There are a couple smaller contenders out there like Pixmo/etc from allenai. Allen AI has hands down the _best_ public VQA dataset I've seen, so huge props to them there. Pixmo is ... okayish. I tried Amazon's models a little but didn't see anything useful.
NOTE: I refuse to use Grok models for the obvious reasons, so fucks to be them.
Claude MCP does the same thing. It’s the setup that is hard. It will do push pull create branch automatically from a single prompt. 500$ a month for Devin could be worth it if you want it taken care off plus use the models for a team, but a single person can set it up
Agents are the worst idea. I think AI will start to progress again when we get better models that drop this whole chat idea, and just focus on completions. Building the tools on top should be the work that is given over to the masses. It's sad that instead the most powerful models have been hamstrung.
Yes, LMArena shows Gemini-2.0-Flash-Exp ranking 3rd right now, after Gemini-Exp-1206 and ChatGPT-4o-latest_(2024-11-20), and ahead of o1-preview and o1-mini:
It is interesting to see that they keep focusing on the cheapest model instead of the frontier model. Probably because of their primary (internal?) customer's need?
It's cheaper and faster to train a small model, which is better for a research team to iterate on, right? If Google decides that a particular small model is really good, why wouldn't they go ahead and release it while they work on scaling up that work to train the larger versions of the model?
On the other hand, once you have a large model, you can use various distillation techniques to train smaller models faster and with better results. Meta seems to be very successful doing this with Llama, in particular.
I have no knowledge of Google specific cases, but in many teams smaller models are trained upon bigger frontier models through distillation. So the frontier models come first then smaller models later.
Training a "frontier model" without testing the architecture is very risky.
Meta trained the smaller Llama 3 models first, and then trained the 405B model on the same architecture once it had been validated on the smaller ones. Later, they went back and used that 405B model to improve the smaller models for the Llama 3.1 release. Mistral started with a number of small models before scaling up to larger models.
I feel like this is a fairly common pattern.
If Google had a bigger version of Gemini 2.0 ready to go, I feel confident they would have mentioned it, and it would be difficult to distill it down to a small model if it wasn't ready to go.
the problem is that last generation of the largest models failed to overcome smaller models on the benchmarks, see lack of new claude opus or gpt-5. The problem is probably in the benchmarks, but anyway.
Considering so many of us would like more vRAM than NVIDIA is giving us for home compute, is there any future where these Trillium TPUs become commodity hardware?
Power concerns aside, individual chips in a TPU pod don't actually have a ton of vRAM; they rely on fast interconnects between a lot of chips to aggregate vRAM and then rely on pipeline / tensor parallelism. It doesn't make sense to try to sell the hardware -- it's operationally expensive. By keeping it in house Google only has to support the OS/hardware in their datacenter and they can and do commercialize through hosted services.
Why do you want the hardware vs just using it in the cloud? If you're training huge models you probably don't also keep all your data on prem, but on GCS or S3 right? It'd be more efficient to use training resources close to your data. I guess inference on huge models? Still isn't just using a hosted API simpler / what everyone is doing now?
Reminder that implied models are not actual models. Models have failed to materialize repeatedly and vanished without further mention. I assume no one is trying to be misleading but, at this point, maybe overly optimistic.
Speed looks good vis-a-vis 4o-mini, and quality looks good so far against my eval set. If it's cheaper than 4o-mini too (which, it probably will be?) then OpenAI have a real problem, because switching between them is a value in a config file.
Is this what it feels like to become one of the gray bearded engineers? This sounds like a bunch of intentionally confusing marketing drivel.
When capitalism has pilfered everything from the pockets of working people so people are constantly stressed over healthcare and groceries, and there's little left to further the pockets of plutocrats, the only marketing that makes sense is to appeal to other companies in order to raid their coffers by tricking their Directors to buy a nonsensical product.
Is that what they mean by "agentic era"? Cause that's what it sounds like to me. Also smells alot like press release driven development where the point is to put a feather in the cap of whatever poor Google engineer is chasing their next promotion.
> Is that what they mean by "agentic era"? Cause that's what it sounds like to me.
What are you basing your opinion on? I have no idea how well these LLM agents will perform but its definitely a thing. OpenAI is working on them , Claude and certainly Google.
Yeah it’s a lot of marketing fluff but these tools are genuinely useful and there’s no wonder why Google is working hard to prevent them from destroying their search-dependent bottom line.
Marketing aside, agents are just LLMs that can reach out of their regular chat bubbles and use tools. Seems like just the next logical evolution
Side note on Gemini: I pay for Google Workspace simply to enable e-mail capability for a custom domain.
I never used the web interface to access email until recently. To my surprise, all of the AI shit is enabled by default. So it’s very likely Gemini has been training on private data without my explicit consent.
Of course G words it as “personalizing” the experience for me but it’s such a load of shit. I’m tired of these companies stealing our data and never getting rightly compensated.
Gmail is hosting your email. Being able to change the domain doesn't change that they're hosting it on their terms. I think there are other email providers that have more privacy-focused policies.
Gemini in search is answering so many of my search questions wrong.
If I ask natural language yes/no questions, Gemini sometimes tells me outright lies with confidence.
It also presents information as authoritative - locations, science facts, corporate ownership, geography - even when it's pure hallucination.
Right at the top of Google search.
edit:
I can't find the most obnoxious offending queries, but here was one I performed today: "how many islands does georgia have?".
Compare that with "how many islands does georgia have? Skidaway Island".
This is an extremely mild case, but I've seen some wildly wrong results, where Google has claimed companies were founded in the wrong states, that towns were located in the wrong states, etc.
At first, this was true but now it has gotten pretty good. The times it gets things wrong are often not the models fault and just google searches fault.
"GVP stands for Good Pharmacovigilance Practice, which is a set of guidelines for monitoring the safety of drugs. SVP stands for Senior Vice President, which is a role in a company that focuses on a specific area of operations."
Seems lot of pharma regulation in my telecom company.
I am sure Google has the resources to compete in this space. What I'm less sure about is whether Google can monetize AI in a way that doesn't cannibalize their advertising income.
Who the hell wants an AI that has the personality of a car salesman?
No mention of Perplexity yet in the comments but it's obvious to me that they're targeting Perplexity Pro directly with their new Deep Research feature (https://blog.google/products/gemini/google-gemini-deep-resea...). I still wonder why Perplexity is worth $7 billion when the 800-pound gorilla is pounding on their door (albeit slowly).
Just tried the deep search. It's a much much slower experience than perplexity at the moment - taking sweet many minutes to return result. Maybe it's more extensive but I use perplexity for quick information summary a lot and this is a very different UX.
Haven't used it enough to evaluate the quality, however.
Before dropping it for a different project that got some traction, "Slow Perplexity" was something I was pretty set on building.
Perplexity is a much less versatile product than it has to be in the chase of speed: you can only chew through so many tokens, do so much CoT, etc. in a given amount of time.
They optimized for virality (it's just as fast as Google but gives me more info!) but I suspect it kills the stickiness for a huge number of users since you end up with some "embarrassing misses": stuff that should have been a slam dunk, goes off the rails due to not enough search, or the wrong context being surfaced from the page, etc. and the user just doesn't see value in it anymore.
I know this isn't really a useful comment, but, I'm still sour about the name they chose. They MUST have known about the Gemini protocol. I'm tempted to think it was intentional, even.
It's like Microsoft creating an AI tool and calling it Peertube. "Hurr durr they couldn't possibly be confused; one is a decentralised video platform and the other is an AI tool hurr durr. And ours is already more popular if you 'bing' it hurr durr."
Worth noting that the Gemini models have the ability to write and then execute Python code. I tried that like this:
Here's the result: https://gist.github.com/simonw/0d8225d62e8d87ce843fde471d143...It can't make outbound network calls though, so this fails:
Amusingly Gemini itself doesn't know that it can't make network calls, so it tries several different approaches before giving up: https://gist.github.com/simonw/2ccfdc68290b5ced24e5e0909563c...The new model seems very good at vision:
I got back a solid description, see here: https://gist.github.com/simonw/32172b6f8bcf8e55e489f10979f8f...reply