Hacker News new | past | comments | ask | show | jobs | submit | foobar_______'s comments login

Exactly. I barely know who this actress is. To me, it sounds like the tens of thousands of other white american voices. How is the remotely too similar?


So much negativity. Is it perfect? No. Is there room for improvement? Definitely. I don't know how you can get so fucking jaded that a demo like this doesn't at least make you a little bit excited or happy or feel awestruck at what humans have been able to accomplish?


You help me feel sane. People, it is a commercial. Nothing more. Don't get your panties in a bundle. If you don't like it, change the channel, don't buy their product, go outside on a hike. The things people get upset about today is fascinating. GO OUTSIDE


Why get so bothered by other people being upset? Apple is going to be fine, you don’t need to worry on their behalf. No need to get your undergarments of choice in a twist. Good opportunity to step outside and get some fresh air.


It doesn’t really ‘bother’ me and I’m not worrying on their behalf. If you’re actually interested in why I’m worried, it makes me question whether there’s less emotional resilience in our society, and I value emotional resilience because I think we need it when life truly tests us.


Sounds like you are getting your panties in a bundle about other people getting their panties in a bundle. Why do you care so much what other random people on the internet think?

Maybe it is you who needs to go outside and stop reading these comments which make you feel 'insane'?


And why do you care so much about what foobar thinks to the point of passively-aggressive asking him?

"Why do you care about X" questions are inane.


....and you have continued the pattern by joining in and asking me the question. Well done!


People exercise their God given right, why do you care so much about it?


Getting emotional are we?


This is clear as day. If they got an early lead to the LLM/AI space like OpenAI did with ChatGPT, then things would be very different. Attributing the open source to "good will" and Meta being righteous seems like some mis-founded 16 year old's overly simplistic ideal of the world. Meta is a business. Period.


I agree with your analogy. Also, there is a quite a bit of "standing on the shoulders of giants" kind of thing going on. Every company's latest release will/should be a bit better than the models released before it. AI enthusiasts are getting a bit annoying - "we got a new leader boys!!!!*!" for each new model released.


Yeah if I had to stack rank 'obscene' energy consumption technologies, Bitcoin would be number 1 and AI would be pretty low on the list.


Feels like another pivotal moment in AI. Feel like I’m watching history live. I think I need to go lay down.


Agreed. The whole things reeks of being desperate. Half the video is jerking themselves off that they've done AI longer than anyone and they "release" (not actually available in most countries) a model that is only marginally better than the current GPT4 in cherry-picked metrics after nearly a year of lead-time?!?!

That's your response? Ouch.


Have you seen the demo video, it is really impressive and AFAIK OpenAI does not has similar features product offering at the moment, demo or released.

Google essentially claimed a novel approach of native multi-modal LLM unlike OpenAI non-native approach and doing so according to them has the potential to further improve LLM the state-of-the-art.

They have also backup their claims in a paper for the world to see and the results for ultra version of the Gemini are encouraging, only losing in the sentence completion dataset to ChatGPT-4. Remember the new Gemini native multi-modal has just started and it has reached version 1.0. Imagine if it is in version 4 as ChatGPT is now. Competition is always good, does not matter if it is desperate or not, because at the end the users win.


If they put the same team on that Gemini video as they do on Pixel promos, you're better off assuming half of it is fake and the other half exaggerated.

Don't buy into marketing. If it's not in your own hands to judge for yourself, then it might as well be literally science fiction.

I do agree with you that competition is good and when massive companies compete it's us who win!


The hype video should be taken with a grain of salt but the level of capability displayed in the video seems probable in the not too distant future even if Gemini can't currently deliver it. All the technical pieces are there for this to be a reality eventually. Exciting times ahead.


It’s not the future, it’s the present. You can build it with gpt4-v today.


I can use GPT-4 right now. Until I can use Gemini, I wouldn't believe Google a thing.


Apparently, now you can since Bard is already powered by Gemini Pro [1]:

[1] Google’s Bard chatbot is getting way better thanks to Gemini:

https://www.theverge.com/2023/12/6/23989744/google-bard-gemi...


I would like more details on Gemini's 'native' multimodal approach before assuming it is something truly unique. Even if GPT-4V were aligning a pretrained image model and pretrained language model with a projection layer like PaLM-E/LLaVA/MiniGPT-4 (unconfirmed speculation, but likely), it's not as if they are not 'natively' training the composite system of projection-aligned models.

There is nothing in any of Google's claims that preclude the architecture being the same kind of composite system. Maybe with some additional blending in of multimodal training earlier in the process than has been published so far. And perhaps also unlike GPT-4V, they might have aligned a pretrained audio model to eliminate the need for a separate speech recognition layer and possibly solving for multi-speaker recognition by voice characteristics, but they didn't even demo that... Even this would not be groundbreaking though. ImageBind from Meta demonstrated the capacity align an audio model with an LLM in the same way images models have been aligned with LLMs. I would perhaps even argue that Google skipping the natural language intermediate step between LLM output and image generation is actually in support of the position that they may be using projection layers to create interfaces between these modalities. However, this direct image generation projection example was also a capability published by Meta with ImageBind.

What seems more likely, and not entirely unimpressive, is that they refined those existing techniques for building composite multimodal systems and created something that they plan to launch soon. However, they still have crucially not actually launched it here. Which puts them in a similar position to when GPT-4 was first announced with vision capabilities, but then did not offer them as a service for quite an extended time. Google has yet to ship it, and as a result fails to back up any of their interesting claims with evidence.

Most of Google's demos here are possible with a clever interface layer to GPT-4V + Whisper today. And while the demos 'feel' more natural, there is no claim being made that they are real-time demos, so we don't know how much practical improvement in the interface and user experience would actually be possible in their product when compared to what is possible with clever combinations of GPT-4V + Whisper today.


If what they're competing with is other unreleased products, then they'll have to compete with OpenAI's thing that made all its researchers crap their pants.


What makes it native?


Good question.

Perhaps for audio and video is by directly integrating the spoken sound (audio mode -> LLM) rather than translating the sound to text and feeding the text to LLM (audio mode -> text mode -> LLM).

But to be honest I'm guessing here perhaps LLM experts (or LLM itself since they claimed comparable capability of human experts) can verify if this is truly what they meant by native multi-modal LLM.


It's highly unlikely for a generative model to be able to reason about language in this level when based on audio features alone. Gemini may use audio cues, but text tokens must be fed in the very early layers of the transformer for complex reasoning to be possible. But because the gemini paper only mentions a transformer architecture, I don't see a way for them to implement speech-to-text inside such architecture (while also allowing direct text input). Maybe native here means that such a stack of models was rather trained together.


The transformer model and architecture is not limited to text-based token input but again I'm not the expert on how this new LLM model namely Gemini are being implemented, and whether the text-based token is necessary. For Gemini, if Google has truly cracked the native multi-modal input without the limitation of text-based input then it's really novel and revolutionary as they claimed it to be.


I’m impressed that it’s multimodal and includes audio. GPT-4V doesn’t include audio afaik.

Also I guess I don’t see it as critical that it’s a big leap. It’s more like “That’s a nice model you came up with, you must have worked real hard on it. Oh look, my team can do that too.”

Good for recruiting too. You can work on world class AI at an org that is stable and reliable.



That’s different. It’s essentially using whisper model for audio to text and that inputs to ChatGPT.

Multimodal would be watching YouTube without captions and asking “how did a certain character know it was raining outside?” Based on rain sound but no image of rain


I don't know if it's related to Gemini, but Bard seems to be able to do this by answering questions like "how many cups of sugar are called for in this video". Not sure if it relies on subtitles or not.

From https://bard.google.com/updates:

> Expanding Bard’s understanding of YouTube videos

> What: We're taking the first steps in Bard's ability to understand YouTube videos. For example, if you’re looking for videos on how to make olive oil cake, you can now also ask how many eggs the recipe in the first video requires.

> Why: We’ve heard you want deeper engagement with YouTube videos. So we’re expanding the YouTube Extension to understand some video content so you can have a richer conversation with Bard about it.


Interesting. Will take Bard for a spin.


Ah that’s right. I guess my question is, is it a true multimodal model (able to produce arbitrary audio) or is it a speech to text system (OpenAI has a model called Whisper for this) feeding text to the model and then using text to speech to read it aloud.

Though now that I am reading the Gemini technical report, it can only receive audio as input, it can’t produce audio as output.

Still based on quickly glancing at their technical report it seems Gemini might have superior audio input capabilities. I am not sure of this though now that I think about it.


One of the demo videos explicitly addresses this point: https://youtu.be/D64QD7Swr3s?si=_bBa9aPmqGbo-Iej


Oh that’s actually pretty good then. It also seems it does output audio despite the PDF from google I was reading saying otherwise. Hmm.


Google is stable and reliable?


They can certainly pretend they are for hiring purposes. Compared to a company that fired their CEO, nearly had the whole company walk out, then saw the board ousted and the CEO restored google does look more reliable.

Just don’t speak to xooglers about it. ;)


> Compared to a company that [...]

Time to press some keys on my keyboar-

> Just don’t speak to xooglers about it. ;)

Oh shit, nevermind, you get it.


:)


I worked at Google up through 8 weeks ago and knew there _had_ to be a trick --

You know those stats they're quoting for beating GPT-4 and humans? (both are barely beaten)

They're doing K = 32 chain of thought. That means running an _entire self-talk conversation 32 times_.

Source: https://storage.googleapis.com/deepmind-media/gemini/gemini_..., section 5.1.1 paragraph 2


How do you know GPT-4 is 1 shot? The details about it aren't released, it is entirely possible it does stuff in multiple stages. Why wouldn't OpenAI use their most powerful version to get better stats, especially when they don't say how they got it?

Google being more open here about what they do is in their favor.


There's a rumour that GPT-4 runs every query either 8x or 16x in parallel, and then picks the "best" answer using an additional AI that is trained for that purpose.


It would have to pick each token then, no? Because you can get a streaming response, which would completely invalidate the idea of the answer being picked after.


It's false, it's the 9 months-down-the-line telephone game of a unsourced rumor re: mixture of experts model. Drives me absolutely crazy.

Extended musings on it, please ignore unless curious about evolution patterns of memes:

Funnily enough, it's gotten _easier_ to talk about over time -- i.e. on day 1 you can't criticize it because it's "just a rumor, how do you know?" -- on day 100 it's even worse because that effect hasn't subsided much, and it spread like wildfire.

On day 270, the same thing that gave it genetic fitness, the alluring simplicity of "ah yes, there's 8x going on", has become the core and only feature of the Nth round of the telephone game. There's no more big expert-sounding words around it that make it seem plausible.


As with most zombie theories, it exists because there is a vacuum of evidence to the contrary, not because it’s true.


It thinks about your question forever...


That genetic fitness is exactly why those stupid conspiracy theories refuse to die out: they've adapted to their hosts.


I recall reading something about it being a MoE (mixture of experts) which would align with what you are saying


That do makes sense if you consider the MIT paper on debating LLMs.


So beam search?


Same way I know the latest BMW isn't running on a lil nuke reactor. I don't, technically. But there's not enough comment room for me to write out the 1000 things that clearly indicate it. It's a "not even wrong" question on your part


where are you seeing that 32-shot vs 1-shot comparison drawn? in the pdf you linked it seems like they run it various times using the same technique on both models and just pick the technique which gemini most wins using.


This reminds me of their last AI launch. When Bard came out, it wasn't available in EU for weeks (months?). When it finally arrived, it was worse than GPT-3.


Still isn't available in Canada.

Silicon Valley hates Canada.


maybe they are trying to project stability (no pun intended)


Google are masters at jerking themselves off. I mean come on... "Gemini era"? "Improving billions of people’s lives"? Tone it down a bit.

It screams desperation to be seen as ahead of OpenAI.


Google has billions of users whose lives are improved by their products. What is far fetched about this AI improving those product lines?

Sounds like it's you that needs to calm down a bit. God forbid we get some competition.


It's just arrogant to name an era after your own product you haven't even released yet. Let it speak for itself. ChatGPT's release was far more humble and didn't need hyping up to be successful.


If you mean with "lives improved" that they have locked people into their products, spying on them, profiling them, making us into their product for them to make lots of money from, yeah, you're totally right.


Im a user, no way my life was improved! If anything they made me sad and miserable. It gives all these nice looking things that you really should not use, you want to but cant, you make the mistake anyway, it is nice for a while then taken away.

It would be funny if it only happened 10 or 20 times.


In other words, you keep trying them because they improve your life. This comment would be funny if it wasn't posted 20 times a day.


It needs to be repeated endlessly.

Im sure they will deliver a great api for this ai then change it in a way that breaks everything.

You will fix yours I will delete mine. I will feel dumb. You will improve your life again and again basically 20+ times. Enjoy!


The Greybeards Of AI...


Yeah seems pretty straight forward to me. Guy has been getting GOOG RSUs for 15 years straight and is now a multi-millionaire. Why would he rock his own boat? It is much easier to ignore any wrongdoing of the hand that feeds.


Microsoft under Satya has really stepped up. Gaming is now larger than TV and Movies in terms of revenue. I know their cloud rollout didn't work as anticipated, but if they get this acquisition right they will have a money printer for the next decade.


Kind of true, however Microsoft under Satya really killed the Windows desktop, burned bridges with the Windows developer community, that now has decided to re-focus on Win32, Windows Forms and WPF, ignoring pretty much anything else.

Windows is still the best alternative to expensive Macs (on mosty countries for common folks), or the Year of Desktop Linux right around the corner, but not thanks to Satya.


I'd say that Windows and Office are the two areas that Satya didn't really touch or influence in much of any way and mostly let continue on autopilot as he focused on every other area of Microsoft.


Satya's entire strategy seems to be going all in on turning microsoft into the one stop shop enterprise productivity SAAS company. Office is their main product at this point. I don't see how you can say he didn't touch or influence it.


Disagree on the Office part, he is clearly behind the push to bring everyone using the Azure version of Office, and the subscription model.


brilliant leadership? considering that overall Games out-grossed Movies about 15 years ago.. that trend only continued when the largest Game rollout of all time beat the largest Movie rollout of all time.. that has happened multiple times in 10 years.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: