Hacker News new | past | comments | ask | show | jobs | submit login

That's not the only thing wrong. Gemini makes a false statement in the video, serving as a great demonstration of how these models still outright lie so frequently, so casually, and so convincingly that you won't notice, even if you have a whole team of researchers and video editors reviewing the output.

It's the single biggest problem with LLMs and Gemini isn't solving it. You simply can't rely on them when correctness is important. Even when the model has the knowledge it would need to answer correctly, as in this case, it will still lie.

The false statement is after it says the duck floats, it continues "It is made of a material that is less dense than water." This is false; "rubber" ducks are made of vinyl polymers which are more dense than water. It floats because the hollow shape contains air, of course.




This seems to be a common view among some folks. Personally, I'm impartial.

Search or even asking other expert human beings are prone to provide incorrect results. I'm unsure where this expectation of 100% absolute correctness comes from. I'm sure there are use cases, but I assume it's the vast minority and most can tolerate larger than expected inaccuracies.


> I'm unsure where this expectation of 100% absolute correctness comes from.

It's a computer. That's why. Change the concept slightly: would you use a calculator if you had to wonder if the answer was correct or maybe it just made it up? Most people feel the same way about any computer based anything. I personally feel these inaccuracies/hallucinations/whatevs are only allowing them to be one rung up from practical jokes. Like I honestly feel the devs are fucking with us.


Speech to text is often wrong too. So is autocorrect. And object detection. Computers don't have to be 100% correct in order to be useful, as long as we don't put too much faith in them.


Call me old fashioned, but I would absolutely like to see autocorrect turned off in many contexts. I much prefer to read messages with 30% more transparent errors rather than any increase in opaque errors. I can tell what someone meant if I see "elephent in the room", but not "element in the room" (not an actual example, autocorrect would likely get that one right).


Your caveat is not the norm though, as everyone is putting a lot of faith in them. So, that's part of the problem. I've talked with people that aren't developers, but they are otherwise smart individuals that have absolutely not considered that the info is not correct. The readers here are a bit too close to the subject, and sometimes I think it is easy to forget that the vast majority of the population do not truly understand what is happening.


Nah, I don’t think anything has the potential to build critical thinking like LLMs en masse. I only worry that they will get better. It’s when they are 99.9% correct we should worry.


People put too much faith in conspiracy theories they find on YT, TikTok, FB, Twitter, etc. What you're claiming is already not the norm. People already put too much faith into all kinds of things.


Okay, but search is done on a computer, and like the person you’re replying to said, we accept close enough.

I don’t necessarily disagree with your interpretation, but there’s a revealed preference thing going on.

The number of non-tech ppl I’ve heard directly reference ChatGPT now is absolutely shocking.


> The number of non-tech ppl I've heard directly reference ChatGPT now is absolutely shocking.

The problem is that a lot of those people will take ChatGPT output at face value. They are wholly unaware that of its inaccuracies or that it hallucinates. I've seen it too many times in the relatively short amount of time that ChatGPT has been around.


So what? People do this with Facebook news too. That's a people problem, not an LLM problem.


People on social media are absolutely 100% posting things deliberately to fuck with people. They are actively seeking to confuse people, cause chaos, divisiveness, and other ill intended purposes. Unless you're saying that the LLM developers are actively doing the same thing, I don't think comparing what people find on the socials vs getting back as a response from a chatBot is a logical comparison at all


There are far more people who post obviously wrong, confusing and dangerous things online with total conviction. There are people who seriously believe Earth is flat, for example.


How is that any different from what these AI chatbots are doing? They make stuff up that they predict will be rewarded highly by humans who look at it. This is exactly what leads to truisms like "rubber duckies are made of a material that floats over water" - which looks like it should be correct, even though it's wrong. It really is no different from Facebook memes that are devised to get a rise out of people and be widely shared.


Because we shouldn't be striving to make mediocrity. We should be striving to build better. Unless the devs of the bots are wanting to have a bot built on trying to deceive people, I just don't see the purpose of this. If we can "train" a bot and fine tune it, we should be fine tuning truth and telling it what absolutely is bullshit.

To avoid the darker topics to keep the conversation on the rails, if there were a misinformation campaign that was trying to state that the Earth's sky is red, then the fine tuning should be able to inform that this is clearly fake so when quoting this it should be stated as incorrect information that is out there. This kind of development should be how we can clean up the fake, but nope, we're seemingly quite happy at accepting it. At least that's how your question comes off to me.


Sure, but current AI bots are just following the human feedback they get. If the feedback is naïve enough to score the factoid about rubber duckys as correct, guess what, that's the kind of thing these AI's will target. You can try to address this by prompting them with requests like "do you think this answer is correct and ethical? Think through this step by step" ('reinforcement learning from AI feedback') but that's very ad hoc and uncertain - ultimately, the humans in the loop call the shots.


At the end of the day, if there is no definitive answer to a question, it should respond in such a manner. "While there are compelling reasons to think A or B, neither A nor B have been verified. They are just the leading theories." That would be a much better answer than "Option A is the answer even if some people think B is." when A is just as unproven as B, but because it answers so definitively, people think it is the right answer.

So the labels thing is something that obviously will never work. But the system has all of the information it needs to know if the question is definitively answerable. If it is not, do not phrase the response definitively. At this point, I'd be happy if it responded to "Is 1+1 = 2?" with a wish washy answer like, "Most people would agree that 1+1 = 2", and if it wanted to say "in base 10, that is the correct answer. however, in base 2, the 1+1 = 10" would also be acceptable. Fake it till you make it is not the solution here.


If we rewind a little bit to the mid to late 2010s, filter bubbles, recommendation systems and unreliable news being spread on social media was a big problem. It was a simpler time, but we never really solved the problem. Point is, I don’t see the existence of other problems as an excuse for LLM hallucination, and writing it off as a “people problem” really undersells how hard it is to solve people problems.


Literally everything is a "people problem"

You can kill people with a fork, it doesn't mean you should legally be allowed to own a nuclear bomb "because it's just the same". The problem always come from scale and accessibility


So you're saying we need a Ministry of Truth to protect people from themselves? This is the same argument used to suppress "harmful" speech on any medium.


I've gotten to the point where I want "advertisment" stamped on anything that is, and I'm getting to the point I want "fiction" stamped on anything that is. I have no problem with fiction existing. It can be quite fun. People trying to pass fiction as fact is a problem though. Trying to force a "fact" stamp would be problematic though, so I'd rather label everything else.

How to enforce it is the real sticky wicket though, so it's only something best discussed at places like this or while sitting around chatting while consuming


And who gets to control the "fiction" stamp? Especially for hot button topics like covid (back in 2020)? Should asking an LLM about lab leak theory be auto-stamped with "fiction" since it's not proven? But then what if it's proven later?


why should all computing be deterministic?

let me show you this "genius"/"wrong-thinking" person as to say about AL(artificial life) and deterministic computing.

https://www.cs.unm.edu/~ackley/

https://www.youtube.com/user/DaveAckley

To sum up a bunch of their content: You can make intractable problems solvable/crunchable if you allow just a little error into the result (which is reduced the longer the calculation calculates). And this is acceptable for a number of use cases where initial accuracy is less important that instant feedback.

It is radically different from a Von Neumann model of a computer - where there is a deterministic 'totalitarian finger pointer' pointing to some registry (and only one registry at a time) is an inherently limited factor. In this model - each computational resource (a unit of ram, and a processing unit) fights for and coordinates reality with it's neighbors without any central coordination.

Really interesting stuff. still in its infancy...


"Computer says no" is not a meme for no reason.


I'm a software engineer, and I more or less stopped asking ChatGPT for stuff that isn't mainstream. It just hallucinates answers and invents config file options or language constructs. Google will maybe not find it, or give you an occasional outdated result, but it rarely happens that it just finds stuff that's flat out wrong (in technology at least).

For mainstream stuff on the other hand ChatGPT is great. And I'm sure that Gemini will be even better.


The important thing is that with Web Search as a user you can learn to adapt to varying information quality. I have a higher trust for Wikipedia.org than I do for SEO-R-US.com, and Google gives me these options.

With a chatbot that's largely impossible, or at least impractical. I don't know where it's getting anything from - maybe it trained on a shitty Reddit post that's 100% wrong, but I have no way to tell.

There has been some work (see: Bard, Bing) where the LLM attempts to cite its sources, but even then that's of limited use. If I get a paragraph of text as an answer, is the expectation really that I crawl through each substring to determine their individual provenances and trustworthiness?

The shape of a product matters. Google as a linker introduces the ability to adapt to imperfect information quality, whereas a chatbot does not.

As an exemplar of this point - I don't trust when Google simply pulls answers from other sites and shows it in-line in the search results. I don't know if I should trust the source! At least there I can find out the source from a single click - with a chatbot that's largely impossible.


> it rarely happens that it just finds stuff that's flat out wrong

"Flat out wrong" implies determinism. For answers which are deterministic such as "syntax checking" and "correctness of code" - this already happens.

ChatGPT, for example, will write and execute code. If the code has an error or returns the wrong result it will try a different approach. This is in production today (I use the paid version).


Dollars to doughnuts says they are using GPT3.5.


I'm currently working with some relatively obscure but open source stuff (JupyterLite and Pyodide) and ChatGPT 4 confidently hallucinates APIs and config options when I ask it for help.

With more mainstream libraries it's pretty good though


I use chatgpt4 for very obscure things

If I ever worried about being quoted then I’ll verify the information

otherwise I’m conversational, have taken an abstract idea into a concrete one and can build on top of it

But I’m quickly migrating over to mistral and if that starts going off the rails I get an answer from chatgpt4 instead


I know exactly where the expectation comes from. The whole world has demanded absolute precision from computers for decades.

Of course, I agree that if we want computers to “think on their own“ or otherwise “be more human“ (whatever that means) we should expect a downgrade in correctness, because humans are wrong all the time.


> The whole world has demanded absolute precision from computers for decades.

Computer engineers maybe. I think the general population is quite tolerant of mistakes as long as the general value is high.

People generally assign very high value to things computers do. To test this hypothesis all you have to do is ask folks to go a few days without their computer or phone.


> The whole world has demanded absolute precision from computers

The opposite. Far too tolerant of the excuse "sorry, computer mistake." (But yeah, just at the same time as "the computer says so".)


Is it less reliable than an encyclopedia? It is less reliable than Wikipedia? Those aren't infallible but what's the expectation if it's wrong on something relatively simple?

With the rush of investment in dollars and to use these in places like healthcare, government, security, etc. there should be absolute precision.


Humans are imperfect, but this comes with some benefits to make up for it.

First, we know they are imperfect. People seem to put more faith into machines, though I do sometimes see people being too trusting of other people.

Second, we have methods for measuring their imperfection. Many people develop ways to tell when someone is answering with false or unjustified confidence, at least in fields they spend significant time in. Talk to a scientist about cutting edge science and you'll get a lot of 'the data shows', 'this indicates', or 'current theories suggest'.

Third, we have methods to handle false information that causes harm. Not always perfect methods, but there are systems of remedies available when experts get things wrong, and these even include some level of judging reasonable errors from unreasonable errors. When a machine gets it wrong, who do we blame?


Absolutely! And fourth, we have ways to make sure the same error doesn't happen again; we can edit Wikipedia, or tell the person they were wrong (and stop listening to them if they keep being wrong).


I find it ironic that computer scientists and technologists are frequently uberrationalists to the point of self parody but they get hyped about a technology that is often confidently wrong.

Just like the hype with AI and the billions of dollars going into it. There’s something there but it’s a big fat unknown right now whether any part of the investment will actually pay off - everyone needs it to work to justify any amount of the growth of the tech industry right now. When everyone needs a thing to work, it starts to really lose the fundamentals of being an actual product. I’m not saying it’s not useful, but is it as useful as the valuations and investments need it to be? Time will tell.


>I'm unsure where this expectation of 100% absolute correctness comes from. I'm sure there are use cases, but I assume it's the vast minority and most can tolerate larger than expected inaccuracies.

As others hinted at, there's some bias because it's coming from a computer, but I think it's far more nuanced than that.

I've worked with many experts and professionals through my career ranging across medicine, various types of engineers, scientists, academics, researchers and so on and the pattern I often see is the level of certainty presented that always bothers me and the same is often embedded in LLM responses.

While humans don't typically quantify the certainty of their statements, the best SMEs I've ever worked with make it very clear what level of certainty they have when making professional statements. The SMEs who seem to be more often wrong than not speak in certainty quite often (some of this is due to cultural pressures and expectations surrounding being an "expert").

In this case, I would expect a seasoned scientist to say something in response to the duck question that: "many rubber ducks exist and are designed to float, this one very well might, we'd really need to test it or have far more information about the composition of the duck, the design, the medium we want it in (Water? Mecury? Helium?)" and so on. It's not an exact answer but you understand there's uncertainty there and we need to better clarify our question and the information surrounding that question. The fact is, it's really complex to know if it'll float or not from visual information alone.

It could have an osmimum ball inside that overcomes most the assumed buoyancy the material contains, including the air demonstrated to make it squeak. It's not transparent. You don't know for sure and the easiest way to alleviate uncertainty in this case is simply to test it.

There's so much uncertainty in the world, around what seem like the most certain and obvious things. LLMs seem to have grabbed some of this bad behavior from human language and culture where projecting confidence is often better (for humans) than being correct.


Most people I worked with either tell me "I don't know" or "I think x, but with not sure" when they are not sure about something, the issue with LLMs is they don't have this concept.


The bigger problem is lack of context. When I speak with a person or review search results, I can use what I know about the source to evaluate the information I'm given. People have different areas of expertise and use language and mannerisms to communicate confidence in their knowledge or lack thereof. Websites are created by people (most times) and have a number of contextual clues that we have learned to interpret over the years.

LLMs do none of this. They pose as a confident expert on almost everything, and are just as likely to spit out BS as a true answer. They don't cite their sources, and if you ask for the source sometimes they provide ones that don't contain the information cited or don't even exist. If you hired a researcher and they did that you wouldn't hire them again.


1. Hunans may also never be 100% - but it seems they are more often correct. 2. When AI is wrong it's often not only slighty off, but completely off the rails. 3. Humans often tell you when they are not sure. Even if it's only their tone. AI is always 100% convinced it's correct.


It’s not AI it’s a machine learning model


If it’s no better than asking a random person, then where is the hype? I already know lots of people who can give me free, maybe incorrect guesses to my questions.

At least we won’t have to worry about it obtaining god-like powers over our society…


> At least we won’t have to worry about it obtaining god-like powers over our society…

We all know someone who's better at self promotion than at whatever they're supposed to be doing. Those people often get far more power than they should have, or can handle—and ChatGPT is those people distilled.


Let's see, so we exclude law, we exclude medical.. it's certainly not a "vast minority" and the failure cases are nothing at all like search or human experts.


Are you suggesting that failure cases are lower when interacting with humans? I don't think that's my experience at all.

Maybe I've only ever seen terrible doctors but I always cross reference what doctors say with reputable sources like WebMD (which I understand likely contain errors). Sometimes I'll go straight to WebMD.

This isn't a knock on doctors - they're humans and prone to errors. Lawyers, engineers, product managers, teachers too.


You think you ask your legal assistant to find some precedents related to your current case and they will come back with an A4 page full of made up cases that sound vaguely related and convincing but are not real? I don't think you understand the failure case at all.


That example seems a bit hyperbolic. Do you think lawyers who leverage ChatGPT will take the made up cases and present them to a judge without doing some additional research?

What I'm saying is that the tolerance for mistakes is strongly correlated to the value ChatGPT creates. I think both will need to be improved but there's probably more opportunity in creating higher value.

I don't have a horse in the race.


> Do you think lawyers who leverage ChatGPT will take the made up cases and present them to a judge without doing some additional research?

I generally agree with you, but it's funny that you use this as an example when it already happened. https://arstechnica.com/tech-policy/2023/06/lawyers-have-rea...


facepalm


> Do you think lawyers who leverage ChatGPT will take the made up cases and present them to a judge without doing some additional research

I really don’t recommend using ChatGPT (even GPT-4) for legal research or analysis. It’s simply terrible at it if you’re examining anything remotely novel. I suspect there is a valuable RAG application to be built for searching and summarizing case law, but the “reasoning” ability and stored knowledge of these models is worse than useless.


> Do you think lawyers who leverage ChatGPT will take the made up cases and present them to a judge without doing some additional research?

You don't?

https://fortune.com/2023/06/23/lawyers-fined-filing-chatgpt-...


What would be the point of a lawyer using chatGPT if it had to root through every single reference chatGPT relied upon? I don't have to doublecheck every reference of a junior attorney, because they actually know what they are doing, and when they don't, it's easy to tell and wont come with fraudulently created decisions/pleadings, etc


> Do you think lawyers who leverage ChatGPT will take the made up cases and present them to a judge without doing some additional research?

Oh dear.


Guessing from the last sentence that you are one of those "most" who "can tolerate larger than expected inaccuracies".

How much inaccuraciy would that be ?


Where did you get the 100% number from? It's not in the original comment, it's not in a lot of similar criticisms of the models.


Honestly I agree. Humans make errors all the time. Perfection is not necessary and requiring perfection blocks deployment of systems that represent a substantial improvement over the status quo despite their imperfections.

The problem is a matter of degree. These models are substantially less reliable than humans and far below the threshold of acceptability in most tasks.

Also, it seems to me that AI can and will surpass the reliability of humans by a lot. Probably not by simply scaling up further or by clever prompting, although those will help, but by new architectures and training techniques. Gemini represents no progress in that direction as far as I can see.


There's a huge difference between demonstrating something with fuzzy accuracy and playing something off as if it's giving good, correct answers. An honest way to handle that would be to highlight where the bot got it wrong instead of running with the answer as if it was right.

Deception isn't always outright lying. This video was deceitful in form and content and presentation. Their product can't do what they're implying it can, and it was put together specifically to mislead people into thinking it was comparable in capabilities to gpt-4v and other competitor's tech.

Working for Google AI has to be infuriating. They're doing some of the most cutting edge research with some of the best and brightest minds in the field, but their shitty middle management and marketing people are doing things that undermine their credibility and make them look like untrustworthy fools. They're a year or more behind OpenAI and Anthropic, barely competitive with Meta, and they've spent billions of dollars more than any other two companies, with a trashcan fire for a tech demo.

It remains to be seen whether they can even outperform Mistral 7b or some of the smaller open source models, or if their benchmark numbers are all marketing hype.


If a human expert gave wrong answers as often and as confidently as LLMs, most would consider no longer asking them. Yet people keep coming back to the same LLM despite the wrong answers to ask again in a different way (try that with a human).

This insistence on comparing machines to humans to excuse the machine is as tiring as it is fallacious.


Aside: this is not what impartial means.


To be fair, one could describe the duck as being made of air and vinyl polymer, which in combination are less dense than water. That's not how humans would normally describe it, but that's kind of arbitrary; consider how aerogel is often described as being mostly made of air.


Is an aircraft carrier made of a material that is less dense than water?


I think you can safely say that air is a critical component of an aircraft carrier. I suppose the frame of it is not made of air, but the ballasts are designed with air in mind and are certainly made to utilize air. The whole system fails without air, meaning that it requires air to function. It comes down to a definitional argument of the word "made" which is pointless.


I guess it's a purely philosophical question. But no normal person would say "my house is made of air" or "atoms are made of vacuum".


only if you average it out over volume :P


Is an aircraft carrier made of metal and air? Or just metal?


Where’s the distinction between the air that is part of the boat, and the air that is not? If the air is included in the boat, should we all be wearing life vests?


If I take all of the air out of a toy duck, it is still a toy duck. If I take all of the vinyl/rubber out of a toy duck, it is just the atmosphere remaining


The material of the duck is not air. It's not sealed. It would still be a duck in a vacuum and it would still float on a liquid the density of water too.


Well this seems like a huge nitpick. If a person said that, you would afford them some leeway, maybe they meant the whole duck, which includes the hollow part in the middle.

As an example, when most people say a balloon's lighter than air, they mean an inflated balloon with hot air or helium, but you catch their meaning and don't rush to correct them.


The model specifically said that the material is less dense than water. If you said that the material of a balloon is less dense than air, very few people would interpret that as a correct statement, and it could be misleading to people who don't know better.

Also, lighter-than-air balloons are intentionally filled with helium and sealed; rubber ducks are not sealed and contain air only incidentally. A balloon in a vacuum would still contain helium (if strong enough) but would not rise, while a rubber duck in a vacuum would not contain air but would still easily float on a liquid of similar density to water.

The reason why it seems like a nitpick is that this is such an inconsequential thing. Yeah, it's a false statement but it doesn't really matter in this case, nobody is relying on this answer for anything important. But the point is, in cases where it does matter these models cannot be trusted. A human would realize when the context is serious and requires accuracy; these models don't.


I’m not an expert but I suspect that this aspect of lack of correctness in these models might be fundamental to how they work.

I suppose there’s two possible solutions: one is a new training or inference architecture that somehow understand “facts”. I’m not an expert so I’m not sure how that would work, but from what I understand about how a model generates text, “truth” can’t really be a element in the training or inference that affects the output.

the second would be a technology built on top of the inference to check correctness, some sort of complex RAG. Again not sure how that would work in a real world way.

I say it might be fundamental to how the model works because as someone pointed out below, the meaning of the word “material” could be interpreted as the air inside the duck. The model’s answer was correct in a human sort of way, or to be more specific in a way that is consistent with how a model actually produces an answer- it outputs in the context of the input. If you asked it if PVC is heavier than water it would answer correctly.

Because language itself is inherently ambiguous and the model doesn’t actually understand anything about the world, it might turn out that there’s no universal way for a model to know what’s true or not.

I could also see a version of a model that is “locked down” but can verify the correctness of its statements, but in a way that limits its capabilities.


> this aspect of lack of correctness in these models might be fundamental to how they work.

Is there some sense in which this isn't obvious to the point of triviality? I keep getting confused because other people seem to keep being surprised that LLMs don't have correctness as a property. Even the most cursory understanding of what they're doing understands that it is, fundamentally, predicting words from other words. I am also capable of predicting words from other words, so I can guess how well that works. It doesn't seem to include correctness even as a concept.

Right? I am actually genuinely confused by this. How is that people think it could be correct in a systematic way?


I think very few people on this forum believe LLMs are correct in a systematic way, but a lot of people seem to think there's something more than predicting words from other words.

Modern machine learning models contain a lot of inscrutable inner layers, with far too many billions of parameters for any human to comprehend, so we can only speculate about what's going on. A lot of people think that, in order to be so good at generating text, there must be a bunch of understanding of the world in those inner layers.

If a model can write convincingly about a soccer game, producing output that's consistent with the rules, the normal flow of the game and the passage of time - to a lot of people, that implies the inner layers 'understand' soccer.

And anyone who noodled around with the text prediction models of a few decades ago, like Markov chains, Bayesian text processing, sentiment detection and things like that can see that LLMs are massively, massively better than the output from the traditional ways of predicting the next word.


> Is there some sense in which this isn't obvious to the point of triviality?

This is maybe a pedantic "yes", but is also extremely relevant to the outstanding performance we see in tasks like programming. The issue is primarily the size of the correct output space (that is, the output space we are trying to model) and how that relates to the number of parameters. Basically, there is a fixed upper bound on the amount of complexity that can be encoded by a given number of parameters (obvious in principle, but we're starting to get some theory about how this works). Simple systems or rather systems with simple rules may be below that upper bound, and correctness is achievable. For more complex systems (relative to parameters) it will still learn an approximation, but error is guaranteed.

I am speculating now, but I seriously suspect the size of the space of not only one or more human language but also every fact that we would want to encode into one of these models is far too big a space for correctness to ever be possible without RAG. At least without some massive pooling of compute, which long term may not be out of the question but likely never intended for individual use.

If you're interested, I highly recommend checking out some of the recent work around monosemanticity for what fleshing out the relationship between model-size and complexity looks like in the near term.


Just to play devil’s advocate: we can train neural networks to model some functions exactly, given sufficient parameters. For example simple functions like ax^2 + bx + c.

The issue is that “correctness” isn’t a differentiable concept. So there’s no gradient to descend. In general, there’s no way to say that a sentence is more or less correct. Some things are just wrong. If I say that human blood is orange that’s not more incorrect than saying it’s purple.


Because it is assumed that it can think or/and reason. In this case, knowing the concepts of density, the density of a material, detecting the material from an image, detecting what object this image is. And, most importantly, knowing that this object is not solid. Because then it could not float.


Maybe you simplify a bit what "guessing words from other words" means. HOW do you guess this, is what's mysterious to many: you can guess words from other words due to habit of language, a model of mind of how other people expect you to predict, a feedback loop helping you do it better over time if you see people are "meh" at your bad predictions, etc.

So if the chatbot is used to talking, knows what you'd expect, and listens to your feedback, why wouldn't it also want to tell the truth like you would instinctively, even best effort only ?

Sadly, the chatbots doesn't yet really care about the game it's playing, it doesn't want to make it interesting, it's just like a slave producing minimal low-effort outputs. I've talked to people exploited for money in dark places, and when they "seduce" you, they talk like a chatbot: most of it is lie, it just has to convince you a little bit to go their way, they pretend to understand or care about what you say, but end of the day, the goal is for you to pay. Like the chatbot.


Yeah. I think there's some ambiguity around the meaning of reasoning- because it is a kind of reasoning to say a Duck's material is less dense than water. In a way it's reasoned that out, and it might actually say something about the way a lot of human reasoning works.... (especially if you've ever listened to certain people talk out loud and say to yourself... huh?)


Bing chat uses gpt-4 and sites sources from it's retrieval.


I think this problem needs to be solved at a higher level, and in fact Bard is doing exactly that. The model itself generates its output, and then higher-level systems can fact check it. I've heard promising things about feeding back answers to the model itself to check for consistency and stuff, but that should be a higher level function (and seems important to avoid infinite recursion or massive complexity stemming from the self-check functionality).


I'm not a fan of current approaches here. "Chain of thought" or other approaches where the model does all its thinking using a literal internal monologue in text seem like a dead end. Humans do most of their thinking non-verbally and we need to figure out how to get these models to think non-verbally too. Unfortunately it seems that Gemini represents no progress in this direction.


> "Chain of thought" or other approaches where the model does all its thinking using a literal internal monologue in text seem like a dead end. Humans do most of their thinking non-verbally and we need to figure out how to get these models to think non-verbally too.

Insofar as we can say that models think at all between the input and the stream of tokens output, they do it nonverbally. Forcing the structure of reduce some of it to verbal form short of the actual response-of-concern does not change that, just as the fact that humans reduce some of their thought to verbal form to work through problems doesn't change that human thought is mostly nonverbal.

(And if you don't consider what goes on between input and output thought, than chain of thought doesn't force all LLM thought to be verbal, because only the part that comes out in words is "thought" to start with in that case -- you are then saying that the basic architecture, not chain of thought prompting, forces all thought to be verbal.)


You're right, the models do think non-verbally. However, crucially, they can only do so for a fixed amount of time for each output token. What's needed is a way for them to think non-verbally continuously, and decide for themselves when they've done enough thinking to output the next token.


Is it clear that humans can think nonverbally (including internal monologue) continuously? As in, for difficult reasoning tasks, do humans benefit a lot from extra time if they are not allowed internal monologue. Genuine question


The point of “verbalizing” the chain of thought isn’t that it’s the most effective method. And frankly I don’t think it matters that humans think non verbally. The goal isn’t to create a human in a box. Verbalizing the chain of thought allows us to audit the thought process, and also create further labels for training.


No, the point of verbalizing the chain of thought is that it's all we know how to do right now.

> And frankly I don’t think it matters that humans think non verbally

You're right, that's not the reason non-verbal is better, but it is evidence that non-verbal is probably better. I think the reason it's better is that language is extremely lossy and ambiguous, which makes a poor medium for reasoning and precise thinking. It would clearly be better to think without having to translate to language and back all the time.

Imagine you had to solve a complicated multi-step physics problem, but after every step of the solution process your short term memory was wiped and you had to read your entire notes so far as if they were someone else's before you could attempt the next step, like the guy from Memento. That's what I imagine being an LLM using CoT is like.


I mean a lot of problems are amenable to subdivision into parts where the process of each part is not needed for the other parts. It's not even clear that humans usually hold in memory all of process of the previous parts especially the it won't be used later.


> Humans do most of their thinking non-verbally and we need to figure out how to get these models to think non-verbally too.

That's a very interesting point, both technically and philosophically.

Where Gemini is "multi-modal" from training, how close do you think that gets? Do we know enough about neurology to identical a native language in which we think? (not rhetorical questions, I'm really wondering)


Neural networks are only similar to brains on the surface. Their learning process is entirely different and their internal architecture is different as well.

We don’t use neural networks because they’re similar to brains. We use them because they are arbitrary function approximators and we have an efficient algorithm (backprop) coupled with hardware (GPUs) to optimize them quickly.


I, a non-AGI, just ‘hallucinated’ yesterday. I hallucinated that my plan was to take all of Friday off and started wondering why I had scheduled morning meetings. I started canceling them in a rush. In fact, all week I had been planning to take a half day, but somehow my brain replaced the idea of a half day off with a full day off. You could have asked me and I would have been completely sure that I was taking all of friday off.


EDIT: never mind, I missed the exact wording about being "made of a material..." which is definitely false then. Thanks for the correction below.

Preserving the original comment so the replies make sense:

---

I think it's a stretch to say that's false.

In a conversational human context, saying it's made of rubber implies it's a rubber shell with air inside.

It floats because it's rubber [with air] as opposed to being a ceramic figurine or painted metal.

I can imagine most non-physicist humans saying it floats because it's rubber.

By analogy, we talk about houses being "made of wood" when everybody knows they're made of plenty of other materials too. But the context is instead of brick or stone or concrete. It's not false to say a house is made of wood.


> In a conversational human context, saying it's made of rubber implies it's a rubber shell with air inside.

Disagree. It could easily be solid rubber. Also, it's not made of rubber, and the model didn't claim it was made of rubber either, so it's irrelevant.

> It floats because it's rubber [with air] as opposed to being a ceramic figurine or painted metal.

A ceramic figurine or painted metal in the same shape would float too. The claim that it floats because of the density of the material is false. It floats because the shape is hollow.

> It's not false to say a house is made of wood.

It's false to say a house is made of air simply because its shape contains air.


This is what the reply was:

> Oh, it it's squeaking then it's definitely going to float.

> It is a rubber duck.

> It is made of a material that is less dense than water.

Full points for saying if it's squeaking then it's going to float.

Full points for saying it's a rubber duck, with the implication that rubber ducks float.

Even with all that context though, I don't see how "it is made of a material that is less dense than water" scores any points at all.


Yeah, I think arguing the logic behind these responses misses the point, since an LLM doesn't use any kind of logic--it just responds in a pattern that mimics the way people respond. It says "it is made of a material that is less dense than water" because that is a thing that is similar to what the samples in its training corpus have said. It has no way to judge whether it is correct, or even what the concept of "correct" is.

When we're grading the "correctness" of these answers, we're really just judging the average correctness of Google's training data.

Maybe the next step in making LLM's more "correct" is not to give them more training data, but to find a way to remove the bad training data from the set?


I don't see it as a problem with most non-critical uses cases (critical being things like medical diagnoses, controlling heavy machinery or robotics, etc).

LLMs right now are most practical for generating templated text and images, which when paired with an experienced worker, can make them orders of magnitude more productive.

Oh, DALL-E created graphic images with a person with 6 fingers? How long would it have taken a pro graphic artist to come up with all the same detail but with perfect fingers? Nothing there they couldn't fix in a few minutes and then SHIP.


>> Nothing there they couldn't fix in a few minutes and then SHIP.

If by ship, you mean put directly into the public domain then yes.

https://www.goodwinlaw.com/en/insights/publications/2023/08/...

and for more interesting takes: https://www.youtube.com/watch?v=5WXvfeTPujU&


After asserting it's a rubber duck, there are some claims without follow-up:

- Just after that it doesn't translate the "rubber" part

- It states there's no land nearby for it to rest or find food in the middle of the ocean: if it's a rubber duck it doesn't need to rest nor feed. (That's a missed opportunity to mention the infamous "Friendly Floatees spill"[1] in 1992 as some rubber ducks floated to that map position). Although it seems to recognize geographical features of the map, it fails to mention Easter Island is relatively nearby. And if it were recognized as a simple duck — which it described as a bird swimming in the water — it seems oblivious to the fact that the duck might feed itself in the water. It doesn't mention either that the size of the duck seems abnormally big in that map context.

- The concept of friends and foes doesn't apply to a rubber duck either. Btw labeling the duck picture as a friend and the bear picture as a foe seems arbitrary (e.g. a real duck can be very aggressive even with other ducks.)

Among other things, the astronomical riddle seems also flawed to me: it answered "The correct order is Sun, Earth, Saturn".

I'd like for it to state :

- the premises it used, like "Assuming it depicts the Sun, Saturn and the Earth" (there are other stars, other ringed-planets, and the Earth similarity seems debatable)

- the sorting criteria it used (e.g. using another sorting key like the average distance from us "Earth, Sun, Saturn" can be a correct order)

[1] https://en.wikipedia.org/wiki/Friendly_Floatees_spill


I did some reading and it seems that rubber's relative density to water has to do with its manufacturing process. I see a couple of different quotes on the specific gravity of so-called 'natural rubber', and most claim it's lower than water.

Am I missing something?

I asked both Bard (Gemini at this point I think?) and GPT-4 why ducks float, and they both seemed accurate: they talked about the density of the material plus the increased buoyancy from air pockets and went into depth on the principles behind buoyancy. When pressed they went into the fact that "rubber"'s density varies by the process and what it was adulterated with, and if it was foamed.

I think this was a matter of the video being a brief summary rather than a falsehood. But please do point out if I'm wrong on the rubber bit, I'm genuinely interested.

I agree that hallucinations are the biggest problems with LLMs, I'm just seeing them get less commonplace and clumsy. Though, to your point, that can make them harder to detect!


Someone on Twitter was also skeptical that the material is more dense than water. I happened to have a rubber duck handy so I cut a sample of material and put it in water. It sinks to the bottom.

Of course the ultimate skeptic would say one test doesn't prove that all rubber ducks are the same. I'm sure someone at some point in history has made a rubber duck out of material that is less dense than water. But I invite you to try it yourself and I expect you will see the same result unless your rubber duck is quite atypical.

Yes, the models will frequently give accurate answers if you ask them this question. That's kind of the point. Despite knowing that they know the answer, you still can't trust them to be correct.


Ah good show :). I was rather preoccupied with the question but didn't have one handy. Well, I do, but my kid would roast me slowly over coals if I so much as smudged it. Ah the joy of the Internet, I did not predict this morning that I would end the day preoccupied with the question of rubber duck density!

I guess for me the question of whether or not the model is lying or hallucinating is if it's correctly summarizing its source material. I find very conflicting materials on the density of rubber, and most of the sources that Google surfaces claim a lower density than water. So it makes sense to me that the model would make the inference.

I'm splitting hairs though, I largely agree with your comment above and above that.

To illustrate my agreement: I like testing AIs with this kind of thing... a few months ago I asked GPT for advice as to how to restart my gas powered water heater. It told me the first step was to make sure the gas was off, then to light the pilot light. I then asked it how the pilot light was supposed to stay lit with the gas off and it backpedaled. My imagining here is that because so many instructional materials about gas powered devices emphasize to start by turning off the gas, that weighted it as the first instruction.

Interesting, the above shows progress though. I realized I asked GPT 3.5 back then, I just re-asked 3.5 and then asked 4 for the first time. 3.5 was still wrong. 4 told me to initially turn off the gas to disappate it, then to ensure gas was flowing to the pilot before sparking it.

But that said I am quite familiar with the AI being confidently wrong, so your point is taken, I only really responded because I was wondering if I was misunderstanding something quite fundamental about the question of density.


That's a tricky one though since the question is, is the air inside of the rubber duck part of the material that makes it? If you removed the air it definitely wouldn't look the same or be considered a rubber duck. I gave it to the bot since when taking ALL the material that makes it a rubber duck, it is less dense than water.


A rubber duck in a vacuum is still a rubber duck and it still floats (though water would evaporate too quickly in a vacuum, it could float on something else of the same density).


A rubber duck with a vacuum inside (removing the air material) of it is just a piece of rubber with eyes. Assuming OP's point about the rubber not being less dense than water, it would sink, no?


No. Air is less dense than water; vacuum is even less dense than air. A rubber duck will collapse if you seal it and try to pull a vacuum inside with air outside, but if the rubber duck is in a vacuum then it will have only vacuum inside and it will still float on a liquid the density of water. If you made a duck out of a metal shell you could pull a vacuum inside, like a thermos bottle, and it would float too.


The metal shell is ridged though, so the volume maintains the same with the vacuum. A rubber duck collapses with a vacuum inside of it, thus losing the shape of a duck and reducing the volume of the object =). That's why I said it's just a piece of rubber with eyes.


> A rubber duck collapses with a vacuum inside of it

Not if there is vacuum outside too. In a vacuum it remains a duck and still floats.


If you hold a rubber duck under water and squeeze out the air, it will fill with water and still be a rubber duck. If you send a rubber duck into space, it will become almost completely empty but still be a rubber duck. Therefore, the liquid used to fill the empty space inside it is not part of the duck.

I mean apply this logic to a boat, right? Is the entire atmosphere part of the boat? Are we all on this boat as well? Is it a cruise boat? If so, where is my drink?


Agree, then the question becomes how will this issue play out?

Maybe AI correctness will be similar to automobile safety. It didn’t take long for both to be recognized as fundamental issues with new transformative technologies.

In both cases there seems to be no silver bullet. Mitigations and precautions will continue to evolve, with varying degrees of effectiveness. Public opinion and legislation will play some role.

Tragically accidents will happen and there will be a cost to pay, which so far has been much higher and more grave for transportation.


Devil's advocate. It is made of a material less dense than water. Air.

It certainly isn't how I would phrase it, and I wouldn't count air as what something is made of, but...

Soda pop is chocked full of air, it's part of it! And I'd say carbon dioxide is a part of the recipe, of pop.

So it's a confusing world for a young LLM.

(I realise it may have referenced rubber prior, but it may have meant air... again, Devil's advocate)


When you make carbonated soda you put carbon dioxide in deliberately and use a sealed container to hold it in. When you make a rubber duck you don't put air in it deliberately and it is not sealed. Carbonated soda ceases to be carbonated when you remove the air. A rubber duck in a vacuum is still a rubber duck and it even still floats.


If the rubber duck has air inside, it is known, and intentional, for it is part of that design.

If you remove the air from the duck, and stop it so it won't refill, you have a flat rubber duck, which is useless for its design.

Much as flat pop is useless for its design.

And this nuance is even more nuance-ish than this devil's advocate post.


A rubber duck in a vacuum (not a duck in atmosphere with a vacuum only inside) would not go flat or pop. It would remain entirely normal, as useful as it ever was, and it would still float on a liquid the density of water. Removing the air would have no effect on the duck whatsoever. It's not part of the material of the duck in any reasonable interpretation.

But pedantic correctness isn't even what matters here. The model made a statement where the straightforward interpretation is false and misleading. A person who didn't know better would be misled. Whether you can possibly come up with a tortured alternative interpretation that is technically not incorrect is irrelevant.


There's nothing wrong with what you're saying, but what do you suggest? Factuality is an area of active research, and Deepmind goes into some detail in their technical paper.

The models are too useful to say, "don't use them at all." Hopefully people will heed the warnings of how they can hallucinate, but further than that I'm not sure what more you can expect.


The problem is not with the model, but with its portrayal in the marketing materials. It's not even the fact that it lied, which is actually realistic. The problem is the lie was not called out as such. A better demo would have had the user note the issue and give the model the opportunity to correct itself.


But you yourself said that it was so convincing that the people doing the demo didn't recognize it as false, so how would they know to call it out as such?

I suppose they could've deliberately found a hallucination and showcased it in the demo. In which case, pretty much every company's promo material is guilty of not showcasing negative aspects of their product. It's nothing new or unique to this case.


They should have looked more carefully, clearly. Especially since they were criticized for the exact same thing in their last launch.


The duck is indeed made of a material that is less dense. Namely water and air.

If you go to such technical routes your definition is wrong too. It doesn't float because it contains air. If you poke in the head of the duck it will sink. Even though at all times it contains air.


The duck is made of water and air? Which duck are we talking about here.


Is it possible for humans to be wrong about something, without lying?


I don't agree with the argument that "if a human can fail in this way, we should overlook this failing in our tooling as well." Because of course that's what LLMs are, tools, like any other piece of software.

If a tool is broken, you seek to fix it. You don't just say "ah yeah it's a broken tool, but it's better than nothing!"

All these LLM releases are amazing pieces of technology and the progress lately is incredible. But don't rag on people critiquing it, how else will it get better? Certainly not by accepting its failings and overlooking them.


“Broken” is word used by pedants. A broken tool doesn’t work. This works, most of the time.

Is a drug “broken” because it only cures a disease 80% of the time?

The framing most critics seem to have is “it must be perfect”.

It’s ok though, their negativity just means they’ll miss out on using a transformative technology. No skin off the rest of us.


I think the comparison to humans is just totally useless. It isn’t even just that, as a tool, it should be better than humans at the thing it does, necessarily. My monitor is on an arm, the arm is pretty bad at positioning things compared to all the different positions my human arms could provide. But it is good enough, and it does it tirelessly. A tool is fit for a purpose or not, the relative performance compared to humans is basically irrelevant.

I think the folks making these tools tend to oversell their capabilities because they want us to imagine the applications we can come up with for them. They aren’t selling the tool, they are selling the ability to make tools based on their platform, which means they need to be speculative about the types of things their platform might enable.


If a broken tool is useful, do you not use it because it is broken ?

Overpowered LLMs like GPT-4 are both broken (according to how you are defining it) and useful -- they're just not the idealized version of the tool.


Maybe not if its the case that your use of the broken tool would result in the eventual undoing of your work. Like, lets say your staple gun is defective and doesn't shoot the staples deep enough, but it still shoots. You can keep using the gun, but it's not going to actually do its job. It seems useful and functional, but it isn't and its liable to create a much bigger mess.


So to continue the analogy, if the staple gun is broken and it requires you to do more than a working (but non-existent) staple gun BUT less work than doing the affixment without the broken staple gun, you would or would not use it ?


But nobody said they wouldn't use it. You said that. You came up with this idea and then demanded other people defend it.

I don't know why "critiquing the tool" is being equated to "refusing to use the tool."

I don't like calling something a strawman, because I think it's an overused argument, but...I mean...


I didn't come up with it nor ask anyone to defend it. I asked a different question about usefulness, and about what it means to him for something to be "broken".

My point is that the attempt to critique it was a failure. It provided no critique.

It was incomplete at the very least -- it assigned it the label of broken, but didn't explain the implications of that. It didn't define at what level of failure it would need to be to valuable.

Additionally, I didn't indicate whether or not he would refuse to use it -- specifically because I didn't know, because he didn't say.

We all use broken tools built on a fragile foundation of imperfect precision.


I think you are missing the point. If I do use it, then my result will be a broken and defective product. How exactly is that not clear? That's the point. It might not be observable to be, but whatever I'm affixing with the staple gun will come loose because its not working right and not sinking the staples in deep enough...

If I don't use it, then the tool is not used and provided no benefit...


It's not clear because it is false and I believe I can produce a proof if you are willing to validate that you accept my premise.

Your CPU, right now, has known defects. It will produce the wrong outputs for some inputs. It seems to meet your definition of broken.

Do you agree with that premise ?


One has nothing to do with the other. There's no rule about all broken tools because they can be broken in different ways. What's so difficult about my hypothetical? I laid it all out for you.


I assumed you understood we reached the end of the usefulness of your hypothetical to the original analogy since, as you said, the tools can be broken in different ways. I tried to introduce a scenario that was more applicable and less theoretical so that we could discuss those particular points.

If we do somehow try to apply your analogy, it would indicate that the LLM output is flawed in a way we cannot scrutinize -- the hidden failures that we aren't detecting (why? It's not specified, I am assuming because we didn't check to see if the tool was "broken" and not meeting some unspecified quality-level; that is, it's an unknown unknown failure mode).

This doesn't really comport with the LLM scenario, where the output is fully viewed, and the outputs are widely understood (that is it is a known failure mode).

This is more closely related to a computing service -- of which you are an active user of. You are using a "broken" computer right now, according to your definition of broken correct ?


>This doesn't really comport with the LLM scenario, where the output is fully viewed, and the outputs are widely understood (that is it is a known failure mode).

Yes, it does. People are regularly using LLMs to brief themselves on topics they are ignorant about. Are you actually serious?

>This is more closely related to a computing service -- of which you are an active user of. You are using a "broken" computer right now, according to your definition of broken correct ?

Doesn't really matter, because I'm not relying upon the computer for the verity of its functions


I think you're reading a lot into GP's comment that isn't there. I don't see any ragging on people critiquing it. I think it's perfectly compatible to think we should continually improve on these things while also recognizing that things can be useful without being perfect


I don't think people are disputing that things can be useful without being perfect. My point was that when things aren't perfect, they can also lead to defects that would not otherwise be perceived based upon the belief that the tool was otherwise working at least adequately. Would you use a staple gun if you weren't sure it was actually working? If it's something you don't know a lot about, how can you be sure it's working adequately?


Lying implies an intent to deceive despite, or giving a response despite having better knowledge, which I'd argue LLMs can't do, at least not yet. It just requires a more robust theory of mind than I'd consider them to realistically be capable of.

They might have been trained/prompted with misinformation, but then it's the people doing the training/prompting who are lying, still not the LLM.


To the question of whether it could have intent to deceive, going to the dictionary, we find that intent essentially means a plan (and computer software in general could be described as a plan being executed) and deceive essentially means saying something false. Furthermore, its plan is to talk in ways that humans talk, emulating their intelligence, and some intelligent human speech is false. Therefore, I do believe it can lie, and will whenever statistically speaking a human also typically would.

Perhaps some humans never lie, but should the LLM be trained only on that tiny slice of people? It's part of life, even non-human life! Evolution works based on things lying: natural camouflage, for example. Do octopuses and chameleons "lie" when they change color to fake out predators? They have intent to deceive!


Not to say this example was lying but they can lie just fine - https://arxiv.org/abs/2311.07590


They're lying in the same way that a sign that says "free cookies" is lying when there are actually no cookies.

I think this is a different usage of the word, and we're pretty used to making the distinction, but it gets confusing with LLMs.


You are making an imaginary distinction that doesn't exist. It doesn't even make any sense in the context of the paper i linked.

The model consistently and purposefully withheld knowledge it was directly aware of. This is lying under any useful definition of the word. You're veering off into meaningless philosophy that has no bearing on outcomes and results.


Most humans I professionally interact with don't double down on their mistakes when presented with evidence to the contrary.

The ones that do are people I do my best to avoid interacting with.

LLMs act more like the latter, than the former.


Given the misleading presentation by real humans in these "whole teams" that this tweet corrects, this doesn't illustrate any underlying powers by the model


>It's the single biggest problem with LLMs and Gemini isn't solving it.

I loved it when the lawyers got busted for using a hallucinating LLM to write their briefs.


People seem to want to use LLMs to mine knowledge, when really it appears to be a next-gen word-processor.


LLMs do not lie, nor do they tell the truth. They have no goal as they are not agents.


With apologies to Dijkstra, the question of whether LLMs can lie is about as relevant as the question of whether submarines can swim.


I totally agree with you on the confident lies. And it’s really tough. Technically the duck is made out of air and plastic right?

If I pushed the model further on the composition of a rubber duck, and it failed to mention its construction, then it’d be lying.

However there is this disgusting part of language where a statement can be misleading, technically true, not the whole truth, missing caveats etc.

Very challenging problem. Obviously Google decided to mislead the audience and basically cover up the shortcomings. Terrible behaviour.


Calling the air inside the duck (which is not sealed inside) part of its "material" would be misleading. That's not how most people would interpret the statement and I'm confident that's not the explanation for why the statement was made.


The air doesn’t matter. Even with a vacuum inside it would float. It’s the overall density of “the duck” that matters, not the density of the plastic.


A canoe floats, and that doesn't even command any thought regarding whether you can replace trapped air with a vacuum. If you had a giant cube half full of water, with a boat on the water, the boat would float regardless of whether the rest of the cube contained air or vacuum, and regardless of whether the boat traps said air (like a pontoon) or is totally vented (like a canoe). The overall density of the canoe is NOT influenced by its shape or any air, though. The canoe is strictly more dense than water (it will sink if it capsizes) yet in the correct orientation it floats.

What does matter, however, is the overall density of the space that was water and became displaced by the canoe. That space can be populated with dense water, or with a less dense canoe+air (or canoe+vacuum) combination. That's what a rubber duck also does: the duck+air (or duck+vacuum) combination is less dense than the displaced water.


No, the density of the object is less than water, not the density of the material. The Duck is made of plastic, and it traps air. Similarly, you can make a boat that floats in water out of concrete or metal. It is an important distinction when trying to understand buoyancy.


It also says the attribute of squeaking means it'll definitely float


That's actually pretty clever because if it squeaks, there is air inside. How many squeaking ducks have you come across that don't float?


You could call it clever or you could call it a spurious correlation.


language models do not lie. (this pedantic distinction being important, because language models.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: