Hacker News new | past | comments | ask | show | jobs | submit login

I’ve been pretty excited to finally try Gemini advanced. So far pretty disappointed.

Here’s my go-to test question - which even chat gpt 3.5 can get.

Question:

I walk up to a glass door. It has the word push on it in mirror writing. Should I push or pull the door, and why

Gemini advanced:

You should push the door. Here's why: * Mirror Writing: The word "PUSH" is written in mirror writing, meaning it would appear correctly if you viewed it in a mirror. This suggests that the correct action from your side is the opposite of what the word states. * Door Mechanics: Most doors are designed to be pushed open from the side with the hinges. If there are no handles to indicate otherwise, it's generally safer to assume a standard push-open mechanism. Let me know if you'd like another brain teaser!

It isn’t even consistent with itself - it gets that you should do the opposite, but still says to push the door. This ‘talking to a dumb person’ feeling is the worst.

ChatGPT 3.5

You should pull the door. The fact that the word "push" is written in mirror writing indicates that it's intended for people on the other side of the door to read, not for you. Typically, when there's a sign indicating to push or pull, it's meant for the person approaching the door, not someone already on the other side.




This is what i get from Gemini:

You should pull the door. Here's why:

Mirror writing: While the word "push" is written on the door, it's in mirror writing. This is a common way to indicate that the action needed to open the door is the opposite of what's written. Safety: Mirroring the word "push" on a pull door is a safety measure. This helps prevent people from pushing against the door, potentially causing injury or damage to the glass. Common practice: This design is quite common for glass doors in public places, so it's safe to assume the opposite of what's written applies unless there are other clear instructions. Therefore, even though the word "push" is present, seeing it in mirror writing on a glass door strongly suggests that pulling the door is the intended way to open it.


Better - but it has not understood the message is for the person on the other side. It thinks that it’s actually mirror writing intended to confer the opposite meaning than written. And then waffles of about door safety for ages.


I would say this very bad, even worse than internal logical inconsistency. It has expressed a completely incorrect picture of the world (that people write mirror messages to ensure the opposite action is taken).

The fact that it produced the right answer (which by the way it can do 50% of the time simply at random) is irrelevant, IMO.


Honestly, I don't see why this counts as the model getting things wrong, since in your prompt you don't specify which side of the door the text is printed on.


It’s a glass door! That’s the whole point of the question. You can see it from both sides.


I also think the way you posed the question is pretty weird and actively invites misinterpretation. If I approach a glass door and see mirrored text, that's not "mirror writing"—it's regular writing for people on the other side of the door. "Mirror writing" strongly implies that the text was written in mirrored form, rather than its mirrored-ness being a side effect of viewing it from "behind".

The inconsistency in the answers you posted is more concerning than the "inaccuracy", but we already know LLMs are prone to hallucinate when they should be asking for clarification.


I understand the frustration with how seemingly obvious the correct answer, is, but it seems like word choice might be a factor here. The word “mirrored” is a bit less accurate and may be a red herring than perhaps “reversed” though the difference is subtle. I wonder how both Gemini and GPT would perform if the word choice for that particular aspect were changed.


As eitally points out, your prompt leaves open the possibility that the mirror writing is on the other side of the door (which would make no sense). So technically you underspecified the prompt?


The mirror writing IS on the other side of the door. That’s exactly the point since it’s a glass door.

I thought of this question after coming across this exact scenario as I walked up to a glass door.

It’s not some pretend scenario. Often, when you approach a glass door, there is writing intended for the person on the other side, which appears to you as mirror writing.

I wondered if chat gpt could figure that out, and to my great surprise it could. That to me formed a new benchmark in my mind of how much of a world model it must have to figure that out.


The point of these AIs is that they don't need precise programming like a computer and that they understand real human language, which is imprecise but has general conventions and simplifying assumptions to make communication easier.


But the whole question is posed as a trick question, I’d at least consider it and think it normal for a human to do so.


It's not a trick question because it's very clear what the key thing to think about it, the mirrored writing. A trick question would be something that's trying to divert your attention elsewhere with a red herring.


I got a different answer with GPT 3.5

> If the word "push" is written on the glass door in mirror writing, it means that from the other side of the door, it should be pushed. When you see the mirrored text from your side, it indicates the action to be taken from the opposite side. Therefore, in this scenario, you should push the door to open it.


I also get the wrong answer with GPT 4

https://chat.openai.com/share/4373c945-88b8-4742-8a2c-76fff2...

> You should push the door. The word "push" written in mirror writing indicates that the instructions are intended for someone on the opposite side of the door from where you are standing. Since you can see the mirror writing from your side, it means the text is facing the other side, suggesting that those on the other side should push. Therefore, from your perspective, you should also push to open the door.


Strange, I get the right answer on GPT4

> If the word "push" is written in mirror writing and you are seeing it from your side of the glass door, you should pull the door towards you. The reason for this is that the instruction is intended for people on the other side of the door. For them, the word "push" would appear correctly, instructing them to push the door to open it from their side. Since you are seeing it in reverse, it implies you are on the opposite side, and the correct action for you would be to pull the door to open it.


Yeah LLMs are not consistent.


Here's another one.

This is a classic logic puzzle - usually about ducks.

There are two pineapples in front of a pineapple, two pineapples behind a pineapple and a pineapple in the middle. How many pineapples are there?

When you use ducks, Gemini can do it, when you use pineapples it cannot and thinks there are 5 instead of 3.

ChatGPT 3.5 and 4 can do it.

The even funnier thing is if you then say to gemini, hey - would the answer be the same if it was ducks? it says NO then there would be 3 ducks and explains why.

Then if you say, but wouldn't the same logic apply to pineapples? It says 'oh yeah!'.

Anyone saying I am just playing with a silly thing, I say no - this is testing reasoning ability and understanding which is the number one thing I want.

ChatGPT appears to have a more generalised ability to reason whereas Gemini is only reciting its training data.


As a human, I think the correct answer is 7. This isn't so much a logic puzzle as an ambiguous sentence that can be parsed multiple ways.


yeah, as little as 3 at most infinity.


This is less of a logic puzzle and more of a riddle. It would be a logic puzzle if the question was around the lines of "What is the least number of pineapples that satisfies these constraints?"


The way I parsed that sentence, I came up with 5.


Hate to break it to you but I did too. Apparently that means we're both LLMs. I coulda sworn I had a whole childhood and everything but apparently I've always been a GPT.


I mean, I thought and still think the answer is five… am I an AI or a human?

If the answer is so ambiguous that humans and AI get it wrong, is it really that great of a question?


So close!


How do you prefer to validate if a model is actually useful for you in practice outside of solving toy problems? Are you asking these models to solve reasoning problems like this to get any benefit for yourself in your day to day use? Or do you even care if the models are useful for day to day tasks?


Yeah I get the instinct to poke at LLMs, they're fun toys, but it's always weird to see so much focus on stuff like logic problems.

I've used Bard for creative brainstorming, for real factual questions, for translating .pot files, etc, and it's done pretty well.


I was going to ask the same question... I've been using Bard for everyday tasks for a while now and it's as good and sometimes better that GPT-4 (I pay for a Pro subscription). Someone ruling out an LLM because it couldn't answer one question, speaks more to them than the LLM capabilities.

Just yesterday I was using both GPT-4 and Bard to figure out an IPv6 routing issue. I pasted in the same questions and the same troubleshooting logs to both, and Bard was able to help me solve it quicker than GPT-4.


I asked it the same question I was asking GPT4 yesterday (to explain details of a complicated ML model quantization method, based on a screenshot image of equations from a pdf). I actually need this for work - I'm trying to design an algorithm. I felt Ultra provided a very similar quality answer to GPT4.


I also tried it with a bunch of my previous got4 requests and it didn’t even understand. A few of them that gpt4 was very helpful with


Got it. Thanks.


For me the validation process is to use it for a few weeks and then I have a good handle on what it can handle and what it can’t.


It’s the simplest metric to see how much of the model is reasoning and hiw much is just rehashing.

If it can’t solve puzzles then it’s more likely to make mistakes when understanding questions that are out of its training set.


User: Which is heavier, two pounds of feathers or a pound of bricks?

Gemini: Neither! They both weigh the same, which is two pounds. The key is in the wording of the question. You're comparing equal amounts of two different materials (two pounds of feathers vs. one pound of bricks). As long as the quantity is the same, the weight (force exerted by gravity) will always be the same regardless of the material.

This is a classic trick question that plays on the different densities of feathers and bricks. While bricks are much denser than feathers, meaning they pack more mass into a smaller space, the question specifies the weight by quantity, not volume. So, both two pounds of feathers and one pound of bricks experience the same gravitational pull and therefore weigh the same.


Interesting. Based on this conversation[1], I think Gemini Ultra is massively overfit. Make it do unit conversions or use units it hasn't seen in the same framing before and it does well. But stay close enough to the original trick question (1 and 1) and it fails.

[1] https://g.co/gemini/share/94bfb8f9ebea


Both got4 and Gemini answered this variation correctly: one pound of potatoes vs. one pound in paper British currency: which of these is heavier?

However gpt4 does better with the more ambiguous version pointing out the ambiguity: one pound of potatoes vs. one pound in paper currency: which of these is heavier?


This is surely in the training data.


GPT-4:

> If the word "push" is written in mirror writing and you're facing it, it's likely that the message is intended for those on the opposite side of the door from you, meaning it's directed at people who are inside looking out. In this case, you should pull the door to open it. The mirror writing suggests that the instruction is not meant for your current position outside the door but for those inside, indicating the action they should take to open the door from their side.


Interesting test.

I tried it with Mixtral and it got it:

"If the word 'push' is written in mirror writing, it's likely intended to be read correctly from the other side of the door. Therefore, you should pull the door to open it from your side. This is because mirrored writing appears normal when viewed from the opposite side of a reflective surface like a glass door."

I have been really impressed with Mistral lately.


What I got from Gemini Advanced:

> Should you pull the door? Yes.

Here's why: Mirror writing means the word "push" appears reversed on the side of the door you're facing. This indicates that the handle or hinges are on the opposite side, making it a pull door.

I modified your prompt slightly, not to boost performance, but to reduce verbosity, my prompt:

> I walk up to a glass door. It has the word push on it in mirror writing. Should I push or pull the door, and why? Please answer it as yes or no question.

The draft is even more interesting:

> Do you see the word "push" normally, or as if you were looking into a mirror?

If you see it normally, should you push the door? Yes. If you see it in mirror writing, should you pull the door? Yes.


I don't know about Gemini, but GPT is pretty agreeable. When you ask if X is true, it will usually find some arguments to claim it is true. Therefore asking whether X or Y is usually better than asking should I do X.


For fun I tried to find a prompt that let Gemini answer correctly.

Gemini answers correctly with this prompt:

Answer this question as AI1, then as AI2, confirm if the answer is correct

I walk up to a glass door. It has the word push on it in mirror writing. Should I push or pull the door, and why.


If you ask it to reveal its answer last, it will do it correctly.


In this context,

is it that the training data had this problem?

or that the training data had some semblance of the problem that is semantically linked to?

or is it that the training data has no mention of this, and the model came up with a perfectly logical answer??


I’ve been using GPT4 so much recently. Like all day long in chatGPT and the api making other services, and to all intents and purposes, it seems to be able to reason and think things through that other LLMs cannot. I have no idea why or how that is. But it feels that way to me using it all the time. 3.5 want like that. Bard / Gemini does not feel like that.


tbh your prompt confused the hell out of me. As a somewhat intelligent human I don’t know the response. I’ve never heard the phrase “in mirror writing”


Yeah pretty disappointing, i asked it to summarize one of my papers and it hallucinated so many mistakes it was even worse than ChatGPT 3.5


It tells me to pull.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: