Hacker News new | past | comments | ask | show | jobs | submit login
First Impressions with Google Gemini (roboflow.com)
83 points by zerojames 4 months ago | hide | past | favorite | 25 comments



Weird that the object detection prompt refused to answer. What other variations of that have we tried? Do you think that's an intentional task that was RLHF'd out or a quirk of some kind?

Thinking out loud some things I'd try:

Find the x/y position of the dog.

What is the center point of the dog?

Give the pixel coordinates (xywh) of the dog in this photo.

Simulate an object detection model trained to find dogs run on this photo; give your output in JSON format.


I have experienced intermittent performance with the web interface.

I am really curious about the object detection performance so I’ll be rerunning that test regularly.


After some prompting, I got a response. It identified the dog in the image, but the dog was in the center point. I gave Gemini the Home Alone image and it couldn't identify the Christmas tree, returning invalid coordinates.

The post has been updated accordingly.


Just guessing but it might be some kind of safety such that Gemini can't be used to solve captchas.


Thanks for the thorough analysis.

Wish articles like this were not written optimized for SEO. They’re much harder to read!


Readability is important to me. What can I do better? I always appreciate feedback.


It is an interesting article and some of the results are unexpected, but the layout is way too long and could be greatly condensed. At first glance and speaking from how I like to lay things out in a paper:

- You do not really need to show the prompt interface. You can show it once and thereafter use a bulleted list format, or simply show the input image if it responded correctly. - Your figures should be about half of their size, they don't need to fit the width of the body. - For comparative results against other models you can use a table with colored cells, with the test name on the row and model names on the columns. - For the dog, show a side-by-side figure with the raw image on the left and the box on the right, and include the coordinates it gave you in the body. - In your conclusion show the full matrix table of comparative results and summarize the relative strengths of the model against the others.

In terms of the writing and methods your conclusion says little and your tests do not go into significant depth. For example with the tire image you could show that it succeeds when cropped but as the photo gets wider it begins to fail to correctly identify the text in the image's center. For example see the methodology and presentation this article used: https://dynomight.net/ducks/

Also, the OCR test is too simple, even a 20-year-old OCR algorithm would probably recognize that. Experimenting with progressive degradation of the image could show its strengths, and analysis could show its accuracy at each level of degradation.


I'd say, trying to read this, the biggest problems are:

- tons of visual clutter, all those gradients and lines like the header or hero image - a floating ToC which insists on jamming in 'recommended links' (?!) the entire time - no outlines. Every single image or screenshot blends into the actual article. - a visual summary which is hard to read because it has tiny text and looks like a correlation heatmap instead of a table - highly inconsistent use of linking. Like, why does 'We have evaluated Gemini across four separate vision tasks:' link only 2 of the 4, and then not to the section in this article? - highly repetitive screenshots, which add nothing, and in conjunction with the lack of outlines for the images and the many outlines inside the images, means that the benchmark sections are a frustrating visual jigsaw puzzle where you have to decode the screenshot again and again to look at the tiny text inside it. It would be better to provide one (1) screenshot of each model's UI, which is all I need to see to get an idea of what it looks like and the implied workflow and what sort of metadata/options it has, and then for each task simply show the image/prompt and each model's responses as a normal blockquote or text.


^^ reformatted

- tons of visual clutter, all those gradients and lines like the header or hero image

- a floating ToC which insists on jamming in 'recommended links' (?!) the entire time

- no outlines. Every single image or screenshot blends into the actual article.

- a visual summary which is hard to read because it has tiny text and looks like a correlation heatmap instead of a table

- highly inconsistent use of linking. Like, why does 'We have evaluated Gemini across four separate vision tasks:' link only 2 of the 4, and then not to the section in this article?

- highly repetitive screenshots, which add nothing, and in conjunction with the lack of outlines for the images and the many outlines inside the images, means that the benchmark sections are a frustrating visual jigsaw puzzle where you have to decode the screenshot again and again to look at the tiny text inside it. It would be better to provide one (1) screenshot of each model's UI, which is all I need to see to get an idea of what it looks like and the implied workflow and what sort of metadata/options it has, and then for each task simply show the image/prompt and each model's responses as a normal blockquote or text.


All: I sincerely appreciate the time spent sharing feedback. Your notes and comments are helpful and give me tools to be a better writer .

Regarding the screenshots, I am not a fan of this approach. We adopted it because of the early trend to share ChatGPT screenshots, and to ensure people could see the origin of our prompting (the web interface).

I will start a discussion about screenshots with the team. This can be better.

I will also discuss the layout, too. Machine learning and AI is difficult enough. To the extent to which we can focus attention on the most important part of the page — the content — we should.

Thank you again for your notes! I appreciate it.


While I don't necessarily agree with all of these points,

> link only 2 of the 4, and then not to the section in this article?

This one is particularly prevalent on websites and it's quite annoying. When the site has any topic explainer articles, the terms that refer to those topics are always linked to those other articles, presumably to increase ad impressions and keep users on their site- but when there are legitimate article-specific links (which are almost always what I want), I have no way to locate those links (for instance when finding an original source).

Back in the day websites would use a different link style for this sort of "internal plug" style links, which was helpful. I guess it died out because users didn't want to click them. So the solution is, make it hard to tell which ones are internal plugs!


Obviously the core content wasn't but the article reads like unfiltered AI output. It doesn't really 'flow' for a lack of a better word.


I'll note that your article was very easy to read in Safari's reader mode.


Interesting results since Gemini scored "better" than GPT4 but Gemini is more like a 3.5 equilivalent model. The ultra one being released later is like gpt4.


You gave very different prompts to gemini and GPT 4("how much money" vs "how many coins", different coordinates systems etc).


Trying myself with GPT 4 Vision (directly using the API), I get response:

"The image shows a total of four coins."

For the exact same image with a prompt "how many coins do I have?"

I keep getting the same thing every time.

If I ask "what is the value shown on the image?"

I get the following:

"The image shows four coins, and from what I can discern, three of the coins have the number "1" on them, which likely indicates that they are one-unit coins in their respective currency. The fourth coin, which is a different color and appears to be bi-metallic, has the number "2" on it, suggesting it is a two-unit coin. The total value shown by the coins, therefore, appears to be 5 units of their currency. However, without knowing the specific currency, I cannot determine the exact monetary value in any other terms."

So to me it's obvious gpt4-vision model can handle this question.

Edit: I also get for regular ChatGPT "You have a total of four coins in the image."

Now I tried to ask GPT4 Vision to find the Dog, and it said:

"The dog is located approximately at 0.29, 0.36, 0.70, 0.90." - Just eyeing it, it may seem decently accurate, but not 100%.

My difference is that I took a screenshot from your pictures rather than providing the original though, so that could also probably affect answers.

Prompt for the dog was: "Find the dog. Return its location in the format x_min, y_min, x_max, y_max. Respond with 0-1 as percentages."


Thank you! We took screenshots of our tests, which you can download here: https://media.roboflow.com/lmms-tests.zip

(Gemini ones are not in there yet.)


That's not really answering anything.


The prompt for the coordinates is still different in GPT 4. Coins one look to be the same but the question is little bit ambiguous as GPT might be answering the total value of coins and did 1+1+1+2. For me, the tax one is working in GPT 4.


> Coins one look to be the same but the question is little bit ambiguous as GPT might be answering the total value of coin

Well that wouldn't be a very intelligent "AI" would it? The prompt was completely unambiguous.


Interesting why it decided to draw a bounding box around the dog's head, maybe because its training images are mostly dog portraits


I would like to see a few more tests to make sure it's definitely not a coincidence that it was specifically a box in the center.


I am trying to use the HTTP API. I asked this simple question "How many fingers does a baby have?" and all I get is the empty reponse: [{'candidates': [{'content': {'role': 'model'}, 'finishReason': 'OTHER'}]}, {'candidates': [{'content': {'role': 'model'}, 'finishReason': 'OTHER'}], 'usageMetadata': {'promptTokenCount': 3, 'totalTokenCount': 3}}]

If I use their example "Give me a recipe for banana bread.", it returns a recipe.

Why won't it answer a simple question about number of fingers?


good work!


These results say that Gemini performs better than GPT4-V, which is clearly not true from experience. Clearly, we need better evals.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: