Hacker News new | past | comments | ask | show | jobs | submit | drothlis's comments login

> Claude's ability to count pixels and interact with a screen using precise coordinate

I guess you mean its "Computer use" API that can (if I understand correctly) send mouse click at specific coordinates?

I got excited thinking Claude can finally do accurate object detection, but alas no. Here's its output:

> Looking at the image directly, the SPACE key appears near the bottom left of the keyboard interface, but I cannot determine its exact pixel coordinates just by looking at the image. I can see it's positioned below the letter grid and appears wider than the regular letter keys, but I apologize - I cannot reliably extract specific pixel coordinates from just viewing the screenshot.

This is 3.5 Sonnet (their most current model).

And they explicitly call out spatial reasoning as a limitation:

> Claude’s spatial reasoning abilities are limited. It may struggle with tasks requiring precise localization or layouts, like reading an analog clock face or describing exact positions of chess pieces.

--https://docs.anthropic.com/en/docs/build-with-claude/vision#...

Since 2022 I occasionally dip in and test this use-case with the latest models but haven't seen much progress on the spatial reasoning. The multi-modality has been a neat addition though.


They report that they trained the model to count pixels and based on accurate mouse clicks coming out of it, it seems to be the case for at least some code path.

> When a developer tasks Claude with using a piece of computer software and gives it the necessary access, Claude looks at screenshots of what’s visible to the user, then counts how many pixels vertically or horizontally it needs to move a cursor in order to click in the correct place. Training Claude to count pixels accurately was critical.


Curious: what use cases do you use to test the spacial reasoning ability of these models?


I noticed in your demo it generated the prompt "tap on the 'Log in' button located directly below the 'Facebook Password' field".

Does your model consistently get the positions right? (above, below, etc). Every time I play with ChatGPT, even GPT-4o, it can't do basic spatial reasoning. For example, here's a typical output (emphasis mine):

> If YouTube is to the upper *left* of ESPN, press "Up" once, then *"Right"* to move the focus.

(I test TV apps where the input is a remote control, rather than tapping directly on the UI elements.)


Beautiful.



I'm a huge fan. But sometimes it's hard to produce the snapshots.


According to the ViperGPT paper their "ImagePatch.find()" uses GLIP.

According to the GLIP paper,† accuracy on a test-set not seen during training is around 60% so... neat demos but whether it'll be reliable enough depends on your application.

https://arxiv.org/abs/2206.05836


Could you implement (some of) astroid's inference using stack graphs? [1],[2]

That would allow a lot of caching optimisations, as you can "index" each file in isolation.

[1]: https://github.blog/2021-12-09-introducing-stack-graphs/

[2]: https://github.com/github/stack-graphs


It side-steps the problem of git conflicts, I suppose. You'd have to use their tool (`touca diff`? I don't know if that exists) instead of `git diff`.


Some ideas I got from Jeremias Rõßler's talk: https://t.co/xWtA58Q9q5

- Snapshot testing is like version-control but for the outputs rather than the inputs (source code).

- Asserts in traditional unit tests are like "block lists" specifying which changes aren't allowed. Instead, snapshot testing allows you to specify an "allow list" of acceptable differences (e.g. timestamps).


Obviously a sensationalised title, but it's a neat illustration of how you'd apply the language models of the future to real tasks.


Would be ridiculously inefficient, while also being nondeterministic and opaque. Impossible to debug, verify, or test anything, and thus would be unwise to use for almost any kind of important task.

But maybe for a very forgiving task you can reduce developer hours.

As soon as you need to start doing any kind of custom training of the model, then you are reintroducing all developer costs and then some, while the other downsides still remain.

And if you allow users of your API to train the model, that introduces a lot of issues. see: Microsoft's Tay chatbot

Also you would need to worry about "prompt injection" attacks.


> Would be ridiculously inefficient, while also being nondeterministic and opaque. Impossible to debug, verify, or test anything, and thus would be unwise to use for almost any kind of important task.

Not to defend a joke app, but I have worked in “serious” production systems that for all intents and purposes were impossible to recreate bugs in to debug. They took data from so many outside sources that the “state” of the software could not be easily replicated at a later time. Random microservice failures littered the logs and you could never tell if one of them was responsible for the final error.

Again, not saying GPT backend is better but I can definitely see use-cases where it could power DB search as a fall-through condition. Kind of like the standard 404 error - did you mean…?


> They took data from so many outside sources that the “state” of the software could not be easily replicated at a later time.

By definition, that's a complex system, and reproducing errors would be equally complex.

A GPT author would produce that for every system. Worse, you would not be able to reproduce bugs in the author itself.

While humans do have bugs that cause them to misunderstand the problem, at least humans are similar enough for us to look at their wrong code and say "Hah, he thought the foobar worked with all frobzes, but it doesn't work with bazzed-up frobzes at all".

IOW, we can point to the reason the bug was written in the first place. With GPT systems it's all opaque - there's no reason or rhyme for why it emitted code that tried to work on bazzed-up frobzes the second time, and not the first time, or why it alternates between the two seemingly randomly ...


> They took data from so many outside sources that the “state” of the software could not be easily replicated at a later time.

Oh, I have fixed systems like those so that everything is deterministic and you can fake the state with a reasonably low amount of effort. It solved a few very important problems.

(But mine were data integration problems. For operations interdependence ones the common advice is to write a fucking lot of observability into it. My favorite minoritary one is "don't create it". I understand there are times you can do neither.)


Wow I did not consider last ditch effort error handling, but that makes a lot of sense. Thank you for giving me something to think about!


Absolutely this. It's a solution looking for a problem.

If the developer task is really so trivial why not just have a human write actual code?

And even if it is actual code instead of a Rube Goldberg-esque restricted query service, I still don't think there's ever any time saved using AI for anything. Unless you also plan on assigning the code review to the AI, a human must be involved. To say that the reviews would be tedious is an understatement. Even the most junior developer is far more likely to comprehend their bug and fix it correctly. The AI is just going to keep hallucinating non-existent APIs, haphazardly breaking linter rules, and writing in plagiarized anti-patterns.


Guys, this is a joke. Don't take it so seriously. Literally the first thing in the README is a meme.


You may not take it seriously, and I may not take it seriously, but it takes one person to read this seriously, convince another person to invest, and then hire a third person and tell them, "make it so", for the joke to no longer be a joke.


A developer getting paid because an investor misunderstands a technology isn’t anything we need to get too worried about, I think. It seems to be a big part of our industry, and I don’t know if that’s ever going to change. I sometimes think of all the crapware dApps that got shoveled out in the last boom - little of meaning was created from a technical standpoint, but smart people got to do what they love to put bread on the table.

Perhaps I’m being overly simplistic, but I don’t see it as all that different from contractors getting paid to do silly and tasteless renos on McMansions. Objectively a bad way to reinvest one’s money, but it’s a wealth transfer in the direction I prefer, so I’ll hold my judgement.


Fair enough. I'm not going to complain much about money moving towards the workers, but I also hate obvious waste as a matter of principle. I also hate being dragged into bullshit work against my will.

I had a close call many years ago - my co-workers and I had to talk higher-ups out of a desperate attempt to add something, anything, that is even tangentially related to AI or blockchains, so either or both of those words could be used in an investor pitch...

That's when I fully grokked that buzzword-driven development doesn't happen because someone in management reads a HBR article and buys into the hype - it happens because someone in management believes the investors/customers buy into the hype. They're probably not wrong, but it still feels dirty to work on bullshit, so I steer clear.


Investors know to "sell the shovels" [to use a gold-rush concept] and are investing into well-diversified positions; which include the likes of GPT's capacity: nVIDIA, AMD, TSMC, MSFT &c — these are the shovels which speculators must buy (or utilize via kWh / price of another's GPT-instance), and I assure you is the case.


If somebody putting a few millions into making this widespread were enough to make it a problem, then software development would already be doomed and we would better start learning woodwork right now.


The argument is stochastic. Maybe this joke will get ignored, but then we could've had the same conversation few years ago about "prompt engineering" becoming a job, and here we are.

Or about launching a Docker container implementing a single, short-lived CLI command.

Or about all the other countless examples of ridiculously complicated and/or wasteful solutions to simple problems that become industry standards simply because they make it easier to do something quickly - all of them discussed/criticized regularly here and elsewhere, yet continuing to gain adoption.

Nah, our industry values development velocity much more than correctness, performance, ergonomics, or any kind of engineering or common sense.


> Maybe this joke will get ignored, but then we could've had the same conversation few years ago about "prompt engineering" becoming a job, and here we are.

The joke is on all of us if we only treat this as a joke. Rails pioneered simple command line templates and convention over configuration, and it took over the world for awhile.

An AI as backend is the logical conclusion of that same trend.


The title is a play on "Attention is All You Need", which is the paper that introduced transformers


Thank you for this human-generated connection [I can still safely and statistically presume].

I already know personally how incredible and what GPT-like systems are capable, and I've only "accepted" this future for about six weeks. Definitely having to process multitudes (beyond technical) and start accepting that prompt engineering is real and that there are about to be more jobless than just losing the trucking industry to AI [largest employer of males in USA] — this is endemic.

The sky is falling. The sky is also blue (this is the stupidest common question GPT is getting right now; instead ask "Why do people care that XYZ is blue/green/red/white/bad/unethical?"


Some good ideas here for when your tests are in a separate repo than the system under test (GPUs/drivers/compilers in the case of the author, but it's applicable to a variety of industries).


Tests in seperate repo is the worst anti pattern I have seen. It’s extremely common that a change requires a change in tests but it’s impossible to correctly manage this situation if the tests can’t be updated in the same commit/pr.


The only time "tests in a separate repo" makes sense to me is if they are truly cross-functional end to end tests that exercise several systems.

Those tests should be as small as possible to verify that everything is still wired together correctly.

Everything else should be either unit tests or narrow integration tests between a small handful of components. And as you said, they should live in the repository of the software they test.


I can't think of any project I've worked on where external test suites even make sense. I suppose it would work when you have a very clear spec or compliance document you could write independent tests, or if you're rewriting a system and need the public API to be bug-for-bug compatible with the old one, but other than those niche use cases I wouldn't want to keep those tests external at all.

Even if you do have external tests, you still need internal ones for the surface area your external tests don't check for. Unit tests and such don't make sense at all combined with a separate test repo.


Think systems integrators and compliance tests. I would imagine that each of the individual systems being "integrated" do have their own unit tests, upstream, in their own repos.


In that case you have to release versions with compatibility for both the new and old way. At no point can I ever see it being a good idea to just let tests fail.


It also makes it impossible to test outside of the public API


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: