Hacker Newsnew | past | comments | ask | show | jobs | submit | lewisl9029's commentslogin

This article is coming out at an interesting time for me.

We probably have different definitions for "budget", but I just ordered a super janky eGPU setup for my very dated 8th gen Intel NUC, with a m2->pcie adapter, a PSU, and a refurb Intel A770 for about 350 all-in, not bad considering that's about the cost of a proper Thunderbolt eGPU enclosure alone.

The overall idea: A770 seems like a really good budget LLM GPU since it has more memory (16GB) and more memory bandwidth (512GB/s) than a 4070, but costs a tiny fraction. The m2-pcie adapter should give it a bit more bandwidth to the rest of the system than Thunderbolt as well, so hopefully it'll make for a decent gaming experience too.

If the eGPU part of the setup doesn't work out for some reason, I'll probably just bite the bullet and order the rest of the PC for a couple hundred more, and return the m2-pcie adapter (I got it off of Amazon instead of Aliexpress specifically so I could do this), looking to end up somewhere around 600 bux total. I think that's probably a more reasonable price of entry for something like this for most people.

Curious if anyone else has experience with the A770 for LLM? Been looking at Intel's https://github.com/intel/ipex-llm project and it looked pretty promising, that's what made me pull the trigger in the end. Am I making a huge mistake?


> refurb Intel A770 for about 350

I'm seeing A770s for about $500 - $550. Where did you find a refurb one for $350 (or less since you're also including other parts of the system)


I got this one from Acer's Ebay for $220: https://www.ebay.com/itm/266390922629

It's out of stock now unfortunately, but it does seem to pop up again from time to time according to Slickdeals: https://slickdeals.net/newsearch.php?q=a770&pp=20&sort=newes...

I would probably just watch the listing and/or set up a deal alert on Slickdeals and wait. If you're in a hurry though, you can probably find a used one on Ebay for not too much more.


I had a somewhat similar experience trying to use LLMs to do OCR.

All the models I've tried (Sonnet 3.5, GPT 4o, Llama 3.2, Qwen2 VL) have been pretty good at extracting text, but they failed miserably at finding bounding boxes, usually just making up random coordinates. I thought this might have been due to internal resizing of images so tried to get them to use relative % based coordinates, but no luck there either.

Eventually gave up and went back to good old PP-OCR models (are these still state of the art? would love to try out some better ones). The actual extraction feels a bit less accurate than the best LLMs, but bounding box detection is pretty much spot on all the time, and it's literally several orders of magnitude more efficient in terms of memory and overall energy use.

My conclusion was that current gen models still just aren't capable enough yet, but I can't help but feel like I might be missing something. How the heck did Anthropic and OpenAI manage to build computer use if their models can't give them accurate coordinates of objects in screenshots?


LLMs are inherently bad at this due to tokenization, scaling, and lack of training on the task. Anthropic’s computer use feature has a specialized model for pixel-counting: > Training Claude to count pixels accurately was critical. Without this skill, the model finds it difficult to give mouse commands. [1] For a VLM trained on identifying bounding boxes, check out PaliGemma [2]

You may also be able to get the computer use API to draw bounding boxes if the costs make sense.

That said, I think the correct solution is likely to use a non-VLM to draw bounding boxes. Depends on the dataset and problem.

1. https://www.anthropic.com/news/developing-computer-use 2. https://huggingface.co/blog/paligemma


PaliGemma on computer use data is absolutely not good. The difference between a FT YOLO model and a FT PaliGemma model is huge if generic bboxes are what you need. Microsoft's OmniParser also winds up using a YOLO backbone [1]. All of the browser use tools (like our friends at browser-use [2]) wind up trying to get a generic set of bboxes using the DOM and then applying generative models.

PaliGemma seems to fit into a completely different niche right now (VQA and Segmentation) that I don't really see having practical applications for computer use.

[1] https://huggingface.co/microsoft/OmniParser?language=python [2] https://github.com/browser-use/browser-use


Maybe still worth it to separate the tasks, and use a traditional text detection model to find bounding boxes, then crop the images. In a second stage, send those cropped samples to the higher-power LLMs to do the actual text extraction, and don't worry about them for bounding boxes at all.

There are some VLLMs that seem to be specifically trained to do bounding box detection (Moondream comes to mind as one that advertises this?), but in general I wouldn't be surprised if none of them work as well as traditional methods.


We've run a couple experiments and have found that our open vision language model Moondream works better than YOLOv11 in general cases. If accuracy matters most, it's worth trying our vision language model. If you need real-time results, you can train YOLO models using data from our model. We have a space for video redaction, that is just object detection, on our Hugging Face. We also have a playground online to try it out.


AFAIK none of those models have been trained to produce bounding boxes. On the other hand Gemini Pro has, so it may be worth looking at for your use case:

https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...


I am doing OCR on hundreds of PDFs using AWS Textract. It requires me to convert each page of the pdf to an image and then analyze the image and it works good for converting to markdown format (which requires custom code). I want to try using some vision models and compare how they do, for example Phi-3.5-vision-instruct.


1. You need to look into the OCR-specific literature of DL (e.g. udop) or segmentation-based (e.g. segment-anything)

2. BigTech and SmallTech train their fancy bounding box / detection models on large datasets that have been built using classical detectors and a ton of manual curation


> they failed miserably at finding bounding boxes, usually just making up random coordinates.

This makes sense to me. These LLMs likely have no statistics about the spatial relationships of tokens in a 2D raster space.


The spatial awareness is what grounding models try to achieve, e.g. UGround [1]

[1] https://huggingface.co/osunlp/UGround-V1-7B?language=python


Gemini 2 can purportedly do this, you can test it with the Spatial Understanding Starter App inside AI Studio. Only caveat is that it's not production ready yet.


I think people have had success with using PaliGemma for this. The computer use type use cases probably use fine tuned versions of LLMs for their use cases rather than the base ones.


Relatedly, we find LLM vision models absolutely atrocious at counting things. We build school curricula, and one basic task for our activities is counting – blocks, pictures of ducks, segments in a chart, whatever. Current LLM models can't reliably count four or five squares in an image.


IMHO, that is expected, at least for the general case.

That is one of the implications of transformers being DLOGTIME-uniform TC0, they don't have access to counter analogs.

You would need to move to log depth circuits, add mod-p_n gates etc... unless someone finds some new mathematics.

Proposition 6.14 in Immerman is where this is lost if you want a cite.

It will be counterintuitive that division is in TC0, but (general) counting is not.


Have you played with moondream? Pretty cool small vision model that did a good job with bounding boxes when I palyed with it.


Thanks for the shout out :)


Yeah I really struggle when I use my hammer to screw pieces of wood together too.


Congrats on the launch! :)

Apparently I signed up for Instant previously but completely forgot about it. Only realized I had an account when I went to the dashboard to find myself still logged in. I dug up the sign up email and apparently I signed up back in 2022, so some kind of default invalidation period on your auth tokens would definitely make me a bit more comfortable.

Regardless, I'm still as excited about the idea of a client-side, offline-first, realtime syncing db as ever, especially now that the space has really been picking up steam with new entrants showing up every few weeks.

One thing I was curious about is how well the system currently supports users with multiple emails? GitHub popularized this pattern, and these days it's pretty much table stakes in the dev tools space to be able to sign in once and use the same account across personal accounts and orgs associated with different emails.

Looking at the docs I'm getting the sense that there might be an assumption of 1 email per user in the user model currently. Is that correct? If so, any plans to evolve the model to become more flexible?


Noted about the refresh tokens, thank you!

> One thing I was curious about is how well the system currently supports users with multiple emails? GitHub popularized this pattern, and these days it's pretty much table stakes in the dev tools space to be able to sign in once and use the same account across personal accounts and orgs associated with different emails

Right now there is an assumption of 1 `user` object per email. You could create an entity like `workspace` inside Instant, and tie multiple users together this way for now.

However, making the `user` support multiple identities, and creating recipes for common data models (like workspaces) is on the near-term roadmap.


This is the only real problem I have with how they handled the situation. But it's a big problem.

Whether or not these folks on free plans are ever going to convert to paid, they trusted PlanetScale to serve as a critical building block for their project/business. I think the least they could do is ease the transition by offering them a reasonable amount of time to offboard.

I personally would never trust critical infra to a company that has ever abruptly terminated a product offering with only 1 month notice.


The mental model that you get critical infra for free is wrong.


How many months is to be expected by industry standards?


Come on. Its a free offering. As the saying goes, you get what you pay for!


Have been looking forward to this release for quite a while! Huge props to the Cloudflare team for putting this out there!

I've been operating a cluster of NGINX nodes on Fly.io and using njs (NGINX's custom JS scripting engine) for all of my custom routing logic, and have been really feeling the limitations (had to spin up a separate companion app in nodejs to work around some of these). Having access to the entirety of the Rust language and ecosystem to customize routing behavior sounds incredibly compelling!

I did a quick scan over the codebase and couldn't see anything around disk caching like in NGINX, only memory caching. Curious if Cloudflare is operating all their production nodes with memory caching as opposed to disk caching at the moment?

I'd love to see an option for disk caching for use cases that are a bit more cost sensitive.


Their cache is likely an entirely different per-PoP service that operates as a network service.


It's been a while, but last time I checked, write latency on R2 was pretty horrendous. Close to 1s compared to S3's <100ms, tested from my laptop in SF. Wouldn't be surprised if they made progress on this front, but definitely do dig deeper if your workload is sensitive to write latency.

Another (that probably contributes directly to the write latency issues) is region selection and replication. S3 just offers a ton more control here. I have a bunch of S3 buckets replicating async across regions around the world to enable fast writes everywhere (my use case can tolerate eventual consistency here). R2 still seems very light on region selection and replication options. Kinda disappointed since they're supposed to be _the_ edge company.


Vertical tabs in Edge seems to trigger false positives on this. Really hope that's not the only heuristic they're using.


Same goes for Sidebery in Firefox, but then it changes to "no" if I do open the Dev Tools. As a non web-dev, this behaviour is truly weird.


Looking at the code[0], it just defines an aspect ratio threshold (170px in either direction, on line 13) for your browser's viewport and triggers if it's outside of that on width or height. So when you open a second panel, your viewport goes back to being closer to 16:9/16:10 and the tool considers that within both thresholds.

The detection is hilariously primitive, entirely unreliable, and only knows about your devtools directly if you're using Firebug.

[0]: https://github.com/sindresorhus/devtools-detect/blob/main/in...


Great list!

Related to your point on supporting 0-transpilation workflows as a first class citizen, is the fact that Flow was explicitly just type annotations & checking, and aimed to introduce 0 runtime constructs.

This is something Flow did from the beginning and TypeScript eventually established as a non-goal after already implementing several runtime constructs that they can no longer afford to remove for backwards compat [1].

Though interestingly enough, Flow themselves recently announced they're going to start introducing runtime constructs, which is an interesting plot twist [2].

[1] https://github.com/Microsoft/TypeScript/wiki/TypeScript-Desi...

[2] https://medium.com/flow-type/clarity-on-flows-direction-and-...


In another plot twist TypeScript is supporting the Types as Comments TC39 proposal that would provide a true 0-transpilation workflow.

https://devblogs.microsoft.com/typescript/a-proposal-for-typ...


In case you're open to something for JS instead of Python, my life has been much better since I switched from AHK to nutjs for my own automation scripts: https://nutjs.dev/

A real programming language, and support for multiple platforms!


Alternatively if you're into dotnet, FlaUI is amazing for automation and gives you a sane environment. I moved to it when I couldn't deal with AutoIT scripts anymore and it's everything I needed. https://github.com/FlaUI/FlaUI


Wow, and here I thought I was the only one to make this move from Firefox to Edge, precisely because of the excellent native vertical tabs functionality.

It's not perfect, but the level of UX polish and the first-class integration with other browser features is unmatched by any Firefox vertical tabs extension I've tried.

I still can't believe the other browsers haven't caught on to the fact that power users want vertical tabs and are willing to switch browsers for it after so many years of Firefox extensions showing the way...


Orion (Mac only) has a great implementation, and in my experience has been quite fast. It's also supposed to be good on battery life, though I've not really tested it.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: