This is a problem with Ubuntu and not VirtualBox. I spend a few hours today fighting with 24.04.02 in libvirt only to realize that they are using some fancy new GUI library for the installer which crashes on all VMs: https://www.dell.com/support/kbdoc/en-us/000123893/manual-no...
Mandatory Ubuntu considered harmful.
If only NVidia considered Debian a first class distribution so I never had to use Ubuntu again.
Sure, but everyone knows humans end up bringing down the database too by writing an innocent looking test query nobody else blinks at, which is why you end up needing a testing strategy for ANY SQL before YOLO'ing into prod.
To offer a 3rd option - what testing pipeline? Incompetent managers aren't going to approve of developers "wasting their time" on writing high quality tests.
It always is for the first week. Then you find out that the last 10% matter a lot more than than the other 90%. And finally they turn off the high compute version and you're left with a brain dead model that loses to a 32b local model half the time.
If a user eventually creates half a dozen projects with an API key for each, and prompts Gemini side-by-side under each key, and only some of the responses are consistently terrible…
Would you expect that to be Google employing cost-saving measures?
A PDF corpus with a size of 1tb can mean anything from 10,000 really poorly scanned documents to 1,000,000,000 nicely generated latex pdfs. What matters is the number of documents, and the number of pages per document.
For the first I can run a segmentation model + traditional OCR in a day or two for the cost of warming my office in winter. For the second you'd need a few hundred dollars and a cloud server.
Feel free to reach out. I'd be happy to have a chat and do some pro-bono work for someone building a open source tool chain and index for the rest of us.
>replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.
As someone who had to build custom tools because VLMs are so unreliable: anyone that uses VLMs for unprocessed images is in for more pain than all the providers which let LLMs without guard rails interact directly with consumers.
They are very good at image labeling. They are ok at very simple documents, e.g. single column text, centered single level of headings, one image or table per page, etc. (which is what all the MVP demos show). They need another trillion parameters to become bad at complex documents with tables and images.
Right now they hallucinate so badly that you simply _can't_ use them for something as simple as a table with a heading at the top, data in the middle and a summary at the bottom.
I've worked on this in my day job: extracting _all_ relevant information from a financial services PDF for a bert based search engine.
The only way to solve that is with a segmentation model followed by a regular OCR model and whatever other specialized models you need to extract other types of data. VLM aren't ready for prime time and won't be for a decade on more.
What worked was using doclaynet trained YOLO models to get the areas of the document that were text, images, tables or formulas: https://github.com/DS4SD/DocLayNet if you don't care about anything but text you can feed the results into tesseract directly (but for the love of god read the manual). Congratulations, you're done.
Here's some pre-trained models that work OK out of the box: https://github.com/ppaanngggg/yolo-doclaynet I found that we needed to increase the resolution from ~700px to ~2100px horizontal for financial data segmentation.
VLMs on the other hand still choke on long text and hallucinate unpredictably. Worse they can't understand nested data. If you give _any_ current model nothing harder than three nested rectangles with text under each they will not extract the text correctly. Given that nested rectangles describes every table no VLM can currently extract data from anything but the most straightforward of tables. But it will happily lie to you that it did - after all a mining company should own a dozen bulldozers right? And if they each cost $35.000 it must be an amazing deal they got, right?
That looks like a pretty good starting point, thanks. I've been dabbling in vision models but need a much higher degree of accuracy than they seem able to provide, opting instead for more traditional techniques and handling errors manually.
For non-table documents a fine tuned yolov8 + tesseract with _good_ image pre-processing has basically a zero percent error rate on monolingual texts. I say basically because the training data has worse labels than what the multi-model system gives out in the cases that I double checked manually.
But no one reads the manual on tesseract and everyone ends up feeding it garbage, with predictable results.
Tables are an open research problem.
We started training a custom version of this model: https://arxiv.org/pdf/2309.14962 but there wasn't the business case since the bert search model dealt well enough with the word soup that came out of easy ocr. If you're interested drop a line. I'd love to get a model like that trained since it's very low hanging fruit that no one has done right.
The first thing I did when I saw this thread was ctrl-f for doclaynet :)
I've been at this problem since 2013, and a few years ago turned my findings into more of a consultancy than a product. See HTTPS://pdfcrun.ch
However, due to various events, I burned out recently and took a permie job, so would love to stick my head in the sand and play video games in my spare time, but secretly hoping you'd see this and to hear about your work.
Doclaynet is the easy part and with triple the usual resolution the previous gen of yolo models have solved document segmentation for every document I've looked at.
The hard part is the table segmentation. I don't have the budget to do a proper exploration of hyper parameters for the gridformer models before starting a $50,000 training run.
This is a back burner project along with speaker diarization. I have no idea why those haven't been solved since they are very low hanging fruit that would release tens of millions in productivity when deployed at scale, but regardless I can't justify buying a Nvidia DGX H200 and spending two months exploring architectures for each.
When I realized how powerful TRAMP was, I don't think I ever used screen/tmux again. I'm sure there are uses, mind. Just TRAMP fully hit all of my needs.
It really is magical, isn’t it? And although I rarely need to use it, I love the multihop setups where you can ssh to this system, then ssh again to this other, then mount an SMB filesystem using these credentials, and start editing.
>And while a VC fund is limited in what it can do in providing open-ended freedom. It can try to provide a meaningful simulacrum of that space and community, which is why I’m so excited about programs like 1517’s Flux that invests $100k in people, no questions asked and lets them explore for a few months without demanding KPIs or instantaneous progress.
>>You can move to the United States. (We will help with visas.)
This is no longer viable for anyone who isn't already a US citizen. Not sure how serious about investing in individuals that VC is, but from talking to 16 to 22 year olds _none_ of them want to move to the US with ICE deporting students for saying the wrong thing online - or the perception they do. US universities and businesses are suffering from brain drain that unless reversed in the next 3 years will be a drag on the US economy for decades.
There should be a name for the phenomenon where people upset about some injustice pick the least plausible example to use as the cause celebre of the injustice.
For a more modern take I can't understand why Daniel Shaver is not the face of police murder in the US. The video is on YouTube, you can find the unedited version with a Google search. There is no benefit of the doubt to give. It was straight up murder done on live cam. The more you read the worse it gets.
But it got buried in a week and no one remembers it.
It's unfortunate that the shooter was not convicted, but the mere fact that there was an investigation and a trial differentiates it from a lot of police violence causes célèbres.
That doesn't resonate with my experience. People know about the murder, but aren't sure what to do.
The murderer, who clearly had mental health issues (eg, having "you're fucked" on the dust cover of his personal AR-15, which he used to commit the act), was acquitted (in a trial of strange circumstances). It's baffling that none of his colleagues - who saw the message on his weapon - ever pulled him aside to ask if he was OK.
And anyway, what does this have to do with your point of holding up an unlikely / outlying example to demonstrate a phenomenon?
His colleagues likely didn't find the dust cover noteworthy. Within contemporary American gun culture, it would seem like a minor bit of braggadocio akin to a "Protected by Smith & Wesson" sticker or a "Warning: We Don't Dial 911" placard; tacky and unprofessional, but not something to take seriously. There's a whole little industry around AR-15 customization, offering thousands of options for engraved dust covers with all kinds of symbols and messages:
I am not remotely aware of this case. How does those words, or any words, on a gun case/cover relate to mental health issues? This isn't a manifesto; it is more like "guard dog? Beware of owner!" decal, or a Calvin pissing on a coexist sticker. Or truck nuts. These might be distasteful to some but is unrelated to mental health. I'd be more worried about my former neighbor who had an unhealthy love of maglite flashlights and owned like 50 of 'em. _That_ was strange.
The "you're fucked" was written on the inside of the ejection port dust cover so that it would become visible after the weapon was fired. The implication is that he was eager to shoot someone.
> There should be a name for the phenomenon where people upset about some injustice pick the least plausible example to use as the cause celebre of the injustice.
The surest way to get flamed out of any crypto mailing list was to ask what the effective clearance rate for the coin was, then following it up with how it could be sped up.
Today the bitcoin network is still stuck at ~7 transactions a second.
Which one are you referring to? What do you want to get all the freedom in the world and no effort for running a decentralized node?
Staking blockchains don't require much resources.
The ones that allow hundreds of txs per second, making verification of the entire tx history orders of magnitude harder. The limited tx throughput of bitcoin is a feature, not a bug.
Mandatory Ubuntu considered harmful.
If only NVidia considered Debian a first class distribution so I never had to use Ubuntu again.
reply