More

tlofreso · 2024-12-02T18:36:31 1733164591

Demo = impressed.

How's SeekStorm's prowess in mid-cap enterprise? How hairy is the ingest pipeline for sources like: decade old sharepoint sites, PDFs with partial text layers, excel, email.msg files, etc...

wolfgarbe · 2024-12-02T18:55:51 1733165751

Yes, integration in complex legacy systems is always challenging. As a small startup, we are concentrating on core search technology to make search faster and to make the most of available server infrastructure. As SeekStorm is open-source, system integrators can take it from there.

fiedzia · 2024-12-02T19:05:30 1733166330

Same as any other full-text search solution - it's your job to integrate it.

m348e912 · 2024-12-03T01:00:33 1733187633

>Demo = impressed.

How did you demo? Did you spin up your own instance and index the wikipedia corpus like the docs suggest? I'd like to just give it a whirl on an already running instance.

Never mind, found that someone posted a link already.

jazzyjackson · 2024-12-02T20:06:54 1733170014

On that topic, can anybody chime in on state of the art PDF OCR? Even if that's a multimodal LLM, I've used ChatGPT to extract tabular data from images but need something I can self host for proprietary data.

CharlieDigital · 2024-12-03T00:14:43 1733184883

Azure Document Intelligence (especially with the layout model[0]) is really good. It has both JSON and MD output modes and does a pretty solid job identifying headers, sections, tables, etc.

What's interesting is that they have a self-deployable container model[1] that only phones home for billing so you can self-host the runtime and model.

[0] https://learn.microsoft.com/en-us/azure/ai-services/document...

[1] https://learn.microsoft.com/en-us/azure/ai-services/document...

jazzyjackson · 2024-12-03T02:43:20 1733193800

Peculiar, Thanks!

tlofreso · 2024-11-06T19:06:23 1730919983

"accurate document extraction is becoming a commodity with powerful VLMs"

Agree.

The capability is fairly trivial for orgs with decent technical talent. The tech / processes all look similar:

User uploads file --> Azure prebuilt-layout returns .MD --> prompt + .MD + schema set to LLM --> JSON returned. Do whatever you want with it.

kietay · 2024-11-06T19:45:21 1730922321

Totally agree that this is becoming the standard "reference architecture" for this kind of pipeline. The only thing that complicates this a lot today is complex inputs. For simple 1-2 page PDFs what you describes works quite well out of the box but for 100+ page doc it starts to fall over in ways I described in another comment.

tlofreso · 2024-11-06T20:21:26 1730924486

Are really large inputs solved at midship? If so, I'd consider that a differentiator (at least today). The demo's limited to 15pgs, and I don't see any marketing around long-context or complex inputs on the site.

I suspect this problem gets solved in the next iteration or two of commodity models. In the meantime, being smart about how the context gets divvied works ok.

I do like the UI you appear to have for citing information. Drawing the polygons around the data, and then where they appear in the PDF. Nice.

Kiro · 2024-11-06T21:09:06 1730927346

Why all those steps? Why not just file + prompt to JSON directly?

tlofreso · 2024-11-06T21:55:38 1730930138

Having the text (for now) is still pretty important for quality output. The vision models are quite good, but not a replacement for a quality OCR step. A combination of Text + Vision is compelling too.

tlofreso · 2024-11-06T18:55:01 1730919301

And if yes, be specific in answering. Emails are a bear! Emails can have several file types as attachemtns. Including: Other emails, Zip files, in-line images where position matters for context.

tlofreso · 2024-11-06T18:26:03 1730917563

Congrats on the launch... You're in a crowded space. What differentiates Midship? What are you doing that's novel?

kietay · 2024-11-06T18:33:55 1730918035

Cofounder here.

Great Q - there is definitely a lot of competition in dev tool offerings but less so in end to end experiences for non technical users.

Some of the things we offer above and beyond dev tools: 1. Schema building to define “what data to extract” 2. A hosted web app to review, audit and export extracted data 3. Integrations into downstream applications like spreadsheets

Outside of those user facing pieces, the biggest engineering effort for us has been in dealing with very complex inputs, like 100+ page PDFs. Just dumping into ChatGPT and asking nicely for the structured data falls over in both obvious (# input/output tokens exceeded) and subtle ways (e.g. missing a row in the middle of the extraction).

tlofreso · 2024-08-19T20:15:45 1724098545

I came across Bert Hubert during covid because of his incredible work on this article: https://berthub.eu/articles/posts/reverse-engineering-source...

Long before Bert was writing articles on the source code of mRNA vaccines, he helped build PowerDNS. He talks about that in a three part series starting here: https://berthub.eu/articles/posts/history-of-powerdns-1999-2...

A fascinating individual...

https://fosstodon.org/@bert_hubert

https://github.com/berthubert

https://berthub.eu/

ahubert · 2024-08-20T08:30:22 1724142622

blush :-)

tlofreso · on Nov 30, 2023

I love the effort Ford is putting into their EV offerings, and Farley seems to get it better than the other legacy auto CEOs. Ford leveraged their two most powerful brands (F-150 / Mustang) to enter the EV market. I contrast this to GM's resurgence to the space with hummer... A brand they discontinued after a tumultuous 2008.

That said, Ford aligning on 150kW chargers across their EV portfolio is a miss. I really hope 250kW is road mapped for next gen Ford EVs when they adopt NACS.

seltzered_ · on Nov 30, 2023

Their vehicle architecture & derating setup only seems to max out at 150 kW so far?

I've been researching the ford mustang mach-e and the NCM li-ion batteries max out around 115kW charging and the newer LFP ones max out at 150 kW. Expect the next gen to be different though.

At least in the earlier models there's supposedly less temperature sensing to figure out a higher speed compared to say Tesla.

tlofreso · on Nov 30, 2023

Bjorn is the GOAT for EV testing

tlofreso · on July 5, 2023

Check out this: https://openaq.org/ Found via this: https://www.airgradient.com/blog/hidden-costs-of-air-quality...

tlofreso · on Nov 22, 2021

I was excited about Colorado's law, though for the most part there's very little value. Companies either exclude candidates from that state [1], or provide cop-out salary range information of $x - $3x/yr [2]

1: https://twitter.com/digitalocean/status/1395818629657149445.

2: https://www.pwc.com/us/en/careers/coloradoifsseniormanager.h...

dehrmann · on Nov 23, 2021

It's a cop-out range because, yes, it's a cop-out, but also because they're just looking for an IC who's good and aren't mircomanaging levels.

tlofreso · on June 26, 2020

Nice work! I built something very similar: https://recipemincer.com

It seems you're using the same Python scraper I am: https://github.com/hhursev/recipe-scrapers