Hacker News new | past | comments | ask | show | jobs | submit | tlofreso's comments login

Demo = impressed.

How's SeekStorm's prowess in mid-cap enterprise? How hairy is the ingest pipeline for sources like: decade old sharepoint sites, PDFs with partial text layers, excel, email.msg files, etc...


Yes, integration in complex legacy systems is always challenging. As a small startup, we are concentrating on core search technology to make search faster and to make the most of available server infrastructure. As SeekStorm is open-source, system integrators can take it from there.


Same as any other full-text search solution - it's your job to integrate it.


>Demo = impressed.

How did you demo? Did you spin up your own instance and index the wikipedia corpus like the docs suggest? I'd like to just give it a whirl on an already running instance.

Never mind, found that someone posted a link already.


On that topic, can anybody chime in on state of the art PDF OCR? Even if that's a multimodal LLM, I've used ChatGPT to extract tabular data from images but need something I can self host for proprietary data.


Azure Document Intelligence (especially with the layout model[0]) is really good. It has both JSON and MD output modes and does a pretty solid job identifying headers, sections, tables, etc.

What's interesting is that they have a self-deployable container model[1] that only phones home for billing so you can self-host the runtime and model.

[0] https://learn.microsoft.com/en-us/azure/ai-services/document...

[1] https://learn.microsoft.com/en-us/azure/ai-services/document...


Peculiar, Thanks!


"accurate document extraction is becoming a commodity with powerful VLMs"

Agree.

The capability is fairly trivial for orgs with decent technical talent. The tech / processes all look similar:

User uploads file --> Azure prebuilt-layout returns .MD --> prompt + .MD + schema set to LLM --> JSON returned. Do whatever you want with it.


Totally agree that this is becoming the standard "reference architecture" for this kind of pipeline. The only thing that complicates this a lot today is complex inputs. For simple 1-2 page PDFs what you describes works quite well out of the box but for 100+ page doc it starts to fall over in ways I described in another comment.


Are really large inputs solved at midship? If so, I'd consider that a differentiator (at least today). The demo's limited to 15pgs, and I don't see any marketing around long-context or complex inputs on the site.

I suspect this problem gets solved in the next iteration or two of commodity models. In the meantime, being smart about how the context gets divvied works ok.

I do like the UI you appear to have for citing information. Drawing the polygons around the data, and then where they appear in the PDF. Nice.


Why all those steps? Why not just file + prompt to JSON directly?


Having the text (for now) is still pretty important for quality output. The vision models are quite good, but not a replacement for a quality OCR step. A combination of Text + Vision is compelling too.


And if yes, be specific in answering. Emails are a bear! Emails can have several file types as attachemtns. Including: Other emails, Zip files, in-line images where position matters for context.


Congrats on the launch... You're in a crowded space. What differentiates Midship? What are you doing that's novel?


Cofounder here.

Great Q - there is definitely a lot of competition in dev tool offerings but less so in end to end experiences for non technical users.

Some of the things we offer above and beyond dev tools: 1. Schema building to define “what data to extract” 2. A hosted web app to review, audit and export extracted data 3. Integrations into downstream applications like spreadsheets

Outside of those user facing pieces, the biggest engineering effort for us has been in dealing with very complex inputs, like 100+ page PDFs. Just dumping into ChatGPT and asking nicely for the structured data falls over in both obvious (# input/output tokens exceeded) and subtle ways (e.g. missing a row in the middle of the extraction).


I came across Bert Hubert during covid because of his incredible work on this article: https://berthub.eu/articles/posts/reverse-engineering-source...

Long before Bert was writing articles on the source code of mRNA vaccines, he helped build PowerDNS. He talks about that in a three part series starting here: https://berthub.eu/articles/posts/history-of-powerdns-1999-2...

A fascinating individual...

https://fosstodon.org/@bert_hubert

https://github.com/berthubert

https://berthub.eu/


blush :-)


I love the effort Ford is putting into their EV offerings, and Farley seems to get it better than the other legacy auto CEOs. Ford leveraged their two most powerful brands (F-150 / Mustang) to enter the EV market. I contrast this to GM's resurgence to the space with hummer... A brand they discontinued after a tumultuous 2008.

That said, Ford aligning on 150kW chargers across their EV portfolio is a miss. I really hope 250kW is road mapped for next gen Ford EVs when they adopt NACS.


Their vehicle architecture & derating setup only seems to max out at 150 kW so far?

I've been researching the ford mustang mach-e and the NCM li-ion batteries max out around 115kW charging and the newer LFP ones max out at 150 kW. Expect the next gen to be different though.

At least in the earlier models there's supposedly less temperature sensing to figure out a higher speed compared to say Tesla.


Bjorn is the GOAT for EV testing



I was excited about Colorado's law, though for the most part there's very little value. Companies either exclude candidates from that state [1], or provide cop-out salary range information of $x - $3x/yr [2]

1: https://twitter.com/digitalocean/status/1395818629657149445.

2: https://www.pwc.com/us/en/careers/coloradoifsseniormanager.h...


It's a cop-out range because, yes, it's a cop-out, but also because they're just looking for an IC who's good and aren't mircomanaging levels.


Nice work! I built something very similar: https://recipemincer.com

It seems you're using the same Python scraper I am: https://github.com/hhursev/recipe-scrapers


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: