How's SeekStorm's prowess in mid-cap enterprise? How hairy is the ingest pipeline for sources like: decade old sharepoint sites, PDFs with partial text layers, excel, email.msg files, etc...
Yes, integration in complex legacy systems is always challenging.
As a small startup, we are concentrating on core search technology to make search faster and to make the most of available server infrastructure.
As SeekStorm is open-source, system integrators can take it from there.
How did you demo? Did you spin up your own instance and index the wikipedia corpus like the docs suggest? I'd like to just give it a whirl on an already running instance.
Never mind, found that someone posted a link already.
On that topic, can anybody chime in on state of the art PDF OCR? Even if that's a multimodal LLM, I've used ChatGPT to extract tabular data from images but need something I can self host for proprietary data.
Azure Document Intelligence (especially with the layout model[0]) is really good. It has both JSON and MD output modes and does a pretty solid job identifying headers, sections, tables, etc.
What's interesting is that they have a self-deployable container model[1] that only phones home for billing so you can self-host the runtime and model.
Totally agree that this is becoming the standard "reference architecture" for this kind of pipeline. The only thing that complicates this a lot today is complex inputs. For simple 1-2 page PDFs what you describes works quite well out of the box but for 100+ page doc it starts to fall over in ways I described in another comment.
Are really large inputs solved at midship? If so, I'd consider that a differentiator (at least today). The demo's limited to 15pgs, and I don't see any marketing around long-context or complex inputs on the site.
I suspect this problem gets solved in the next iteration or two of commodity models. In the meantime, being smart about how the context gets divvied works ok.
I do like the UI you appear to have for citing information. Drawing the polygons around the data, and then where they appear in the PDF. Nice.
Having the text (for now) is still pretty important for quality output. The vision models are quite good, but not a replacement for a quality OCR step. A combination of Text + Vision is compelling too.
And if yes, be specific in answering. Emails are a bear! Emails can have several file types as attachemtns. Including: Other emails, Zip files, in-line images where position matters for context.
Great Q - there is definitely a lot of competition in dev tool offerings but less so in end to end experiences for non technical users.
Some of the things we offer above and beyond dev tools:
1. Schema building to define “what data to extract”
2. A hosted web app to review, audit and export extracted data
3. Integrations into downstream applications like spreadsheets
Outside of those user facing pieces, the biggest engineering effort for us has been in dealing with very complex inputs, like 100+ page PDFs. Just dumping into ChatGPT and asking nicely for the structured data falls over in both obvious (# input/output tokens exceeded) and subtle ways (e.g. missing a row in the middle of the extraction).
I love the effort Ford is putting into their EV offerings, and Farley seems to get it better than the other legacy auto CEOs. Ford leveraged their two most powerful brands (F-150 / Mustang) to enter the EV market. I contrast this to GM's resurgence to the space with hummer... A brand they discontinued after a tumultuous 2008.
That said, Ford aligning on 150kW chargers across their EV portfolio is a miss. I really hope 250kW is road mapped for next gen Ford EVs when they adopt NACS.
Their vehicle architecture & derating setup only seems to max out at 150 kW so far?
I've been researching the ford mustang mach-e and the NCM li-ion batteries max out around 115kW charging and the newer LFP ones max out at 150 kW. Expect the next gen to be different though.
At least in the earlier models there's supposedly less temperature sensing to figure out a higher speed compared to say Tesla.
I was excited about Colorado's law, though for the most part there's very little value. Companies either exclude candidates from that state [1], or provide cop-out salary range information of $x - $3x/yr [2]
How's SeekStorm's prowess in mid-cap enterprise? How hairy is the ingest pipeline for sources like: decade old sharepoint sites, PDFs with partial text layers, excel, email.msg files, etc...