How's SeekStorm's prowess in mid-cap enterprise? How hairy is the ingest pipeline for sources like: decade old sharepoint sites, PDFs with partial text layers, excel, email.msg files, etc...
Yes, integration in complex legacy systems is always challenging.
As a small startup, we are concentrating on core search technology to make search faster and to make the most of available server infrastructure.
As SeekStorm is open-source, system integrators can take it from there.
How did you demo? Did you spin up your own instance and index the wikipedia corpus like the docs suggest? I'd like to just give it a whirl on an already running instance.
Never mind, found that someone posted a link already.
On that topic, can anybody chime in on state of the art PDF OCR? Even if that's a multimodal LLM, I've used ChatGPT to extract tabular data from images but need something I can self host for proprietary data.
Azure Document Intelligence (especially with the layout model[0]) is really good. It has both JSON and MD output modes and does a pretty solid job identifying headers, sections, tables, etc.
What's interesting is that they have a self-deployable container model[1] that only phones home for billing so you can self-host the runtime and model.
How's SeekStorm's prowess in mid-cap enterprise? How hairy is the ingest pipeline for sources like: decade old sharepoint sites, PDFs with partial text layers, excel, email.msg files, etc...