I’m processing a large volume of PDFs (30–40 pages each) with inconsistent layouts.
Each document contains one specific table I need to extract, but every company formats it differently.
Current stack:
– Azure Document Intelligence (prebuilt)
– preprocessing (PyMuPDF → image → filters)
– a multimodal LLM to turn the detected table into clean JSON
Main issue:
– To localize the table, I currently rely on template-specific configs. At scale, this becomes unmanageable because there may be hundreds of unique layouts.
Has anyone solved this class of problem?
Looking for strategies for:
– robust table localization across many templates,
– hybrid rule-based + ML approaches,
– layout-based detection,
– or “templateless” methods that generalize better.