Ask HN: What is the state of AI data annotation of pdf documents using LLM

gillesjacobs · 2024-04-01T16:10:27

Everything is an engineering trade-off between classification performance, data requirements (quality and quantity for supervised fine-tuning/training), dev time and inference speed/resources.

You don't always have to do full document extraction. For a pdf form information extraction task, I once implemented a pipelined approach on images that extracts handwritten fields by using a trained segmentation model and a extraction model. You get error percolation but performance was between 85-90 F1, and we had human-in-the-loop for uncertain preds.

However, if top performance is a concern (and you're not dealing with handwritten stuff here) I would look into SotA document extraction models for extracting text, tables, graphs etc like LayoutLMv3 [1]. It's probably something like LayoutLM that OpenAI is using themselves. Every major cloud provider has its own document AI services that have similar performance, so that might get you going faster.

On the extracted text of scientific papers, you will run in to context size issues, so take it into account if using LLM zero-shot approach.

You could go with a chunked topic model approach for this too (BerTopic is a nice starting point).

In any case, it all depends on the specific task (amount of topics to classify), and the engineering trade-offs. Only you have all information to figure that one out.

1. https://huggingface.co/docs/transformers/model_doc/layoutlmv...

2. https://maartengr.github.io/BERTopic/index.html

serjester · 2024-04-01T16:11:00

You could try ingesting your files with open-parse. It analyzes the layout of your documents, converts to markdown and extract tables.

Then you could feed this output to a small model like gpt-3.5 and it should be able to classify the topic quite easily.

Alternatively you could convert the pdf to images and feed that to a model. Or use naive text splitting implemented by something like LlamaIndex. Ton of options!

Disclaimer: (I’m the author)

https://github.com/Filimoa/open-parse

xnx · 2024-04-01T16:05:25

Have you tried uploading a sample paper to Google Gemini or Chat GPT4 and seeing what results you get when you prompt it what something like "Label the sections containing application to organic chemistry"?