Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What is the state of AI data annotation of pdf documents using LLM
17 points by ak_111 60 days ago | hide | past | favorite | 3 comments
As an example I have a corpus of scientific papers for which I would like to label the segments that contain an application to organic chemistry.

Assuming the labeling is not very sophisticated (literally "this passage contains an application to organic chemistry") Do I need to train models to detect and label these segments or is it viable to feed into an LLM model with no prior training?

What are currently the best/cheapest services/libraries that help with this kind of workflow that doesn't involve reinventing the wheel?




Everything is an engineering trade-off between classification performance, data requirements (quality and quantity for supervised fine-tuning/training), dev time and inference speed/resources.

You don't always have to do full document extraction. For a pdf form information extraction task, I once implemented a pipelined approach on images that extracts handwritten fields by using a trained segmentation model and a extraction model. You get error percolation but performance was between 85-90 F1, and we had human-in-the-loop for uncertain preds.

However, if top performance is a concern (and you're not dealing with handwritten stuff here) I would look into SotA document extraction models for extracting text, tables, graphs etc like LayoutLMv3 [1]. It's probably something like LayoutLM that OpenAI is using themselves. Every major cloud provider has its own document AI services that have similar performance, so that might get you going faster.

On the extracted text of scientific papers, you will run in to context size issues, so take it into account if using LLM zero-shot approach.

You could go with a chunked topic model approach for this too (BerTopic is a nice starting point).

In any case, it all depends on the specific task (amount of topics to classify), and the engineering trade-offs. Only you have all information to figure that one out.

1. https://huggingface.co/docs/transformers/model_doc/layoutlmv...

2. https://maartengr.github.io/BERTopic/index.html


You could try ingesting your files with open-parse. It analyzes the layout of your documents, converts to markdown and extract tables.

Then you could feed this output to a small model like gpt-3.5 and it should be able to classify the topic quite easily.

Alternatively you could convert the pdf to images and feed that to a model. Or use naive text splitting implemented by something like LlamaIndex. Ton of options!

Disclaimer: (Iā€™m the author)

https://github.com/Filimoa/open-parse


Have you tried uploading a sample paper to Google Gemini or Chat GPT4 and seeing what results you get when you prompt it what something like "Label the sections containing application to organic chemistry"?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: