As an example I have a corpus of scientific papers for which I would like to label the segments that contain an application to organic chemistry.
Assuming the labeling is not very sophisticated (literally "this passage contains an application to organic chemistry") Do I need to train models to detect and label these segments or is it viable to feed into an LLM model with no prior training?
What are currently the best/cheapest services/libraries that help with this kind of workflow that doesn't involve reinventing the wheel?
You don't always have to do full document extraction. For a pdf form information extraction task, I once implemented a pipelined approach on images that extracts handwritten fields by using a trained segmentation model and a extraction model. You get error percolation but performance was between 85-90 F1, and we had human-in-the-loop for uncertain preds.
However, if top performance is a concern (and you're not dealing with handwritten stuff here) I would look into SotA document extraction models for extracting text, tables, graphs etc like LayoutLMv3 [1]. It's probably something like LayoutLM that OpenAI is using themselves. Every major cloud provider has its own document AI services that have similar performance, so that might get you going faster.
On the extracted text of scientific papers, you will run in to context size issues, so take it into account if using LLM zero-shot approach.
You could go with a chunked topic model approach for this too (BerTopic is a nice starting point).
In any case, it all depends on the specific task (amount of topics to classify), and the engineering trade-offs. Only you have all information to figure that one out.
1. https://huggingface.co/docs/transformers/model_doc/layoutlmv...
2. https://maartengr.github.io/BERTopic/index.html