You can do some really cool things now with these models, like ask them to extract not just the text but figures/graphs as nodes/edges and it works very well. Back when GPT-4 with vision came out I tried this with a simple prompt + dumping in a pydantic schema of what I wanted and it was spot on, pretty much this (before json mode was a supported):
You are an expert in PDFs. You are helping a user extract text from a PDF.
Extract the text from the image as a structured json output.
Extract the data using the following schema:
{Page.model_json_schema()}
Example:
{{
"title": "Title",
"page_number": 1,
"sections": [
...
],
"figures": [
...
]
}}