You can do some really cool things now with these models, like ask them to extra...

You can do some really cool things now with these models, like ask them to extract not just the text but figures/graphs as nodes/edges and it works very well. Back when GPT-4 with vision came out I tried this with a simple prompt + dumping in a pydantic schema of what I wanted and it was spot on, pretty much this (before json mode was a supported):

    You are an expert in PDFs. You are helping a user extract text from a PDF.

    Extract the text from the image as a structured json output.

    Extract the data using the following schema:

    {Page.model_json_schema()}

    Example:
    {{
      "title": "Title",
      "page_number": 1,
      "sections": [
        ...
      ],
      "figures": [
        ...
      ]
    }}

https://binal.pub/2023/12/structured-ocr-with-gpt-vision/