Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: PDF to MD by LLMs – Extract Text/Tables/Image Descriptives by GPT4o (github.com/yigitkonur)
30 points by yigitkonur35 5 hours ago | hide | past | favorite | 8 comments
I've developed a Python API service that uses GPT-4o for OCR on PDFs. It features parallel processing and batch handling for improved performance. Not only does it convert PDF to markdown, but it also describes the images within the PDF using captions like `[Image: This picture shows 4 people waving]`.

In testing with NASA's Apollo 17 flight documents, it successfully converted complex, multi-oriented pages into well-structured Markdown.

The project is open-source and available on GitHub. Feedback is welcome.






I have not found any mention of accuracy. Since it's using LLM, how accurate the conversion is? As in does that NASA document match 100% with the pdf or did it introduce any made up things (hallucinations)?

That converted NASA doc should be included in repo and linked in readme if you haven't already.


People are really freaked out about hallucinations, but you can totally tackle that with solid prompts. The one in the repo right now is doing a pretty good job. Keep in mind though, this project is all about maxing out context for LLMs in products that need PDF input.

We're not talking about some hardcore archiving system for the Library of Congress here. The goal is to boost consistency whenever you're feeding PDF context into an LLM-powered tool. Appreciate the feedback, I'll be sure to add that in.


Was just looking for something like this. Does it handle equations to latex or similar? How about rotated tables, ie landscape mode but page is still portait?

I messed around with some rotating tables in that Apollo 17 demo video - you can check it out in the repo if you want. It's pretty straightforward to tweak just by changing the prompt. You can customize that prompt section in the code to fit whatever you need.

Oh, and if you throw in a line about LaTeX, it'll make things even more consistent. Just add it to that markdown definition part I set up. Honestly, it'll probably work pretty well as is - should be way better than those clunky old OCR systems.


How does this compare with commercial OCR APIs on a cost per page?

It is a lot cheaper! While cost-effectiveness may not be the primary advantage, this solution offers superior accuracy and consistency. Key benefits include precise table generation and output in easily editable markdown format.

Let's make some numbers game:

- Average token usage per image: ~1200 - Total tokens per page (including prompt): ~1500 - [GPT4o] Input token cost: $5 per million tokens - [GPT4o] Output token cost: $15 per million tokens

For 1000 documents: - Estimated total cost: $15

This represents excellent value considering the consistency and flexibility provided. For further cost optimization, consider:

1. Utilizing GPT4 mini: Reduces cost to approximately $8 per 1000 documents 2. Implementing batch API: Further reduces cost to around $4 per 1000 documents

I think it offers an optimal balance of affordability & reliability.

PS: One of the most affordable solution on market, cloudconvert charges ~30$ for 1K document (pdftron mode required 4 credits)


> I think it offers an optimal balance of affordability & reliability.

It is hard to trust "you" when ChatGPT wrote that text. You never know which part of the answer is genuine and which part was made up by ChatGPT.

To actually answer that question: Pricing varies quite a bit depending on what exactly you want to do with a document.

Text detection generally costs $1.5 per 1k pages:

https://cloud.google.com/vision/pricing

https://aws.amazon.com/textract/pricing/

https://azure.microsoft.com/en-us/pricing/details/ai-documen...


You've got a point, but try testing it on a tricky example like the Apollo 17 document - you know, with those sideways tables and old-school writing. You'll see all three non-AI services totally bomb. Now, if you tweak it to batch = 1 instead of 10, you'll notice there's hardly any made-up stuff. When you dial down the temperature close to zero, it's super unlikely to see hallucinations with limited context. At worst, you might get some skipped bits, but that's not a dealbreaker for folks looking to feed PDFs into AI systems. Let's face it, regular OCR already messes up so much that...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: