How are folks parsing PDFs with GPT4? Every 3rd party tool that supposedly uses GPT has failed for me on even the most regular of PDF bank statements. Currently I am parsing the text contents manually using pdfbox to reconstruct historical bank statements, but this is error-prone due to page breaks and run-on lines.
I had success with having it guess the formatting, which was a pretty cool experiment.
I had a table in a PDF of registers and associated information, and I copied and pasted the text directly into ChatGPT as a big block, and asked it to structure the data as a table again based on its best understanding of the data, knowing it came from a table. To my surprise, it did a really good job. A couple small edits here and there were needed to change some formatting and it missed a couple values, but overall it took me a couple minutes to edit and I was on my way.
Chrome's "Search images with Google" is already good at extracting text. I wish they could recognize table formats within image or a page and allow it to be exported to google sheets. That would be game changer.
My current solution is pdfplumber → GPT-3 API. I played around with a few different options, and this is personally what’s worked best for my use cases.
Same deal for me, I've tried a few AI PDF pipelines with no good output. If I feed it cleaned up delimited text then everything's good[1], but by the time I get to that point I might as well be using Tabula + Orange / R. I'll keep trying though, because damn, PDF files need to go die under a rock.
[1] Well . . kinda. The model has some weird ideas about what an edge table is.
I'm guessing that manually feeding it a table at a time, with prompts tailored to the shapes you see in that table, is more effective than using a tool that tries to do that for you. Automated tools on top of GPT will be great when they work but they introduce a new communication joint where things can fail.