How are folks parsing PDFs with GPT4? Every 3rd party tool that supposedly uses ...

Ductapemaster · on April 12, 2023

I had success with having it guess the formatting, which was a pretty cool experiment.

I had a table in a PDF of registers and associated information, and I copied and pasted the text directly into ChatGPT as a big block, and asked it to structure the data as a table again based on its best understanding of the data, knowing it came from a table. To my surprise, it did a really good job. A couple small edits here and there were needed to change some formatting and it missed a couple values, but overall it took me a couple minutes to edit and I was on my way.

passion__desire · on April 12, 2023

Chrome's "Search images with Google" is already good at extracting text. I wish they could recognize table formats within image or a page and allow it to be exported to google sheets. That would be game changer.

ragazzina · on April 12, 2023

I think he just copy-pasted everything from the PDF to GPT4? He says:

> with the only tedious aspect being the cut-and-paste between the raw data, GPT4, and the spreadsheet

zachwill · on April 12, 2023

Note: I haven’t worked with PDF bank statements.

My current solution is pdfplumber → GPT-3 API. I played around with a few different options, and this is personally what’s worked best for my use cases.

MilStdJunkie · on April 12, 2023

Same deal for me, I've tried a few AI PDF pipelines with no good output. If I feed it cleaned up delimited text then everything's good[1], but by the time I get to that point I might as well be using Tabula + Orange / R. I'll keep trying though, because damn, PDF files need to go die under a rock.

[1] Well . . kinda. The model has some weird ideas about what an edge table is.

Jeff_Brown · on April 12, 2023

I'm guessing that manually feeding it a table at a time, with prompts tailored to the shapes you see in that table, is more effective than using a tool that tries to do that for you. Automated tools on top of GPT will be great when they work but they introduce a new communication joint where things can fail.