Great project! I've had success using camelot-py (https://camelot-py.readthedocs.io) to extract tabular data from PDFs (for images, I use imagemagick to convert those to PDF). If your table has borders the default method (lattice) works quite well. For non-bordered table there is the option to use 'stream' option but usually requires bit more preprocessing to get usable results.
As pointed out in this thread, right now it only works with text-based PDFs. But there's a PR[1] which will add OCR support (using EasyOCR) for image-based PDFs in some time.
From the link: "Camelot only works with text-based PDFs and not scanned documents." If you have character data, using it is almost always going to be more accurate than OCR.
I don't know how OP uses it with images converted to PDFs though, as that would be just like a scan, and ImageMagick doesn't do OCR as far as I can tell.
Yes I need to work on that PR, haven't been getting a lot of free time these days. It adds OCR support using EasyOCR, which I found on HN some time ago!
Hi !
Thank you for sharing this, it's a great tool I bumped into when searching for an image to CSV converter. But it seems to work with graphs only if I'm not mistaken.
This is neat. Over Docsumo, I've had fun to build one of the pipelines [0] to extract tables from any kinds of documents.
Our older pipelines use image-processing-based approaches. However, they had too much assumptions in them (for instance, header texts, column types, etc).
Now, we've moved onto to ML-based approach to train generic models that can be applied to variety of documents for table structure recognition.
What was your motivation for making this? I see that you mention that it was a learning project but it seems like a specific enough tool that it is potentially solving some real-world problem for you.
You are right! I'm a student and my problem was some Supply Chain and Lean Management homeworks:
I needed to process computations on tables of numbers but my teacher didn't had the data, only screenshots of those tables, so I spent a long time copying numbers into Excel until I decided to implement this. :-)
This is really neat. A lot of the hard bits in converting scientific pdfs to text is to deal with the tables, which more often than not are graphical and usually do not have a text overlay.
Since it's been distributed to you, you implicitly are allowed to use it unless told otherwise, but possibly only for private use. Most websites, for example, are copyrighted, but you are allowed to read them.
.. maybe take nth frame shots and stich them together with standard photo stichig tools, then ocr the big long image..
all thisto extract chat logs from ms teams..
if your an exchange admin there might be a way. if your just a pleb trying to capture project logs, or records of your coworkers bad mouting you, tough luck.
I had some success last year integrating tesseract OCR and OpenCV with Tabula (compiled to javascript). The purpose was to build a Google Docs pdf table import addon without requiring a backend. Happy to get in touch to figure out how I could contribute the work back to Tabula (if that makes sense).
The program runs with Python and Tesseract. It is quite fast (less than one second for a table of 100 numbers) though I never tested it with larger tables.
It detects numbers from an image of a table, which is supposed not to be rotated and also cropped : only the table is visible on the image. So, in order to process multiple tables per image, one needs to create an image for each table.
This program is rather simple I must say. ;-)
As for the handwriting, I think Tesseract can handle the recognition if the writing is good, but the table needs to fullfil the expected hypothesis. Also the pre-processing can't get rid of a lot of noise so it can be a problem too !
Tangentially, I would like to be able to extract tables from PDF files for my Easy Data Transform software. So I would like to find a C++ library that does this. Can anyone recommend one? Needs to work with proprietary software (so no GPL). Doesn't have to be free. And what is the state of art on this? How reliably can data tables be extract from real world PDFs?
Is there also a solution for automatic border detection.
Last year tried reading bank statements, which were scanned slips. Unfortunately they didn't have any borders which made it super difficult to extract content. Would be cool if someone could make something for this :) I thought it would be easy but I broke my mind on it for several days until I gave up.
The logic for detecting a table is to get rid of everything but vertical lines over a certain length, save that in one image, then get rid of everything but horizontal lines of a certain length, save that image. Then overlay the two and take the bounding rectangle. So you don't need the table to have a border as long as you have vertical and horizontal lines and they extend far enough to encompass all the data you need.
I am not sure if this works since they are not forms but statements. I.e. no defined structure only the columns are fixed width but the rows are diffewrent sizes without borders.Would be cool if it worked though. I'll give it a go.
The name "form recognizer" is perhaps poorly given, considering it can detect much more than forms (eg invoices, receipts). You can create your own custom models as well.
My experience in image to text conversion softwares has not given confidence to use in production or as part of reliable workflow. What's unique or different about this tool compared to its predecessors? Does it use any novel algorithms, neural nets or any other techniques?
Nice! Have used quite a few tools like this to convert data government agencies report in pdfs to csvs. The biggest challenge that existing tools fail to adequately address is when table formats vary (e.g., increasing level of indentation). Perhaps formatting those in json first would be easier
When you say increasing the level of indention ... do you have an example handy? I’m working on a pdf / data (word, excel, docx, csv), tool at the moment, and I think it’s pretty robust to things like this.
Yep, understandable. Right now it kicks out pdfs that don't fit the rules, but I think there are a few sensitivity variables / configs I can incorporate to make that seamless.
I had been meaning to find or write a tool like this for ages -- often times the only place where you can find pinout information for a chip is from a table buried on page 7xx of a massive pdf datasheet. Trying to create a symbol for, e.g. a 200+ ball BGA is awful.
Some PDF readers would let you copy and paste the tables as tabular data into Excel, at least with some ST datasheets.
Had to find the right combination of reader and operating system for that.
Hi !
Tesseract is used to recognize numbers as strings, which are then casted into floats.
So it actually does recognize letters, but during the cast it will generate an error and outputs as numpy.inf ; but this was a choice of mine, and one can easily change the code to cast detections into integers, except when it is a letter in which case keep it as a string. :-)
Hi !
I think you can use my code to detect and extract the grid, then run a "color detector" script on all the regions instead of the Tesseract recognizer, and return the RBG/HSL/... values in a text file.
I don't know if the region's pre-processing (denoizing, clear border, ...) will be useful though.
For the color detection, I found this tutorial : https://www.pyimagesearch.com/2014/08/04/opencv-python-color... which comes from Adrian Rosebrock (https://github.com/jrosebr1) who makes very great Python tutorials.
Hope it'll work !
Tangential question, but I remember some HN link about a SaaS business that was doing some OCR on paper bills to make sense of it automatically, anybody remember the name of that service? I have like a thousands of these things to scan and extract tabular data from.