Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: I made a tool to convert images of tables to CSV (github.com/artperrin)
354 points by aperrin on March 9, 2021 | hide | past | favorite | 60 comments



Great project! I've had success using camelot-py (https://camelot-py.readthedocs.io) to extract tabular data from PDFs (for images, I use imagemagick to convert those to PDF). If your table has borders the default method (lattice) works quite well. For non-bordered table there is the option to use 'stream' option but usually requires bit more preprocessing to get usable results.


how does camelot extract tables from pdf? does it convert to image and then does OCR?


Hey! Camelot maintainer here. You can check out this doc for details on how Camelot extracts tables from PDFs: https://camelot-py.readthedocs.io/en/master/user/how-it-work...

As pointed out in this thread, right now it only works with text-based PDFs. But there's a PR[1] which will add OCR support (using EasyOCR) for image-based PDFs in some time.

[1] https://github.com/camelot-dev/camelot/pull/209


From the link: "Camelot only works with text-based PDFs and not scanned documents." If you have character data, using it is almost always going to be more accurate than OCR.

I don't know how OP uses it with images converted to PDFs though, as that would be just like a scan, and ImageMagick doesn't do OCR as far as I can tell.


It uses pytesseract and Open-CV, so there is image processing.


Looks like it's a bit in-progress: https://github.com/camelot-dev/camelot/pull/209

"Update docs" isn't checked, and that's what I was going on.


Yes I need to work on that PR, haven't been getting a lot of free time these days. It adds OCR support using EasyOCR, which I found on HN some time ago!


This is similar to WebPlotDigitizer, which helps you extract data from graphs:

https://automeris.io/WebPlotDigitizer/index.html


Hi ! Thank you for sharing this, it's a great tool I bumped into when searching for an image to CSV converter. But it seems to work with graphs only if I'm not mistaken.


Yes, your tool is a welcome addition!


Nice, i've used Engauge Digitizer in the past

http://markummitchell.github.io/engauge-digitizer/


Hi ! I couldn't find a tool like that when I needed it, so I made that as a Python beginner's project. Hope you'll find it useful. :-)



This is neat. Over Docsumo, I've had fun to build one of the pipelines [0] to extract tables from any kinds of documents. Our older pipelines use image-processing-based approaches. However, they had too much assumptions in them (for instance, header texts, column types, etc).

Now, we've moved onto to ML-based approach to train generic models that can be applied to variety of documents for table structure recognition.

[0] - https://docsumo.com/free-tools/extract-tables-from-pdf-image...


What was your motivation for making this? I see that you mention that it was a learning project but it seems like a specific enough tool that it is potentially solving some real-world problem for you.


You are right! I'm a student and my problem was some Supply Chain and Lean Management homeworks: I needed to process computations on tables of numbers but my teacher didn't had the data, only screenshots of those tables, so I spent a long time copying numbers into Excel until I decided to implement this. :-)


This is really neat. A lot of the hard bits in converting scientific pdfs to text is to deal with the tables, which more often than not are graphical and usually do not have a text overlay.


It would be cool if you could put a license for this!


Done it, thank you for the tip ! ;-)


What is the default license/freedom for work found on GitHub if it doesn't contain a license?


It’s copyright by the author and you are not licensed to use it


Since it's been distributed to you, you implicitly are allowed to use it unless told otherwise, but possibly only for private use. Most websites, for example, are copyrighted, but you are allowed to read them.


The GitHub ToS allow users to view and fork the project on GitHub, but otherwise normal copyright rules apply.

https://docs.github.com/en/github/creating-cloning-and-archi...


Does anyone have a tool to convert video of scrolling text to a text file?


If the scrolling speed is constant, should be possible to take every Nth frame and make text recognition with usual means


.. maybe take nth frame shots and stich them together with standard photo stichig tools, then ocr the big long image.. all thisto extract chat logs from ms teams..


Don't they have any API? Also in Windows app it should be possible to access text components directly by another app


if your an exchange admin there might be a way. if your just a pleb trying to capture project logs, or records of your coworkers bad mouting you, tough luck.



Hey, Tabula maintainer here. tabula-java only works with "vector" PDFs. That is, tables drawn with vector lines, squiggles and glyphs.

Integrating an OCR library is something we always wanted to do.


I had some success last year integrating tesseract OCR and OpenCV with Tabula (compiled to javascript). The purpose was to build a Google Docs pdf table import addon without requiring a backend. Happy to get in touch to figure out how I could contribute the work back to Tabula (if that makes sense).

Here is a gif of table detection for a scanned PDF doc (the first run is slower as it requires fetching the opencv is bundle): https://lh3.googleusercontent.com/-OobUBBtnydg/X6Vn_Ls3juI/A...

Here's a demo of the addon running outside of Google Docs: https://pdftableutil.possiblenull.com/app/


How fast is it? Does it work with rotated images? How about multiple tables per image?


The program runs with Python and Tesseract. It is quite fast (less than one second for a table of 100 numbers) though I never tested it with larger tables. It detects numbers from an image of a table, which is supposed not to be rotated and also cropped : only the table is visible on the image. So, in order to process multiple tables per image, one needs to create an image for each table. This program is rather simple I must say. ;-)

As for the handwriting, I think Tesseract can handle the recognition if the writing is good, but the table needs to fullfil the expected hypothesis. Also the pre-processing can't get rid of a lot of noise so it can be a problem too !


What about hand writing?


Try azure form recognizer, it does very well with it from my experience


Wow this is amazing. Simple and useful. Looking forward to using it


Tangentially, I would like to be able to extract tables from PDF files for my Easy Data Transform software. So I would like to find a C++ library that does this. Can anyone recommend one? Needs to work with proprietary software (so no GPL). Doesn't have to be free. And what is the state of art on this? How reliably can data tables be extract from real world PDFs?


Is there also a solution for automatic border detection. Last year tried reading bank statements, which were scanned slips. Unfortunately they didn't have any borders which made it super difficult to extract content. Would be cool if someone could make something for this :) I thought it would be easy but I broke my mind on it for several days until I gave up.


https://github.com/eihli/image-table-ocr seems to automatically find tables within larger images, IDK if it works without borders though.


The logic for detecting a table is to get rid of everything but vertical lines over a certain length, save that in one image, then get rid of everything but horizontal lines of a certain length, save that image. Then overlay the two and take the bounding rectangle. So you don't need the table to have a border as long as you have vertical and horizontal lines and they extend far enough to encompass all the data you need.


Yep — reach out to the email in bio. It’s Mac based right now, I’m working on a windows and Linux version.


Azure FormRecognizer API


I am not sure if this works since they are not forms but statements. I.e. no defined structure only the columns are fixed width but the rows are diffewrent sizes without borders.Would be cool if it worked though. I'll give it a go.


The name "form recognizer" is perhaps poorly given, considering it can detect much more than forms (eg invoices, receipts). You can create your own custom models as well.

Disclaimer: I work for creator of said service


Can confirm this is the best out there


My experience in image to text conversion softwares has not given confidence to use in production or as part of reliable workflow. What's unique or different about this tool compared to its predecessors? Does it use any novel algorithms, neural nets or any other techniques?


Nice! Have used quite a few tools like this to convert data government agencies report in pdfs to csvs. The biggest challenge that existing tools fail to adequately address is when table formats vary (e.g., increasing level of indentation). Perhaps formatting those in json first would be easier


When you say increasing the level of indention ... do you have an example handy? I’m working on a pdf / data (word, excel, docx, csv), tool at the moment, and I think it’s pretty robust to things like this.


Accounting tables often do this. This is not the perfect example, but here's a flavor of that. (last page of PDF) https://s23.q4cdn.com/574569502/files/doc_financials/2021/q4...


Yep, understandable. Right now it kicks out pdfs that don't fit the rules, but I think there are a few sensitivity variables / configs I can incorporate to make that seamless.


I had been meaning to find or write a tool like this for ages -- often times the only place where you can find pinout information for a chip is from a table buried on page 7xx of a massive pdf datasheet. Trying to create a symbol for, e.g. a 200+ ball BGA is awful.


Some PDF readers would let you copy and paste the tables as tabular data into Excel, at least with some ST datasheets. Had to find the right combination of reader and operating system for that.


Exactly the application I had in my mind. aperrin, does the tool recognize letters as well, or is it numbers only for now?


Hi ! Tesseract is used to recognize numbers as strings, which are then casted into floats. So it actually does recognize letters, but during the cast it will generate an error and outputs as numpy.inf ; but this was a choice of mine, and one can easily change the code to cast detections into integers, except when it is a letter in which case keep it as a string. :-)


Do you have any idea on how I could use your code to parse tables which have colored cells but no text?


Hi ! I think you can use my code to detect and extract the grid, then run a "color detector" script on all the regions instead of the Tesseract recognizer, and return the RBG/HSL/... values in a text file. I don't know if the region's pre-processing (denoizing, clear border, ...) will be useful though. For the color detection, I found this tutorial : https://www.pyimagesearch.com/2014/08/04/opencv-python-color... which comes from Adrian Rosebrock (https://github.com/jrosebr1) who makes very great Python tutorials. Hope it'll work !


Tangential question, but I remember some HN link about a SaaS business that was doing some OCR on paper bills to make sense of it automatically, anybody remember the name of that service? I have like a thousands of these things to scan and extract tabular data from.


There was Shoeboxed over a decade ago, looks like it’s still around. I think there have been others in the meantime.



That's pretty neat.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: