
Ask HN: How do you extract data in pdf catalog and put in a database? - Ceezy
Me, I actually hire people in India, they wrote excel files and then i can instert them im my database.Do you know more efficient ways? Or even tools?
======
SQL2219
You might consider converting the pdf files to rtf, there are command line
tools that do that. Once it is converted to an rtf, you can programmatically
read and parse through the text. I am assuming that your pdf files contain
tables, but even if they don't this will still probably work.

------
beamatronic
How clear are the PDFs, have you tried any OCR?

~~~
Ceezy
Thank you guys! My pdf are very UNCLEAR! they mix with photo they are like 30
MO each minmum a nightmare... Plus it's not realy strutured like, every
paragraph start with a name of a product. The document is really made to be
read by a seller and not to a computer. I will google OCR and RTF to chose
witch one is the best. Do you guys(and girls) have idea how big company like
amazon deal with that?

~~~
SQL2219
Mechanical Turk

~~~
beamatronic
Are they single page PDFs? If so, you might want to first convert them to high
quality JPEGs, since that will be easier to deal with. With Mechanical Turk
you'll want to make a task that gives the worker one image at a time. Instead
of working with Excel sheets, have you consider using Google Forms? You can
give access to a Form and not to the underlying Sheet itself. If you use a
Google Sheet as your "system of record" you will automatically have it backed
up and also you can always have the option to export to Excel if you need to.

------
SQL2219
I think MS Office 2016 can convert PDF to RTF also.

