

HTML preview for doc, docx, pdf & rtf - _raghu
http://recruiterbox.com/blog/11/html-preview-for-doc-docx-pdf-rtf

======
afiler
Prompted by downloading a .doc file from Qwest only to find out that inside
was a monospaced text file, I set up a small, nearly UI-free site for doing
document conversions. [http://doc.mar.cx/<url>](http://doc.mar.cx/<url>);
gives an HTML or other sensible rendering of an url (e.g.
[http://doc.mar.cx/http://www.itu.int/dms_pub/itu-t/oth/02/02...](http://doc.mar.cx/http://www.itu.int/dms_pub/itu-t/oth/02/02/T02020000010001MSWE.doc)
) and
[http://doc.mar.cx/<extension>/<url>](http://doc.mar.cx/<extension>/<url>);
attempts to convert the url into the format with the given extension (e.g.
[http://doc.mar.cx/txt/http://www.itu.int/dms_pub/itu-t/oth/0...](http://doc.mar.cx/txt/http://www.itu.int/dms_pub/itu-t/oth/02/02/T02020000010001MSWE.doc)
).

I use wvHtml for doc->html, wvPDF for doc->pdf, but antiword for doc->txt. To
convert .docx, .xls, .xlsx, and WordPerfect files to HTML, I use OpenOffice,
by way of jodconverter. For ODF files, I use OdfConverter. Conversion of Excel
files to .csv files uses xls2csv. For PowerPoint files, I use ppthtml to
convert to html, and catppt to convert to text. For Lotus 1-2-3 files (I added
this after downloading some historical telecom data from the FCC!), I use
ssconvert.

Any conversion that results in an HTML file (e.g. doc or pdf to html) I bundle
all the images into a single file using the data: url scheme. To do this, I
wrote a utility called pagecan: <http://afiler.com/pagecan/>

------
sushi
UX Suggestion: Please hyperlink the Blog text besides the Recruiterbox logo.
It's underlined so users expect it to be a link.

~~~
p4bl0
Also, a <title> tag would be useful :-).

But apart from this, I now I'll face this very problem soon (well, for a
relatively fluctuant value of "soon"), so thank you very much for sharing this
_raghu!

------
bravura
You should also consider 'pandoc', written in Haskell, for converting between
markup formats: <http://johnmacfarlane.net/pandoc/>

I am curious for more details about why Tika wasn't good enough. Please
explain.

~~~
_raghu
Tika is very good at converting documents to plain text. Very reliable too.
The problem for us was that, most resumes have a lot of formatting in them.
For example candidates use tables to structure data. When such a resume is
converted to plain text using tika, it looks jumbled.

Will take a look at pandoc. Thanks for suggesting.

------
kalmi10
Based on the title I expected some html5 magic for converting binary files
into html in the browser.

------
dpapathanasiou
How would you compare abiword for doc/docx conversion versus antiword
(<http://www.winfield.demon.nl/>)?

Also, what are the limitations of abiword for doc/docx files?

~~~
_raghu
Haven't tried antiword. As of now I find abiword pretty stable for both doc
and docx. I need more data but I found a few cases where it just hanged while
converting. There is no specific pattern to when the program hangs. For now I
am logging such cases and timing out the conversion in 3 seconds.

~~~
dpapathanasiou
Thanks.

Where do you get your doc files?

Are they the just ones submitted to your site, or is there a pastebin or
similar repo of doc files?

------
tucosan
How about trying out calibre <http://calibre-ebook.com> It can do all kinds of
conversions from a number of formats, it is quite reliable, and it can be run
headless.

------
jamesshamenski
Million Dollar Question:

How could you additionally parse the information to extract structured data?
For example; names of candidates, addresses, previous employers, job titles
held.

~~~
earle
That's been done across online job boards since 1996 when we launched hotjobs.
Resumes, although varying aesthetically contain a pretty ridged structure that
presents itself well to localized extraction. This allows easy term extraction
for searching across a very large data set quickly.

A simple 30 line flex/yacc combo will work effectively at a high ninety
percentile

------
Jakob
Please add a candidate delete function. I sent an email with candidate with
multiple attachments and Recruiterbox created multiple candidates by mistake.

------
nopal
There's really not much here.

Could we see some code or a demo?

