
Pdf2htmlEX – Convert PDF to HTML without losing text or format - coolwanglu
https://github.com/coolwanglu/pdf2htmlEX
======
chill1
I've actually been using this to convert large PDF files to HTML to be
displayed in-browser. It's for my work, so I don't feel comfortable posting a
link to the demo instance here.

It is definitely the best solution I've found so far. The outputted HTML / CSS
/ images look almost identical to the source PDF. That being said, there are a
few issues still:

* One Gigantic (600kb) CSS file from a single PDF

* Hundreds of individual fonts

* HTML semantics are non-existent

These are all relatively easy to fix, I believe. I have found my own solutions
to most of the issues in post-processing.

Kudos to you, coolwanglu. Also, I'd like to get in touch with you about
lending a hand to fix some of the issues I've encountered.

Thanks for a cool piece of software!

~~~
coolwanglu
Hey thanks for the info!

2nd & 3rd are in the future plan, as I'm still working on accuracy and speed.
And #115(<https://github.com/coolwanglu/pdf2htmlEX/issues/115>) is about the
2nd issue.

About the first one, I've not got an elegant solution yet, maybe a CSS file
per page?

Please file new issues at GitHub if you think it's necessary :)

~~~
acmecorps
I love this! Kudos for this awesome app.

------
ComputerGuru
Can anyone recommend an equally good opposite (HTML to PDF)?

wkhtmltopdf [0] is probably the most popular, but it's also ridiculously
buggy.

0: <https://code.google.com/p/wkhtmltopdf/>

~~~
SigmundA
<http://phantomjs.org/> is the best so far in my experience since it handles
all the client side javascript properly.

The PDF's it outputs are full vector not just rasters, it the same engine used
in Chrome to view PDF's and print web pages from my understanding.

~~~
rgrieselhuber
We've tried everything, including PrinceXML, and PhantomJS has been the best
for us so far.

------
AndreasFrom
This works and displays correctly, but is unbearably slow on iPad 2 whereas
the PDF loads instantly. What is the point then or does it work a lot better
in desktop browsers?

~~~
coolwanglu
I heard that with careful optimization on the server side and a clever JS may
solve this. So far the default UI just demostrates the ability of reading-
while-downloading.

The idea is that now the document becomes more controllable and accessible,
say you can put Google Analytics in your resume written in LaTeX; or maybe an
social reading service, where you can comment, annotate and share.

Unlike PDF viewers, web browers are never optimized for this kind of messy
inputs. The next version of pdf2htmlEX will be focused on optimizations, e.g.
smaller size of background images, hopefully that would help.

~~~
nwh
> social reading service

I truly wish there was at least one ground that hadn't been touched by
"social" crap.

~~~
twic
Porn?

~~~
nwh
Nope.

------
crazygringo
Interesting. So it converts all vector graphics to a background image per
page, but keeps all text as browser-rendered on top of it.

I guess I don't really see much practical purpose for it -- most browsers
these days seem perfectly fine opening PDF files natively, after all. But it's
a very cool technological demonstration.

Maybe this could be some kind of bridge tool for generating sites with fancy
typographical layout? You could use Adobe Illustrator etc. to do fancy column
work, drop caps, hyphenation, all that jazz -- and then "render" into HTML. It
would certainly be as anti-"responsive" as you can get, but it would certainly
have the ability to generate more advanced typography much faster than you can
produce with HTML/CSS by hand.

~~~
altrego99
As a practical purpose, how about being able to edit a PDF document? I
understand that it can be done through some other tools, but this is one more
- and would be free and easy.

Convert to HTML -> Edit -> Print back to PDF (if needed)

~~~
StavrosK
I'm not sure the html will be clean enough to edit, sadly...

------
dannyrough
I do this almost daily. I use a PDF converter driver found on the internet .
Install it and it becomes a selectable converter option.Then you can convert
PDFs to many forms in any program at all, including Adobe Acrobat . Just open
a PDF, select convert, and choice a form you want, the task will be finished
in several seconds. if you haven't found a good choice , you can have a try.
best wishes. [http://www.rasteredge.com/how-to/csharp-imaging/pdf-
convert-...](http://www.rasteredge.com/how-to/csharp-imaging/pdf-convert-
html/)

------
_DiskError
Question, does your public folder periodically delete files? I accidentally
uploaded something confidential and it seems to be gone. I was wondering if
this was a manual deletion or just expired since I still see files that were
uploaded around the same time still there.

------
alcuadrado
Can't Mozilla's pdf.js be used to get the same result? Great results anyway!

~~~
coolwanglu
You don't want to rely on the computing power at the client side, do you? :)

~~~
crazysim
I guess one possible setup would be pdf.js running on server-side and having
its output captured. One advantage of this, from what I can see, is that there
would probably be fewer external dependencies than this setup.

~~~
coolwanglu
Yes, actually they had this kind of plan, but I am not sure how it has been
going.

It would be definitely interesting in that way, but in that case it may not be
worth it to rewrite everything in JS.

------
chucknelson
Promising start. Hopefully performance improves with each release.

~~~
coolwanglu
Right, that is in the schedule, just heard enough complains, in a good way.

------
Dnguyen
I didn't see any mention of tables in the doc. Does this means it's outside of
the "good enough" range? Table extraction would be a great feature.

~~~
coolwanglu
It's still a startup, so currently it's focused on accurate rendering, and
fast speed(which is not achieved yet so far).

Features about recognition would be planned in the future, usually PDF viewers
do not recognize too many things, do they? :)

------
rcfox
How did you manage to get Mediafire to host your demo?

~~~
coolwanglu
MF uses pdf2htmlEX :) And it also provides public folder and public dropbox
<\- I really like that.

This means that you can create one of your own.

------
est
路过拜大牛

~~~
v-yadli
路过拜大牛 +1

