

Pdf2htmlEX: A PDF to HTML converter - lispython
http://coolwanglu.github.com/pdf2htmlEX/

======
coolwanglu
Hello, I'm the author. MATHML is not used. PDF is rendered with only HTML/CSS,
and a few JS.

Please comment at github such that I can see it in time.

~~~
mgualt
Amazing - I am attempting to install this on mac osx lion -- it is taking a
lot of time because of the dependencies. With so many dependencies the
probability of failure is very high. Let's hope it works.

I urge you to find a way to allow people to install your software more easily.

I managed to get it to install (after about an hour and a half of tinkering.
However I get "Segmentation fault" when I try running it:

pdf2htmlEX --debug=1 test.pdf

temporary dir: /tmp/pdf2htmlEX-LY9cOv

Preprocessing: ....

Working: Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/__css

Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/__pages

Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/p1.png

Install font: (29 0) -> f1

Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/f1.pfa

Segmentation fault: 11

~~~
mgualt
I was able to install

cmake, fontforge and libpoppler with homebrew,

gcc-4.7 using

<https://github.com/sol-prog/gcc-4.7-binary>

~~~
evandrix
you mean `poppler` instead of `libpoppler`

------
peterlai
I'm flattered the author mentions Crocodoc. Crocodoc is hiring by the way if
anyone wants to hack on stuff like this full time:
<https://crocodoc.com/jobs/>

------
wesley
What's I'd like to see is a library that can extract multi-column text into a
readable format. From looking at the source of the HTML here, they're doing it
with absolute positioning. Nothing wrong with that for display purposes, but
I'd like to have a library that can extract text meaningfully from a multi-
column PDF.

~~~
maxerickson
The pdftotext tool from xpdf does something like that. One option pads the
output text with spaces to roughly match the layout of the pdf (the -layout
option) and another option just strips the pdf formatting out (the -raw
option).

Depending on the structure of the pdf, one or the either may give better
output (the -layout output would need some more processing).

------
fudged71
This is fantastic! I've been using LaTeX for a while now, and nothing has
really outputted HTML anywhere near this quality. I'm very impressed!

------
mgualt
Very interesting - It would be great if the author could outline his overall
goals and design ideas.

What are some of the constraints on the PDF in terms of page dimensions or
configuration?

How is the math translation done? Does it use MathML or something else?

For me, the interest is that I can now go LaTeX ---> Webpage.

~~~
agilebyte
Have you tried wiki.lyx.org/Tools/ELyXer for tex to html? I have used it on my
dissertation and was mightily impressed (I am easily impressed):

<http://patterns.radekstepan.com/>

~~~
mgualt
From my point of view, that's not really tex to html. That's tex markup to
html. I am talking about using the latex software, whose purpose is to do
typesetting. The amazing thing about this converter is that it takes the latex
OUTPUT and produces html.

------
ChuckMcM
Neat idea, make it generate epub and it moves the pdf->e-book ball a bit
further down the field. Looking at the source to this page view-
source:<http://coolwanglu.github.com/pdf2htmlEX/demo/geneve.html> it looks
like you can't yet generate a font from the characters, rather it uses the
'font trick' to put images on the page. That makes the epub problem harder
(which really really wants fonts not images it seems)

~~~
coolwanglu
What do you mean no fonts? You can try to copy the text out, which is not
possible if images are used.

~~~
ChuckMcM
"           "

There I cut and pasted a quote from the document linked into this response.
What do you see? I see a bunch of boxes.

~~~
coolwanglu
If you use HTML inspectors and remove that piece of mess, you'll find text on
the html also disappear.

It's the problem of font encoding, which is one of the difference between PDF
and HTML. Sometimes you cannot copy the text out of PDF, but you can read
correctly.

I'm working on that problem. I made things like this so far because I think
visual accuracy is more important.

~~~
ChuckMcM
_"It's the problem of font encoding"_

Yes, and that was exactly my comment. It would be really cool if the converter
generated character code points for the characters on the screen. So that
cutting and pasting did what you might expect. But to make that work you need
to do some form of OCR on the document, figure out where the text is, and how
it is composed, then you create a font which re-creates the look based on the
imagery in the document and then you generate the CSS that lays down the text
and decorates it with the font and re-create the visual of the PDF. (or make
it an epub)

If you can get it to that point, there will be huge utility for folks who want
to convert paper books to e-books. Because the typical scanner will generate
PDF but the typical e-book will only flow e-pub (or .mobi or proprietary
formats).

~~~
coolwanglu
OCR is beyond the scope of pdf2htmlEX I'm just trying to find out the real
meaning of the glyphs through glyph names.

Actually usually you should be able to select/copy text without problem, if
there's no type 0 fonts.

------
fpp
Looks pretty neat - saving it as html file works great, but you can't print
the docs (in Chrome print to pdf only shows a scroll bar, in FF it does not
properly format).

Great work anyway - I'll have a deeper look.

~~~
coolwanglu
Yeah, known issue. Currently I've no idea how to fix it :(

------
corry
Very cool, definitely an area where there needs to be lots of work done.

What is browser compatibility like? Is IE8 supported?

Edit: removed reference to HTML5/canvas, didn't see any in the source HTML.

~~~
coolwanglu
AFAIK, IE8 doesn't support enough HTML5 stuffs, so no. IE9 should be OK

------
SeanDav
Sorry for the very noob question, but how do you actually get this to run on a
windows XP system?

I just want to run a quick test, but it seems I have to build the project - is
that correct?

~~~
lectrick
Step 1: Only use Windows for games. Fire up a *nix VM, fullscreen it and get
real work done in the big boy's open source developer land :)

------
akie
I'm very impressed. Can you post some more examples online, some non-technical
PDFs for example? I'm curious how well it does 'generic' PDFs (for example
magazine layouts).

~~~
coolwanglu
What do you suggest? I don't have one in my mind now.

------
guilloche
Amazing, Can the same trick be used for latex=>html? It would be better than
tth which is also very good.

~~~
coolwanglu
of course you can compile latex to pdf first

------
neurostimulant
Very cool! This is exactly what I need. I'm going to play with it for a while.

------
Genmutant
Has somebody already built it for Windows and could upload the binary?

~~~
coolwanglu
I've tried and succeeded with CYGWIN. But no idea how can I distribute the
package with the dependencies.

------
coolwanglu
Please try with commit f02e1d4 if any of you cannot build it on Mac

------
dutchbrit
What about complex vectors with gradients etc?

~~~
coolwanglu
Just go head, try and be pleased.

------
additive
Is it faster than pdftohtml?

~~~
coolwanglu
Probably not, as font conversion is slow. pdftohtml does not extract fonts for
now.

------
Evbn
Technically impressive, but what systems can render HTML and JS but not PDF?

~~~
nilliams
There are also various use-cases for doing this as part of a larger product.
Say you need to take a customer's crappy PDFs & reformat them for display
within a web app, on a public display or to send as an HTML email. You could
use this tool, convert to HTML, then drop in your own CSS stylesheet to
reformat it. If your customer had many of said crappy PDFs you could no-doubt
automate the whole process.

Needless to say I had to do something pretty similar recently, though I ended
up having to ask the customer to provide better source data than the PDFs they
initially sent. This tool could have been very useful at the time, hope to
give it a spin soon.

------
Evbn
Is it possible to reflow the page, at least in simple cases like 2-column
documents? That would be awesome for mobile.

~~~
coolwanglu
That's beyond the scope of pdf2htmlEX.

~~~
mgualt
Indeed, and I hope it stays out of scope forever. The idea of reflow is
anathema to the idea of typesetting, as far as I can see.

