Hacker News new | past | comments | ask | show | jobs | submit login
Pdf2htmlEX: A PDF to HTML converter (coolwanglu.github.com)
107 points by lispython on Sept 16, 2012 | hide | past | web | favorite | 61 comments



I'm flattered the author mentions Crocodoc. Crocodoc is hiring by the way if anyone wants to hack on stuff like this full time: https://crocodoc.com/jobs/


Hello, I'm the author. MATHML is not used. PDF is rendered with only HTML/CSS, and a few JS.

Please comment at github such that I can see it in time.


This is clever, thanks for sharing.

When viewing the output (using the "computer science cheat sheet") I found some differences between browsers that I thought HN readers might find interesting. These aren't primarily issues with your tool, hence posting here.

- I primarily use Chrome (21) as my browser, and the cheat sheet renders very quickly. I noticed it doesn't seem to render some equations correctly (see bad operators here[1]).

- FF (15.0.1) seems to render more correctly, but it is glacially slow. The whole app (chrome and all) freezes for several seconds between clicks while the document is loaded in any tab.

- IE (9) renders the same page both correctly and quickly.

[1] http://imageshack.us/a/img88/3754/chromeformulas.png


The problem happens only on Windows.

For Chrome, if you zoom in, I think everything should be fine.But Chrome is lack of antialias in Windows.

I'm trying to solve the problem of Firefox.


Just to add to the compatibility list, all examples render perfectly on Opera 12, albeit a bit slow.


Amazing - I am attempting to install this on mac osx lion -- it is taking a lot of time because of the dependencies. With so many dependencies the probability of failure is very high. Let's hope it works.

I urge you to find a way to allow people to install your software more easily.

I managed to get it to install (after about an hour and a half of tinkering. However I get "Segmentation fault" when I try running it:

pdf2htmlEX --debug=1 test.pdf

temporary dir: /tmp/pdf2htmlEX-LY9cOv

Preprocessing: ....

Working: Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/__css

Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/__pages

Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/p1.png

Install font: (29 0) -> f1

Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/f1.pfa

Segmentation fault: 11


I was able to install

cmake, fontforge and libpoppler with homebrew,

gcc-4.7 using

https://github.com/sol-prog/gcc-4.7-binary


you mean `poppler` instead of `libpoppler`


     sudo apt-add-repository ppa:coolwanglu/pdf2htmlex   
     sudo apt-get update
     sudo apt-get install pdf2htmlex
I haven't yet tried to build on mac os, but in ubuntu it was trivially simple.


It's confirmed by some guys using Mac. We are working on this. Please hold on, and join the discussion on github if you like. Thanks for your patience.


The compiling problem should have been fixed. Could you please try the latest master branch, see if it works well or maybe fail at one assertion?


Thank you for replying - I tried the new branch, and posted the problem I encountered as an issue with gists attached on the github.


I cannot reproduce it with a 20110222 version of fontforge.

Would you mind send me the pdf file, for me to debug?

Does it always crash, with other pdf files?


yes - it crashes as described with any pdf file.


sorry to hear that. some guys are working on MacPorts and Homebrew formula.

I hope this would help you. https://trac.macports.org/ticket/36028


The problem is that I don't have a machine with Mac. Which version of fontforge have you installed?


This is what fontforge displays when I start it up:

Executable based on sources from 14:57 GMT 31-Jul-2012-D. Library based on sources from 14:57 GMT 31-Jul-2012.


I'm now trying compile with an older version. But please update fontforge if you can.


I think my fontforge is the current version (see above) -- please correct me if I'm wrong.


Usually I built from git. There has been some improvement relevant to pdf2htmlEX during the path month.

However it should not crash, and it's confirmed by many people now.

Could you please try the commit f02e1d4 ?


Hi! This is an incredible project. I'm just curious, you mention that crocodoc has been "consulted" for this project.

Did you ask them how they do their HTML5 conversion or what exactly do you mean by that?

Anyway, a big Thanks for creating this project!


I meant I took a look at a HTML page generated by crocodoc. Their approach was interesting.


This is awesome stuff! Thanks for sharing this.


Damn, that's cool. Somewhat full circle too, in light of the many pdf printer drivers in use today.


What's I'd like to see is a library that can extract multi-column text into a readable format. From looking at the source of the HTML here, they're doing it with absolute positioning. Nothing wrong with that for display purposes, but I'd like to have a library that can extract text meaningfully from a multi-column PDF.


The pdftotext tool from xpdf does something like that. One option pads the output text with spaces to roughly match the layout of the pdf (the -layout option) and another option just strips the pdf formatting out (the -raw option).

Depending on the structure of the pdf, one or the either may give better output (the -layout output would need some more processing).


This is fantastic! I've been using LaTeX for a while now, and nothing has really outputted HTML anywhere near this quality. I'm very impressed!


Very interesting - It would be great if the author could outline his overall goals and design ideas.

What are some of the constraints on the PDF in terms of page dimensions or configuration?

How is the math translation done? Does it use MathML or something else?

For me, the interest is that I can now go LaTeX ---> Webpage.


Have you tried wiki.lyx.org/Tools/ELyXer for tex to html? I have used it on my dissertation and was mightily impressed (I am easily impressed):

http://patterns.radekstepan.com/


From my point of view, that's not really tex to html. That's tex markup to html. I am talking about using the latex software, whose purpose is to do typesetting. The amazing thing about this converter is that it takes the latex OUTPUT and produces html.


Neat idea, make it generate epub and it moves the pdf->e-book ball a bit further down the field. Looking at the source to this page view-source:http://coolwanglu.github.com/pdf2htmlEX/demo/geneve.html it looks like you can't yet generate a font from the characters, rather it uses the 'font trick' to put images on the page. That makes the epub problem harder (which really really wants fonts not images it seems)


What do you mean no fonts? You can try to copy the text out, which is not possible if images are used.


"           "

There I cut and pasted a quote from the document linked into this response. What do you see? I see a bunch of boxes.


If you use HTML inspectors and remove that piece of mess, you'll find text on the html also disappear.

It's the problem of font encoding, which is one of the difference between PDF and HTML. Sometimes you cannot copy the text out of PDF, but you can read correctly.

I'm working on that problem. I made things like this so far because I think visual accuracy is more important.


"It's the problem of font encoding"

Yes, and that was exactly my comment. It would be really cool if the converter generated character code points for the characters on the screen. So that cutting and pasting did what you might expect. But to make that work you need to do some form of OCR on the document, figure out where the text is, and how it is composed, then you create a font which re-creates the look based on the imagery in the document and then you generate the CSS that lays down the text and decorates it with the font and re-create the visual of the PDF. (or make it an epub)

If you can get it to that point, there will be huge utility for folks who want to convert paper books to e-books. Because the typical scanner will generate PDF but the typical e-book will only flow e-pub (or .mobi or proprietary formats).


OCR is beyond the scope of pdf2htmlEX I'm just trying to find out the real meaning of the glyphs through glyph names.

Actually usually you should be able to select/copy text without problem, if there's no type 0 fonts.


Looks pretty neat - saving it as html file works great, but you can't print the docs (in Chrome print to pdf only shows a scroll bar, in FF it does not properly format).

Great work anyway - I'll have a deeper look.


Yeah, known issue. Currently I've no idea how to fix it :(


I'm very impressed. Can you post some more examples online, some non-technical PDFs for example? I'm curious how well it does 'generic' PDFs (for example magazine layouts).


What do you suggest? I don't have one in my mind now.


Sorry for the very noob question, but how do you actually get this to run on a windows XP system?

I just want to run a quick test, but it seems I have to build the project - is that correct?


Step 1: Only use Windows for games. Fire up a *nix VM, fullscreen it and get real work done in the big boy's open source developer land :)


You may build it with CYGWIN.


Very cool, definitely an area where there needs to be lots of work done.

What is browser compatibility like? Is IE8 supported?

Edit: removed reference to HTML5/canvas, didn't see any in the source HTML.


AFAIK, IE8 doesn't support enough HTML5 stuffs, so no. IE9 should be OK


Amazing, Can the same trick be used for latex=>html? It would be better than tth which is also very good.


of course you can compile latex to pdf first


Very cool! This is exactly what I need. I'm going to play with it for a while.


Has somebody already built it for Windows and could upload the binary?


I've tried and succeeded with CYGWIN. But no idea how can I distribute the package with the dependencies.


Please try with commit f02e1d4 if any of you cannot build it on Mac


What about complex vectors with gradients etc?


Just go head, try and be pleased.


Is it faster than pdftohtml?


Probably not, as font conversion is slow. pdftohtml does not extract fonts for now.


Technically impressive, but what systems can render HTML and JS but not PDF?


There are also various use-cases for doing this as part of a larger product. Say you need to take a customer's crappy PDFs & reformat them for display within a web app, on a public display or to send as an HTML email. You could use this tool, convert to HTML, then drop in your own CSS stylesheet to reformat it. If your customer had many of said crappy PDFs you could no-doubt automate the whole process.

Needless to say I had to do something pretty similar recently, though I ended up having to ask the customer to provide better source data than the PDFs they initially sent. This tool could have been very useful at the time, hope to give it a spin soon.


Firefox and IE on Windows.


Is it possible to reflow the page, at least in simple cases like 2-column documents? That would be awesome for mobile.


That's beyond the scope of pdf2htmlEX.


Indeed, and I hope it stays out of scope forever. The idea of reflow is anathema to the idea of typesetting, as far as I can see.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: