Please comment at github such that I can see it in time.
When viewing the output (using the "computer science cheat sheet") I found some differences between browsers that I thought HN readers might find interesting. These aren't primarily issues with your tool, hence posting here.
- I primarily use Chrome (21) as my browser, and the cheat sheet renders very quickly. I noticed it doesn't seem to render some equations correctly (see bad operators here).
- FF (15.0.1) seems to render more correctly, but it is glacially slow. The whole app (chrome and all) freezes for several seconds between clicks while the document is loaded in any tab.
- IE (9) renders the same page both correctly and quickly.
For Chrome, if you zoom in, I think everything should be fine.But Chrome is lack of antialias in Windows.
I'm trying to solve the problem of Firefox.
I urge you to find a way to allow people to install your software more easily.
I managed to get it to install (after about an hour and a half of tinkering. However I get "Segmentation fault" when I try running it:
pdf2htmlEX --debug=1 test.pdf
temporary dir: /tmp/pdf2htmlEX-LY9cOv
Working: Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/__css
Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/__pages
Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/p1.png
Install font: (29 0) -> f1
Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/f1.pfa
Segmentation fault: 11
cmake, fontforge and libpoppler with homebrew,
sudo apt-add-repository ppa:coolwanglu/pdf2htmlex
sudo apt-get update
sudo apt-get install pdf2htmlex
Would you mind send me the pdf file, for me to debug?
Does it always crash, with other pdf files?
I hope this would help you.
Executable based on sources from 14:57 GMT 31-Jul-2012-D.
Library based on sources from 14:57 GMT 31-Jul-2012.
However it should not crash, and it's confirmed by many people now.
Could you please try the commit f02e1d4 ?
Did you ask them how they do their HTML5 conversion or what exactly do you mean by that?
Anyway, a big Thanks for creating this project!
Depending on the structure of the pdf, one or the either may give better output (the -layout output would need some more processing).
What are some of the constraints on the PDF in terms of page dimensions or configuration?
How is the math translation done? Does it use MathML or something else?
For me, the interest is that I can now go LaTeX ---> Webpage.
There I cut and pasted a quote from the document linked into this response. What do you see? I see a bunch of boxes.
It's the problem of font encoding, which is one of the difference between PDF and HTML. Sometimes you cannot copy the text out of PDF, but you can read correctly.
I'm working on that problem. I made things like this so far because I think visual accuracy is more important.
Yes, and that was exactly my comment. It would be really cool if the converter generated character code points for the characters on the screen. So that cutting and pasting did what you might expect. But to make that work you need to do some form of OCR on the document, figure out where the text is, and how it is composed, then you create a font which re-creates the look based on the imagery in the document and then you generate the CSS that lays down the text and decorates it with the font and re-create the visual of the PDF. (or make it an epub)
If you can get it to that point, there will be huge utility for folks who want to convert paper books to e-books. Because the typical scanner will generate PDF but the typical e-book will only flow e-pub (or .mobi or proprietary formats).
Actually usually you should be able to select/copy text without problem, if there's no type 0 fonts.
Great work anyway - I'll have a deeper look.
I just want to run a quick test, but it seems I have to build the project - is that correct?
What is browser compatibility like? Is IE8 supported?
Edit: removed reference to HTML5/canvas, didn't see any in the source HTML.
Needless to say I had to do something pretty similar recently, though I ended up having to ask the customer to provide better source data than the PDFs they initially sent. This tool could have been very useful at the time, hope to give it a spin soon.