

Ask HN: please review my app - html to pdf API - jgresula
http://pdfcrowd.com

======
dpapathanasiou
Have you thought about the reverse, i.e., a tool that could convert pdfs to
html faithfully?

I would be willing to pay money for a reliable tool that didn't need much
manual editing after processing.

Unfortunately, the pdftohtml project (<http://pdftohtml.sourceforge.net/>) has
been inactive, and the current version has trouble with even moderately
complex layouts.

~~~
jgresula
That's a non-trivial task. There are no such objects like tables, styles,
lists or paragraphs in PDF so you would need to reconstruct this kind of
information. Also, text and vector graphics is positioned absolutely. Tagged
PDFs contains some meta-information about the document structure which could
help but still it is a lot of work.

The fundamental problem is that PDF stores the document presentation while
html defines the document and the presentation is created by the browser. And
obviously, to restore a document definition from its presentation is hard as
lot of information is missing.

~~~
dpapathanasiou
_That's a non-trivial task._

Yes, that's true.

I only bring it up b/c if your goal is to turn pdfcrowd into an app that
people would pay money for (and I would be one of them), solving that problem
would go a long way towards achieving it.

~~~
brandnewlow
Total noob question, couldn't you programmatically capture a browsershot and
then convert that into a PDF?

HTML -> png seems to have been figured out. Is .png -> pdf that hard to do?

~~~
vibhavs
No, .png to .pdf is not difficult.

I believe dpapathanasiou's suggestion is not to blindly convert a pdf into
html file with one giant image file of the pdf.

Instead, he wants to create an html document that maintains the same content
and layout from the pdf.

~~~
brandnewlow
D'Oh! Got myself mixed up there a bit.

------
petesalty
I can easily see a use for this. I'm doing a pro bono project for a small non
profit, and part of the project requires generating simple PDF reports. They
don't have any money so we need to keep it low cost.

One of the ways of doing this is to host it on a simple shared server (it's
not a heavily used app).

Downside of this is that it's unlikely we'll be able to use any of the PDF
tools I've used in the past (since they need to be installed). This should
work fine for our purposes.

Thanks, I was wondering how I'd get around this.

To all those who were dissing this because they couldn't immediately see a use
for it, try to have a more open mind.

~~~
wdewind
I'm also developing an HTML->PDF feature and jumped when I saw this! I tried
smashingmag.com - which is funny because i meant smashingmagazine.com but
actually smashingmag.com is some japanese site. either way i got back a
totally blank pdf - maybe japanese character set is to blame?

One other caveat is that having the ability to view flash would be awesome as
well. main function of pdf as i understand it is to create a document that
PRINTS completely identically on every setup, so frequently people are going
to want to print flash, which is already a huge pain in the ass. Unfortunately
it looks like it blanks out completely if there is flash on the page
(2advanced.net)

if you could solve that i would start paying tomorrow.

------
raffi
The quality of <http://www.princexml.com> is amazing. It's not open source and
there is a cost to use it commercially (<1K, if I recall). I used it to
convert my HTML documentation for Sleep into a camera-ready PDF.

<http://sleep.dashnine.org/manual/> \- original docs
<http://sleep.dashnine.org/download/sleep21manual.pdf> \- result

~~~
jonallanharper
I have exhausted myself trying to persuade prince xml to _not_ blur my images.
That's the biggest hurdle for me.

If PDFCrowd can effectively handle images, I'll brand their logo into my
bicep.

~~~
corruption
Have you tried flying saucer? I found it to be excellent for my purposes.

~~~
jonallanharper
Have not. I will definitely look into. Thanks!

------
thepsi
Nice execution - as per the comment below, something like this would've saved
me lots of manual fiddling back when I was doing lots of PDF stuff.

Given the focus on APIs I guess you're aiming it at those wanting to
programmatically generate PDFs using a familiar markup, rather than conversion
of existing (static) content into PDF? If so, maybe investigate the ability to
overlay rendering onto an existing PDF template at some point - in my
experience it's been a common requirement (think form letters, account
statements, etc).

Interesting that it appears to execute Javascript; guess it's a sign of the
times that you _need_ to in order to render many sites correctly nowadays. I
haven't poked it too hard, but suspect there might be one or two security
challenges there...

------
DanHulton
Well, your default HTML generates one screwey PDF. When viewed in Mac OSX
Preview, I get the text "T pe our HTML here..." Then, when I select the text,
certain letters get partially removed or overwritten and I end up with
gibberish.

I've just spent weeks working on HTML -> PDF conversion code, so I know it's
not just my viewer. I've put all kinds of crazy stuff through there.

~~~
thepsi
Exact same thing works perfectly for me (OS X Preview, version 5.0.1). I'd be
interested to know what this turns out to be.

------
karanbhangui
Slick design, but out of curiosity, why wouldn't developers just use
<http://code.google.com/p/wkhtmltopdf/> ?

~~~
jgresula
There is no doubt that many developers will use wkhtmltopdf.

I think that the Pdfcrowd's selling points could be 1) wide availability -
only HTTP is needed so it can be used theoretically on any platform 2) no need
to install any 3rd party software which makes the applications more portable
3) API bindings

------
latortuga
We used an html->pdf conversion service (I believe it was
<http://www.htm2pdf.co.uk/> but I'm not positive) for awhile to do billing and
our biggest problem was that it went down _all the time_. We ended up
purchasing a (pretty cheap) license to a Java library that does pdf generation
for us and is pretty easy to use. This is definitely a service that people
will pay money to use - best of luck to you!

------
ivan_ah
NICE! You have beat me (and I am sure a dozen others hackers) to the
realization of this idea...

Here is an idea for an extra feature: make a print bookmarklet -- clicking on
it you get a nice PDF version of the page you are viewing right now. I can't
stand firefox's print renditions of some pages... terrible...

(also you might want to set the page size to letter or A4 depending on the
geolocation of your visitor's ip address)

------
watmough
Excellent stuff.

I notice there are some questions about how to make money. One may be to
position yourself as a way to get PDF reports generated from phone apps, in
which case you may want to do per app licensing and provide facilities for
email delivery of PDFs.

I could see this being useful porting apps from iPhone (can easily generate
PDFs) to Android (which does not appear to support PDF output).

------
sjs382
A lot of html to pdf conversion is useless if page-break-* properties are not
followed. Shame, too. I've been building something like this all week.

~~~
jgresula
I don't know the exact status of how WebKit handles these properties. I know
that at least "page-break-after: always" works since that is what I use when
the user clicks the 'Insert Page Break' button in the editor
(<http://pdfcrowd.com/editor/>).

------
ricmo
have you seen this? <http://code.google.com/p/wkhtmltopdf/>

~~~
jgresula
Yes, I have seen it but have not tried it yet.

~~~
gridspy
Woah. A 3rd party does your entire value add and yet your hack up your own?

~~~
jgresula
I did not know about this project at the time I started with pdfcrowd. But
anyway, I just took my existing pdf library and integrated it with WebKit
which was not that hard as one could think.

------
qeorge
This is awesome. I'm at once excited about using this in the future, and
dismayed thinking of the time I've spent manually generating PDFs because none
of the HTML -> PDF options worked.

I fed it my homepage, and it nailed it. I'm impressed.

------
deutronium
The pdf conversion is awesome! I just tried printing <http://times.com/> to a
pdf in firefox and it ended up putting the main content of the site on page 2,
whereas yours seemed to render it perfectly.

------
juliancox
Looks good. I'm keen to use (and pay for) a service like this - if its
reliable and quick. With a ruby gem its particularly attractive as all other
rails to pdf solutions are incomplete, require a pdf specific dsl or are very
expensive.

------
pstinnett
Haven't tested this, but great idea. I've used a couple of the PDF creation
tools and it seems so tedious to build out even a simple table view on a PDF.
Good luck with this!

------
carbocation
This is great! The only downside that I saw after converting one of my pages
is that the colors dulled substantially.

~~~
jgresula
That's a known problem on my todo list. The colors are dulled only in Acrobat
but other PDF readers render the colors correctly. Please, could you post the
link to that page if possible? Thanks.

~~~
gridspy
<http://your.gridspy.co.nz/powertech> dulled substantially.

Also, you don't support the CSS3 styling of the header text.

The fonts look super aliased.

Finally, you don't snap the rendered HTML to the nearest page, leading to a
page containing only the footer.

------
oskee80
Worked great for me, good job. I'd be interested in a PHP binding too, and
knowing what the eventual cost will be.

------
washingtondc
I like it, but my site didn't come out correctly (www.convertyourcds.com).
Perhaps my html is screwed up?

~~~
jgresula
Sorry, don't know why. Your site does not validate but it could be problem on
my side as well.

------
va_coder
I tried a relatively complex site - CNN - expecting the results to look bad,
but it looks great

------
mleonhard
How much will it cost?

~~~
jgresula
My current plan is to charge for conversion tokens but I'm not decided how
much yet.

~~~
juliancox
Check out: <http://www.htm2pdf.co.uk/htm2pdf-web-service.aspx> Their pricing
indicates 40,000 conversions for $90. I'd pay that.

~~~
Tomazaz
Similar, also support pdf by e-mail <http://www.web2pdfconvert.com>

------
jrockway
Uh, "a2ps file.html"? Doesn't even need an API key...

------
asnyder
Why no PHP API binding?

~~~
jbm
I've gone ahead and built one.

Try it:

[http://www.tokyomuslim.com/2010/04/php-class-to-run-
pdfcrowd...](http://www.tokyomuslim.com/2010/04/php-class-to-run-pdfcrowd-
com/)

I don't blame anyone for not wanting to use PHP's poorly documented CURL
classes.

~~~
asnyder
Thanks for getting this done. But come on, CURL is documented pretty well.
There are even examples. What's there to know? Init a connection, set the
flags, pass in whatever you like, submit, and check response. Pretty
straightforward to me.

~~~
jbm
The CURL stuff seems so oddly unlike the rest of the PHP commands; it's more-
or-less a direct port of the c++ library, names & all included.

The place where it reallllly irritates is the cookie management, but
thankfully I didn't have to deal with that in this case. (I did for a client
at a newspaper - nightmarish.)

------
alilja
Why do I need this?

