Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Simple PDF to PNG Server (cloudbrowser.xyz)
78 points by slowenough on Nov 15, 2019 | hide | past | favorite | 56 comments



Actually it just uses ImageMagick and soffice under the hood, so it can do way more than PDF. I've tested it with DOC, DOCX, XLS, and they all seem to work.

Edit: in case anyone wants an example, here's the conversion of tesla-model-s.pdf[0]:

https://secureview.cloudbrowser.xyz/uploads/filekat9.v05lnle...

[0]: https://www.tesla.com/sites/default/files/tesla-model-s.pdf


Sorry, I'm afraid I just cleaned the upload directory. HN, you have converted 1.6G of files.

Here's that Tesla example link again:

https://secureview.cloudbrowser.xyz/uploads/file5rc5.9nhts2j...


Forgive me, I've cleaned the uploads directory again. Over 3.5G of files converted.

Here's that Tesla link (I'm making it permanent now):

https://secureview.cloudbrowser.xyz/uploads/filecp9z.ohbtt6u...


Sounds like you need to auto-clean your uploads directory


Nice. What is your tech stack to serve it?


Just a digital ocean droplet running FreeBSD, then PM2 running Node.JS and ImageMagick built with multicore support.


One wonders if you can WASM-ify ImageMagick and the use a JS FileReader and a canvas or other PNG output approach to do this all client side. Hrmm, I might tackle this myself if I get the time...


Looks like someone has gotten pretty far WASM-ifying ImageMagick already [1]!

[1]: https://github.com/KnicKnic/WASM-ImageMagick


ImageMagick with "multicore" support won't speed things up the moment number of concurrent requests touches the number of cores you have. And in digital ocean's case you have only one core.


Actually I created a Droplet with 6 cores


But remember that multi-threading was shown to improve performance in the era of single-core computers.


Sure, but if your concurrent number of fully tendered requests at least sometimes dips below the number of cores, it is a big win.

(At the possible cost of top-end scale; improving response to normal workload significantly and hurting the top workload reachable somewhat).


Github? I would love to contribute.


Yep, I just put it up now

https://github.com/crislin2046/p2.


Note that HN (and probably many others) won't correctly autolink URLs with a period at the end. It's also somewhat confusing. You may want to consider removing it.


Oh! Sorry. Let me change that

http://tiny.cc/p2github

edit: I like the dot ¯\_(ツ)_/¯


Package it as a docker image and you can then run it on cloud.run (Might be expensive though).

Disclaimer: I work for Google Cloud.


Hey, this is actually atop FreeBSD so it is not as simple as just making a Dockerfile.

It should be too much tho, perhaps just making an install_deps.sh script alternative to use apt, and writing a Dockerfile.

You're welcome to submit a PR but I might get to it first.

Edit: I just made some changes relating to this, including a Dockerfile. I have not tested in Docker, but it now works on Nix (Debian)


what's your experience running soffice server-side? I might have to do this soon, and I'm dreading it.


My experience is relatively small.

The `convert` script of ImageMagick calls soffice by default if the input file is a LibreOffice type.

I actually started writing a shell script to detect if it was DOCX etc, and run soffice first, but then I found out that ImageMagick does that automatically, if libreoffice is installed.


hmm, OK, I didn't know that. Thanks that's useful :)

Be interesting to see how your demo site holds up under HN front page traffic - I know Imagemagic is used a lot for random web sites, but I worry about scale a lot. Any info would be useful!


no probs. if you want to email me I can let you know once it all quiets down.


sent :)


We’ve been running it as part of alfresco. It’s not very stable and hangs every few weeks. Since we kill it every night we didn’t notice any problems.


good tip, thanks


It's actually not too bad. You can run a pool of soffice processes, but unless you really care about the startup time, I'd suggest running individual processes per job, which you can run headless from the command line. For the most part libreoffice does well, although at some point you'll start to discover the quirks in their rendering...


Does it support SVG?


as an output format?


I wondered about input format.


Hey, yeah I think it doesn't support fully on the input end, but it does support on the output (if you change the $format variable in the convert script).

Input looks a bit weird, for example,

https://secureview.cloudbrowser.xyz/uploads/file1tda.4paa65d...

Which is from:

https://dev.w3.org/SVG/tools/svgweb/samples/svg-files/AJ_Dig...


Before I upload and download a file from a random Show HN with URLs like "secretpage-canneverbefound" and "very-secure-manifest-convert", I'd like to see some source code.

Typically with little projects like this it is customary to discuss how you built the application.


It looks like the URLs named like that are trying to hack you and steal your secrets, I get that.

The reason they're named that way is an artifact of how this was meant to be a private service to let a remote cloud browser view PDF (etc) files securely, without forcing the client to download them.

So initially I didn't intend to make this public as its own service because the service was supposed to have only 1 customer (this other service).

It wasn't easy at first, as ImageMagick took a long time to convert PDFs at 300 dpi, so I rebuilt it with multicore support, and still it took a long time. I eventually discovered a sweet spot at 100dpi.

But aside from that I tried to keep it very simple, it's just some Node.JS and some shell scripts running on FreeBSD on digital ocean.


With superb open source, cross platform libraries like pdfium available to the public, why won't I just do this on the client?


OK, that's a good point. I think the only reason was I looked at PDF.js and felt the quality was not so good, and then I tried ImageMagick and found it gave good quality, so I just went with that.

Another reason is I made this to convert the PDF (or Word doc, XLS, etc) remotely without the file ever touching the user's device. So they couldn't be exposed to any exploits contained in the file or of the PDF engine (PDFium recently caused a Chrome 0 day I think).


Wow, wasn't expecting such a well constructed counter argument.


Thank you. The internet, eh? You never know what you're gonna get.


"to convert the PDF (or Word doc, XLS, etc) remotely without the file ever touching the user's device"

This made me think that the site accepts a URL to a PDF/XLS/DOC file and returns PNG.


It sounds like that makes you think the site accepts a link to a PDF/XLS/DOC and returns a PNG.

Actually, that is basically what happens in the original context for this service.

If you take a look at

https://free.cloudbrowser.xyz

open a remote browser, then search for a PDF/XLS/DOC etc and then click on it, you'll see a dialog which says the file is transferred.

What's happening is the remote browser downloads the file to a temp directory in the cloud, then some software does a POST request with that file to the service (Simple PDF to PNG Server) and gets the viewing link back immediately, before the conversion has completed.

It then sends the viewing link back to the remote browser which opens a tab. The client can then view the document as images securely without the file ever touching their device.

Originally, I didn't intend to release this viewer as a separate thing/product/service.

If you use the viewer (without the remote browser) you do have to submit the file. I didn't make it to fetch them. I made it as a "secure document viewer" for the remote browser product.


That's a great point.

I typically don't like using remote resources for stuff that I can do locally. But for iffy pdfs, it's a good tradeoff.

Edit: Nice to see the PDF capability added :)


Thanks, I remember you commented on my first share of the remote browser thing :)


I used PDF.js in a side project for comparing pdfs and it worked quite well. It runs completely client side, so I can build the site with a static site generator and html: https://www.parepdf.com/

It can start to bog down with large image heavy pdfs. But overall it runs performantly.


Does this use ghostscript? Was looking into doing this a while back but wasn't sure if I'd have to get a license from Artifex to use it in a SaaS app.


The `gs` I have installed says it is licensed with this[0][1] so I think it's fine.

[0]: https://www.gnu.org/licenses/agpl-3.0.en.html

[1]: https://tldrlegal.com/license/gnu-affero-general-public-lice...


Yeah, let's just upload our files to a page that's called "secretpage-canneverbefound.html", and with no info other than the 5 word title on HN. -_-


I mean you don't need to upload a PDF list of all your darkest secrets.


No but a mention this is happening serverside and you are actually copying the file to the server kind of notice or message would be correct imho


> [...] mention this is happening serverside

Not being snarky, but the submission title does say "Server".


I know, and you, and probably all tech savvy people, realise when clicking the browse file button you are probably going to submit a file to a server, but not everybody does and those do need an explicit warning in my opinion.


I'm glad I read the comments, I was about to do exactly that!


Do this with plot files and dxf's and you can sell it


If you'd like to mail me at cris@dosycorp we could discuss what you need.


It seems like there are free sites that do that already.


Have you considered making this open source and distribute as a container image that people can self-build/deploy/host on a serverless platform, like a Google Cloud Run?

For example, this DOC to PDF converter: https://github.com/as-a-service/pdf


Thanks, this is actually on FreeBSD, so it's not so easy to make it for Docker, but the repo now has a Dockerfile for nix systems. I haven't tested it tho.



I like the minimalism.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: