

Generating Thousands of PDFs on EC2 with Ruby - ghotli
http://railsdog.com/blog/2009/12/generating-pdfs-on-ec2-with-ruby/

======
bensummers
It would have taken 30 hours to generate the files without using EC2. I wonder
how long it took to set up everything to do them in an hour, and whether it
was worth it. It seems quite a complex set of tools for a simple task.

Perhaps just splitting the 3600 files into four sets and starting them off
manually on manually created instances might have been quicker overall?

~~~
seancribbs
The client was slow in delivering the data and we needed it within 48 hours,
with the possibility of generating the output multiple times to discover any
errors or other issues.

------
zacharypinter
I'm really looking forward to the Chef Platform mentioned in this article. It
seems like chef is a set of great core tools, but putting them all together
and getting an environment going is still more work than it should be.

The two times I've used chef (first chef-solo, then setting up a chef-server)
ended up being far more work than just setting up a server manually (haven't
had to scale out tons of servers yet, which I assume makes the current chef
setup time worthwhile).

I'd like to see chef evolve into something more like Heroku, where I setup a
few config files and then run a simple command line script.

The benefit of Chef Platform is that it would (hopefully) be a simple monthly
fee (like say $10/month) instead of fee on top of every server resource I use.
Also, I like the idea of creating my own recipes (instead of relying on Heroku
to have an addon, which is problematic when they lack the addon for something
like MongoDB).

~~~
hallmark
RightScale is a service that manages cloud deployments - they started with EC2
and have expanded much beyond that. They started using Chef recently.

[http://blog.rightscale.com/2009/09/16/rackspace-rightlink-
ch...](http://blog.rightscale.com/2009/09/16/rackspace-rightlink-chef-machine-
tags-vpc/)

Although they are quite expensive for a small shop, if you're interested in
deploying within Amazon's cloud and have some money to throw at the problem,
RightScale is definitely worth a look. You can get pretty far with just their
free developer accounts - as I've done - but the more powerful Chef
functionality comes online once you become a paid customer.

------
brown9-2
The NYTimes did something similar with EC2 and Hadoop to generate a PDF for
each article from published between 1851 and 1980 from 4TB of TIFF files:
[http://open.blogs.nytimes.com/2007/11/01/self-service-
prorat...](http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-
super-computing-fun/)

~~~
bbgm
Aron Pilhofer at DocumentCloud also comes from the NY Times and their
CloudCrowd project is also great for this purpose

<http://wiki.github.com/documentcloud/cloud-crowd>

------
lsb
They've been working on a website for 2 months, but they can't wait 1 day for
pdfs? That sounds silly.

~~~
seancribbs
This is the reality of many projects. We got the full dataset around 48 hours
before the output was needed. See my comment above too.

------
MartinMond
I'm missing the one thing that's actually interesting (read hard) How did they
collect the generated files? Did they take any precautions in case a worker
died while creating a pdf? (Does AMPQ have some sort of transactional
semantics? If so, how would one apply them)

~~~
almost
Amazons own SQS service has a simple solution to this problem. When you get an
item from the queue it is temporarily hidden then once you've processed it
successfully you explicitly delete the item from the queue. If the item is not
deleted within the set timeout period then it goes back into the queue ready
for another worker to have a go at. I assume RabbitMQ has something similar.

Actually I'm not sure why they used RanbitMQ rather than SQS for this task.
SQS seems to me to be a perfect fit and there's no extra effort involved to
set it up.

------
jmtame
EC2 isn't bad. Under Rails, I use Rio to store a lot of remotely fetched
images (it's one line to fetch a remote image in Rio), HTTParty to handle some
JSONP, and EC2 bandwidth is free when using Heroku. Nice setup, the only
disadvantage is that Heroku's file systems are read-only, so you have to write
temporarily to /tmp, and then again to EC2.

