

How to use Amazon’s S3 web service for Scaling Image Hosting - jasonlgrimes
http://blog.teachstreet.com/homepage/how-to-use-amazon-s3-scaling-image-hosting/

======
d2viant
"We could push our images to Amazon, and never have to worry about backing
them up, or keeping extra copies in case of hardware failure"

Perhaps a clarification, but Amazon does not guarantee your data as part of
S3. They will do their best job (and have a great track record), but just like
any other service -- failures will occur. Ultimately their SLA may provide for
reimbursement of downtime, but that won't get your data back.

That being said, S3 is still (probably) magnitudes better than what you can
homegrow.

~~~
johnswamps
Related: <http://news.ycombinator.com/item?id=1334187>. Some guy was storing
his data on Amazon EBS and it was all lost (not sure how EBS compares to S3 in
terms of probability of data loss).

~~~
timf
Werner Vogels says S3 is designed "for 99.999999999% durability." That is
_eleven nines_.

(From
[http://www.allthingsdistributed.com/2010/05/amazon_s3_reduce...](http://www.allthingsdistributed.com/2010/05/amazon_s3_reduced_redundancy_storage.html)
)

EBS volumes have a theoretical "annual failure rate (AFR) of between 0.1% –
0.5%, where failure refers to a complete loss of the volume."

(From <http://aws.amazon.com/ebs/> )

Considering how many EBS volumes there are, that is not that low of a
percentage for us to see many cases. It happens and the case at that link is
not the only one.

But that person didn't take the advice, he did not snapshot his EBS volumes to
S3. It says it right there in the description (as well as in the user's
guide): "The durability of your volume depends both on the size of your volume
and the percentage of the data that has changed since your last snapshot."

------
zepolen
I've been doing the stuff in this article for almost a year now but instead of
x-sendfile, I use nginx's proxy_cache which does the 'get from S3, save to
disk, serve subsequents from disk' for you.

That way deploying more photo servers for speed just requires nginx and a few
lines of config.

A couple notes:

1\. Make sure you partition your data on your local disk properly so that
there are never more than ~5000 files per directory, you can use nginx rewrite
rules to do this for you (and make nginx save the actual url to nested
folders).

2\. If possible, make thumbnails smaller than 4kb.

~~~
swindsor
1\. Good point - I didn't bring this up in the article. When we store the
images locally, we create divide end of the unique ID of the image into two
levels of subdirectories. This gives us "random enough" distribution in
subdirectories, and an easy way to look up the files on disk. This logic
exists in our rails app, but we didn't want to expose this pathing out to end
users. This allows us to change it later if we need to create further
subdirectories.

2\. Good tip - we try to use our judgement here balancing good design/good
performance. We tend to opt for a pretty website first, then go back and make
it faster with optimization.

~~~
zepolen
Regarding "This logic exists in our rails app, but we didn't want to expose
this pathing out to end users". I'm not sure if you understood, but I'm saying
exactly this. nginx will take a /whatever request and convert it to
/w/wh/wha/whatever on disk transparently - you can (and I have needed to)
change this whenever you like with no front facing changes.

~~~
swindsor
Oh, that's cool - I didn't know you could use matches from a location in an
alias. Is that a new feature in nginx?

~~~
zepolen
I think it's always been a feature of nginx:

    
    
        location ~ "/(.*)" {
            root /whatever/$1;
    

You could even do:

    
    
            set $subdir $1;
    

so that it can be used in another location handler.

------
jasonkester
You're not done.

You've done 99% of the work to get your images serving quickly to your users,
but then you stopped. Why??? Cloudfront takes exactly four minutes to set up,
and it's a full-blown CDN.

Nobody, not even Amazon, recommends serving content directly from S3 anymore.
Either this article is 2 years out of date or the author doesn't understand
his subject as well as he thinks.

~~~
brlewis
For whatever reason they want to serve arbitrary sizes on the fly. They do
mention cloudfront in their future optimizations section, but complain about
the price of CDNs in general. Cloudfront looks awfully cheap unless I'm
missing something.

~~~
ant5
_For whatever reason they want to serve arbitrary sizes on the fly._

The primary advantage here is that you don't have to coordinate generation of
new images just because someone, somewhere requires a specific size in some
client of your image system.

Of course, just stick a CDN in front of your on-demand scaling implementation,
potentially backed by S3, and you're done.

~~~
swindsor
We wanted to be able to serve arbitrary sizes so we can handle different
demands for thumbnails and new use cases. Having to resize all of your images
because you've re-designed your homepage sucks, and is no fun (especially if
you have a large set of images).

Cloudfront is really enticing, though. For a cheap CDN, if we really had the
need, we'd probably migrate to cloudfront, then come up with a task to batch
resize our S3 images, then stick them back into S3. This could probably be
done on a one-off task on ec2 by spinning up a few instances, or even with
Hadoop.

If anyone's taken this approach, I'd love to see it!

------
blasdel
I use <http://github.com/markevans/dragonfly> for this purpose, using
rack/cache to hold on to the generated images (also with no expiration
necessary).

Right now I'm working on doing streaming intermediate resizes that never load
the full decompressed image into memory, because I need to be able to take
uploads of 50 megapixel images and be able to respond with new versions based
on user input in a timely manner. RMagick may be an extremely shitty
implementation, but even a more solid one is stymied by the algorithm.

The solution is to just do something else -- for JPEGs, libjpeg has the
capability to do a cheap streaming resize to 1/2, 1/4, or 1/8 by partially
sampling the 8x8 DCT blocks and for PNGs MediaWiki's image server includes a
utility for doing streaming resizes, though it's finicky about the exact scale
factors it'll allow: <http://svn.wikimedia.org/viewvc/mediawiki/trunk/pngds/>

~~~
elektronaut
Thanks for the link, I haven't seen that plugin before.

We use a plugin I wrote (<http://github.com/elektronaut/dynamic_image/>),
which seems to be based on similar ideas on syntax. For caching we just do
regular page caching, with no expiration. Since the request doesn't have to
hit the backend at all (except for the first view), this is pretty damn fast.

------
minalecs
nice write up.. just today I asked for a service that would do this
<http://news.ycombinator.com/item?id=1514991> ( sorry for thread hijack) but
basically I didn't want to do image processing on the same instance as the app
is running because of the constraints of cloud instances.

Based on your steps listed here 1.Handle request 2.Fetch original source image
from S3 3.Resize/apply effects 4. Return result back to user. Why not just
resize/apply at time of upload and store in S3... are you resizing and
applying effects on every request for an image ?

~~~
jrnkntl
From the article:

"For performance, each of our image servers cache the source, and any resize,
locally to disk. Since images are never updated (only created), and get a
unique ID for each one, we don’t have to worry about cache invalidation, only
expiration. We can then write a simple script to remove images from this disk
cache with files of an access time greater than a certain threshold (say 30
days). That way, if we change from one size thumbnail to another, eventually
the old thumbnail sizes will get purged."

------
cracell
I've been wondering for a bit if there's enough potential profit to build an
image upload processing service. Image uploading and manipulation is one of
the very common pain points to Rails applications and I assume a lot of web
applications in general.

A nice pain free, embed the uploader and set some settings and then the 3rd
party processes the images and then uploads them to your S3 and pings you.
With the ability to reprocess old images to a new size and integrate with
attachment_fu or paperclip (or has it's own similar plugin)

~~~
bittersweet
Just found <http://www.uploadjuicer.com/> that does this as well.

edit: Sorry just found out they only do resizing/rescaling, you still have to
take care of the initial upload yourself.

~~~
stympy
I have just published a gem and a sample Rails app that show how you can
directly upload to S3, then call the API at uploadjuicer.com:

<http://github.com/uploadjuicer/>

