

Abusing Amazon Images - chaosmachine
http://aaugh.com/imageabuse.html

======
stephenjudkins
This approach to generating images generates a lot of benefits when you're
working at scale. Before, when we wanted an image that's been thumbnailed,
cropped, or modified a certain way, we would generate the image (or check if
it exists) at page render time, and pass on the URL. This meant that page
rendering blocked either checking if the image exists or generating a new
image. Further, when our flaky file store starting freaking out it meant the
entire site went down.

By encoding all the relevant information in the URL, like Amazon does, we
offload such work asynchronously to different web requests. If there's info in
this URL we want hidden from users (spam-protected email addresses) we encrypt
it using Base64-encoded AES. If the file store were to go down--which hasn't
happened since we switched to using S3--users would only see broken images
instead of the entire site crashing. Further, we store the images on S3 using
the exact same URL as we fetch it from, and have our static web server check
there first. That means if the image is cached it never hits our application
stack.

Now, when we generate a page with one of these dynamic images, our app stack
only has to generate a URL instead of relying on the vagaries of our file
store and image processing libraries. It's made our site more reliable, easier
to manage, and faster. If we suddenly start serving up many more images, we
could easily replace our image generation service with something higher-
performance or more reliable, since it's just a web service. Right now, it
works great.

Clearly Amazon has seen the benefits of the same approach.

~~~
jacquesm
I use a really really dirty hack for this exact problem, I handle the
rescaling in the 404 handler.

It's the laziest possible approach and it has one huge hidden benefit, which
is that if a bot hits your pages you're not going to end up pre-generating a
bunch of scaled images that nobody is ever going to use.

A simple list of allowed sizes in the handler makes sure that someone can not
use URL injection to generate a large series of nonsense images for sizes that
we'd normally not use.

Typically a url looks like <http://host/resourceid_widthxheight.jpg> , where
width and height are replaced by the desired size.

I'll be the first to admit this is a nasty bit of code but I haven't found
anything else that comes close in terms of efficiency and flexibility.

~~~
thwarted
I don't know why so many people consider this to be a "dirty hack". It makes
total sense to overload ErrorDocument and generate the missing content. You
get really good efficiency in the common case of the document existing because
it is in the filesystem and execution never leaves apache core to run a CGI
except in the exceptional case of having to generate it (depending on your
usage patterns, of course).

This is even documented in the apache docs, albeit with rewrite rules, which
is more heavyweight and less efficient.
[http://httpd.apache.org/docs/2.2/rewrite/rewrite_guide_advan...](http://httpd.apache.org/docs/2.2/rewrite/rewrite_guide_advanced.html#on-
the-fly-content) I believe that using ErrorDocument for this purpose is
explicitly mentioned somewhere in the apache docs, but I can't find it right
now. I'm pretty sure that' where I first learned of it (maybe back in the 1.3
days).

I have a feeling that if ErrorDocument had been named (or aliased to)
GenerateMissingContent, then this would be a widely accepted method of dynamic
on the fly generation with filesystem caching. I've even heard a rationale
that having a script generate the content on error and write it into the
filesystem where apache can then read it directly on later requests is
reimplementing a reverse caching proxy, and you should just stick squid in
front of your web servers (with the, ahem, additional overhead of having to
run and mange two services). This seems like significant effort of trying to
avoid something that has "error" in the name for non-error cases.

~~~
jacquesm
If there would have been a more appropriately named hook such as
'about_to_404_last_chance_to_make_content' I'm sure I would have used that
instead :)

The resulting image is indeed written to the filesystem, right next to the
non-scaled one (so when we remove we can remove all of them in one go without
having to visit another directory).

The really nasty trick is that when all is done I redirect the browser to the
_same_ url and this time I'm sure it will find it.

And if the source file is missing it really does 404, of course.

The only tricky situation is when you have a whole pile of people requesting
the same image at the same time and it hasn't been saved yet, the first party
to do the 'fopen' gets dibs on doing the transform, everybody else simply
redirects if the file exists, by the time the browsers have processed the 3xx
the file is ready for consumption.

On very rare occasions the race is so close that the image gets rescaled and
saved twice, I've logged the occurence of this over many millions of rescales
and it only happened a handful of times, not worth improving on.

~~~
thwarted
_The really nasty trick is that when all is done I redirect the browser to the
same url and this time I'm sure it will find it._

This is a pretty nasty trick (seems like a risk of a redirect loop if the file
can't be written), but unnecessary. You can emit content out of the script
used for ErrorDocument the same as you can with a CGI. So you generate the
content, write it to the filesystem, emit a "Status: 200" header, which
replaces the 404 status apache was going to generate, then send a content-type
and the generated content in the response body.

Your way may be easier if you have complex cache control or expires headers
that apache is configured to ad to static files but not dynamically generated
content, since the client is only ever given the file by apache directly from
the filesystem. I've never used that though.

~~~
drusenko
we use this exact approach to generate all of our blog pages. incredibly
efficient.

the twist we use to handle race conditions is to create a lock file. if the
404 handler is called and the lock file exists, someone else is generating
that page as we speak -- it goes into a timed loop to check for the file being
available. when found, it serves it directly with the 200 header, like you
mentioned.

we have a maximum loop value (a second, i think) -- if you hit that, it means
the other process died for some reason, and never deleted the lock or the
page. in that case, generate and write the page yourself, and remove the lock.
all future requests are served off of the file system.

one nice part of this caching mechanism is that to remove the cache, you just
rm -rf the blog directory. tada!

------
patio11
This approach might be totally obvious to many people here, but its quick to
implement, it performs well, and it scales impressively:

1) The web server checks to see if the image exists in a file store and, if it
does, serves it directly.

2) If not, the request goes to your web application stack (Rails, etc) which
fires off an ImageMagick process to do whatever magick the image requests,
operating on either a base image or something which you magick up at request
time.

3) Your application twiddles its thumbs for a second or two. (This has
potentially negative consequences for other requests coming into the same
mongrel, at least in Rails. Consider doing load balancing which is aware of
the mongrel being tied up -- I think Pound does this fairly easily.)

4) After you've got the image, save it to the file store and tell the web
server "Send the user the file at this path".

5) The next request for the same URL will hit the webserver, find the file on
disk, and stream it automatically.

In my case I use a cron script every day to blow away all the images that
haven't been accessed in the last X days, but depending on your needs it might
be easier to just persist them indefinitely. (I use GIFs to give users a live
preview of the PDF they're essentially editing, and generate one every second,
so if I didn't do this I would drown in half-written bingo cards. My users go
through a couple of gigs a week 20kb at a time.)

~~~
JacobAldridge
Patrick, your observations, awareness, and ability / willingness to
communicate them clearly continue to impress me. Cheers.

~~~
patio11
Thanks! Writing in English helps me take refreshing little breaks from, e.g.,
trying to figure out what the single comment left in a 3,000 line source file
written in Japanese by our Korean outsourcing team means.

(Answer: it means today will be a long, long day.)

------
mcantor
Amazon should be paying this guy for documenting their codebase for them.

~~~
martey
Because Amazon has no internal documentation about how to create product
images? Or because Amazon's terms of services explicitly prevent doing
anything useful with these images?

~~~
chrischen
Because this guy was very thorough in detailing how to manipulate amazon's
dynamic image generator.

------
chaosmachine
See also: <http://www.gertler.com/nat/abusedimages.html>

------
joshwa
See also: <http://www.scene7.com/solutions/dynamic_imaging.asp> \- used by
many huge e-commerce sites (including my >$1B employer).

------
aarongough
IMDB uses a similar system for retrieving/resizing their images (not wholly
surprising as they are owned by Amazon)...

IMDB's format for resizing images: [server hostname]/images/M/[image
identifier]._V1._SX[width]_SY[HEIGHT]_.[format]

eg: [http://ia.media-
imdb.com/images/M/MV5BNzEwNzQ1NjczM15BMl5Ban...](http://ia.media-
imdb.com/images/M/MV5BNzEwNzQ1NjczM15BMl5BanBnXkFtZTcwNTk3MTE1Mg@@._V1._SX1000_SY400_.jpg)

Unfortunately I didn't have any luck wringing meaning out of their image
identification strings...

I also found out rather late in the game that they use an interesting cookie
based scheme for defeating people trying to link to their images from external
sites...

------
windsurfer
Blurring a large image is really CPU intensive. Someone could DOS amazon using
a blur URL.

~~~
NathanKP
It would probably cache the image. So a real DOS would require blurring
multiple images at multiple levels.

~~~
windsurfer
You could just increment the blur amount. Or set a random amount on
distributed clients.

------
bcl
For <http://www.movielandmarks.com> I ended up caching the DVD images locally
and serving them up myself. Some of the URL's served up by AWS go away and I
don't re-request them after the initial add of a movie.

I'm going to have to read this article in more detail and see if it can clean
that up for me and move image serving back to Amazon.

If you are going to use Amazon's images, why not sign up for an associate
account and provide links back to the product pages so you can generate some
income from it.

~~~
mrkurt
You should be careful caching images and other data, since it's (or was) a TOS
violation. :)

------
albemuth
a ruby dsl library should pop out any moment on github

------
qeorge
phpThumb[1] + mod_rewrite is a simple way to get up and running with a similar
cache of your own. It supports ImageMagick so you've got a lot of options.

[1] <http://phpthumb.sourceforge.net/>

------
releasedatez
Thanks for sharing this. It'll help me a lot.

------
liuliu
I don't understand why Amazon has to make image cropping/resizing thing on
fly. Good image resizing algorithm(cubic/seam curving) is time consuming. Even
good algorithms(poisson?) for image modification(add logo) is computational
expensive. Just don't get the logic why storing image is expensive than
computing one.

~~~
tlrobinson
Surely they cache the results?

It seems pretty smart to me. Rather than having some external process that
tries to figure out every permutation of image that might be necessary they
just have a simple HTTP API that always gives you exactly the image you need.

As long as they're caching the image after the first request I don't see any
problem.

