

Batch Processing Millions and Millions of Images - GhotiFish
http://codeascraft.com/2010/07/09/batch-processing-millions-of-images/

======
jafaku
_" But at this rate it would take 170 years to resize all of those images.

But here’s the spoiler… We did it in nine days. Every single image."_

TL;DR: Computers are faster than humans.

------
martian
This is an old post and it suffers from some shiny rhetoric, but it was very
technically relevant for work I did recently.

I had a similar problem as the OP: millions of images stored on S3 that each
needed to be rescaled to a set of four new dimensions. Etsy's work encouraged
me to dive in with the hope that I could accomplish my task with relatively
little pain.

Indeed, I was able to write some multi-threaded Python that ran on a single
XXL EC2 instance and was able to rip through every image in less than 48
hours.

Thanks to Mike, and the folks at Etsy, for posting about your experiences!

~~~
billmalarky
Hopefully this comment will find it's way to the top instead of the snarky one
that's sitting there now.

------
blt
Why use per-image multithreading at all? Process one image per core.

------
trevyn
This article is from 2010.

~~~
code_duck
It's also been posted to HN previously:
[https://news.ycombinator.com/item?id=1502179](https://news.ycombinator.com/item?id=1502179)

~~~
GhotiFish
ok. I don't understand HN's submission system at all.

I was looking to submit an article [1] that was an intensely fun read. Before
I did I looked for previous submissions, and found this: [2] and this [3]

Given the seriously low scores these stories had vs how good the article was,
I assumed it just got unlucky, so I posted it again.

In response to my POST, Hacker News sent me a 302 redirect to [3].

Ok, that must be because the url refers to the same story. Which I figured was
some automatic thing HN does to avoid reposts. Given that the OP url hadn't
obviously mutated in any way, I knew I didn't need to check, so I submitted
again... and nothing, that was fine.

So I have no idea how hacker news works. I plead innocent by way of insanity.

1:
[http://www.datagenetics.com/blog/december32011/index.html](http://www.datagenetics.com/blog/december32011/index.html)
2:
[http://news.ycombinator.com/item?id=3971657](http://news.ycombinator.com/item?id=3971657)
3:
[http://news.ycombinator.com/item?id=3474763](http://news.ycombinator.com/item?id=3474763)

~~~
code_duck
Both of the HN links refer to Battleship: the Movie? I don't really understand
- isn't this story about some guy at etsy figuring out how to do his job?

------
code_duck
In my experience, Etsy has a tendency to do average through questionable work
and then spend an above average effort glorifying themselves about it.

------
dmourati
NFS, seriously? Wow, weak sauce.

~~~
packetslave
Well then, tell us about the system YOU built to batch process millions of
images that doesn't use something "weak sauce" like NFS.

In the real world, you use what works, not the flavor of the month.

~~~
dmourati
Sure, I built a system for a company specializing in image processing and
uploads/sharing called Eye-Fi.

We used MogileFS for image storage.

Resizing millions of images was a daily task.

NFS is not the right tool but is frequently mistaken for such.

~~~
packetslave
NFS can be the right solution, depending on the problem. Storage backend for
serving live images to users? Probably not. Taking uploaded images and storing
a copy (after processing), maybe. Depends on the amount of IOPS you expect to
need to throw at those copies.

I haven't seen any posts from Etsy on their normal image pipeline, so it's
hard to say. It sounds like this was a one-time conversion. Their live image
serving is on Akamai, but nothing says they're back-ending the origin servers
on NFS.

I have to imagine with the amount of ops talent at Etsy, they're not using NFS
because they're too dumb to know about a better option. I'm guessing John
Allspaw knows a thing or two about how to serve images at high volume.

~~~
dmourati
The thought that Etsy had good ops people was what lead to my surprise in
relying on NFS for anything of the sort. Two mitigating facts: one, this
article is from 2010, two, this may have been a one-time thing. Still,
surprised and dismayed that they would let NFS into production.

~~~
code_duck
Etsy grew quite quickly, after having been founded by people with little
experience running a website at scale. Once an experienced and competent team
came on board, I believe they spent a lot of effort recovering from poor
decisions made by inexperienced people under duress.

