
Batch Processing Millions and Millions of Images - mattyb
http://codeascraft.etsy.com/2010/07/09/batch-processing-millions-of-images/
======
lars512
The problem of resizing all these images is an "embarrassingly parallel" one,
right? You don't care about how fast any individual image is resized, only how
fast they're resized in aggregate, and each image is a nice small chunk of
work.

The author spends time tuning the number of workers and the number of OpenMP
processes per worker for GraphicsMagick on his 16-core machines. Isn't this
type of tuning a waste of time? Even using just two cores instead of one seems
to introduce inefficiency. Wouldn't he have been better off just using 16
workers, each compiled without OpenMP so that it would run serially (and more
efficiently)?

~~~
mahmud
You really nailed it. Yes, yes and yes!

This is the sort of stupid batch task that can be done with a shell-script
using all the machines in the office after hours. Cut out the middle ware, cut
down the threads and fancy IPC, and hunker down with good ole minimal number
of processes on a bunch of machines.

Heck, they might even see a big win if they use a different "server" memory
management algorithm that's more tailored to batch, not-so-responsive
applications.

~~~
gaius
Barely even a shell script; you could do it in a one-liner. Let's say you had
16000 images and 16 cores,

    
    
        $ cat filelist.txt|xargs -n 1000 -P 16 ./myconvertprog
    

I do "batch" compression like this all the time.

~~~
mleonhard
Thanks for pointing out the '-P' argument for xargs!

------
bockris
Several years ago I had to do something similar (10 of thousands instead of
millions though).

Our problem was that we were making a composite image but sometimes the
'background' image content was shifted to the left or right. (e.g. All images
were the same size with a white background but some were made from different
source template and so the actual content sometimes started at pixel 25 and
other times at pixel 35.) To make it worse they were all JPEGS.

I ended up writing a program to find the bounding box of the content (with a
fudge factor to account for the JPEG dithering) and identified the bad images
to be fixed by hand. It was a nice diversion from SQL and ASP.

~~~
timmorgan
What tools did you use and how long did it take?

~~~
bockris
This was in 2002 and I used Python with the PIL module. It didn't really take
all that long, less than a day for the whole project. Probably 2-3 hours of
run time on my desktop. I wasn't actually doing anything but identifying the
bad images so I wasn't pounding the disk with alternating reads/writes.

~~~
blasdel
PIL is fucking amazing, I've never seen anything like it for any other
language. Last year I used it to write a computer vision utility for doing
black-box UI testing on embedded medical devices, and I can't imagine having
used anything else. Right now I'm fighting RMagick on another project, I would
have set the building on fire if I'd had to deal with it's bullshit on the CV
project.

It links directly with libjpeg, libpng, etc. instead of via some imagemagick
bullshit. Instead of being fucked if what you want to do doesn't line up with
an existing magick command, it gives you the tools to do it yourself, and even
pretty performantly since it can give you numpy arrays.

------
liuliu
Call me naive, but I think for any serious processing, you need to dig into
the actual underlying algorithm and implementation to made difference. The
hardware difference between generations are huge, different arch can have a
big impact in term of performance (really down to the detail: cache-line,
bandwidth, and SIMD inst, image processing is sensitive to all these above).
Tuning your custom implementation can suddenly be worthwhile.

------
pilif
Mmh

 _"In fact, the research phase of this project took longer than the batch
processing itself. That was clearly time well spent."_

I'm not so sure about this. A sufficiently parallelized but otherwise
unresearched script could have been created in a fraction of the time. So you
would just generate the correct thumbnails on new uploads and then let the
batch job run for, say, two weeks. That naive approach will take you around a
days worth of development time.

The time you gained by doing it the naive way you then put into the rewrite of
that legacy component that prevented you from on-the-fly generating the
images.

What you have done here is, IMHO, wasted time in a legacy solution and you
wasted hardware resources for storage of the additional pictures of which
probably only a minority will actually be seen anyways.

Don't get me wrong: I'm sure you had a lot of fun and there's little more
exciting things than seeing your code perform magnitudes quicker than the
initial approach. But in the context of your quote, I have to disagree.

Unless there's more background you didn't tell us about.

------
callmeed
Good, detailed post. I just did about 250K images using ImageMagick and a
shell script. It was across 3 boxes and I noticed quite a performance
difference between an older and newer version of IM. It could have been
another factor, I suppose.

One thing I'm skeptical about:

 _"We found out, almost by accident, that using the previously down-sized
“large” images resulted in better quality and faster processing than starting
with the original full-size images."_

Obviously, the processing speed will be faster with a down-sized image, but I
can't see how starting with a smaller image will give you better quality.
Unless the original image was so large that downsizing caused a lot of detail
loss (in which case the image should have been sharpened).

~~~
liuliu
If you use a wrong algorithm, that can be a case. For example, when you use
bilinear interpolation to "down-size" image, for large image, you can end up
interpolating between two far away pixels whereas if you use a smaller image,
the two pixels you use are already averaged (interpolated), thus, can get a
more pleasing result.

------
patio11
If you want to make the "pick which setting is better" optimization a wee bit
better, it is about two hours of work to rig up "Which do you like: image A,
image B, or 'They're the same'?" and then simulated annealing your way to
victory. Randomize which side you place the "better" setting on, obviously.

I'd almost be tempted to spend a few more hours and hook it up to mechanical
turk rather than actually doing the classification myself. (Partially because
I get bored easily and partially because I have all the artistic appreciation
of a mole rat.)

------
cageface
Did you investigate resizing them on the fly with some kind of caching layer?
As one of the posters notes, it seems likely that a lot of those images will
be very rarely seen, if ever.

~~~
mcfunley
Certain details of our (legacy) image system made the batch resize necessary.
The explanation would probably be tedious, but maybe Mike will cover it when
he rewrites the damn thing.

------
natch
Nice article. FWIW Perl is often called the "Swiss army chain saw," not "Swiss
army knife." Thanks for sharing your experience.

------
spidaman
Hate to be the Monday morning quarterback but hey, it's Saturday morning and
I'm stuck in a Starbucks in Benicia. So, this is a nice walk through the
optimization process but it is a fundamentally unscalable system. When the
workload triples next year, does it make sense to scale up the hardware? (No)

This system doesn't scale out without ad-hoc partitioning. I think Etsy's
approach here should have been informed by the New York Times' project a few
years ago. They converted 4 TB of scanned tiffs (their article archive) into
PDFs on a hadoop cluster running on ec2 and s3. Parallelizing the process
across hardware nodes is the scale free way to do this, exactly what hadoop
was intended for.

All that said, I know there are smart folks at Etsy. I'll give 'em the benefit
of the doubt; there may be a good reason not to go that route but this write
up didn't make that clear.

~~~
mikebrittain
I am informed by the NYT project. :) For practical reasons, that approach
wasn't right for us... at the time. Those details (for sake of clarity and
sanity) didn't make it to the article.

------
DanielBMarkham
I hate to say this, but for some reason I can't help myself.

This looks like a problem for which I could write the code in 15 minutes.
Assuming Markham's Rule of Estimating (double the number and go to then next
larger unit), that's about 30 hours of work.

This is a problem just begging for a functional solution. If you've got a
language that already has a bunch of libraries, like .NET, you just wire it
up, pipeline it, and send it out to as many workers as you want. The image
tweaking, worker nitpicking stuff just isn't worth the time. If you're doing
millions of everything, then I'm sure you've already got some broker
parallelization thing going on. Shouldn't need to redo that every time. If you
want to be truly anal, do some tests to determine the fastest way to compress.
But aside from that, it's all big old hunks of immutable data splaying out
into the universe, running on as many cores as you need.

I probably missed something. These things always look simpler from the
outside.

------
ewams
Why did you do this as a batch job? Do you run batch jobs instead of resizing
them when a user uploads the pictures? Thanks for sharing.

~~~
Daniel_Newby
This was for a site design after the images had been uploaded.

