
Ask HN: Any ideas on how to thumbnail 2M images? - phprecovery
Hi,<p>I&#x27;m a developer at the New York Public Library and we&#x27;re currently evaluating ways to create derivatives (i.e. thumbnails) from a library of 2 million master images.<p>At 5 seconds an image using a tool like ImageMagick (which might be optimistic), it will take 115 machine days.<p>Any suggestions or tips to speed up the process?<p>Edit: Original images are scanned TIFFs, 25-30MB each.
======
samptemp
I'm kinda a little disappointed at the answers provided here.

1) I downloaded a 49.6 MB TIFF file from here:
[http://hubblesite.org/newscenter/archive/releases/2004/32/im...](http://hubblesite.org/newscenter/archive/releases/2004/32/image/d/warn/)

2) From the OSX terminal:

$ time sips -Z 150 hs-2004-32-d-full_tif.tif

Resulting file size is 150x150px and 32KB.

Time to run process: 1.48s user 0.15s system 98% cpu 1.663 total

The superpower computer required for this incredible speed is a 3 year old
MacBook Pro (with SSD) ;-)

UPDATE:

I created 10 copies of the image. Total size: 520.1 MB

$ time sips -Z 150 _.tif

Time to process 10 images: 14.98s user 1.44s system 94% cpu 17.384 total
(approximately 1.498 per image).

2000000 images _ 1.663 seconds = 38 days.

Note: This is using larger images (49.6MB vs the 25-30MB images you have) and
is using a single MacBook Pro. Divide amongst a few machines and be done in a
week. GraphicsMagick could possibly be even faster.

ANOTHER UPDATE:

Since your files are smaller (20-30MB), I found a 30MB jpg sample here:
[http://sto-rvlt-01.sys.comcast.net/speedtest/random4000x4000...](http://sto-
rvlt-01.sys.comcast.net/speedtest/random4000x4000.jpg)

Time to process was much faster: 0.37s user 0.04s system 97% cpu 0.418 total

2000000 images * 0.418 seconds = 9.67 days

At that speed we are talking under 10 days. On a single 3 year old MacBook
Pro. Using built-in software. Without any optimizations. Find 5 computers in
the NY Public Library and you'll be done in 2 days.

I am not saying this is the fastest solution. I posted this because it appears
that people are over-engineering this problem or proposing solutions which
will cost a lot of money (Amazon, bandwidth, shipping hard drive fees, etc).

~~~
samptemp
To add to my above comment: a big high-five to those offering their free time
or pro bono advice to the New York Public Library! You guys really rock!

~~~
jcanyc
In terms of cost/expense, AWS is very cost effective for this type of work
load.

If you build out the skill set and scripts to run this type of job, you can
process the images, store them on S3 and kill the running infrastructure. You
can then run this process for big and small workloads as the needs arise on
demand without having to own servers or space in a data center.

~~~
rugdobe
how long would it take to upload 2M images?

~~~
pan69
It will depend on your upstream connection speed?

------
vitovito
The NY Times did this using Amazon Web Services to process all their TIFFs
(into PDFs, but thumbnailing isn't really different). They uploaded 4TB of
data to Amazon over the internet, but today you could just send them a hard
drive and they'll copy it onto S3 for you.

NYT: [http://open.blogs.nytimes.com/2007/11/01/self-service-
prorat...](http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-
super-computing-fun/?_php=true&_type=blogs&_r=0) and
[http://open.blogs.nytimes.com/2008/05/21/the-new-york-
times-...](http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-
archives-amazon-web-services-timesmachine/)

AWS Import/Export:
[http://aws.amazon.com/importexport/](http://aws.amazon.com/importexport/)

------
jcanyc
First consider thumbnailing these in the fly as they're requested by the user.
But if you must mass convert them, this job must be parallelized among many
compute nodes; overly tuning the processing of each conversion probably won't
be very fruitful.

You probably want to move these images to AWS S3 and run compute jobs and
upload the resulting images back up to S3. You could create AWS Simple Queue
Messages with the S3 URLs of each of the images and pop messages and autoscale
EC2 instances based on the depth of that queue. What's the plan for these
files after they're processed?

I am local, have deep AWS experience and have next week off if you'd like some
pro bono advice.

50TB S3 ~$1500/month SQS ~$0.50/million messages

------
zacman85
We would be happy to help at imgix. We currently process tens of millions of
images per day, including large images like yours. Feel free to contact me at
chris (at) imgix (dot) com. Link: [http://www.imgix.com](http://www.imgix.com)

------
orr94

        At 5 seconds an image using a tool like ImageMagick (which might be optimistic)
    

Are you sure ImageMagick would take that long? I haven't timed it, but I don't
recall thumbnail creation with ImageMagick taking that long.

Also, if you can parallelize it, it won't take 115 days.

~~~
phprecovery
The files are scanned TIFFs in the 25-30 MB range so we're thinking 5 seconds
might be on the optimistic side.

Yes, definitely planning to parallelize but we're likely limited to four
simultaneous instances so that will still take ~20-30 days.

~~~
Igglyboo
Are cloud providers an option? You could use AWS of Google Compute Engine,
setup 150 instances and do it in ~1 day.

~~~
bilalhusain
As jenkstom pointed out, there might be bandwidth issues. I guess it'll
require uploading few dozen TB data.

~~~
snowwrestler
AWS has a "ship a hard drive" upload option.

------
cjbprime
Could you post at least one such image here with the thumbnail size you want?
Then we can compete on ideas using actual numbers.

------
no_future
I wrote a simple CUDA based image thumbnailing dongle with OpenCV a while
back. [https://github.com/NealP/cudathumb](https://github.com/NealP/cudathumb)
It's pretty fast but doesn't expose a good interface via console so if someone
would like to contribute that I would be not unpleased. Also you need to have
OpenCV compiled with CUDA to use it(which is kind of a nightmare).
GraphicsMagick with its multithreaded thing(it should just werk if you enable
it) is pretty fast if you run it on a decent CPU, it shouldn't take nearly 115
days. If thats still not fast enough for you you could try to make something
with the Intel Performance Primitives package(if you're on Intel CPUs), though
the cognitive load imposed by writing it might not be worth whatever speed
boost it grants.

~~~
jason_slack
oh my, I just posted asking if anyone wanted to collaborate on a CUDA program
to do this!! Thanks for making yours available on GitHub so I can learn from
it.

------
richm44
Don't fork a new process for each image, write something that uses the
imagemagick library directly (or another library if you prefer). Don't do them
serially, use threads so you can make use of all your cores.

That said, even serially 5 seconds per image seems very slow. Are you sure
you're not hitting network latency from a remote filesystem or something? If
so do some bulk copies to get the data locally then work on the local copy.

------
brudgers
This sounds like a job that only has to be done once [at this scale]. 115
machine days is a lot of computing. It's not much human time. What really
counts as optimization?

I suspect it's mostly the quality of the access points and less the efficiency
of the algorithm making thumbnails.

I suspect that the secondary optimization is how quickly those access points
become available. In the 17 hours between the time your question was posted
and this comment, more than 12,000 images could have been thumbnailed and
possibly 12,000 additional resources made available or meta-data records
improved.

The great thing about starting with something slow that works is that there is
plenty of time to improve it [and plenty of time to decide if it really needs
improving].

If the process had started at 8am yesterday, more than 50,000 images would
have been converted by 8am Monday morning.

Now in the real world of bureaucracy, there can be more friction entailed in
obtaining a box, sticking it in the corner, and letting it run for four
months, than spending substantial human time researching and implementing a
solution that looks clever and elegant. But really, the hardware for this job,
even if purchased new is only a few hundred dollars.

------
cjbprime
Hey, if you're a Public Library, can't you just upload them all to Flickr, let
Flickr thumbnail them, and then download the thumbnails? :)

~~~
jagger27
Hey, for sure! It's only 55TB up which should take about 5 days to upload on a
gigabit connection.

------
jenkstom
You could "rent" space by setting up by-the-hour cloud services, but you'd
probably just trade a compute problem for a bandwidth problem. You might want
to make sure you know where the bottleneck is before trying to add resources -
if it's your LAN you can save time by moving closer to the files and adding a
faster connection. But most likely it is CPU or memory.

Can you borrow some servers from a local computer store for a few weeks? It's
not outside the realm of possibility. Call them all, they can only say "no".
Or maybe some businesses in the area might have some spare resources. I'd say
leverage your status as a pro bono organization for some goodwill help.

~~~
orr94
Frankly, even using your own workstations could help. Especially if you run it
overnight.

~~~
Igglyboo
Yea seriously, just get everyone to bring in their old laptops/computers and
whatever else they can find to do this.

Parallelization is the way to go for a task like this.

------
garethsprice
Get a benchmark for converting a single image. Use "strace -t" (or similar on
your chosen OS) to see where the bottlenecks are occurring at each stage in
the program's execution.

This is a linear time (O(n)) problem with a large set, so it's worth the
effort to shave a few milliseconds where you can as each millisecond optimized
will be multiplied 2-million-fold (about 35 minutes). Once you have an optimal
configuration for single images, test a small set, then let it loose on the
whole set. If you can shave off 2 seconds for a single image, that's 46
machine days right there.

Can you buffer the images onto a ramdisk during conversion? Guessing HDD IO
will be a large bottleneck.

Be sure to run your single image test on different images, so you don't get
false optimization positives due to various I/O caches.

What's the maximum number of images that ImageMagick will take in as a batch
list? (Guessing it's somewhat short of 2M) Whatever it is, make sure to run as
large a list as possible. There's a suggestion at
[http://www.imagemagick.org/Usage/files/#image_streams](http://www.imagemagick.org/Usage/files/#image_streams)
but it re-initializes IM each time which sounds slow (still, can put the
binaries on a ramdisk?)

You want to create a stream / "tape head" type setup where files are being
processed with minimum need to re-init the conversion program. But it looks
like IM6 doesn't support this so, with a sampleset of that size, you may even
want to look into coding up a simple C program using libtiff/libjpeg that's
sole job is to run the conversion as a stream, if you have access to such
skills. It may be faster than a large general purpose tool.

Simple parallelism - create the list, split it into N (ImagickMaxNum) input
list files, run on N workstations to reduce the total problem time by O/N.
True parallelism (network queue-based) may be worth exploring using a queue
system (RabbitMQ?) but don't try to write it yourself.

There may be situations where it makes sense to access the files via a
filename mask if you can rename them (img_0 -> img_2000000) so you don't have
to store and parse the file list and can use a simple increment counter.

Hope this helps! I'm no optimization guru and the above is very top-of-mind,
but I enjoy these large problem sets. I'm also in NYC, would love to help out
the NYPL and would volunteer some free time to do so (I need a useful side
project). PM me if you'd like to talk further!

EDIT: Is this the wrong problem to tackle entirely? Can you convert the set on
demand and cache-as-you-go? ie. if it's book covers that people are browsing,
the first user may wait 5 seconds for an image, but that's not _awful_... Is
there a particular reason to want to create precached derivatives?

------
Someone1234
If you do them one at a time it is 5 seconds. And the majority of that 5
seconds is likely IO wait times. You should invest in an SSD, and should also
look at running multiple conversions concurrently so that the IO on the SSD is
near capped.

Did you see this also:
[http://www.graphicsmagick.org/index.html](http://www.graphicsmagick.org/index.html)

They claim to be faster than ImageMagic.

~~~
morpher
Although having the data already on an SSD would surely be faster, I don't see
how it would be faster to copy from HDD to SSD and then process as opposed to
just processing from the HDD.

------
jscheel
Use VIPS
([http://www.vips.ecs.soton.ac.uk/index.php?title=VIPS](http://www.vips.ecs.soton.ac.uk/index.php?title=VIPS))
for large TIFF conversion. It should handle them much, much faster. Also, I
think there was a write-up on highscalability.com about instagram, or
pinterest, or someone resizing a ton of images a while ago.

------
exelib
My question is: Do you need all thubnails at once? I setted up a images album
for kindergarten for my child with more than 17GB (>4000) jpeg's. I used nginx
webserver for resizing on the fly and then let nginx cache it. In a VM with 4
cores from i7-920 it thumbnail 2-6 images per second. After first access
images are taken from the cache.

------
percept
In RubyLand I used to use ImageScience
([http://docs.seattlerb.org/ImageScience.html](http://docs.seattlerb.org/ImageScience.html)),
which is apparently based on FreeImage:

[http://freeimage.sourceforge.net/](http://freeimage.sourceforge.net/)

------
eabraham
A few months ago I did benchmarks on ImageMagick and similar libraries for a
Rails application I was working on. I discovered Vips which has a significant
performance improvement over Imagemagick.

[https://github.com/jcupitt/ruby-vips](https://github.com/jcupitt/ruby-vips)

------
pixl8ed
[http://codeascraft.com/2010/07/09/batch-processing-
millions-...](http://codeascraft.com/2010/07/09/batch-processing-millions-of-
images/) is a writeup of a similar task at Etsy to resize 135 million images.

------
keviv
A simple solution: Use Gearman to queue the jobs and use supervisord to run
multiple instances of the worker to process these jobs. Should do the trick.

------
virmundi
Out of curiosity do you need then all build? Could you tumbnail as they are
requested? So the first request is slow but you cache fir the future?

------
jason_slack
just as an interesting experiment, would anyone want to collaborate on a CUDA
program to do this? I'm learning to utilize it now with c++.

------
LarryMade2
gthumb; gnome file menu - create thumbnails...

