

How to process a million songs in 20 minutes - brianwhitman
http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/

======
joebo
It would be interesting to know how long it takes to run locally on a single
instance

~~~
hogu
agreed, actually, I want to know how long it takes to run on say... an i7,
I've gotten pretty poor performance out of those amazon small instances when
doing alot of cpu oriented tasks.

~~~
thirdhaf
Let's pessimistically assume you have to read all information associated with
each file from disk. I suspect this sort of query is IO bound since processing
is minimal and assuming we can mostly avoid fragmentation on disk we could
feasibly read data at 100MB/s with a consumer level magnetic drive. The
article says the data set is 300GB so we could feasibly get this done in an
hour.

------
KirinDave
So... MapReduce? Kinda figured that when you had 6 zeros after your first
zero.

This looks like a fun project, but I can't help but feel like Hadoop
experience reports are a little late to the party at this point. Is there
anyone out there who doesn't immediately think MapReduce when they see numbers
at scale like this? If anything, the tool is _overused_ , not neglected.

~~~
slewis
Its a well-written tutorial on using mapreduce to process a specific large
music dataset. Now someone can jump right in and start writing their own
processing code without worrying about all the boring details.

Maybe the headline is a bit misleading but I find the post interesting and
valuable.

------
dvcat
Can anyone clarify if song data is dense? If it is dense, I am not even sure
if Mapreduce is the right paradigm to use mainly because you will eventually
get to a situation where transfer time overwhelms compute time.

~~~
brianwhitman
the EN song data is dense in the sense that there is far more "columns" than
rows in almost any bulk analysis -- average song unpacks to ~2000 segments,
each with ~30 coefficients + global features.

however, in paul's case here he's really just using MR as a quick way to do a
parallel computation on many machines. There's no reduce step, it's just
taking a single average from each individual song and not correlating anything
or using any inter-song statistics.

------
revorad
Can you dynamically adjust the number of EC2 instances to optimise for
processing time or price?

~~~
plamere
yes, the number and size of EC2 instances can be set by switches on the
command line when you launch the job.

~~~
revorad
I saw that but I mean how do you know you need 100 instances? Is there a way
to estimate the optimum number or at least set boundaries, such as max price
or max processing time?

Do you get back info such as how long each instance took to do its job?

~~~
rb2k_
Since it scales somewhat close to linear, you can just calculate it yourself.

And since you pay per minute on EC2, you can either chose to have 10 instances
calculate a minute or 1 instance for 10 minutes. Either way, you pay for 10
minutes of processing time.

p.s. simplified but probably somewhere in a reasonable ballpark

~~~
mattj
It's actually rounded up to complete hours, so 10 instances for a minute is
10x the price of 1 instance for 10 minutes.

For fast iterative development this isn't an issue - you can re-use the job
flow (the instances you spun up), so you can launch the job several times over
the course of that hour.

