
MapReduce for the Masses: Zero to Hadoop in 5 minutes with Common Crawl - Aloisius
http://www.commoncrawl.org/mapreduce-for-the-masses/
======
alexro
Outside of Google, Facebook and another top 10 internet businesses, can
someone share a realistic example of employing M/R in a typical corporate
environment?

I'd be really interested at moving forward with the tech but for the time
being I do have to stay where I'm - in my primitive cubicle cell.

~~~
patio11
Most of the Fortune 500s have enough data to theoretically have something to
throw M/R at, though whether they'd gain from any particular project is
anyone's guess.

e.g. At a national bank, determine whether distance to the nearest branch or
ATM correlates with deposit frequency or average customer relationship value.
Your inputs are a) 10 billion timestamped transactions, b) 50 million
accounts, c) 200 million addresses and dates for which they began and entered
service, and d) a list of 25,000 branch/ATM locations and the dates they
entered service.

This is _fairly_ straightforward to describe as a map/reduce job. You could do
it on one machine with a few nested loops and some elbow grease, too, but the
mucky mucks might want an answer this quarter.

I know the feeling, though: I keep wanting to try it, but haven't been able to
find a good excuse in my own business yet.

~~~
Radim
I doubt such companies would put their data "in the cloud".

And if they don't, the cost of supporting this "straightforward" Hadoop
infrastructure, both in terms of hardware, engineering and support, is so
massive that the little elbow grease for a simple I-know-what-it-does solution
may well be worth it.

In other words, I share alexro's concerns. If you're buying into the M/R hype
and process your blog logs in the cloud, that's one thing. But legitimate
business use-cases are probably not as common as people may expect/hope.

~~~
ajessup
Hadoop is not a cloud solution - most organizations deploy it on their own
infrastructure. Some folks (eg. Amazon AWS) offer a "hosted" version.

~~~
Radim
Sure, but that's not what this article is about ("5 min setup, for the
masses").

------
hspencer77
This is a good article...would be interested in seeing this by using AppScale
([http://code.google.com/p/appscale/wiki/MapReduce_API_Documen...](http://code.google.com/p/appscale/wiki/MapReduce_API_Documentation))
and Eucalyptus.

~~~
ssalevan
That's an interesting idea, and I dig using a fully open stack; we'll consider
adding it into our next howto!

------
metaobject
Does anyone know of an application of M/R to non-text data like image data or
time series data? I'm trying to think about how to process a hge set of 3D
atmOspheric data where we are looking for geographic areas that have certain
favorable time series statistics. We have the data stored in time series order
for each pixel (where a pixel is a 4KM x 4KM area on Earth) and we compute
stats for random pixels and try to find optimal combinations of N
pixels/locations (where N is a runtime setting).

~~~
ahalan
MapReduce is applicable wherever you can partition the data and process each
part independently of others.

I used Hadoop/Hbase for EEG time series analysis, looking for certain
oscillation patterns (basically classic time-series classification) and it was
an embarrassingly parallel problem:

Map:

1\. Partition the data into fixed segments (either temporal, say 1hr chunks or
location based, say 10x10 blocks of pixels). Alternatively you can use a
'sliding window' and extract features as you go. In some cases you can use
symbolic representation/piecewise approximation to reduce dimensionality, as
in iSax: <http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.html> , "sketches" as
described here: [http://www.amazon.com/High-Performance-Discovery-Time-
Techni...](http://www.amazon.com/High-Performance-Discovery-Time-
Techniques/dp/0387008578) or some other time-series segmentation techniques:
<http://scholar.google.com/scholar?q=time+series+segmentation>

2\. Extract features for each segment (either linear statistics/moments or
non-linear signatures:
[http://www.nbb.cornell.edu/neurobio/land/PROJECTS/Complexity...](http://www.nbb.cornell.edu/neurobio/land/PROJECTS/Complexity/index.html)
). The most difficult part here has nothing to do with MapReduce but decide
which features carry the most information. I found ID3 criterion helpful:
<http://en.wikipedia.org/wiki/ID3_algorithm>, also see
[http://www.quora.com/Time-Series/What-are-some-time-
series-c...](http://www.quora.com/Time-Series/What-are-some-time-series-
classification-methods) and
[http://scholar.google.com/scholar?hl=en&as_sdt=0,33&...](http://scholar.google.com/scholar?hl=en&as_sdt=0,33&q=time+series+dimensionality+reduction)

Reduce:

3\. Aggregate the results into a hash-table where the keys are segment'
signatures/features/fingerprints, and the values are arrays of pointers to
corresponding segments (Based on the size this table can either sit on a
single machine, of be distributed on multiple hdfs nodes)

Essentially you do time-series clustering at the Reduce stage with each
'basket' in a hash-table containing a group of similar segments. It can be
used as an index for similarity or range searches (for fast in-memory
retrieval you can use HBase which sits on top of HDFS). You can also have
multiple indices for different feature sets.

\-----

The hard part is problem decomposition, i.e. dividing work into independent
units, replacing one big nested loop/sigma on the entire dataset with smaller
loops that can run in parallel on parts of the dataset, when you've done that,
MapReduce is just a natural way to execute the job and aggregate the results.

------
CurtHagenlocher
It would be nice if there were an estimate for how much it costs to run the
sample code.

EDIT: Apparently, I can't read. :(

~~~
ssalevan
Hey Curt, Most of my own runs, using the 2 small VM default, resulted in 3
normalized hours of usage, which equated to around 25 cents per run.

~~~
Radim
that's for the crawl sample, not the entire 4TB index, right?

how much data was that?

~~~
ssalevan
That was just for the crawl sample, yes, and was approximately 100M of data,
though you can specify as much as you'd prefer.

The cool thing about running this job inside Elastic MapReduce right now is
the ability to get at S3 data for free, and for cost of access outside of it,
both pretty reasonable sums. Right now, you can analyze the entire dataset for
around $150, and if you build a good enough algorithm you'll be able to get a
lot of good information back.

We're working to index this information so you can process it even more
inexpensively, so stay tuned for more updates!

~~~
mat_kelcey
How is the $150 broken down?

