Hacker Newsnew | comments | ask | jobs | submitlogin
MapReduce for the Masses: Zero to Hadoop in 5 minutes with Common Crawl (commoncrawl.org)
105 points by Aloisius 852 days ago | comments


alexro 852 days ago | link

Outside of Google, Facebook and another top 10 internet businesses, can someone share a realistic example of employing M/R in a typical corporate environment?

I'd be really interested at moving forward with the tech but for the time being I do have to stay where I'm - in my primitive cubicle cell.

-----

patio11 852 days ago | link

Most of the Fortune 500s have enough data to theoretically have something to throw M/R at, though whether they'd gain from any particular project is anyone's guess.

e.g. At a national bank, determine whether distance to the nearest branch or ATM correlates with deposit frequency or average customer relationship value. Your inputs are a) 10 billion timestamped transactions, b) 50 million accounts, c) 200 million addresses and dates for which they began and entered service, and d) a list of 25,000 branch/ATM locations and the dates they entered service.

This is fairly straightforward to describe as a map/reduce job. You could do it on one machine with a few nested loops and some elbow grease, too, but the mucky mucks might want an answer this quarter.

I know the feeling, though: I keep wanting to try it, but haven't been able to find a good excuse in my own business yet.

-----

Radim 852 days ago | link

I doubt such companies would put their data "in the cloud".

And if they don't, the cost of supporting this "straightforward" Hadoop infrastructure, both in terms of hardware, engineering and support, is so massive that the little elbow grease for a simple I-know-what-it-does solution may well be worth it.

In other words, I share alexro's concerns. If you're buying into the M/R hype and process your blog logs in the cloud, that's one thing. But legitimate business use-cases are probably not as common as people may expect/hope.

-----

ajessup 852 days ago | link

Hadoop is not a cloud solution - most organizations deploy it on their own infrastructure. Some folks (eg. Amazon AWS) offer a "hosted" version.

-----

Radim 852 days ago | link

Sure, but that's not what this article is about ("5 min setup, for the masses").

-----

nl 852 days ago | link

And if they don't, the cost of supporting this "straightforward" Hadoop infrastructure, both in terms of hardware, engineering and support, is so massive that the little elbow grease for a simple I-know-what-it-does solution may well be worth it.

That's not true. A small cluster and support from a company like Cloudera is much less than an new Oracle install.

-----

ericlavigne 851 days ago | link

The mucky mucks can have their answer today with a random subset of that data: 10,000 accounts and the corresponding addresses and transactions. We'll need map-reduce tomorrow if they want precision in the last decimal point.

-----

rmc 849 days ago | link

You might be able to do it with a random subset, but your accuracy would not be as good. For companies with this much data, 0.1% increase (or decrease) can cost millions. If you use a random subset, you won't be able to accurately find this teeny little variation/

-----

ajessup 852 days ago | link

I think any corporate data warehouse is an area that Hadoop+Pig/Hive could add a huge amount of value. Here's why. Conventional data warehouses are built from running ETL transforms against existing sources, where you are carefully selecting which attributes you want to transform in. These transforms are tricky to set up and even trickier to change if you decide you need a new piece of data (which happens every day in big BI implementations). And if the source system isn't permanently preserving the data (eg. log files) then pulling in data retrospectively often isn't possible.

With a HDFS cluster you can cost effectively dump undifferentiated data from existing sources into a "data lake" without worrying about complex and highly selective transforms, and then use tools like Pig and Hive to do ad-hoc interrogations of the data.

Most data warehouse implementations fail to a large extent because of the ETL problem - Hadoop could help solve that in a big way.

Further reading (no connection to me) http://www.pentaho.com/hadoop/

-----

grammr 852 days ago | link

Many people mistakenly assume that MapReduce is only useful when dealing with Google or Facebook scale datasets. Nowadays, MapReduce is simply a convenient data access pattern, particularly when your data is spread across multiple machines. For example, Riak employs the MapReduce paradigm as its "primary querying and data-processing system." [0]

Clearly, Riak does not expect each and every query to run for hours and involve TBs of data. MapReduce is just a straightforward way to process distributed data.

[0] http://wiki.basho.com/MapReduce.html

-----

seiji 852 days ago | link

Read http://infolab.stanford.edu/~ullman/mmds.html a few times and you'll know more about manipulating large data sets than most people.

If you're just looking for a quick example: imagine you want to parse 6,000 webserver logs (each being 100+GB) to determine the most popular pages, referrers and user agents. The map phase extracts all fields you want and the reduce phase counts all the extracted unique values.

"mapreduce" is about processing at scale. You can always do the same thing on a smaller set of data locally (and usually with just sed, awk, sort, and uniq).

-----

nl 852 days ago | link

The NY Times, back in 2007 shared how they generate PDFs using Map/Reduce (on Amazon infrastructure)[1]

Hadoop has a long, long list of companies using it[2]. Cloudera has a similar list[3]

[1] http://open.blogs.nytimes.com/2007/11/01/self-service-prorat...

[2] http://wiki.apache.org/hadoop/PoweredBy

[3] http://www.cloudera.com/customers/

-----

pork 852 days ago | link

If you have tons of data (logs, for example), on the order or tens or more terabytes, and want the benefit of SQL in addition to MR to query and crunch that data offline, you can use something like Hive which runs on top of Hadoop, gives you all the power and familiarity of SQL, and lets you "drop down" to MR if you need the extra power.

-----

nl 852 days ago | link

Splunk is killing it in the "Map/Reduce for enterprise" market. See http://www.splunk.com/industries

-----

jchrisa 852 days ago | link

map reduce can be used for garden variety stuff, for instance my blog index is the result of a map reduce index of recent posts (in this case those tagged html5) http://bit.ly/s7NHYM

-----

CurtHagenlocher 852 days ago | link

It would be nice if there were an estimate for how much it costs to run the sample code.

EDIT: Apparently, I can't read. :(

-----

Aloisius 852 days ago | link

The blog post says it costs something like 25 cents.

-----

ssalevan 852 days ago | link

Hey Curt, Most of my own runs, using the 2 small VM default, resulted in 3 normalized hours of usage, which equated to around 25 cents per run.

-----

Radim 852 days ago | link

that's for the crawl sample, not the entire 4TB index, right?

how much data was that?

-----

ssalevan 852 days ago | link

That was just for the crawl sample, yes, and was approximately 100M of data, though you can specify as much as you'd prefer.

The cool thing about running this job inside Elastic MapReduce right now is the ability to get at S3 data for free, and for cost of access outside of it, both pretty reasonable sums. Right now, you can analyze the entire dataset for around $150, and if you build a good enough algorithm you'll be able to get a lot of good information back.

We're working to index this information so you can process it even more inexpensively, so stay tuned for more updates!

-----

mat_kelcey 852 days ago | link

How is the $150 broken down?

-----

metaobject 852 days ago | link

Does anyone know of an application of M/R to non-text data like image data or time series data? I'm trying to think about how to process a hge set of 3D atmOspheric data where we are looking for geographic areas that have certain favorable time series statistics. We have the data stored in time series order for each pixel (where a pixel is a 4KM x 4KM area on Earth) and we compute stats for random pixels and try to find optimal combinations of N pixels/locations (where N is a runtime setting).

-----

ahalan 852 days ago | link

MapReduce is applicable wherever you can partition the data and process each part independently of others.

I used Hadoop/Hbase for EEG time series analysis, looking for certain oscillation patterns (basically classic time-series classification) and it was an embarrassingly parallel problem:

Map:

1. Partition the data into fixed segments (either temporal, say 1hr chunks or location based, say 10x10 blocks of pixels). Alternatively you can use a 'sliding window' and extract features as you go. In some cases you can use symbolic representation/piecewise approximation to reduce dimensionality, as in iSax: http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.html , "sketches" as described here: http://www.amazon.com/High-Performance-Discovery-Time-Techni... or some other time-series segmentation techniques: http://scholar.google.com/scholar?q=time+series+segmentation

2. Extract features for each segment (either linear statistics/moments or non-linear signatures: http://www.nbb.cornell.edu/neurobio/land/PROJECTS/Complexity... ). The most difficult part here has nothing to do with MapReduce but decide which features carry the most information. I found ID3 criterion helpful: http://en.wikipedia.org/wiki/ID3_algorithm, also see http://www.quora.com/Time-Series/What-are-some-time-series-c... and http://scholar.google.com/scholar?hl=en&as_sdt=0,33&...

Reduce:

3. Aggregate the results into a hash-table where the keys are segment' signatures/features/fingerprints, and the values are arrays of pointers to corresponding segments (Based on the size this table can either sit on a single machine, of be distributed on multiple hdfs nodes)

Essentially you do time-series clustering at the Reduce stage with each 'basket' in a hash-table containing a group of similar segments. It can be used as an index for similarity or range searches (for fast in-memory retrieval you can use HBase which sits on top of HDFS). You can also have multiple indices for different feature sets.

-----

The hard part is problem decomposition, i.e. dividing work into independent units, replacing one big nested loop/sigma on the entire dataset with smaller loops that can run in parallel on parts of the dataset, when you've done that, MapReduce is just a natural way to execute the job and aggregate the results.

-----

grncdr 852 days ago | link

Don't know of an application specifically like that, but I'd love to hear more, what kind of dimensionality does the per-pixel data have? (is it just RGB * N time series values?) There's nothing very text specific about Hadoop, just a lot of text-oriented examples.

(e-mail is on my profile)

-----

hspencer77 852 days ago | link

This is a good article...would be interested in seeing this by using AppScale (http://code.google.com/p/appscale/wiki/MapReduce_API_Documen...) and Eucalyptus.

-----

ssalevan 852 days ago | link

That's an interesting idea, and I dig using a fully open stack; we'll consider adding it into our next howto!

-----




Lists | RSS | Bookmarklet | Guidelines | FAQ | DMCA | News News | Feature Requests | Bugs | Y Combinator | Apply | Library

Search: