Hacker News new | past | comments | ask | show | jobs | submit login
Map Reduce: A simple introduction (2010) (ksat.me)
117 points by awjr on May 2, 2014 | hide | past | favorite | 36 comments



Map Reduce seems very interesting, but every example I have seen explains it in terms of counting frequency of words in documents. I would love to have someone explain it with an actual business example. I can't think of many real world uses where counting the frequency of words would matter to most businesses. (besides maybe some analysis of log files)


Think of the SQL GROUP BY functionality as an example. You have a giant terabyte-scale text file that is a CSV of observations (say the raw US census results). You want to group people by some facet (say zipcode and household income) and compute some complex, arbitrary function for each group.

Your map function scans the rows and outputs the kv pair (zipcode+","+income, csv line). All the csv lines in a group go to the same instance of the reduce function, where you can run any code you like (compute averages, do deep learning, etc). The output is the results of what you want to compute for each group.

This is a pretty simple example, but does demonstrate where the power of mr comes from -- arbitrary functions in the map and reduce functions that are allowed restricted one-way communication from the mapper to the reducer. It also should help you understand the glib "you can implement SQL on mapreduce" comment below, which is what Apache Hive does.


I agree. I hate this example. Not sure why it's so ubiquitous. Think of like this: Each record in your dataset goes through the "map" phase. This is really just a categorization. You look at each record and assign it, or some parts of it, a "key" based on what it contains, and send those KV pairs back out into the ether. The framework will then automatically group those by key and send them, in no guaranteed order and in parallel, to your "reduce" phase. You get a key, and a list of all of those data points that were assigned that key. You do some computation on them, and return the result. When all of this is done, the process is complete.

I think what makes this example so confusing (as well as the other common hadoop example where you do a partial count in the mapper "hello":3, "goodbye":2) is that the keys mean nothing to the final result (unless you really wanted to know the frequency of each word) - they are only used to shard the work.

Sorry I don't have a real-world example (crunching log files as you mentioned is a common one, but it's really just the same thing: counting frequencies)


Here's a tiny twist on WordCount that many, many companies use MR for (often their first foot in the water): you have many gigabytes or terabytes of web access logs and need to bill customers via some API scheme like /api/3001?x=1&y=2 where "3001" there is the customer ID. You add some regex and now you're doing "WordCount" on the customer ID (per day, month, etc).


>I can't think of many real world uses where counting the frequency of words would matter to most businesses.

The idea is that it's the simple "hello world" of teaching mapreduce.

Likewise for teaching new programming language syntax, the idea of literally displaying the phrase "hello world" is not useful for explaining more real world business uses. Since everybody presumably already knows what "hello world" means, they can ignore that string and instead, pay attention to the surrounding syntax (printf, WriteLine, println, puts, echo, etc) of whatever new programming language they're trying to learn.

Since counting words is very easy to do without mapreduce (using dictionaries or associative arrays) and it also doesn't require any particular business domain knowledge, people can ignore it and just concentrate on the structure of setting up mapreduce.

In that context, the "uselessness" of counting word frequencies makes it easier to isolate the learning of mapreduce.


I've used it for calculating Value at Risk on a portfolio consisting of thousands of books and millions of positions.

With the books spread over a distributed, in-memory data grid of 20 or so nodes, the system sent a command object to each node to do the calculations in parallel in the same process context as the position data. The data was partitioned so that all positions for each book were in the same node. When the calculations were complete, the reduce process consolidated the results.

VaR requires storing interim results to roll the calculation up the hierarchy of books, but that was straightforward with this approach.

Any time you can parallelize the calculations, map-reduce is worth considering.

Incidentally, the core idea (like all good software ideas) was present in Lisp decades before Google popularized it. The words map and reduce are even in the language.


Part of the difficulty is that there are very few open source _applications_ (vs frameworks) built using MapReduce available to study. The only ones I know of are:

     https://github.com/snowplow/snowplow
     https://github.com/PredictionIO/PredictionIO
We (Snowplow) use MapReduce primarily to:

1. Scale our event enrichment process horizontally - raw events come in, we validate them, enrich them (IP -> geo etc), store them. With MapReduce, we just throw more boxes at the enrichment process for larger users (we enrich 200m events in ~90 mins on 6 x c3.2xlarges, spot cost of $0.58)

2. Do easy recomputations across user's full history of raw events - e.g. we add a new enrichment or a user's business logic changes, we can rerun over their full history going back to 2012

Hope this helps!


Let's use a simple data mining example with logs. You want to count the number of users by region visiting your website to understand where your users are coming from.

In your log files you have the IP which you can use to geolocate those users.

These counts can be caluclated in parallel.

We do this by first "mapping" each line in a log file extracting out the IP. The output of map will be a geolocated hash by region.

Reduce's goal is to calculate an answer based on these keys.

If we think about the output of map being the geo location, we can then collect a count by geo located hash of where our users are coming from.

Map: Map the raw input to some key value space

Reduce: Reduce the output of the map to some aggregate result, think of this like the sql aggregation functions

Hope that helps!


Think about it this way: you can't have 1 baby with 9 pregnant women in a month, but you can have 9 babies in 9 months with 9 pregnant women.

In this analogy, the pregnant woman is the node and the baby is the processed data (of course).

MapReduce is all about dividing tasks as small as they can be and then executing those tasks in parallel in several nodes. Most examples are of counting words because it's very straight forward to explain that you can give one page of a book to N people, have them count the words, and afterwards merge sort the sums into M counters and just sum it again.


A very simple (or simplified) example could be - say you have just one table - a transaction table from a local deli shop with a volume of around 500 transactions / day, and all you are doing with it is producing a report of how many items of each type (sugar/salt/bread) is sold. Say, you produced this report on daily, monthly and yearly basis. You decided to use, say, just a spreadsheet, which is sufficient to store this data, as well as running the calculation using pivot tables. This works for a year.

The shop is now more popular, and a bit bigger too. The volume increases to say 2000 transactions a day, and you say, well, let's use mysql to store this. You are still comfortably generating the report at the end of the year using simple sql queries.

Now suddenly, they decide go really big, to expand and open more stores across the city or state, say about 500 stores. They also expand the items in the stores from just few hundreds to few thousands. They project the transactions across all stores to be in the range of 1,000,000, on average containing about 10 items in each. They also want more reports on which products are doing good/bad, what are the buying habits, what's the trend over Thanksgiving, do year-to-year comparisons on various matrix. And the mysql solution no longer works - storing 10,000,000 rows on daily basis and running those sql queries is turning out to be practically impossible. In a year, you now have 3,650,000,000 rows that needs to be joined with 100,000 items. Yes, you can add more space and more resources to the machine, but running all those queries are now taking hours or days instead of seconds.

This is the point where Hadoop/Map-Reduce comes to rescue. You now have a cluster of, say, 5 machines, each having, say, 64G or RAM and 1 PB of storage, but still costing just around 25000$.

Since you are not familiar with Java or Map - Reduce, and/or don't have time to learn it, you decide to use Hive - an important tool of the Hadoop ecosystem among others - that still let's you access and process the data in the familiar sql query way - but generating map-reduce jobs on your behalf on the nodes. They split the processing across different nodes, bring required outputs together, may run through other map-reduce jobs if required, and ultimately, give you the results that you can use produce those reports - in a reasonable time, of course.

This is very simplified but real-life business use case.


I could not agree more. That is why I published an introductory example of map reduce that is still simple enough for educational purposes yet of a more real world nature.

glennengstrand.info/analytics/oss

Here you will see a blog, paper, and github repo of some map reduce jobs that take openly available San Francisco crime data and load it into an OLAP cube.


Here you go, I wrote this a little while ago: http://tech.hulu.com/blog/2014/04/10/beaconspec/

The first part covers MapReduce, the rest you can skip.


In realworld, people barely use "MapReduce", they use a high-level wrapper, that provides a more friendly user interface similar to SQL, like Hive, and to a lesser extent, Pig.


You can implement SQL on top of MapReduce.


The earliest search engines could be described as pretty much exactly this map-reduce.


At first I thought you were making fun of the CEO with the misspellings. Then I continued reading and realized you just don't use spell check. ;)

Interesting article, thanks.


This is a good explanation and I would love to share it but the poor spelling and grammar makes it un-shareable.


To be honest, I don't think (and I'm going off the author's name) that english is his first language. Personally, I posted this to HN as it succinctly gave me a quick overview of MR.

I try not to judge an article/blog post on the quality of the grammar when the ideas they are trying to convey are solid and useful.

There are some exceptionally fantastic minds out there that are able to succinctly explain, sometimes difficult, subjects.

I appreciate them for at least trying, and if they explained it in a language that is not their first language, all the more respect to them.


Perfectionist much?

While I can agree that pointing out the spelling and/or grammar mistakes could be constructive, calling the article "un-shareable" because of them seems terribly over-the-top.


Unless you're already familiar with the material, spelling and grammar lends credibility to the content of the article. Similar to a code smell, it makes you ask "are you sure you know what you're doing?"


While I agree that numerous spelling and grammar mistakes can indicate a lack of maturity in writing skills, I also know that some of the most brilliant people I've ever worked with were absolutely terrible at spelling. One guy in particular was an amazing programmer. He produced libraries of code that were fast, efficient, and easy to read and understand. They were also very well documented. I know, because I went through the documentation and fixed the dozens and dozens of spelling errors. If I didn't know otherwise, I'd have thought he made a game out of how poorly he spelled words.

I can appreciate numerous spelling/grammar errors making you analyze an article with a bit more scrutiny. However, I can't think of a single example of an article with those kind of errors where I couldn't figure out from the content whether I thought the author really knew what they were talking about.

At any rate, I think there is a big difference between an article needing a little bit more scrutiny, and the article being "unsharable".


Just my personal opinion, no disrespect intended. I have personally authored posts that needed a bit more time on the proofreading table and I did worry that people would question my authority on the subject due to grammatical errors in the article.


Fair enough.

Just to be clear, I don't want to give the impression that I don't think spelling/grammar matter (even for blog posts). I just think it is easy to get so pedantic about it that you place far too much weight on them.

I have my own set of pet peeves, and am probably guilty of allowing the violation of one of them to taint my view of an article too quickly and too often.

And of course, there are always those articles that are so bad grammatically that it looks like a first-grader wrote them (but I don't think that's the kind of article we were discussing).


The only place I noticed spelling errors was in the email from the CEO... and spelling errors there seemed to fit.


I love the punchline, besides the ridiculous requirements and the deadline the way it is dealt with upon completion is way too close to comfort, I may have worked at some point in the distant past for that particular boss.


Am I right in thinking MapReduce seems to be going out of fashion somewhat as even Google themselves have moved towards using something they term Millwheel.

Which is a stream based processing system.

I have seen one paper on a streaming MapReduce solution though.


Wow. I worked on Millwheel as a summer intern the summer before last. At the time it was a team of about 11 people. I'm honestly pretty surprised to see this comment as I thought it was just a small internal research project.

Have you seen any references to it in the wild other than the Google Research paper?


Oh maybe I'm wrong, I really thought I saw something that said it was used for the index creation. I'm just having a look over the papers I've read.

They do definitely seem to have switched from MapReduce though at least - http://www.theregister.co.uk/2010/09/09/google_caffeine_expl...


Hadoop is busy replacing MapReduce with Yez. MapReduce was a great first implementation but we need something that performs better.


This is a fantastic explanation. anyone got any favourite uses of this / other resources to share of similar clarity?


I'm a big fan of (and occasional poster to) Reddit's Explain Like I'm Five, where you can ask questions and people are supposed to answer them in simple, accessible terms (if not in the "Little Timmy" literal 5yo sort of speech)

You can find it at http://www.reddit.com/r/explainlikeimfive


I really like this one:

http://www.michaelnielsen.org/ddi/how-the-bitcoin-protocol-a...

He designs his own cryptocoin to explain the choices made by bitcoin and how it works.


So this is essentially analogous to division of labour efficiencies. Or am I missing something?


It's more a technique/architecture for parallelizing highly parallelizable tasks.


The many typos of the text make it almost difficult to read...


Great explanation, quite cheeky as well :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: