

Hadoop Reaches 1.0 - imalolz
http://hadoop.apache.org/common/docs/r1.0.0/releasenotes.html

======
digitalsushi
Hadoop reaches 1.0 and my understanding of how to use it is still in
development.

Does anyone have a high level resource of how MapReduce works for mediocre
programmers like myself that are late to the game? I know she's not ready to
have my babies, but surely I could get to know her a little, maybe just be
friends? I grabbed a Hadoop pre-made virtual machine the other month and was
surely so far over my head that I had to run away to regroup.

In general I have some very unoptimized problems that MapReduce probably isn't
the right shoe for, but I'd love to explain to my boss _why_ it's the wrong
shoe. And learning about it might be a great start down that path.

~~~
LeafStorm
A good introduction to MapReduce is probably CouchDB, where you use it for
database views instead of SQL-style queries. The basic concepts are:

\- The "Map" phase takes a key/value pair of input and produces as many other
key/value pairs of output as it wants. This can be zero, it can be one, or it
can be over 9000. Each Map over a piece of input data operates in isolation.

\- The "Reduce" phase takes a bunch of values with the same (or similar,
depending on how it's invoked) keys and reduces them down into one value.

A good example is, say you have a bunch of documents like this:

    
    
        {"type": "post",
         "text": "...",
         "tags": ["couchdb", "databases", "js"]}
    

And you want to find out all the tags, and how many posts have a given tag.
First, you have a map phase:

    
    
        function (doc)
          if (doc.type === "post") {
            doc.tags.forEach(function (tag) {
              emit(tag, 1);
            });
          }
        }
    

In this case, it filters out all the documents that aren't posts. It then
emits a `(tag, 1)` pair for each tag on the post. You may end up with a pair
set that looks like:

    
    
        ("c", 1)
        ("couchdb", 1)
        ("databases", 1)
        ("databases", 1)
        ("databases", 1)
        ("js", 1)
        ("js", 1)
        ("mongodb", 1)
        ("redis", 1)
    

Then, your reduce phase may look like:

    
    
        function (keys, values, rereduce) {
          return sum(values);
        }
    

Though the kinds of results you get out of it depend on how you invoke it. If
you just reduce the whole dataset, for example, you get:

    
    
        (null, 9)
    

Because that's the sum of the values from _all_ the pairs. On the other hand,
running it in group mode will reduce each key separately, so you get this:

    
    
        ("c", 1)
        ("couchdb", 1)
        ("databases", 3)
        ("js", 2)
        ("mongodb", 1)
        ("redis", 1)
    

Since the sum of all the pairs with "databases" was 3, the value for the pair
keyed as "databases" was 3. You're not limited to summing - any kind of
operation that aggregates multiple values and can be grouped by key will work
as well.

Like you said, there are problems that this doesn't work for. But for the
problems it _does_ work for, it's very computationally efficient and fun.

~~~
mun2mun
I have a question. I have read somewhere that map-reduce can leverage
parallelism. So if I map a function to an array every element in the array is
mapped with that function so that they can be executed parallely because they
have no dependency on each other. But how do reduce leverage parallelism? As
far as I understand output of the reduce function is dependent on the previous
value.

~~~
scott_s
In principle, reductions can often be staged, since there's no ordering
requirements. Imagine a tree of reductions. But you are correct, the reduce
phase is what will limit parallelism. If you have a cheap map operation, but a
really expensive reduction, you may not see much scalability. (Where
"scalable" is a way of saying "performance improves as available hardware
increases because more parallelism inherent in the application is
exploitable.")

------
rberdeen
Hadoop versioning has always been a little confusing to me:

* 0.23.0: 11 November, 2011

* 0.22.0: 10 December, 2011

Now we have 1.0, but it's based on 0.20, not any of the more recent releases?

The 1.0 release notes are pretty useless--it's just a list of issues. Is there
a summary anywhere?

------
akg
For those interested, here is a pretty good discussion on HN comparing
different NoSQL DBs: <http://news.ycombinator.com/item?id=2052852>

------
firemanx
Awesome! Now if we can just get HBase to update it's prereqs and bump it's
version, I can have some symmetry in my life!

On a more serious note - is anyone using HDFS for something like the WebHDFS
stuff was designed? We're currently looking at HDFS right now for an Event
Store mechanism, but it appears to me to be pretty large file / stream
oriented, and I'm wondering how it will stack up if we want to do something
that involves files much smaller than say, 64MB.

~~~
stingraycharles
If you settle with Java or a bit of java extensions, you can probably write
your own TaskSplitter and define a way that hadoop should distribute your jobs
into smaller tasks. Be aware: you might end up either having a lot of trouble
getting the 'optimal splits', or you'll lose one of Hadoop's major advantages,
data (calculation) locality (for example, when you decide to combine 10
smaller files into a single task, and you have 10 different DataNodes, chances
are small that all files are stored on the machine that's performing the
MapReduce task).

One thing to note, though: HDFS is indeed very stream oriented. It works in
blocks of 64 MB (by default), and only sends data upstream when you either
close a file or a full block is available to be written. So, when your servers
crashes at 63MB, and you have unrecoverable data, you'll have lost all 63MB of
data. That was one of the big caveats we had to work around for our own
problems we solve with Hadoop.

~~~
tlipcon
This isn't quite true - data is streamed from the client through a pipeline
made up of all of the replicas, as it's written. It's true you'll lose data if
you crash in the middle of a block, _unless_ you call the sync() function
which makes sure the data has been fully replicated to all of the nodes.

~~~
stingraycharles
Hadoop only writes a block from a client to a DataNode when a whole block is
available. This is to minimize the amount of "open connections" in the
datanodes (it can take a long time for the client to generate 64MB of data,
while distributing the block over the replicas can occur in a relatively short
time).

For more information about this, see:
[http://hadoop.apache.org/common/docs/current/hdfs_design.htm...](http://hadoop.apache.org/common/docs/current/hdfs_design.html#Staging)
and
[http://hadoop.apache.org/common/docs/current/hdfs_design.htm...](http://hadoop.apache.org/common/docs/current/hdfs_design.html#Replication+Pipelining)

------
CatDaaaady
It was already prod ready in my opinion. I think this release is more of a
"polish" thing since some people are timid to run "0.20" code in prod.

~~~
knappster
Agreed.

Working with hadoop a few years ago was a pain in the ass, what really made it
ready (at least for me) was the packaging done by Cloudera.

~~~
rvs
There's now an Apache effort for producing a fully packaged, validated and
deployable stack of Hadoop components. The project is called Apache Bigtop
(incubating) and the relationship with Cloudera's CDH is like a relationship
between Debian and Ubuntu. We make it super easy for folks to deploy the
released versions of Bigtop distribution either via packages:
<http://bit.ly/rHpybV> or VMs: <http://bit.ly/tBGmNt>

------
nchuhoai
I agree with an earlier comment. Big Data, Hadoop etc. are keywords that are
supposed to get big in 2012, however, as a regular web dev, it's hard for me
to grasp what it can do unless you have gigantic data stores

------
imalolz
Congrats on the milestone to those involved - it's great to have something
like this available to everyone for free.

On a side note, and not to take anything away from the H-team, I'm pretty
curious on how it compares to Google's GFS and the rest of their distributed
computing stack (MR, Chubby, etc.). It would be sweet if Google released some
or all of these some day.

------
paraschopra
Can someone describe differences from previous version? Or just this means
Hadoop is now "production ready"?

~~~
tlipcon
The 1.0.0 release is actually formerly known as the 0.20.205.1 release -- ie
just bugfixes since 0.20.205.

Hadoop's been "production ready" for years - there are hundreds of companies
running it in business critical applications. But some people want to see
"1.0" before they move to production :) So we recently decided to call it 1.0
so that the version numbering matches the maturity Hadoop has already
achieved.

-Todd (Hadoop PMC)

~~~
jshen
Have they figured out which API they are using? You know, the old deprecated
vs the new one, which last time I useed hadoop was missing features that
required me to use the old API. Even though they had @deprecated all over the
old API.

~~~
tlipcon
Both APIs are available and will continue to be available for the foreseeable
future.

-Todd

~~~
jshen
My point is that I would assume a 1.0 release would have a clear "right way".
If I'm starting a fresh project is the new API the right one?

