
How to Learn Hadoop for Free - jdwittenauer
http://www.johnwittenauer.net/how-to-learn-hadoop-for-free/
======
atemerev
A better idea: work as a consultant, get a Hadoop-related assignment, accept
it, learn on the go and swear A LOT in the process, still deliver on time.

You've got paid AND you got to know some Hadoop! (Worked for me; YMMV)

------
web64
Is Hadoop/MapReduce still as relevant now as it was a few years ago? What
stack would you set up today for a standard Big Data processing system?

~~~
StreamBright
Hadoop certified engineer here. I think Hadoop is losing its popularity or
from a different point of view, it got 90% of its potential market saturated
and having trouble entering other markets. The biggest challenges are
operational stability and performance, and the lack of understanding from the
Hadoop companies about the performance characteristics of their system. On the
top of that there is always 2 version of everything (Tez vs Impala, ORC vs
Parquet, etc.) because HWX and Cloudera cannot really work together in an
opensource fashion. On the top of everything there are better products on the
market for different use cases for Hadoop. The following list is incomplete:
Alluxio, Apache Beam, Apache Kudu. These systems trying to address some of the
aforementioned shortcomings of Hadoop. There are other products like PrestoDB
that take a slightly different approach to a particular problem (accessing
data via SQL like interface) and mix it with a extra goodness (in memory
caching) and delivering an entirely different customer experience. If you
leave Hadoop land you can also play with Spark or Storm (depending on your use
case). Now that Facebook uses Spark there is a good chance that an average use
won't be running into scaling issues with it. I left out products from vendors
that target the same customers as Hadoop vendors on purpose. There are plenty
of closed source solutions that will leave Hadoop in the dust in almost every
aspect of big data processing (performance, security, UI, stability,
availability, etc.).

~~~
jdwittenauer
I think the term "Hadoop" is becoming almost meaningless. It seems to now be
more of a pointer referencing a basket of distributed processing technologies
that run on YARN/HDFS. Agree completely with having multiple technologies to
solve every problem, that's one of the most confusing parts to learn.

My own perspective is that there are lots of businesses that haven't yet
needed the capabilities provided by a platform like Hadoop, but they likely
will in the future. So the market may be saturated based on current needs but
that market will continue to expand. Whether it's Hadoop (YARN/HDFS/etc.) that
wins that market share or some other stack like Spark/Mesos remains to be
seen.

~~~
bsg75
> It seems to now be more of a pointer referencing a basket of distributed
> processing technologies that run on YARN/HDFS

You reference the MapR distribution for their training material, and its
interesting that their version of HDFS is a reimplentation in C++ (MapR-FS).
Its part of the reason I settled on MapR to use tools like Apache Drill,
because the filesystem becomes usable to non-Hadoop tools via NFS (i.e. Awk).

Given a shift in some categories away from map-reduce to other approaches,
could Hadoop eventually just become a collection of distributed filesystems
and job schedulers?

------
dandermotj
Probably the one tool missing from this list is Impala, which is essentially
Hive's successor. Uses the same metastore and runs an order of magnitude
faster. Almost the same flavor of SQL too.

~~~
jdwittenauer
Agree that Impala would fit well on this list. They didn't have any training
on it, presumably because it's a Cloudera-led technology, but my understanding
is it's very popular. Not sure that it truly replaces Hive/Tez though. I think
they each excel at certain types of workloads.

