

Bored in grad school? Learn Hadoop - Aloisius
http://commoncrawl.org/learn-hadoop-and-get-a-paper-published/

======
cwhittle
While learning Hadoop is a good idea, if you're bored in grad school, you're
doing it wrong.

~~~
achompas
Agreed 100%. Grad school is a time to explore all kinds of CS fields (or
various topics within a field you're interested in). Right now I wish I had
28-hour days to mess around with GPU computing, compiler design, language
theory, and random machine learning projects.

EDIT: its also worth pointing out that Hadoop is not the 'be all, end all'
framework for large-scale data analysis. Many problems do not fit the map-
reduce paradigm, and are better suited for other frameworks (looking into
GraphLab once I get free time, for example).

~~~
srean
Hahaha I hear you. "Me too"s are frowned upon here, but I could not resist.
There are so many things that I want to learn that I wish sleep was just
optional.

I know Hadoop is the poster child of all things good, but its API really makes
me fall asleep. Not a big fan. Add to that the fact that Google's
implementation is (or atleast used to be) 4~6 times faster for similar sized
processing jobs and more fun to code in (the latter may be entirely
subjective). The funny part is that Google's clusters then were made of weaker
machines ! I dont know how it is now.

EDIT: It is indeed in C++ that alone cannot fully explain the discrepancy
though. Java can be slower but shouldnt be that much slower. I am sure
caching, memory footprint and as you said data access latencies and overall
design plays a big role. If I remember correctly UIUC has an open source
mapreduce framework written in C++ and they claim a similar speedup over
hadoop.

~~~
tumanian
Hadoop API is a pain. Well, it is fine when doing the vanilla mapreduce, but
once one steps away from that, things get ugly real fast with the zoo of Jobs,
Tasks, Contexts, two mapred/mapreduce apis which do the same thing, plus
hadoop.23 specific calls. I haven't touched Hadoop 1.0 yet, though I doubt
they have cleaned all the mess. Hope they will streamline the API by 2.0

------
LisaG
Hi

I am from Common Crawl. Apologies for the site being down! Too much traffic
from HN :) We're working on getting it back up. The Google cache below has all
the contents, so please refer to there for the moment. Here's the excerpted
beginning..

Learn Hadoop and get a paper published

We’re looking for students who want to try out the Hadoop platform and get a
technical report published. Hadoop’s version of MapReduce will undoubtedbly
come in handy in your future research, and Hadoop is a fun platform to get to
know. Common Crawl, a nonprofit organization with a mission to build and
maintain an open crawl of the web that is accessible to everyone, has a huge
repository of open data – about 5 billion web pages – and documentation to
help you learn these too

[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://commoncrawl.org/learn-
hadoop-and-get-a-paper-published/&hl=en&prmd=imvns&strip=1)

------
groundshop
It's the start of the Summer, I'm in my 2nd year MS-CS. I began a simulation
last night at 9PM that's still running (usually 14 hour turnaround). I just
got in, sat at my desk and thought "I'm bored, guess I'll check HN" and THIS
is what I see.

~~~
malloc47
I've been authoring a post on solving interview questions in map/reduce
between experimental runs myself, so I guess I already hit the "bored graduate
student" point too...

~~~
oacgnol
I remember having to resist the urge to map/reduce everything in interview
questions :P. Somehow, I don't think interviewers want to hear how you'd use a
cannon to solve a tiny problem.

~~~
eli_gottlieb
Why use MapReduce when you can use the real `map` and `fold`?

~~~
finalword
because they're stupid?

hurray for java. the most verbose and dog slow language ever invented.

------
migpwr
I apologize for the off-topic comment but I am hoping a few of the folks on
here who are familiar with Hadoop can help me with a small career decision.

I'm fortunate enough to be up for two systems positions with companies in my
area, and one of them is part of a group maintaining a Hadoop cluster. I've
never maintained Hadoop infrastructure before so I'm wondering if it's worth
the "career capital" investment. There don't appear to be too many openings
that I know of for Hadoop Sys admins, and while I don't expect to lose any
skills working on it, I wonder if this platform is realistically expected to
grow, and maybe become something I could maybe build a valuable niche set of
skills for...

Any help would be appreciated...

~~~
oacgnol
I would say go for it. Hadoop is becoming more of a norm in industry and I can
tell you from personal experience that having prior hands-on Hadoop work makes
your resume pop a little more.

Even better is if you're able to muck around in core Hadoop versus abstracted
management layers like Cloudera. The things you have to learn in maintaining a
Hadoop cluster like cloud systems management, tuning, running jobs, etc. makes
for valuable experience, even if niche.

------
RobAtticus
Perhaps I'm just jaded from NSDI a few weeks ago, but I'm tired of papers
about Hadoop and systems like it. If you're going to do work with Hadoop, at
least start comparing what you're doing with other work that improves Hadoop.
There's been a bunch of papers about improving the scheduling or the shuffle
phase or whatever but they all compare to vanilla Hadoop and not each other.

------
Bootvis
Down for me, can someone paste the contents?

~~~
brucehart
I was able to find a cached version here:
[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://commoncrawl.org/learn-
hadoop-and-get-a-paper-published/&hl=en&prmd=imvns&strip=1)

------
LisaG
Site is back up. Thanks for your patience!

