Hacker News new | past | comments | ask | show | jobs | submit login
Bored in grad school? Learn Hadoop (commoncrawl.org)
52 points by Aloisius on May 9, 2012 | hide | past | favorite | 21 comments



While learning Hadoop is a good idea, if you're bored in grad school, you're doing it wrong.


Agreed 100%. Grad school is a time to explore all kinds of CS fields (or various topics within a field you're interested in). Right now I wish I had 28-hour days to mess around with GPU computing, compiler design, language theory, and random machine learning projects.

EDIT: its also worth pointing out that Hadoop is not the 'be all, end all' framework for large-scale data analysis. Many problems do not fit the map-reduce paradigm, and are better suited for other frameworks (looking into GraphLab once I get free time, for example).


Hahaha I hear you. "Me too"s are frowned upon here, but I could not resist. There are so many things that I want to learn that I wish sleep was just optional.

I know Hadoop is the poster child of all things good, but its API really makes me fall asleep. Not a big fan. Add to that the fact that Google's implementation is (or atleast used to be) 4~6 times faster for similar sized processing jobs and more fun to code in (the latter may be entirely subjective). The funny part is that Google's clusters then were made of weaker machines ! I dont know how it is now.

EDIT: It is indeed in C++ that alone cannot fully explain the discrepancy though. Java can be slower but shouldnt be that much slower. I am sure caching, memory footprint and as you said data access latencies and overall design plays a big role. If I remember correctly UIUC has an open source mapreduce framework written in C++ and they claim a similar speedup over hadoop.


Hadoop API is a pain. Well, it is fine when doing the vanilla mapreduce, but once one steps away from that, things get ugly real fast with the zoo of Jobs, Tasks, Contexts, two mapred/mapreduce apis which do the same thing, plus hadoop.23 specific calls. I haven't touched Hadoop 1.0 yet, though I doubt they have cleaned all the mess. Hope they will streamline the API by 2.0


I haven't used Google's software. Hadoop is in Java. If Google's implementation is in C or C++, that could explain a bit of the performance overhead.

4-6 times is crazy though. You sure it just wasn't due to data access latencies?


You're right, but I imagine most grad students will inevitably go through a "doing it wrong" phase. Mine lasted close to a year.


Hi

I am from Common Crawl. Apologies for the site being down! Too much traffic from HN :) We're working on getting it back up. The Google cache below has all the contents, so please refer to there for the moment. Here's the excerpted beginning..

Learn Hadoop and get a paper published

We’re looking for students who want to try out the Hadoop platform and get a technical report published. Hadoop’s version of MapReduce will undoubtedbly come in handy in your future research, and Hadoop is a fun platform to get to know. Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data – about 5 billion web pages – and documentation to help you learn these too

http://webcache.googleusercontent.com/search?q=cache:http://...


It's the start of the Summer, I'm in my 2nd year MS-CS. I began a simulation last night at 9PM that's still running (usually 14 hour turnaround). I just got in, sat at my desk and thought "I'm bored, guess I'll check HN" and THIS is what I see.


I've been authoring a post on solving interview questions in map/reduce between experimental runs myself, so I guess I already hit the "bored graduate student" point too...


I remember having to resist the urge to map/reduce everything in interview questions :P. Somehow, I don't think interviewers want to hear how you'd use a cannon to solve a tiny problem.


Why use MapReduce when you can use the real `map` and `fold`?


because they're stupid?

hurray for java. the most verbose and dog slow language ever invented.


I apologize for the off-topic comment but I am hoping a few of the folks on here who are familiar with Hadoop can help me with a small career decision.

I'm fortunate enough to be up for two systems positions with companies in my area, and one of them is part of a group maintaining a Hadoop cluster. I've never maintained Hadoop infrastructure before so I'm wondering if it's worth the "career capital" investment. There don't appear to be too many openings that I know of for Hadoop Sys admins, and while I don't expect to lose any skills working on it, I wonder if this platform is realistically expected to grow, and maybe become something I could maybe build a valuable niche set of skills for...

Any help would be appreciated...


I would say go for it. Hadoop is becoming more of a norm in industry and I can tell you from personal experience that having prior hands-on Hadoop work makes your resume pop a little more.

Even better is if you're able to muck around in core Hadoop versus abstracted management layers like Cloudera. The things you have to learn in maintaining a Hadoop cluster like cloud systems management, tuning, running jobs, etc. makes for valuable experience, even if niche.


Definitely the right direction:

http://www.itjobswatch.co.uk/jobs/uk/hadoop.do

http://www.pcworld.com/businesscenter/article/255142/idc_exp...

http://www.mckinsey.com/Insights/MGI/Research/Technology_and...

Although I'd second the point that getting your hands dirty with the Hadoop core is likely more valuable than straight sysadmin work.


I don't see exploding demand in this. However, the experience will likely translate over to other domains and make you better for it. I'm not a sys admin but setup a small hadoop cluster ... but wow ... talk about learning experiences!


Perhaps I'm just jaded from NSDI a few weeks ago, but I'm tired of papers about Hadoop and systems like it. If you're going to do work with Hadoop, at least start comparing what you're doing with other work that improves Hadoop. There's been a bunch of papers about improving the scheduling or the shuffle phase or whatever but they all compare to vanilla Hadoop and not each other.


Down for me, can someone paste the contents?


I was able to find a cached version here: http://webcache.googleusercontent.com/search?q=cache:http://...


Down for me too. I can't see a cached version in Google too.


Site is back up. Thanks for your patience!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: