

Ask HN: Jobs/skills required in big data - vishal_bhavsar

Hello,<p>As someone who is interested with data, I am considering possibly diving into the world of 'big data'.<p>I'm looking to see if someone can direct/help me in figuring out the main skills required to get a job/create a company/work at startup in this area.
======
paulsutter
Be a good developer and love working with data. Java is the most common
language of the open source big data tools. If you're a statistician learn to
code, and if you're a developer you should have a good data intuition.

You might work for a while with a consulting firm like Think Big Analytics to
get some solid experience with the major tools.

At Quantcast, we process 10 petabytes a day but few people we hired had a big
data background. We actually told the sourcers to stop using Hadoop as a
search keyword because so many people add it to their resumes, whether or not
they know how to work with data.

~~~
vishal_bhavsar
interesting..thank you for your response

------
philgo20
I would start by looking at job description in hot "big data" startup to see
what skills and technology they list. That should get you started and avoid
learning the wrong stuff.

Hadoop?

------
mindcrime
You could do worse than beginning to learn Hadoop; but the Big Data world is
more than just Hadoop. Some other technologies you might want to look into,
depending on which aspect of this you are most interested in:

Statistics / Probability - the theory behind a lot of the analysis that goes
on.

R - data analysis using R often goes hand in hand with the "big data" stuff.

Mahout - a machine learning library, built (mostly) on top of Hadoop.

Incanter - an 'R' like environment written in Clojure

Clojure itself, would not be a bad idea.

Learn about a couple of message passing / message queuing systems: 0MQ,
HornetQ, Kafka, etc.

Streaming computation / real (or near) time map reduce: Storm, S4, etc.

On a semi-related note, log collection tools like Flume and Scribe.

A bit about RPC systems and data protocols like Thrift, Protocol Buffers,
Etch, etc. wouldn't hurt.

Knowing about some distributed filesystems would be good.. HDFS, and maybe a
touch of knowledge about some of the "old school" ones like PVFS.

On the subject of "old school" there is still a place for MPI and OpenMP for
solving some types of problems on large clusters. Learning MPI is never a bad
thing.

Learning about some modern nosql data stores like maybe HBase, Cassandra,
MongoDB, etc. would be good.

Basic Linux sysadmin skills, of course.

Knowing about virtualization technologies: Xen, VirtualBox, KVM, whatever.

Automated deployment / configuration management / remote comand execution
tools: Puppet, Chef, cfengine, Fabric, etc.

~~~
vishal_bhavsar
wow...thank you for that response.

I do have some stats background. Basically did Advanced Econometrics at
University level and we used R as the program there. So I'm familiar on that
end.

But looking to see if I can learn more about it, because I do enjoy analyzing
data. But have relatively little coding, programming experience.

I searched into it today, still a little bit lost but have bit more
information now. Trying to weight the pros and cons of diving into the field
at this point in my life as I am not sure how long it would take to learn all
of it to become good at it and get a good job/start a company.

~~~
mindcrime
Cool. To elaborate a bit more.. while "Big Data" is more than just Hadoop, I'd
say that Hadoop is, for now, the Elephant In The Room (no pun intended), when
it comes to Big Data processing. If we take that as a truism, I'd say:

If you want to get into the programming side of this stuff, you would be well
served to spend some time learning Java. I know Java isn't considered the
"coolest" language to know anymore, but if you want to work with Hadoop using
the native APIs, well... it's written in Java. Now, that's not to say that you
can't use Hadoop to distribute work in other languages (say, Python) but if
you wanted to master Hadoop, some Java background would help, IMO.

Of course one nice thing about Hadoop isn't that you don't _have_ to write
directly to the map/reduce API in Java. There are layers built on top of it,
that abstract some of that away... you get things like Pig, Cascading, etc.
that make Hadoop easier to use. But knowing some Java will help if you want to
hack on Hadoop itself, or if you need to write directly to the API for some
reason.

If you have a good stats background but not much programming, I'd recommend
start learning Java + Hadoop, and then grow outwards from there, pulling in
other technologies (see above) as you go.

