
A programmer’s guide to big data: tools to know - mwetzler
http://gigaom.com/data/a-programmers-guide-to-big-data-12-tools-to-know/
======
briandoll
Here's another list, sans product hype:

Unix command-line tools like awk and grep. Set operation commands are
essential too (<http://www.catonmat.net/blog/set-operations-in-unix-shell/>)

Ruby/Python/Perl for more complex massaging and wrangling

Excel (yes, really) for quick stats and graphs, great 1st step in
understanding what you have

D3.js for visualization

I've used R in the past, but I found I was trying to squeeze data into R
models. D3 is pretty hard to grasp initially (I'm very much still learning)
but I'm finding that it's helping me think through how to visualize the data I
have, rather than just forcing it into one of a few standard charts.

~~~
dxbydt
This isn't big data at all. This is stuff you'd use AFTER you've processed
your big data & obtained a reduced dataset result that was small data, of the
order of a few hundred MB. To process said big data in the first place, of the
order of a few PB, typically you are going to use hdfs & at a programmatic
level, an api like cascading or better yet, scalding, so you end up needing
only that last item D3.js , with everything else existing as scalding source.

~~~
saosebastiao
This is where "Big Data" becomes about as useful as a penis measuring contest.
The vast majority of business data analysis happens at MS Excel levels of
data. I have done analysis on several GBs of data using R, and those data-sets
are easily the 99th percentile of data size in my field. For datasets that are
bigger than that, SQL is still pretty damn good. And when you get to the point
where SQL starts to break down in usefulness, you are at a level where you can
start sampling and still get perfectly usable results in almost all use cases.

In my experience, "Big Data" is a misnomer. What it really means is fast data.
You have a repeatable calculation to perform on datasets that are perfectly
fine chugging away on an Oracle cluster for 3 hours, but you want it done in
20 minutes. That is what Big Data is really used for. Anything else is just
marketing hype.

~~~
bravura
No. Big data has a formal definition, but one that most people ignore:

When the size of the data becomes part of the problem.

(An O'Reilly author said this once, I forget his name.)

For example, physicists in the 80's who had 10s of MBs of data had a big data
problem.

Nowadays, this typically means out-of-core data sets.

~~~
saosebastiao
That definition is but one of many supposed formal definitions. And it isn't a
particularly good one, because then "Big Data" becomes a definition that can
only be quantified on a person-level of granularity. According to the guy with
the Marketing degree from Ho-Hum College, anything bigger than can fit in an
Excel spreadsheet is "Big Data". Switch to Doug Cutting, and "Big Data" is in
the hundreds of Petabytes.

I don't buy the out-of-core definition either. I work using a Data Warehouse
with petabytes of information...it certainly doesn't fit in memory. I don't
use Hive, Pig, Cascading, etc...(okay, sometimes I use Cascalog, but not as a
strategy for dealing with large amounts of data). I use SQL. And it works
perfectly fine. But if you ask any of the people out there talking about "Big
Data", an SQL database doesn't fit into the definition. Hell, I have done
processing on a 200GB CSV file using nothing more than GAWK. Nobody is calling
GAWK a big data tool.

Face it. "Big Data" is a buzzword for CIOs that read magazines for CIOs but
still need to find an engineer to set up their email on their iPad.

~~~
EwanToo
Precisely, the term big data is about selling stuff to CIOs

Nobody at yahoo went "we need hadoop to deal with our Big Data problem", it
was simply a very large amount of data with relatively limited budget problem,
and plenty of very large companies are happy using teradata or netezza to
manage PB of information.

The new set of tools are often brilliant, but the problems that they solve are
almost all not new.

------
hayksaakian
I still don't understand big data.

Is it machine learning + analytics?

~~~
TallGuyShort
Although I believe "big" usually refers to the size of the dataset, I
originally heard the term used in the same sense as "Big Oil" - and the
implication was that as companies collected massive amounts of data and
figured out how to profit from it, "data" was going to become the next high-
value commodity.

edit: Although to answer your actual question, it's a bit like "cloud
computing" - it probably refers to very scalable systems with an emphasis on
reading, writing, and processing large amounts of data - but really it's a bit
of a marketing term :)

~~~
mnicole
To branch off of your comment, which is the best description of it, I think it
should be noted that for the past two years this subject has been the most
nominated topic across all of our events for both information and security
officers (the company I work for empowers F1000 C-suite executives to come up
with relevant material for their peers to discuss annually). For companies
that are inundated with web analytics, feedback they're getting from and
trends they're seeing on social media to sales numbers and any other
quantifiable data, they're trying to tie it all together so they can make
changes or create new opportunities for their markets.

~~~
hayksaakian
Oh OK,that makes more sense.

Big data is actually the problem then, not the solution?

Meaning there's too much disparate data, and now we can try to bring it
together.

~~~
mnicole
Yup!

------
elchief
I don't get it. Why not just learn Hive (it's SQL), or use python with Hadoop.

