

Let’s Talk Your Data, Not Big Data - bsg75
http://www.wired.com/insights/2012/11/lets-talk-your-data-not-big-data/

======
tumanian
A former Yahoo big data engineer here. At some point I was bombarded with
enquires from different startups who were looking for a "Big Data Guy". The
conversation would go as:

    
    
      - (Me) So, how much data are we talking about?
      - (them) 50GB
      - Per hour? 
      - No thats our dataset so far. 
      - (pause)
      - (pause)
      (FIN)

~~~
kinleyd
Nice anecdote. :)

------
dbecker
The wild confusion around "big data" stems in part from the fact that many
people use it to mean something unrelated to data size.

I was recently in a back-and-forth on twitter about this. Some people argued
that "big data" refers to the complexity of the analysis or the value of the
insight, rather than the size of the data.

Kaggle CEO Anthony Goldbloom advocated for a definition "too big to fit in an
excel spreadsheet."

I advocated for for a definition "large enough that the storage and
manipulation becomes part of the challenge (in addition to the analysis).

The phrase has taken on so many definitions as to become meaningless.

~~~
bravura
I argue for your definition:

"Big data is when the size of the data becomes part of the problem."

Mike Loukides (O'Reilly) mentioned to me that Roger Magoulas (O'Reilly) was
the first he heard using that definition.

According to this definition, physicists in the 80's were doing big data.

~~~
xradionut
People in science and commerce know data issues and have been dealing with the
practice, analysis and theory for years.

O'Reilly's just pimpin' the term "Big Data" to sell books. They did the same
with "Web 2.0", Java and the initial internet boom.

~~~
dbecker
O Reilly may be pimping the term to developers, but that's nothing compared to
how Cloudera has pimped "Big Data" to enterprise.

------
stdbrouw
There's this – potentially apocryphal – story that the etymology of the term
"monkey patch" comes from someone mistaking the then-current term "guerilla
patch" for the similar-sounding "gorilla patch".

Big data seems like a very similar story to me. It has nothing to do with the
size of the data, not even with the complexity of the analysis, it simply
means, "a movement that wants to do more with more kinds of data." Nowhere
near what big data used to mean, but outside of technical circles, we're
beyond the point where that even matters anymore.

------
supercanuck
I just read this Wired article and now I'm wondering whether or not I just
read an advertisement for Chartio?

~~~
smoyer
Yes .... Yes you did (and I'm down voting the parent for that very reason).

P.S. If I could down vote twice, I'd give it a second one for turning a
sentence into an article.

------
bsg75
"The paper concludes by advising analysts to not go through the Hadoop hoops
until your data size passes standard hard drive limits (currently around 1
Terabyte) or at least reasonable memory limits (512 GB)."

Is a TB really considered Big currently?

~~~
juiceandjuice
Unless you've got a ridiculous RAID array, there's plenty of times where
processing anything over 100GB without a cluster is going to suck.
Sequentially reading 1TB of data is going to take a few hours even with a RAID
array, or be very expensive with SSDs.

I maintain a database that's over 1TB and Oracle handles it very well. The
trick there is understanding that, under absolutely no circumstance should you
ever need to do a full table scan, because the table isn't designed for that.

So, I'd argue that a monolithic database is okay even up to 10TB even with
slower disks as long as you never need to touch more than 10% of it. If you
need to touch 100% of the data 100% of the time, I'd say anything over 100GB
is too big for one machine.

The reality is that it just depends. There's times where you're going to want
a hadoop cluster even for only 16GB of data, and there's going to be times
where a database is going to be fine with 10TB of data.

------
sauravc
This pretty much sums up most Hadoop projects I've seen at startups:
<http://wavii.files.wordpress.com/2011/12/hadoop_too_big.jpg>

(Originally from this Wavii Engineering blog post
<http://blog.wavii.com/2011/12/29/your-mileage-may-vary/> )

------
rdtsc
I have my own definition of big data or maybe this is the next "after big
data" and that is "streaming data".

Data being generated non stop at a high enough rate that it doesn't make sense
to store it. You can only analyse, extra relevant statistics or some features
and move on.

Storing it is just putting in a huge buffer and as new data comes in the old
data falls of the end.

In some situations where products are up 24/7 in multiple time zones, there is
no time for offline batch processing. By the time the batch has finished there
is newer possibly bigger batch and so on.

------
besquared
At Yammer we used to have a phrase that analytics was "domain knowledge +
counting". Lately we've moved to "analytics is workflow" which better captures
the deep integration of data, analysis and business decisions that we have
now. I still think the first thing matters a lot. You don't necessarily need
fancy technology. If you can count and know everything about your business and
what makes it valuable then you can start building real insights.

------
mecredis
The only definition of big data I've thought was remotely insightful is "too
large to process with one machine."

Note (as the example in the Wired article indicates) that the converse isn't
true: just because you are using multiple machines to process data doesn't
mean it's big.

