

Ask HN: How big is Big Data? - xtacy

There's a lot of buzz around Hadoop and Big Data, but I really wonder how big is Big Data?  How do big data startups measure this?<p>In terms of total storage, is it a few Terabytes for most enterprises?
======
christopheraden
I've had a similar question before. I have heard of as few as 60,000
observations was considered "big data" [1], yet at my company, we generate
about 60 million pharmacy claims every 3 months, and no one here calls it big
data. In terms of storage, it's on the order of a couple hundred terabytes for
all our data. This is considered small enough that we can query it with
traditional SQL.

Big Data, the experts say, is more about the novel way you analyze data,
relative to the difficulty of the problem. A speaker at PyCon who was talking
about algorithms and data structures for handling genetic data had a term that
I like quite a bit better: "Data of Unusual Size" (C. Titus Brown at MSU was
the speaker).

"Big Data" is a really big buzzword right now, but the term is overused and
often does not convey the meaning it's supposed to. "Big" is a relative term.
The novelty of how much data is being used as opposed to how much used to be
used (in the sumo case, they had never handled so much data before) is what
makes it big.

As for Hadoop, you'd want to use it when it's no longer feasible to keep your
data stored in an RDBMS, or when speed becomes an issue, or when you want your
schema to be more flexible than an RDBMS. If you are not concerned with the
reliability of your data (RDBMS make the safety of the data a paramount
priority--read the wikipedia page on ACID to see these guarantees), there's
plenty of reasons for choosing Hadoop.

[1]: <http://www.wired.com/wiredenterprise/2013/03/big-data/> [2]:
<http://en.wikipedia.org/wiki/ACID> [3]:
[http://hortonworks.com/blog/4-reasons-to-use-hadoop-for-
data...](http://hortonworks.com/blog/4-reasons-to-use-hadoop-for-data-
science/)

~~~
Joyfield
I would say that "a couple hundred terabytes" IS pretty big.

~~~
christopheraden
The types of queries we run against it don't require real-time results, and we
do a pretty heavy amount of subsetting. By the time it reaches the point where
we do numerical summaries and statistics, the largest set I've worked with
here was around 30GB. Most times it's around 5-10GB.

~~~
xtacy
5-10GB seems small for analysis. Is this sampled across your historical
records, or just the most recent? What's the turnaround time for stats today,
and what's your pain point? (Slow IO? Lack of high level programming
frameworks? Or something else?)

~~~
christopheraden
Most of the work I do involves recent data--cycles are six months at most and
3 months on average. 3 months of data, sifting by a pretty strict filter, it's
not unsurprising that hundreds of terabytes of claims over years and years
gets filtered into a few GB.

Start to finish on jobs is a few hours (though it can be a few days if the
filter is less strict or the task is more complicated), including pulling the
data from the warehouse.

Without a doubt, the bottleneck of the process is the data warehouse query.
I'm sure having a more distributed database (it's DB2--I've been more pleased
with Teradata's speed) could make the queries faster, but things are slow to
change. Second to the query, the latency from working with a remote server
(I'm in CA--the server's in Minnesota) adds latency if there's something I
need to pull to the local machine. The actual SAS code (Is it still "high
level" if the syntax models Fortran? I kid, I kid.) takes a negligible amount
of time compared to the query.

------
maclee
<http://www.youtube.com/watch?v=B27SpLOOhWw>

Look at the above it will give you a good idea. I'm not going to go into
examples as the above video really give some good examples. A Terabytes of
data is nothing these days, I see Terabytes databases within companies on a
regular bases within my job. The size does not mater, but the three big
components of "Big Data" are: multiple sources of information in multiple
formats, volume of data and rate of ingest/rate of new incoming data. The
basic idea is to be able to process all of the incoming data within your
company and get some kind of intelligent information out of all this data that
you can use.

------
chris_dcosta
In my experience, and I have worked in this field since 2001, "Big Data" is
the the answer to the problem of poorly implemented reporting models. The
excuse for bad performance has always been "the size of the dataset" not
ultimately the real culprit: lack of technical knowledge on how to build an
appropriate and performant model. Size of the data set is almost irrelevant,
when it's done right, but when it's done wrong, even a small ( 100 000 records
) dataset can be sold as "the problem".

Business people love to chase a holy grail, this being yet another one.

------
brudgers
_"If a program manipulates a large amount of data, it does so in a small
number of ways."_ Alan Perlis

"Big Data" has an operational definition. It's relative to current technology.
Less than two decades ago a terrabyte was big enough that Microsoft created
TerraServer as a technology demo [<http://en.wikipedia.org/wiki/TerraServer-
USA>]. Terraserver would dwarf big data of the time when Perlis wrote Epigram
4. Today, TerraServer is dwarfed by Youtube.

------
ibudiallo
I don't know if I'm alone in this but I have this question.

What is Big Data ? I hear it a lot from non technical people, so my first
thought was that it is q company . After doing little research I am starting
to think it is a concept. Sometimes it sometimes it sounds like marketing
term. What is it ?

~~~
mcintyre1994
Pretty much a marketing term. It's basically processing of data of an
undetermined size, deemed to be "big". You might consider Google or Facebook
as companies that process lots of data, so you might consider their operations
as some on "big data". It is pretty much just a marketing term now though -
"big" is insanely relative.

------
webnrrd2k
I think of it in more human terms -- it has to do with _time to process_ than
actual size. I think data becomes _big_ when it takes longer than I'd like to
answer the questions I want answered. As a corollary,the longer it takes, the
"bigger" it becomes.

------
johnward
Not as big as extreme data.

