

You don't have big data - liz_mongohq
http://blog.mongohq.com/you-dont-have-big-data/

======
leoc
> “Big Data” data tends to be cold data, that is, data that you aren’t
> actively access­ing and, apart from ana­lyz­ing it, prob­a­bly never will.
> In fact, apart from analy­sis, it could be regarded as frozen. It may be fed
> with fresh rapidly cool­ing records and the cool­ing records ana­lyzed for
> up-​​to-​​date analy­sis, but the “Big Data” pool should be at least
> con­cep­tu­ally sep­a­rated from the live data; min­gling the two’s
> require­ments can eas­ily end up in an unsat­is­fac­tory
> lowest-​​common-​​capability sit­u­a­tion where nei­ther is opti­mal.

So in other words, “Big Data” is what used to be known, somewhat less sexily,
as “data warehousing”?

~~~
delluminatus
So the author is suggesting.

I think that the two are different, though. The difference: _scale_. Data
warehousing solutions simply can't scale to "big data". Many companies have a
"data warehouse" that is usually a single (albeit very beefy) server running
SQL Server. But "big data" is the kind of data that can't be supported by a
single server or by a traditional database. It's the difference between
warehousing activity in a network of a 10k employees vs. warehousing activity
on a website with a million daily uniques.

That's my imrpession, anwyay -- "data warehousing" is dominated by the RDBMS,
but "big data" is dominated by highly scalable, distributed databases.

~~~
ironchef
I'm going to somewhat disagree. Traditional data warehouses (teradata for
example) can handle well into the petabytes. I think Ebay's main TD cluster
was around 15 or 20 PB last I knew.

I think you're more along the proper direction in the single server vs N
server. Every data warehouse I've worked on in the past decade or so has been
multi server.

~~~
delluminatus
Well, I think Teradata probably falls under the big data umbrella, given that
its basically a distributed database, right? I was thinking more about just
plain Oracle or SQL Server or whathaveyou, which can scale pretty big, big
enough for 99% of companies, but not "big", and are notoriously difficult to
cluster/distribute.

~~~
ironchef
Fair enough. I think the reason people struggle with classifying this stuff is
because there are a lot of gray areas and it's difficult to verbalize all the
details meaningfully. One could say "Oh...RDBMSes!"...but teradata is still
relational...it's just distributed. "Oh! OLAP vs OLTP!"...but it's not really
that either. One _could_ use teradata for a couple gig data mart just fine
(and argue that it's not big data). That's why i tend to fall back to the 3 /
4 Vs that have tended to typify the 4 aspects of "big data" (volume, velocity,
variety, veracity).

------
jph
> “Big Data” data tends to be cold data

My experience is the opposite: big data tends to be a lot of rapid streaming
data, near-real-time, with inputs from many sources and sensors. This
overwhelms traditional databases.

For us, big data means we want stream filtering, heuristic sampling, map
reduce, and the like.

~~~
oijaf888
How often do you go over the raw data, in a manner that aggregates couldn't
help you with, for data from 6+ months ago?

~~~
jph
Never. "Big Data" for this use case is entirely about velocity, not volume. I
suppose a better catchphrase would be "Fast Data". :)

~~~
glesica
Yes, I think "streaming data" and "big data" really are different things, and
I like your term "Fast Data". From a lower-level perspective they involve very
different challenges.

If you need to store the data (Twitter-like data) then storage becomes your
primary concern. Analysis can be done later and in less-than-real time. But if
your data are coming in so fast that keeping up is the problem, then storage
isn't even a consideration (you generally just don't) and analysis becomes the
challenging part, you need to aggregate do be able to update your metrics on
the fly.

So they really aren't even overlapping problems outside of the fact that they
both deal with lots of data.

------
ironchef
The 3 typical traits used in discussing big data are volume (size of the
data), velocity (speed of data entry / exit...or as you call it how "hot" the
data is), and variety (number of different sources / schema, etc.)...with
sometimes adding veracity (trust of the data). Is DJ/Liz's assertion that if
you don't have volume, you're not dealing with big data?

I guess my response would be... who cares?

While the techniques and infrastructure for high volume can be different from
those for high velocity (near-time / real-time analytics is different than the
more batch-based whole hog analytics associated with high volumes), they are
often related. Similarly, there is often commonalities in dealing with the
various Vs which may just be of different scale.

Would they argue that one shouldn't be looking at using chef/ansible/salt,
etc. as long as you're dealing with a handful of machines? If so, I would
think most would argue that laying a good baseline so that when growth occurs
is a good thing to do. From a "big data" perspective would they then argue
that one doesn't need an ETL/ELT pipeline process? That one shouldn't think
about how to deal with the high volume? That would seem to ... less than
optimal.

------
yawz
We, as an industry, feel the need to invent catchy names to existing things so
that we can sell/market our products. Big Data, Cloud, SOA, (and the worst of
all) Web 2.0, etc. Then, we get to debate forever trying to fit definitions to
those names... _sigh_ There's a saying where I come from: _" The village fool
threw a stone in a well, forty wise men couldn't pull it out."_

~~~
sabbatic13
That's what I was thinking. As I read, I kept thinking "but you're just making
up terms and definitions and pretending that there's some objective standard."
Even for fairly standard tech terminology, usage drifts all over the place.

------
jandrewrogers
Translating this blog post:

"MongoDB can't handle Big Data so we are going to redefine Big Data for our
own convenience as cheerleaders of MongoDB and then assert that Big Data, per
our new definition, is really not that important."

The blog post makes quite a few dubious assertions, apparently for the sole
purpose of justifying MongoDB's inadequacies as a large-scale data platform.

~~~
mrkurt
We don't dispute the importance of big data (whatever the definition). What we
tend to find, though, is that customers want to optimize for big data problems
they'll never have. You _can_ run a Mongo cluster for 100s of TB of data (is
that big data?), but it means making application and schema compromises that
most people don't need to make. This post, more than anything, is to help our
customers think properly about their data problems.

So I don't think your translation is correct. I actually think it's pretty far
off. We're probably some of the most cynical MongoDB users you'll meet.

(disclosure: I'm one of the founders of MongoHQ)

~~~
jandrewrogers
Being required to make application and schema compromises in order to scale to
tens or hundreds of terabytes is a symptom of the inadequacy I was referring
to. It is not a property of databases generally, it is a property of MongoDB.

I get the argument that customers should not over-engineer their database
systems but in other databases a lot of that "over-engineering" comes almost
for free in terms of user effort.

Also, for many types of analytics, there really isn't a concept of "cold"
data. A single query should be able to access data inserted milliseconds ago
and data inserted a month ago as though it were in the same table. A lot of
"real-time" analytics work this way. This does not need to be done purely in-
memory if the storage engine is designed well. The old OLTP/OLAP dichotomy of
the 1990s has been slowly fading for a long time.

~~~
mrkurt
Just to clarify, I meant "compromise" in a more broad sense. Scaling databases
requires continuous compromise, usually of flexibility. There's nothing unique
to Mongo about this. In the relational world (actually, with Mongo too) you
end up denormalizing data, which creates complexity. You may also give up
joins, secondary indexes, constraints, etc.

Some DBs handle this by requiring compromise at the very beginning, which
makes sense when you're going to have a huge amount of data from the get go.

There's no DB on the planet that lets you go up into the TB range without
having to make some compromises (either up front or down the road).

~~~
jandrewrogers
You do have to give up the transaction theoretic elements once you get into
the hundreds of nodes. Or at least, you will notice the sub-linear behavior in
the scaling. Complex updates across multiple records will show some
limitations on performance.

For things like joins, query selectivity on multiple columns, etc not so much.
You don't even need secondary indexing or denormalization to do things like
graph analysis or polygon searches on a table (in the same query even) at
scale. All of the access method related operations can scale very efficiently
to massively parallel systems if you use the appropriate data structures and
algorithms.

MongoDB uses an approximation of the correct algorithms for gigabyte scale
systems. Those algorithms are just wildly inappropriate when you start talking
about terabyte scale systems. I have no investment in MongoDB negative or
positive, but like all databases it is going to be lousy outside of the
implicit scope supported by the design and architecture. In the specific case
of MongoDB, and as someone that has designed their share of database engines,
the internals are not designed to support non-small databases to any
significant extent.

And honestly, a terabyte is a pretty trivial database these days. That is the
kind of thing you run on a single server with ease. Smoothly scaling that to
dozens of nodes as though it was a single system is something you can buy. I
really don't understand the assertion that scaling to 10TB is difficult or
requires anything different than scaling to 10GB. That is demonstrably untrue.

------
shawnee_
There's probably a parallel somewhere here between Jevons Paradox and this
idea of what big data is all about. Jevons Paradox was the observation that
the consumption of coal increased, rather than decreased after steam engines
came along (they originally thought steam engines using less coal would lead
to less coal consumption overall). But what happened (obvious to us in the
21st century) was that steam engines made _transportation_ itself more energy-
efficient, and created increase in demand for fuel, of which coal was the
primary type available at the time.

In other words: the less costly a useful task becomes, the more we tend to do
it "just because".

Seems like the author is trying to say: storing a lot of data vertically "just
because" does not big data make. But the author does not fully explain why
this is significant.

It is significant, though, because Jevons Paradox would predict that
eventually companies _will_ want to access so-called "frozen" data more
frequently and will want to make that frequent access as cost-effective (and
payload-efficient) as possible, which distributed NoSQL-type DBs do very well.

~~~
dredmorbius
A college professor of mine observed something quite similar: he'd been in
administration and had returned to teach for a final year before retiring. His
observation: whenever you build a computer system, you've got to design it for
much more than the scale you think you'll need, because people will come up
for uses for it which you hadn't anticipated.

Granted, this was the 1980s when mainframe / centralized computing was big
(client-server was still a few years off), and the desktop revolution was just
getting underway.

------
izzydata
What exactly is the authors point? Is he just upset about the terminology
usage? Is he implying that he has "big-data" like it is some kind of pissing
contest? He never even explains what kind of scale he considers to be "big-
data". He just says everything is lots of data and nothing is big data at all.

Also his usage of childish memes makes me unable to take it seriously.

~~~
YZF
I think the point is that people often use the wrong tools thinking they have
"big data" and therefore they must use the same techniques someone like Google
uses. The difference is most likely Google has many orders of magnitude more
data and does many more orders of magnitude processing on it.

I think it is useful to have some rules of thumb as to when you need to apply
more exotic techniques and tools vs. something where simple stuff works. So in
that sense asking do you have "big data" or not can be useful...

~~~
im3w1l
So let's skip the discussion about how much data is _a lot_.

Around which dataset sizes are which methods appropriate?

------
gtrubetskoy
To understand "big data", you need to read this book (it's free)
[http://infolab.stanford.edu/~ullman/mmds.html](http://infolab.stanford.edu/~ullman/mmds.html).
Anything else anyone blogs about regarding definition of big, cold or whatever
buzzwords (especially on a mongo-related site) is water under the bridge.

------
exelius
So "big data" is just log data that has no apparent use? I would tend to
disagree: you can perform "big data" operations on the same data sets you
perform traditional analytics, you're just looking for different things.

Really, why don't we put down the pitchforks and call "big data" what it
really is: data mining. You can perform data mining on small sets and large
sets. Nearly infinitely large sets, given the tools to manage that data. "Big
data" is just a buzzword that doesn't mean anything more than data mining
(which includes machine learning, AI methods, etc) on a really large data set.

------
blauwbilgorgel
Data size is relative to company size. If I work with a 120Gb clicklog file
for a company with 10 employees, where no other employee has the tools or
know-how to work with such a dataset, then that data is (treated as) big data.
In the hands of Google, Yahoo, MS or Facebook if would probably look like a
floppy.

No, you probably do not necessarily need a Hadoop cluster to work with a 120GB
file. With Python and Pandas you could probably run through it on a budget
laptop. But data will always be relative to your companies size and current
know-how. In the most banal way: Data can be big data, because some manager
can't open it in Excel.

------
freehunter
I've done work for fairly small regional companies who feel that now that they
have an eCommerce site, it means they have Big Data and they now need to spend
a ton of money to manage this Big Data, when all they really have is worthless
data and no understanding of what they could gain if they were collecting the
right data.

Just because you have a lot of data and you don't know how to put it together
doesn't mean you have Big Data. More often than not (in my experience), it
means you have useless data. Take that money and hire someone who actually
understands the right metrics to capture.

------
e12e
I tend to subscribe to the ideas that Joyent has around their manta product --
it's about "data gravity" and "data velocity". There was an article submitted
earlier to hn that didn't get many votes -- and no comments -- but is well
worth a read:

"Building a Black Swan: Disrupting NetApp, EMC, Amazon’s S3 … maybe all
BigData": [https://medium.com/money-
banking/b8427c23bf0f](https://medium.com/money-banking/b8427c23bf0f)

[https://news.ycombinator.com/item?id=6028334](https://news.ycombinator.com/item?id=6028334)

Also worth skimming is the blog post introducing Manta:

"Hello, Manta: Bringing Unix to Big Data": [http://www.joyent.com/blog/hello-
manta-bringing-unix-to-big-...](http://www.joyent.com/blog/hello-manta-
bringing-unix-to-big-data)

And, (more closely) related to the "you don't have big data", I found this
little post about Manta also interesting:
[http://building.wanelo.com/post/54110156963/a-cost-
effective...](http://building.wanelo.com/post/54110156963/a-cost-effective-
approach-to-scaling-event-based-data)

(as an example of how keeping things simple (here: simple log files) can
combine to provide the scalability needed to collect "BigData" \-- and then
the Manta architecture can help with turning that data into information)

(No affiliation with Joyent/Manta -- but it does seem like they have a pretty
good product concept/idea -- and even if not using Manta or Joyent -- building
on their ideas and architecture -- seems like something that might be useful
also for smaller installations).

------
josh2600
The article's described origin of Big Data ignores the original big data
problem: Insurance, specifically actuarial tables. The origins of big data are
actually far older than 1990, going back to the first mass computing systems
in the old days which were literally aisles of women working in offices
computing actuarial tables[0].

One could even go further back to the 16th century (and earlier if you're
really adventurous) where the ideas of life statistics applying to groups (but
not individuals) was first explained.

In short, 1990 is sort of an arbitrary date and does not accurately reflect
the origin of Big Data. We have wanted to record the sum total of humanity for
as long as we have had the ability to record things; it's simply becoming more
reasonable to attempt today.

[0][http://www.officemuseum.com/1907_Actuarial_Division_Metropol...](http://www.officemuseum.com/1907_Actuarial_Division_Metropolitan_Life_Insurance_Co_NYC.jpg)

------
YZF
Big data is also a moving target.

We can look at size and we can look at rate (read or write).

One size classification is: data we can store in memory vs. on drive. I
wouldn't refer to anything <64GB as "big" data. If you can store it on one
machine it probably shouldn't be considered big from a size perspective (so
let's say 4TB as some sort of threshold).

More generally, if you need more than a single server to store or do real-time
processing of your data I'd say it qualifies as big. Otherwise probably not.

Another factor a lot of people often don't look at is efficiency. Yes, if
everything is a JSON string with a lot of static annotations we can make a
small amount of data look very big. You need to look at a more information
theoretic "size". This often applies to processing... Yes, you can spend many
CPU cycles parsing incoming HTTP requests but that doesn't mean the data is
inherently big.

------
atomic77
There needs to be a time-dependent definition of "Big Data", because this term
is frequently abused and results in confusion all around.

In my view there has always been "big" data. As others have pointed out, none
of this is really any different from the data warehousing days. How much data
you can store is a function of how much money you are willing to spend, and
all that has changed is the amount of disk storage you get per dollar.

I would propose a definition as follows:

S_t > N * H_t

Where S_t is "big" data when it is greater than some constant # of hard drives
N and H is the size of the average consumer HDD at time t.

So if we assume that today the average HDD is 2TB and we define C as say 200,
big data is 400TB. 10 years ago "big data" would have been say 20TB. Simple.

~~~
nstott
Whenever I see algebraic formula written like that, I have this odd compulsion
to try and pronounce it as if it were english.

~~~
atomic77
I read the underscores like a "u"... so i guess a mnemonic for this formula
could be SuT 'N HuT... sutton hut?!

Maybe some other letters would make the mnemonic more interesting :)

------
elchief
8 core procs are dirt cheap. Dualy mobos are cheap. A quarter terabyte of ram
is a couple grand. A terabyte drive costs fifty bucks. A 1000 core gpu card is
a couple hun.

Til you max out a box with those specs, you don't need big data.

~~~
pfarrell
1 terabyte is definitely small data. There's only one company consistently
doing big data and that's, of course, the goog. I like the dynamic definition
that big data is only what is not easily handled by existing tools. The tools
of the past few years (hive, cascalog, elastic search, etc) and some of the
emerging ones (shark, impala, Druid, etc) have radically raised the bar for
what problems smaller shops can handle.

We're handling a very modest small data set (1,000 messages/sec) with just a
few people. That's thanks to the awesome tools coming out of this fruitful
(and overhyped) industry.

------
lucb1e
I tell everyone that it's not big data that they're handling, but it's like
speaking to deaf people. It's even in radio spots, "handle and organize big
data with our tools."

If you want to refer to real big data, perhaps just switch to the term "50PB
of data" or whatever order of magnitude it is.

------
PaulHoule
Not if you use mongo

