

How I came to love big data - ejpastorino
http://37signals.com/svn/posts/3315-how-i-came-to-love-big-data-or-at-least-acknowledge-its-existence

======
meritt
Would love to see if indexes and a sane schema were used for the RDBMS case.
I've built extremely large reporting databases (Dimensional Modeling
techniques from Kimball) that perform exceedingly well for very adhoc queries.
If your query patterns are even somewhat predictable and occur frequently,
it's far better have a properly structured and indexed database than using the
"let's analyze every single data element every single query!" approach that is
implicit with Hadoop and MR.

Not to mention the massive cost-savings from using the right technology with a
small footprint versus using a brute-force approach and a large cluster of
machines.

~~~
ironchef
"exceedingly well for very adhoc queries" vs "If your query patterns are even
somewhat predictable and occur frequently"

Aren't those ... largely opposite? To say "very adhoc" i would anticipate that
as largely meaning "not predictable". Also...can you perhaps quantify "very
large"? How many terabytes? I've done a decent amount of work with the "new
school" OLAP approaches (hadoop / mapreduce, etc.) and found them to work
quite well especially in certain cases such as time series (think weblog
analysis) where sequential scanning is a simplistic approach.

~~~
meritt
In a dimensional modeling approach you need to identify what
elements/attributes users would query upon ahead of time. This leads to
support of adhoc queries -- "Show me minutely clicks & revenue from 11am-1pm
on Mondays in 2011 except holidays for publishers (A,B) against advertisers
(A,B,C)" As long as you define the grain of your data, adding new attributes
for dimensions is very easy and flexible.

I guess I'm comparing it against better-suited-for-brute-force approaches
where someone is analyzing a log file for really random things that tend to be
a one-time thing. "Show me hits to this particular resource from IP addresses
which match this pattern and the user-agent contains Safari and the response
time is larger than 300ms and the response size is less than 100ms!". While
you could fit this data easily into a DM, you'd need to plan ahead for that
sort of querying. If it's a non-frequent occurrence, it'd make more sense to
process it sequentially (even if its across 10,000 machines in a hadoop
cluster).

When I left the company, our Greenplum cluster (so a bit of both worlds: RDBMS
cluster that automatically parallelizes queries across multiple nodes and
aggregates the results) was around 500T. This approach was scaled up from a
single MySQL instance though, which were seeing around 5 million new rows per
day for one particular business channel.

I'm not suggesting that "new school" approaches do not work nor not
necessarily work quickly. What I am suggesting is this: MR is a very naive
approach that is only "fast" due to executing the problem in parallel across
many nodes. If you have datasets which are going to be queried often in
similar use cases, one should take the past ~30 years of innovation in RDBMS'
instead of masking the difficulties by throwing a lot of CPU (and therefore
money) at a problem and solving it in the most inefficient manner possible. It
pains me to see people coming up with overly complex "solutions" to basic OLAP
needs on Hadoop-based or even NoSQL platforms instead of simply using the
right tool for the job.

That said, the one-off cases where it doesn't make sense to build out a
schema, ETL pipeline, and managing a database because it's a very niche or
one-time need: That's where the real value of Hadoop/MR comes into play.

------
zachrose
Naive question: What does analyzing big data sets get you that sampling
doesn't?

~~~
Almaviva
Sampling lowers your confidence resolution, period. When you're testing
hypotheses the biggest constraint can be that the effect you're looking for is
too small to be within the resolution that your confidence intervals give you.
Improving this resolution, even by a little bit, can be worth a lot.

~~~
tel
Because you stated something absolutely, I feel the need to round off the
edge. Sampling can increase your confidence resolution if it allows you to
integrate signals from more data sources together using a larger model that is
infeasible without sampling.

~~~
disgruntledphd2
I think the difference is in the aims. With traditional statistics, you're
trying to estimate some quantity of interest in the population, while with
"big data", you're typically trying to make predictions for individual users.
While this can be done with traditional statistics (in fact, the predict
method in R does exactly that) it becomes easier to match participants on what
books they might like if you have data for what books everyone in your
population likes rather than just a sample.

Now, whether or not the inferential premises of statistics hold up on website
data (and population data) that typically is neither random nor
representative, that's another story.

------
zwass
I'm confused by the assertion that Hive was "much slower than using MySQL with
the same dataset." The author makes this claim, and then provides a table that
shows Hive performing ~50% better than MySQL on a variety of datasets (none of
which really flex the muscle of Hadoop in operating on data sets going beyond
single digit GB).

Regardless, Impala sounds like it could be pretty sweet!

------
xradionut
"These aren’t scientific benchmarks by any means (nothing’s been especially
tuned or optimized)..."

I had to smile when I read that. Working with data, sometimes optimization or
redesign can yield significant performance gains. (Especially when reworking
some of my colleages queries or code...)

