
Don't use Hadoop when your data isn't that big - gcoleman
http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
======
w_t_payne
Hooray! Some sense at last.

I have worked for at least 3 different employers that claimed to be using "Big
Data". Only one of them was really telling the truth.

All of them wanted to feel like they were doing something special.

The sad thing is, they were all special, each in their own particular way, but
none of what made each company magic and special had anything to do with the
size of the data that they were handling.

Hadoop was required in exactly zero of these cases.

Funnily enough, after I left one of them, they started to build lots of
Hadoop-based systems for reasons which, as far as I could fathom, had more to
do with the resumes of the engineers involved than the actual technical merits
of the case.

Sad, but 'tis the way of the world.

~~~
harrytuttle
Yes I've worked for a couple of outfits that did this over the years. It's
usually down to who sold them the solution though. It's worst if the
salesperson is external.

The funniest has to be the engineering company which I used to work with (not
for thank fuck) back in the 90's (when 100Gb was big data!). They managed to
bag 2x full 42U racks full of HP N-Class HPUX UNIX kit, PDU systems and
separate disk arrays for just over £1,000,000 supplied with one full time UNIX
monkey to keep the plates spinning.

Turns out they had only 10 CAD/engineering staff and about 8Gb of data in an
Oracle DB which a single Sun Ultra desktop (<£10,000) could have handled
without a problem at the time.

~~~
davidmr
After I moved from academia/gov't labs to private industry, I was absolutely
flabbergasted at the waste. Many companies literally just buy whatever their
vendors throw at them to solve a problem, and if a reasonably intelligent
techie spent just an hour or two thinking through the problem, they'd realize
that the solution was either far less complicated than the proposed kit or far
more; either way, what's proposed is almost never right.

Another thing I haven't seen much of is acceptance testing. This is actually
more surprising to me than the reliance on vendors for problem solving. It's
an absolute no-brainer to say "I have problem X, which will be solved by
solution Y, and the success will be measured by the set of tests Z. Unless the
above happens, we will not pay the bill and you will take the hardware back."
The vendors looked at me like I was from Mars when I started suggesting these.

I have not since complained about government waste in research computing
budgets.

~~~
jacques_chester
Outsourcing your own teams means there are no in-house experts to perform the
oversight role you've described. Thus the principal-agent is amplified.

------
mightybyte
I agree with the general thrust of this article. But hadoop isn't just for
scaling up the absolute size of the data set. It is also useful for scaling up
the absolute amount of CPU power you can throw at a problem. If I have 1 GB
data set, but the computations that I need to do on that data set are complex
enough that it would take a single machine a long time to do them, then hadoop
is still useful. I gain tremendously by being able to fire up 100 extra large
EC2 servers and run my computation much more quickly than I could with SQL or
Python on a single machine.

Now some might counter this point with the observation others have made here
that using hadoop imposes a ~10x slowdown. But even then, my 100 EC2 servers
will get the job done 10x faster. Running a job in 1 hour with hadoop is MUCH
better than running the same job in 10 hours without it, especially when
you're doing data analysis and you need to iterate rapidly.

So there is a point where using hadoop is not productive. But that limit is
not 5 TB and depends on a lot more variables. Over simplification makes for
catchy blog posts, but is rarely the way to make good engineering decisions.

~~~
pjbringer
In that case, your IO problem is easy, because it's small with regard to CPU
time. You can get away with putting the whole data in sql databases, and/or
making multiple copies of your data. Then you can use as many workers as you
want, with usually simple partitioning logic.

~~~
lgieron
Originally I did just that, but ultimately decided to move to Hadoop. When
combined with Amazon EMR, launching arbitrarily large cluster is just a few
clicks. You can then monitor progress, have robust cluser-wide error handling,
and your data gets nicely merged into output files in S3 (not so easy with the
home-baked solution).

~~~
vosper
We've had a lot of success with EMR as well - we have an hourly Pig job that
produces data for our analytics database. It's not a particularly complex
script, but our traffic volume is unpredictable so it's reassuring to know
that we can add resources to a slow job and have it finish faster.

The downside of EMR is that it can be fairly expensive once you start needing
the beefy machines. We're lucky that we can afford to have our analytics
delayed an hour or two and can thus run on Spot instances (except for the
Master node). When we move to a streaming architecture I'm not sure EMR will
still be competitive, since we won't be able to have those machines go away on
us.

Edit: clarity.

------
bhauer
The point of the article that resonates with me is how frequently a technology
that is poorly fit with a problem domain is selected because of conventional
wisdom rather than data.

Related, it is remarkable how we developers routinely cite Knuth's advice
about premature optimization to justify our decision when the shoe fits, and
then turn around and flatly ignore the advice when it doesn't fit.

Selecting Hadoop before you have a specific and concrete need for it--or see
that need approaching rapidly on the horizon--is in my experience often and
surprisingly coupled with a disdain for other performance characteristics
(because Knuth!). The developers prematurely selecting Hadoop as their data
management platform will routinely be the same developers who believe it's
reasonable for a web application with modest functionality to require dozens
of application nodes to service concurrent request load measured in the mere
thousands. The sad thing here being that application platforms and frameworks
are not all that dissimilar; today, selecting something with higher
performance in the application tier is relatively low-friction from a learning
and effort perspective. But it's often not done. Meanwhile, selecting Hadoop
on the data tier is a substantially different paradigm versus the alternative
(as the article points out), so you have some debt incoming once you make that
decision. And yet, this is done often enough for many of us to recognize the
problem.

In my experience, for a modest web application, it's better to focus resources
and effort on avoiding common and stupid mistakes that lead to performance and
scale pain. Selecting Hadoop too early doesn't really do a whole lot to move
the performance needle for a modest web application.

Trouble is, many web businesses are blind to the fact that they are a modest
concern and not the next Facebook.

~~~
superuser2
> the same developers who believe it's reasonable for a web application with
> modest functionality to require dozens of application nodes to service
> concurrent request load measured in the mere thousands

I'm assuming this is a dig at Rails and Django? What are you suggesting
instead?

~~~
bhauer
Just about anything on the JVM (Dropwizard, Play, Finagle, Scalatra, Rest-
Express, Rest-Easy, Compojure, Unfiltered, Jersey, Vert.x, Spark), Go
(Gorilla, Beego, Revel), Lua (Lapis), Haskell, Erlang...

~~~
superuser2
Are any of these environments actually proven in more than a handful of real-
world production environments?

I'd certainly like to get into alternatives but my impression was that none of
these battle-tested the way Rails and Django are.

~~~
jake_morrison
There is a reason that the largest Internet companies are still running their
infrastructure on Java and C++. It scales and operations people know how to
manage it.

Rails is the classic example of optimizing for developer ease of use instead
of performance. Which probably makes sense for startups, but scaling is a pain
in the ass.

Openresty is great for infrastructure level solutions, e.g. request routing
and authentication. It is used by some of the largest Internet companies in
China.

The benchmarks at
[http://www.techempower.com/benchmarks/](http://www.techempower.com/benchmarks/)
are a reminder of how poorly some popular frameworks perform. And how poorly
cloud performs vs relatively cheap dedicated hardware.

------
hackula1
99% of the cases I have seen where people have been working with tables that
are in the 5+ TB range for analysis, there is some obvious way to compress the
data that they have overlooked. Most analysts find some way to aggregate a
dataset once, then do actual work on that aggregated dataset, rather than the
raw data. In geospatial analytics, for example, a trillion records can be
aggregated down to census blocks/block groups so you only have a few million
records to deal with. The initial aggregation often takes several days, but
after that you can calculate most things in a few seconds with reasonable
hardware.

~~~
pge
In addition to compression, let's not forget sampling. For a lot of problems a
random subset of the data will give you a statistically meaningful answer with
sufficient precision. It seems like the rise of "big data" has led to the
assumption that all queries have to be run against the entire dataset.

~~~
disgruntledphd2
True. The issue with "Big Data" is that sometimes, especially when you need to
produce personalised recommendations, the sampling doesn't cut it (or at least
produces sub-optimal results).

------
davidmr
While I couldn't agree more with the general point of the article, I have some
small additional comments.

Just as a bit of background, I think that Chris would very much agree that I
am not the intended recipient of this advice, and so my comments probably
aren't keeping in the spirit of the article. I've spent the last 10+ years
exclusively in very large HPC environments where the average size of the
problem set is somewhere between 500TB and 10PB, and usually much closer to
the latter than the former.

I think that, for the types of problems Chris mentions, for small data sets,
hadoop is as silly a solution as he claims, and for the large map-reduce
problem set of the (divide and conquer using simple arithmetic) of 5TB+, he's
clearly in the right. Periodically I peruse job postings to see what is out
there, and I'm personally ashamed at what many people call "big data", but
just because your problem set doesn't fit the traditional model of big data
(incidentally, I'm having trouble thinking of a canonical example of big data.
perhaps genome sequencing? astronomical survey data?), doesn't mean that a)
hadoop is not the right solution, and b) it's best done on a box with a hard
drive and a postgres install, pandas/scipy, whatever.

We'll be generous and say that 4TB can do 150MB/s. A single run through the
data at maximum efficiency will cost you ~8 hours. Since we've restricted
ourselves to a single box, we're also not going to be able to keep the data in
memory for subsequent calculations/simulations/whatever.

Take for example a 4TB data set. It is defined such that it would fit on a 4TB
hard drive, but if your problem set involves reading the entire set of data
and not just the indexes of a well-ordered schema, you're still going to have
a bad time if you want it done quickly, or have a parameterization model that
requires each permutation to be applied through the entire sequence of the
data rather than chunks you can properly load into memory and then move on,
you're going to have a really bad time.

I suppose all of this is to say that the amount of required parallelization of
a problem isn't necessarily related to the size of the problem set as is
mentioned most in the article, but also the inherent CPU and IO
characteristics of the problem. Some small problems are great for large-scale
map-reduce clusters, some huge problems are horrible for even bigger-scale
map-reduce clusters (think fluid dynamics or something that requires each
subdivision of the problem space to communicate with its neighbors).

I've had a quote printed on my door for years: Supercomputers are an expensive
tool for turning CPU-bound problems into IO-bound problems.

~~~
jedbrown
"Big Data starts at 1.5 PB [because that's what fits in memory on Blue
Waters]" \-- Bill Gropp

On today's top HPC installations, it currently takes about an hour to read or
write the contents of memory from/to global storage. If the workflow has broad
dependencies (PDEs are especially bad, but lots of network analysis also fits
the bill), it's much better to use more parallelism to run for a shorter
period of time with the working set in memory. If you can't fit your working
set in memory on the largest machine available, chances are you should either
find time on a bigger machine or you can't afford to do the analysis. (The
largest scientific allocations are in the 10s of millions of core hours per
year, which would be burned in a few reads and writes on a million-core
machine.)

Also note that MapReduce is not very expressive. When IBM's Watson team was
getting started, some people suggested using MapReduce/Hadoop, but the team
quickly concluded it was way too slow/constraining and instead opted for an
in-memory database with MPI for communication.

~~~
w_t_payne
1.5 PB is pretty huge for most people outside of supercomputing. The biggest
dataset I ever worked with fell some way short of that (It was approaching
1PB, I think, although we were not too sure how much data we had exactly, and
it might have been slightly over, depending on how you measured). Even
handling small(er) working sets could be a challenge in organisation, since we
only had a small compute cluster (& associated infrastructure) to work with.

~~~
dekhn
"Supercomputing" has never really been about big data. I used to work in
supercomputing (NERSC) and have used most of the major machines in the field
in the past. The supercomputer centers claim they're moving lots of data
around, but it you look closely, it's almost entirely message passing during a
calculation. And in most cases, the calculations they are doing are just
wasted resources- simulating a protein over one long trajectory.
Supercomputers, as compared to high throughput clusters, are just wasted
money.

~~~
davidmr
I wrote a different reply but deleted it. I'm curious about something: if, as
you claim, supercomputers are wasted money, why are there so many of them?
Have all the world's top supercomputing sites somehow colluded to convince all
the world's largest governments that they're useful?

~~~
dekhn
At this point, large capability supercomputers that solve problems we don't
need solved are mostly kept around as trophy pieces with a few "show me"
calculations. Note that when the Chinese obtained the leadership in TOP500 a
lot of people got worried (as they did when Japan did it with the Earth
Simulator). And so the US may spend a bunch of money to get on top of that
list again. Big deal. It's LINPACK. Fortunately, TOP500 should eventually
introduce better codes to benchmark.

I'm sure there are scientists who put their all into getting their code to run
on a super computer, then apply for a pittance of time and sit in a queue for
weeks, but they're not publishing as many (or as interesting) papers as the
ones who are pulling out their credit card and building mini-supercomputers on
Amazon that rely on conventional interconnects and better algorithms and data
processing tools.

Don't assume I'm ignorant about what supercomputers are used for. I used to
work on supercomputers; that includes writing, running and evaluating codes,
and selecting proposals on some of the largest machines in the world (at the
time).

But these new approaches to doing science and engineering on large computer
systems have _obsoleted_ conventional supercomputing for all but an extremely
limited set of computational problems. And that set of problems becomes
smaller when clever computer scientists figure out smarter ways to run things
on cheaper architectures. For example, page rank is a classic eigenvector
problem; you can solve it by building a big matrix on your supercomputer and
doing the appropriate calculations, using 50+ years of numerical optimization
(but not really very good support for nodes failing between checkpoints).
However, you can _also_ implement it as an iterative mapreduce. The mapreduce
checkpoints every little bit of map work, and along the way, handles those
failures quite well. It can also handle data sets larger than the sum of RAM
on the machines quite well.

Guess which one works better operationally, scales to a larger data set size,
and ports to a lot of architectures cheaply?

~~~
tfgg
As far as I understand, you're saying that high-speed interconnects are
largely a waste of time? (and apparently that doing protein MD is a waste of
time, but that's a scientific question that I'm going to disagree with you
on). I can imagine that might be the case for some simulations, but how I do,
for example, a parallel FFT without significant communication?

I personally see it more as there's a pretty limited set of computational
problems which are low-communication, often requiring rather gross
approximations, even with the "clever computer scientists" looking at the
problem, and they're often not the problems scientists actually want to solve,
and so dismissing supercomputing is rather ridiculous.

~~~
dekhn
High speed interconnects are not worth the extra cost. They just compensate
for poor programming.

With regards to protein MD, no, I'm not saying protein MD is a waste of time.
In fact, I run the Exacycle program at Google which has run the largest MD
simulations (many milliseconds) ever done, and the results are quite good. You
would never have gotten results as good as ours on a supercomputer- even the
world's largest, with the most highly scaling MD codes.

I speak from experience- I used to code for supercomputers, in fact working on
protein (and DNA/RNA, my interest) structural dynamics, with some of the
leading MD codes.

I helped port a better approach to Google's infrastructure some time ago. It's
not a clever hack or gross approximation:
[https://simtk.org/home/msmbuilder](https://simtk.org/home/msmbuilder) it's a
distinctly better way of modelling protein dynamics than running long single
trajectories on supercomputers.

~~~
tfgg
> High speed interconnects are not worth the extra cost. They just compensate
> for poor programming.

Sorry, that's complete rubbish. Some problems just can't be made
embarrassingly parallel or near so. You might be able to swap a problem for a
similarish problem with better characteristics, but that's not the same as
being a "poor programmer" since you're now using a different model rather than
optimizing the implementation of the existing one. You still haven't explained
how I do a parallel FFT without high speed interconnects -- I suspect the
answer is "don't". I've tried running the stuff I do (DFT) over a 10GigE
network on my cluster and it ran at about 10% of the speed of an Infiniband-
enabled calculation. There are methods for getting better scaling, but they
all (as I said) involve rather crass approximations which you don't always
want to do. Those methods will probably increasingly be used more on large
supercomputers due to their superior scaling characteristics, even with their
nice high-speed interconnects, but you're still making a sacrifice in
accuracy.

It might be possible to get good low-communication scaling for some models, as
you were in the case of your sampling system (which typically parallelizes
nicely, but only if the individual sampling jobs can fit on a single node),
but you can't extrapolate that to assume that everyone can. As I said,
exchanging a poorly-scaling model for a different, non-equivalent well-scaling
model is a scientific question with tradeoffs, not a programming one.

~~~
dekhn
Are you sure you need to do an FFT?

Are you sure there is not a network friendly FFT algorithm?

I used to think very differently about computing before I read Google's MR,
BigTable, and GFS papers. After joining Google and working on problems like
this, I can assure you that FFTs can indeed scale quite nicely.

note that infiniband isn't a high speed interconnect. it's a cheap commodity
interconnect- in fact, per port, it can be cheaper than 10GB. but lots of
effort has been put into making the MPI libraries work efficiently over it,
compared to 10GBe, so codes run more effficiently (as you observed).

DFT seems like the next thing that's going to not need supercomputers. I see
some nice new algorithms coming on line designed for Amazon cloud
infrastructure.

~~~
xtreme
> note that infiniband isn't a high speed interconnect. it's a cheap commodity
> interconnect- in fact, per port, it can be cheaper than 10GB.

Why do you think that an interconnect can't be both cheap and high-speed? As
you said, there has been a lot of work to make MPI efficient on infiniband,
but that has been possible because inifiniband offers much lower (~10 times)
latency than 10GbE, not because no one has optimized MPI for ethernet. In
fact, standards like RoCE and iWarp have been devised by Ethernet working
group to compete with infiniband on this particular metric.

~~~
dekhn
actually, wire latencies for 10GbE are much better than you think- they are
basically the same.

Anyway, my point was that infiniband is not a high speed interconnect, and
it's cheaper than 10GbE. The challenge is to deliver scaled infiniband, which
is far harder than scaled ethernet.

------
beagle3
Indeed.

A rule of the thumb that I've inferred from many installations is: Just
introducing Hadoop makes everything 10 times slower AND more expensive than an
efficient tool set (e.g. pandas).

So it only makes sense to start hadooping when you are getting close to the
limit of what you can pandas - everything you do before that is a horrible
waste of resources.

And when you do get there - often, a slightly smarter distribution among
servers and staying with e.g. pandas, will let you keep scaling up without
introducing the /10 factor in productivity. Although, it might be unavoidable
at some point.

~~~
yummyfajitas
There is a huge gap between Pandas and Hadoop. That gap is well served by a
$2-3k Postgres box with 32-64GB ram and 5-10TB of disk.

The main reason to switch to Hadoop at the point when Pandas fails is because
you expect to scale past that gap fairly quickly.

------
rdtsc
"We don't have big data" or "our data is rather small" \-- said no dev team
ever.

"Big data" is like "cloud" it is a cool label everyone applying to their
system. Just like OO was in its time. Well once they applied the label they
feel they need to live up to it so well "we gotta use what big data companies
use" and they pick Hadoop. I've heard hadoop used when MySQL, SQLite or even
flat files would have worked.

~~~
e12e
I did say that my former employer. Former.

~~~
e12e
*at my former employer.

------
mattjaynes
Novelty Driven Development (NDD)

Chris points out a great example of NDD here with Hadoop.

I do a lot of client work and I see this mistake CONSTANTLY. So often in fact,
that I recently wrote up a story to illustrate the problem. Rather than use a
tech example, I use a restaurant and plumbing to drive the point home. When
the same scenario is put into the context of something more concrete like
physical plumbing, it shows how ridiculous NDD really is.

[http://devopsu.com/blog/boring-systems-build-badass-
business...](http://devopsu.com/blog/boring-systems-build-badass-businesses/)

~~~
StavrosK
That blog's background is 2.6 MB, perhaps you should compress it a bit.

Also, I think it's a bit disingenuous to compare a complex system built to
various specific needs versus a simpler one that doesn't address those needs
at all and just say "their only difference is that one worked and the other
didn't".

Obviously, if the only criterion was that it should just work, everyone would
go with the simpler system. The more complex system probably had some things
going for it, too, otherwise nobody would choose it.

~~~
mattjaynes
"Obviously, if the only criterion was that it should just work, everyone would
go with the simpler system."

It's obvious to you and me, but to many people it's not obvious at all. You
only have to look at the _very_ large deadpool of companies that made these
mistakes and died because of it.

People optimize prematurely, scale prematurely, make ego based decisions, care
more about interesting tech than making the business grow, etc etc. That's
really the whole point of the parent article - people make bad decisions
because of the allure of some "cool" but complex and unnecessary tool.

Also, thanks for the note on the background - I'll fix that :)

------
eksith
"A 2 terabyte hard drive costs $94.99, 4 terabytes is $169.99. Buy one and
stick it in a desktop computer or server. Then install Postgres on it."

Done! Although with more drives and a backup server. Right now, we're pushing
15Tb with no loss in performance.

~~~
daxelrod
Do you use standard off-the-shelf consumer hard drives at those prices?

Most companies I've worked for have shelled out quite a bit more for
"enterprise" class hard drives. I've always struggled to understand what these
bring to the table, and my understanding is that it's some combination of
greater reliability and a service agreement.

It's always seemed to me, in my software-centric naivete, that it would be
more cost-effective to get off-the-shelf drives, RAID them for redundancy to
increase uptime, and make regular backups to increase data longevity.

~~~
Sanddancer
For the most part, you're right. Enterprise hard drives tend to come with
better agreements, especially for situations where data protection standards
make it prohibitive to send a dead drive back to RMA it; often they'll just
say to send back the controller board or the like. Oddly, however, one of the
big design features for enterprise hard drives is that they'll give up faster
if there's a mis-read, etc. Because you're striping your data over a large set
of spindles, it's often faster to just throw your hands up and work from the
parity disk until you can diagnose the issue. This is also why most good sized
arrays will have hot-spares, so you can begin the rebuild while you figure out
if the drive was just having a temporary glitch or not.

~~~
caw
> it's often faster to just throw your hands up and work from the parity disk
> until you can diagnose the issue.

One of the advantages of enterprise drives is how long the drive spends error
checking. Consumer drives spend considerably longer when it detects an error,
and that can cause the RAID controller to drop the drive as malfunctioning.
Then it has to rebuild the array or operate on hot spare.

There used to be a firmware setting you could toggle on the WD Caviar Black
drives to effectively turn them into the RE (enterprise) drives. They've since
made some design changes so that's no longer possible.

------
jamii
The paper introducing graphchi has dozens of examples of graphchi on a mac
mini running rings around 10-100 machine hadoop clusters -
[http://graphlab.org/graphchi/](http://graphlab.org/graphchi/)

I worked on a project earlier this year that was envisioned as a hadoop setup.
The end result was a 200loc python script that runs as a daily batch job -
[https://github.com/jamii/springer-
recommendations](https://github.com/jamii/springer-recommendations)

I'm tempted to setup a business selling 'medium-data' solutions, for people
who think they need hadoop.

------
alanctgardner2
I'm always torn by these headlines: yes, many organizations lack the size of
data required to take advantage of Hadoop. Few of the articles really bother
explaining the advantages of Hadoop, and how what you're doing really moves
the break-even point in terms of data size:

\- 3x replication: if the data needs to be retained long-term, slapping it on
one hard drive isn't going to cut it. This is pretty poor justification by
itself, but it's nice to have.

\- working set: if you only pull 1GB out of your data set for your
computations, it makes sense to pull data from a database and run Python
locally. If you need to run a batch job across your full, multi-TB data set
every day then Hadoop starts looking more attractive

\- data growth: a company may only have 10GB of data now, but how much do they
expect to have in a year? It's important to forecast how much data you'll
accumulate in the future. Especially if you want to throw all your
logs/clickstreams/whatever into storage.

So, if you're expecting explosive growth, you want to hang on to every piece
of data ever, or you're going to do a lot of computation across the whole
dataset, it makes sense to adopt Hadoop even if your dataset isn't 'big' to
start.

As for this article, the author undersells MapReduce a bit. Human-written MR
jobs can jam a lot of work into those two operations (and a free sort, which
is often useful). Using a tool like Crunch can turn really complicated jobs
into one or two phases of MR. Once Tez is widely available people won't even
write MR anymore, they'll all likely write a 'high-level' language and compile
it down to Tez.

~~~
Sanddancer
You bring interesting points that you may need to analyze the future
requirements, but at the same time, I feel you're underselling the things you
can do with a modern SQL cluster.

* Replication: Disks are cheap, data can go on as many drives and servers as you need. Master-slave replication is pretty damn bulletproof these days, and multi-master isn't as terrible as it was even a few years ago.

* Working set: All servers can have the entire working set. Additionally, features like foreign data wrappers in postgresql and federated tables in mysql mean that you can query your sharded databases, and still get aggregate results back.

* Data growth: SQL servers were Big Data before Big Data was cool. Terabyte storage clusters are no problem for an SQL database engine, with some entities having clusters in the petabyte range.

Databases are a very fast moving target of late, with competition from many
fronts meaning that what was true even last year may not be true today.
Additionally, with things like Postgresql's foreign data wrappers, you can
choose the best data processing engine for the job -- keep the data on the
databases for the nice bits of SQL, but still be able to throw it towards a
hadoop cluster at a whim if needed. Using the right tool for the job is
important, and equally important is keeping up with what tools are out there,
because this is a rapidly evolving area of tech, and analyzing each job
separately, instead of relying on any mantra, is what's needed to keep one
flexible to keep up.

~~~
alanctgardner2
Preface: "Big Data" is a stupid label, and I wish it would die. You have no
idea how much time I spend explaining to non-technical people that comparing
Hadoop and Riak is like comparing a tractor to a helicopter.

You can definitely put lots of data in a RDBMS, and you can get it back very
quickly. I'm not advocating for Hadoop for problems where a database - be it
sharded, vertically scaled, whatever - will do; I know you can scale
relational to petabytes of data.

Hadoop isn't supposed to be a tool for _having_ a lot of data, it's a tool for
_doing things_ with a lot of data. I've done lots of weird computation on
Hadoop: OCR of time-series images is a good example. This is general purpose,
parallel computation which takes advantage of Hadoop's scheduling
infrastructure and the notion of data locality - some of the data lives on
every compute node, and can be accessed with very low latency.

To put it another way, databases will get faster and more scalable, but
there's still a need for ETL when moving between different data models. Some
people use Hadoop just for ETL, particularly when ingesting semi-structured
data, because it's good at computation, but only OK at storing the finished,
structured data.

I'm going to try not to be offended by your closing line; I was trying to
point out a particular dimension of considering whether an application is
suited to Hadoop - the computation component. This is often ignored by people
who treat Hadoop like yet another database, when that's really a complete
mischaracterization. I certainly don't advocate 'relying on any mantra', but
instead considering all solutions and selecting the most appropriate tool.

------
noelwelsh
In these discussions it is mandatory to quote this paper:
[http://research.microsoft.com/pubs/163083/hotcbp12%20final.p...](http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf):

"We completely agree that Hadoop on a cluster is the right solution for jobs
where the input data is multi-terabyte or larger. However, in this position
paper we ask if this is the right path for general purpose data analytics?
Evidence suggests that many MapReduce-like jobs process relatively small input
data sets (less than 14 GB). Memory has reached a GB/$ ratio such that it is
now technically and ﬁnancially feasible to have servers with 100s GB of DRAM.
We therefore ask, should we be scaling by using single machines with very
large memories rather than clusters? We conjecture that, in terms of hardware
and programmer time, this may be a better option for the majority of data
processing jobs."

Their data is based on Hadoop jobs running at Yahoo, Facebook, and Microsoft
-- companies most would agree do have real Big Data -- and they find the
median job size is <14GB.

~~~
zenbowman
Even if all jobs take less than 14GB data as input, but the sum total of your
data is far greater, it is easier to use Hadoop as a unified single filesystem
for storing data rather than relying on multiple different machines that you
manage by hand. Even if you treat Hadoop simply as a multi-machine filesystem,
it automates the job of managing which files are where, that you inevitably
have to manage by hand if you go with storing your data on multiple machines
sans some kind of filesystem interface.

------
PaulHoule
A study of jobs submitted to the Yahoo! cluster showed that the median job
involved 12GB of data.

There's really nothing wrong with that at all, because breaking on 64MB
blocks, that 12GB can be processed in parallel, which means turning an answer
around really quick on that 12GB, say 30 seconds or so. Usually the work can
be scheduled on machines that already have the necessary input, so the network
cost is low.

Now, it might not be worth it for one hacker to build a Hadoop cluster to do
that one job, but if you have a departmental-wide or company-wide cluster you
can just submit your jobs, get quick answers, and let somebody else sysadmin.

Sure the M/R model is limited, but it's a powerful model that is simple to
program. You can write unit tests for Mappers and Reducers that don't involve
initializing Hadoop at all, and THAT speeds up development.

Yes, it is easy to translate SQL jobs to M/R, but M/R can do things that SQL
can't do. For instance, an arbitrary CPU or internet intensive job can be
easily embedded in the map or in the reduce, so you can do parameter scans
over ray tracing or crack codes or whatever.

I built my own Map/Reduce framework optimized for SMP machines and ultimately
had my 'shuffle' implementation break with increasing input size. At that
point I switched to Hadoop because I didn't plan to have time to deal with
scalability problems.

[https://github.com/paulhoule/infovore/wiki](https://github.com/paulhoule/infovore/wiki)

With cloud provisioning, you can run a Hadoop cluster for as little as 7.5
cents, so it's a very sane answer for how to get weekly batch jobs done.

~~~
cmarschner
Looking at the median is not very interesting, since jobs in these
environments are always heavily skewed. You have those 5-10% jobs that are
seceral orders if magnitude beyond those 13GB, and those are the ones you run
the cluster for.

~~~
PaulHoule
Of course, but the ability to run small jobs and get a quick turnaround can be
transformational in the sense that it lets you try things out and "fail
faster"

------
mrcactu5
I remember starting to grasp what "big data" meant when I had a phone
interview with Twitter.

@ Imagine you have some numbers spread over some computers -- too many to fit
in one computer find the median.

▪ Uhh, sort them?

@ Can you find the median on a single computer without sorting them.

▪ :-(

@ We'll call you back tomorrow.

I was promptly rejected, but it set the tone for my later studies.

The criterion for Big Data seems to be that it fits on thousands of computers,
perhaps several TB or a PB. Then I had to think of some examples:

* A million YouTube Videos

* All the tweets in the US in the past 15 minutes

* All US tax records

I still think the map-reduce philosophy is really cool. And I know at that
scale there are special counting algorithms (like Bloom Filters) that may lead
to some improvements at the GB or MB scales.

~~~
t1m
Twitter often gets singled out for it's big dataness, though the latest
numbers I've seen are only about 400 million tweets per day. Even allowing
1K/tweet this is a rather manageable 400GB uncompressed.

15 minutes of US tweets would fit on your phone :-)

~~~
cdavid
I am pretty sure a tweet is more than 1kb: the message itself is 140
characters, not 140 bytes, and you have lots of metadata around a tweet. I
would expect easily one order of magnitude more / weet.

------
gfodor
The author has a point, but dataset-size is one dimension of several you need
to consider when making the choice to use Hadoop.

A local python script is great, but what if it takes 2 or 3 hours to run? Now
you need to set up a server to run python scripts. What if the data is
generated somewhere that would have high locality to a hadoop cluster? Now you
need to pull that data down to your laptop to run your job. What if there are
a dozen people running similar jobs? Now your python script server is a highly
stressed single point of failure. What if the data is growing 100% month-over-
month? Your python scripts are going in the trash soon since they were not
written in a way that can be easily translated to map-reduce and hadoop-sized
workloads are inevitable.

The next step up is a centralized database, but in my experience running your
own (large, highly used) database is a whole lot harder than just throwing
files on S3 and spinning up hadoop clusters on EC2 if you have people that can
write pig jobs.

A solution like elastic map reduce removes a lot of practical problems such as
data distribution, resource management, and system operations beyond the fact
that it makes it possible to easily run jobs over terabytes of data at a time.

------
geertj
Amen! Finally someone talking sense. Apart from being a hype, Hadoop is also a
wet dream for VCs that want to have "exposure" to "big data".

Hadoop is a solution for cases where you have multiple petabytes of data, with
query's that need to touch a significant portion of your data. Roughly
speaking in this case your execution time will scale with the number of nodes
in your cluster. Classic example is creating the inverted word list for a
search engine.

For most other use cases, including all cases where you can index your data,
you do not need Hadoop.

------
ivanprado
You are right that it should be used the proper tool for each particular
problem. And Hadoop world is harder than single machine systems (like pandas).
So, you shouldn't user Hadoop if you can do the job with simpler systems.

But I have something to add. Hadoop is not only introducing new techniques for
distributed storage and computation. Hadoop is also proposing a methodological
change in the way a data project is approached.

I'm not talking only about doing some analytic over the data, but building an
entire data driven system. A good example would be the case of building a
vertical search engine, for example for classified ads around the world. You
can try to build the system just using a database and some workers dealing
with the data. But soon you'll find a lot of problems for managing the system.

Hadoop provides you all the storage and processing power that you want (it is
matter of money). Why if you build your system in a way where you recompute
always everything from the raw input data? That can be seen as something
stupid: Why doing that if you can run the system with less resources?

The answer is that with this approach you can:

\- Being human fail-tolerant. If somebody introduces a bug in the system, you
just have to fix the code, and relaunch the computation. That is not possible
with stateful systems, like those based in doing updates over a database. \-
Being very agile in developing evolutions. Change the whole system is not
traumatic, as you just have to change the code and relaunch the process with
the new code without much impact in the system. That is not something simple
in database backed systems.

The following page shows how a vertical search engine would be built using
Hadoop and what would it be its advantages:
[http://www.datasalt.com/2011/10/scalable-vertical-search-
eng...](http://www.datasalt.com/2011/10/scalable-vertical-search-engine-with-
hadoop/)

~~~
jacques_chester
> _That is not possible with stateful systems, like those based in doing
> updates over a database._

Could you elaborate? This sounds like the NoSQL argument that relational
databases are "not agile", which usually means relational databases complain
that you have records that won't logically fit the changes you just made.

------
bitL
I couldn't disagree more with some of the statements in the article.

Hadoop is not a database! It's a parallel computing platform for MapReduce-
style problems that could preserve locality. If your problem fits this, Hadoop
absolutely rocks. If your problem is different then please use another tool.
If your problem deals for example with high-resolution geographic or LiDAR
data that can be easily processed independently (and that would easily give
you a few petabytes each scan/flight so you can't stuff it into GPU), Hadoop
is about the only open thing you can use to process them reliably (imagine
having Earth surface data with the resolution of 1cm and the need to prepare
multiple levels of detail, simplify geometry, perform object recognition,
identify roads etc.). Even when your data is smaller if your problem fits in
the MapReduce model well, Hadoop is a pretty convenient way to be future-proof
while enjoying already mature application infrastructure. Why would you even
bother working with toys that put everything into memory and then fail
miserably in production (HW failure, need to reseed data after crash etc.)?

I worked in a company that routinely processed these kinds of data; usually
people accepting the thinking from this article hit a wall someday in
production, couldn't guarantee reliability and ended up writing endless hacks
for their algorithms that didn't scale when it was needed and became
frustrating bottlenecks for everyone.

Yes, I saw also some ridiculous uses of Hadoop (database that didn't have a
chance ever growing over 20M records, problems not fitting MapReduce that
needed custom messaging between jobs etc). Just use your reason properly,
whatever has the potential for the future to handle large data do it with
Hadoop or any appropriate system that supports your algorithmic model (S4,
Kafka, OrientDB, Storm etc.) straight away.

Make your software future-proof now or you'll have to rewrite it from the
scratch when you will be under huge pressure. Don't become complacent with
what you "know" now.

~~~
dalke
> "Hadoop is not a database!"

Nor does the essay claim that Hadoop is a database.

> "If your problem fits this, Hadoop absolutely rocks"

The essay points out that most systems which do use Hadoop don't actually fit
the Hadoop model, and that other 'mature application infrastructures' would be
more effective than Hadoop. You misinterpreted it to mean the converse.

It also agrees with you that there are cases where "Hadoop might be a good
option". It says "The only benefit to using Hadoop is scaling", and that it
might be appropriate for >5TB data sets. Your 1PB example is of course larger
than 5TB.

So I don't think you actually disagree with it.

> "Why would you even bother working with toys that put everything into memory
> and then fail miserably in production"

Thank you for calling my software a "toy" and disdaining my field of research.
Do not presume that the needs of your field hold for all others.

Hadoop isn't useful for what I'm interested in, which is to support
interactive search of ~2 million chemical structures. This requires a sub-100
ms query time. Hadoop, last I checked, was lousy for soft real-time work like
this.

You actually say "Hadoop or any appropriate system that supports your
algorithmic model".

My actual solution was the old-fashioned way: a combination of improved
algorithms, multithreading, better data locality, and a bit of chip-specific
assembly. The result is about 100x faster than the previous widely used tool,
and gives me the sub-second search times I want.

Moreover, it scales well. I use the new code as part of the inner loop in a
clustering task, what once took a week on a machine cluster is now being done
on a single node in an afternoon.

Had I taken your suggestion I would have invested in more hardware, which
would have been the wrong solution for my needs. Also, data size in my field
doubles every 5-10 years, which is much slower than the rate of computer
performance.

~~~
bitL
Sorry, I didn't mean to offend you in any way!

First, I really don't like if somebody by default compares Hadoop with SQL and
this is a widespread confusion - they are completely different beasts; in fact
an extension called Hive gives Hadoop + HTable an SQL-like syntax.

However Hadoop is a parallel, batch-processing platform. Hadoop is slow-
responding, you can't even talk about latency because a single task takes a
lot of time even to setup/execute. It's completely inappropriate for real-time
low-latency computations. For those, the in-memory, GPGPU, streams are much
better. However, if you have a large dataset whose loading times onto a single
machine may be very long - imagine loading 1TB to memory from a network drive
if your computer can handle it - you might be better off by partitioning your
data across thousands of smaller nodes (e.g. ARM microservers) and perform
computations on each of them independently. This way you don't need to
transfer a lot of data, each smaller local dataset is loaded very fast (= you
preserver locality), you find balance between being CPU-bound and IO-bound and
likely finish your computations much faster than on a single computer with
huge memory but limited bus.

Hadoop's tragedy is that it is now an established platform that is supported
by large companies which are mostly driven by technologically clueless people,
trying to put it everywhere as it is "in vogue" to do so. Google abandoned
MapReduce model a few years ago but industry didn't seem to notice.

Think about your case for chemical structures - is there any part of your
algorithm that needs to be computed only occasionally but it's a lot of data
to process? Why not offload it to Hadoop for preparing digest from these data
which you can use in your real-time algorithm? That's actually pretty common
usage scenario - Hadoop handles the rough mining, extracting the precious
stuff from crude data, assembling it to a form that is refined and can be used
by other parts of system that have completely different requirements, for
example low-latency interactions.

~~~
dalke
> "I really don't like if somebody by default compares Hadoop with SQL"

The first comparison was to Scala, and then to SQL. The main comparison was to
SQL and a Python script. While it does talk more about SQL than other
solutions, I don't see that as a default comparison.

> "For those, the in-memory, GPGPU, streams are much better"

I'll go off on a bit of philosophical tangent here. Yes, GPGPUs can be more
effective for the task I'm doing, since I'm actually memory bound. However,
GPGPUs require dedicated hardware, while the code I work is effective even on
laptops with limited GPUs, including web servers.

Ideologically, I prefer to enable single-person developers, and scientists who
do not have much training in hardware and network administration. For that
situataion, GPGPUs are not "much better", because so long as the performance
is fast enough, it doesn't need to be faster. And 100 ms is "fast enough."

> "This way you don't need to transfer a lot of data"

True. But to point out, I used the traditional approach of developing a file
format which can be memory-mapped directly to my internal data structures, and
a search algorithm which doesn't need to search the entire data set before
loading it. These optimizations aren't synergistic with how I understand how
Hadoop works.

> Hadoop's tragedy ... mostly driven by technologically clueless people

The author of the essay agrees with you. Here's the P.P.S.:

"I don’t intend to hate on Hadoop. I use Hadoop regularly for jobs I probably
couldn’t easily handle with other tools. ... Hadoop is a fine tool, it makes
certain tradeoffs to target certain specific use cases. The only point I’m
pushing here is to think carefully rather than just running Hadoop on The
Cloud in order to handle your 500mb of Big Data at an Enterprise Scale."

This is why I don't think you actually disagree with the author.

> Think about your case for chemical structures - is there any part of your
> algorithm that needs to be computed only occasionally but it's a lot of data
> to process?

Certainly, but there are two other important facets to that. 1) updates occur
weekly, a full rebuild on a single core only takes 12 hours, and those 12
hours aren't critical, and 2) the deltas are relatively small and data from
previous builds can be reused, so incremental updates should take about 30
minutes - I'll be working on that code in the next couple of weeks.

It's easier to have a cron job trigger a command-line program every week than
to set up a Hadoop server.

------
jdk
Back in college in '97 or so, in our databases class on the first day, the
professor asked, "Who's worked with databases before?" A bunch of hands went
up. "Oh, sorry, let me rephrase, who's worked with databases larger than a few
dozen gigs?" Only one or two hands remained up. "If it's smaller than that,
just save yourself the effort and use a flat-file instead."

15 years later and it's the same thing, plus a few orders of magnitude.

~~~
corresation
Interesting, but it isn't really similar advice. Indeed, it sounds like
absolutely _terrible_ advice (unless there is some context that is missing).

~~~
AndrewDucker
I suspect it should have been "Who's worked with databases too big to fit
entirely in memory".

Because if it all fits then you don't have to worry too much about performance
overheads. If it doesn't then you need to think about how you're going to
avoid table-scans and the like.

~~~
dalke
Many programs use SQLite in part because its ACID properties make it an
excellent way to save system state.

A lot of people use databases to implement persistent user state on top of
stateless HTTP, even if the data itself is small enough to fit into memory.

This latter use was known even in 1997. For example, the book "Database Backed
Web Sites: The Thinking Person's Guide to Web Publishing" was published on
Jan. 1 of that year.

So even when that advice was offered, it was wrong.

------
zenbowman
For a lot of people, what the author says is absolutely right, but I think a
lot of the comments here suggesting that only a handful of institutions are
solving Hadoop-scale problems is simply inaccurate.

Yes, there are companies that are trying to appear more attractive by using
Hadoop, but there are plenty of cases where Hadoop is replacing ad-hoc file
storage on multiple machines.

It's primary use is as a large scale filesystem, so if you are running up
against problems storing and analyzing data on a single box, and you feel that
the amount of data you have will continue to accelerate, it is a good option
for file storage. It doesn't replace your database, it complements it, and
there's work being done to allow large-scale databases on top of Hadoop,
although the existing ones aren't mature yet. But there's a lot of
institutions that are taking on problems that a single-box setup cannot
handle.

And MapReduce isn't a bad programming model, but it should be thought of as
the assembly language of Hadoop. If you are solving a particular problem on
Hadoop, writing a DSL for it is the way to go, or see if one of the existing
DSLs fits your needs (HIVE, Pig, etc).

------
dbecker
I'd love to understood how the Hadoop hype and marketing team generated so
much unwarranted interest in Hadoop.

I'm witnessing a feeding frenzy for Hadoop talent in situations where there's
absolutely no need for Hadoop, and I can't recall anything like this for any
other software.

~~~
meshko
I was with you until "and I can't recall anything like this for any other
software."

~~~
dbecker
Just for my curiosity, what other software had the same level of unwarranted
demand and hype?

~~~
stefanve
I don't think it's about software per se, but more about the idea of Big Data.
We have seen many ideas that have come and go most of the times a good idea
but a good idea is not something you should use for everything.

From the top of my head

Ajax (very handy but lot of times misused and misunderstood or just DHTML)

No SQL (can be handy, but for some people it is like a religion)

XML (why use CSV when you are able to use XML)

Cloud (Oh by cloud you mean the internet?)

and many more :)

------
bborud
Hehe, most of the Hadoop installations I've seen chew through data amounts
that I used to process 10 times as quickly using dirty, rotten Perl scripts,
sort, cat and flat files :)

------
lmm
So the OP _claimed Hadoop skills_ , the interviewer _asked him to use Hadoop_
, gave a him small example problem. He then _didn 't use Hadoop_, and thinks
there's something wrong with the interviewer for objecting to this?

Interview problems are sometimes kinda artificial, no shit. Given the
impracticality of giving every candidate the kind of dataset Hadoop would be
needed for, how would the OP suggest an employer test for Hadoop skills?

~~~
Mikeb85
Thank you for saying this. Exactly what I was thinking when I read the
article...

------
mindcrime
On a related note, I find it amusing that people seem to think that "Big Data
== Hadoop". In actuality, there are plenty of other approaches to scaling out
clusters to handle large jobs, including MPI and OpenMP, as well as BSP (Bulk
Synchronous Parallel)[3] frameworks.

[1]:
[http://en.wikipedia.org/wiki/Message_Passing_Interface](http://en.wikipedia.org/wiki/Message_Passing_Interface)

[2]:
[http://en.wikipedia.org/wiki/OpenMP](http://en.wikipedia.org/wiki/OpenMP)

[3]:
[http://en.wikipedia.org/wiki/Bulk_synchronous_parallel](http://en.wikipedia.org/wiki/Bulk_synchronous_parallel)

~~~
gfodor
Why do you find this amusing?

~~~
mindcrime
It's basically what w_t_payne said. It's amusing (in a sense) because it
betrays such a lack of understanding of what's actually going on, and such a
willingness to simply accept the "received wisdom" and not question it and
think independently.

~~~
dekhn
Some of us have backgrounds in MPI, BSP and other distributed computing
paradigms. I switched to high throughput computing and I can tell you that I
would never go back to MPI. People who use MPI almost always use it as a
crutch to avoid solving the hard problem that would make their computation a
lot easier.

As an example I used to write MPI code for molecular dynamics simulators. The
goal was to use a large supercomputer to scale the computation up to longer
trajectories. But then, with cheap linux boxes it made more sense to just run
many simulations in parallel, because low latency MPI class networks cost much
more than the machines. We and others made the switch to other approaches
which didn't need MPI- we ran many sims in parallel and did only loosely
coupled exchanges of data. In fact the coupling was so loose we would run for
weeks, dumping output in a large storage system and processing the data with
MapReduce.

When I look back at the people who still do the MPI simulations- they are
doing dinky stuff and can't even process the data they generate because they
spend all their time scaling a code using MPI that didn't need to be scaled
with MPI.

~~~
mindcrime
Sure, and I'm not saying "MPI > Hadoop" or anything. Just pointing out that
Hadoop is not the only game in town, and is hardly the only way to deal with
"big data". I'm a Hadoop fan myself, but I also did a lot of MPI stuff in the
past, and I believe there are still scenarios where MPI makes a lot of sense.
I have less experience with OpenMP, but I think anybody planning a "big data"
project would be well served to at least go out, and study up on _all_ of
Hadoop (and other M/R platforms), MPI, OpenMP, BSP, etc., before simply
choosing Hadoop because it's the framework du jour.

~~~
gfodor
Often times choosing the framework du jour is the best choice just _because_
it's the framework du jour. Support, training, books, an active ecosystem, a
rich base of developers to hire from with experience, corporations
incentivized to fund further development, etc, etc all act as a hedge against
"slight technical mismatch between our requirements and the technology vs more
esoteric ones."

~~~
mindcrime
But what you're talking about here is still a _considered_ decision, based on
actual analysis and thought. It's not choosing a framework _only_ because it's
the framework du jour, but because of the second order effects of it being so.

------
rozim
See 'Nobody ever got ﬁred for using Hadoop on a cluster'
[http://research.microsoft.com/pubs/163083/hotcbp12%20final.p...](http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf)

------
jacquesm
Spot on. I recently audited a project that was using an over the top technical
solution for a problem that would - with only minor nuts and bolts work - have
fit easily on a single machine instead of on a cluster, and it would have run
much faster too. Demonstrating this made the case and they've since happily
converted. You can buy off-the-shelf machines with 256G of ram at reasonable
(for large values of reasonable) cost with IO speeds to match if you equip
them with SSDs.

Big Data to me means at a minimum 10's of T, and what big data means changes
over time, so todays big data will fit the laptop of the day after tomorrow.

------
paul_f
This can be a confusing topic. Hadoop is several things. A No-SQL data store,
map-reduce and a global file system. No-SQL and Map-Reduce can be quite
valuable, even on a single server. CouchDB runs on Android for example.

If you don't need a global file system, use MariaDB, CouchDB or Mongo
depending on your use-case.

------
naiquevin
I have no experience with hadoop at all and it may be slightly off topic but
this reminds me of a post titled "Taco bell programming"[1], after reading
which I started learning and using Unix tools and commands much more than
before instead of writing silly Python scripts for almost anything that needed
automation.

[1]:
[http://web.archive.org/web/20110220110013/http://teddziuba.c...](http://web.archive.org/web/20110220110013/http://teddziuba.com/2010/10/taco-
bell-programming.html)

------
sitkack
Hadoop is the problem, MapReduce is not the problem. Having used both Hadoop
and Disco, I can say that Disco is by far a net positive on all projects I
used it on. And the overhead to coding it in Disco vs single node is about an
extra 30 minutes. You can start with working single node and go multinode w/o
much effort.

[http://discoproject.org](http://discoproject.org)

Hadoop on the other hand is a huge, massive pain in the ass. And I am a Hadoop
consultant. I recommend that most customers NOT use it.

~~~
travisoliphant
This is 100% true. Hadoop gets way too much attention given the other useful
solutions that exist out there. I have known people to use Disco successfully
on several hundred node cluster. You can also interact Disco with IPython
parallel much more easily.

This is why we include it in Linux versions of Anaconda.

------
lgieron
The article focuses on data sizes and completely ignores per-row computations
time requirements. In my case, our dataset is just 1-2 tb, but we need
hundreds of cores to process it within reasonable timeframe - hence hadoop.

------
aaron695
Why is the title "Don't use Hadoop when your data isn't that big"

When the article is "Don't use Hadoop - your data isn't that big"

Two totally different points?

And I'm sure it was correct to begin with.

------
deathflute
Well written article. I think most people who do not have a background in data
are unaware of the various options out there and fall for the marketing behind
hadoop like tools.

I would urge people doing analytics to take a look at kdb+ from kx. Unless you
have ridiculously large amounts of data(> 200 TB), I can bet that you would be
better off with kdb. The only downside is that it costs a lot of money which
is a pity.

~~~
pwang
You'd have to hire a team of people that can write good K. Since they're so
highly in demand in the niche area of finance, you'll be paying a very pretty
penny for them.

~~~
deathflute
I have been part of a team which used kdb heavily and this is not really true.
The best part of kdb is that it requires almost no administration and at the
end of the day becomes just another programming environment/tool for you to
use. Hiring a team of people who just do kdb is only done by banks or other
unimaginative large teams.

------
karl_gluck
I love how whoever mods this site now doesn't even have to follow their own
rule about not editorializing headlines.

Mods: Don't be hypocrites. If you're going to enforce your "only use the
source title" trash on us, follow it yourself.

Original title: "Don't use Hadoop - your data isn't that big"

Mod-invented title: "Don't use Hadoop when your data isn't that big"

~~~
adambard
I've also been sort of feeling like titles get rewritten just for the sake of
being rewritten these days. I understand the need to combat sensationalism,
but this sort of edit really takes away the writer's voice.

------
acidity
I have been looking into building simple recommendation engine (i have at max
million data rows) using Python. I looked into Crab
([https://github.com/muricoca/crab](https://github.com/muricoca/crab)) and it
seems to not have updated for 2 years.

Any suggestions for libraries or just use basic numpy/scipy and implement the
algorithms?

~~~
w_t_payne
Just implement the damn algorithms. Anything else is just another library to
learn, and another tool to babysit.

~~~
edraferi
Agreed. Take the Machine Learning class on Coursera if you need an
introduction to them.

~~~
acidity
Okay. I already have read about the algorithms and even implemented them
crudely while studying them.

Just not a big fan of NHI. But if in this case, thats the best way - let it be
:)

------
jimbokun
Does anyone use Hadoop for job management?

We have millions of XML documents in a document database. Many of the
questions we want to ask about those documents can be answered through the
native database querying capabilities.

But there are always questions falling outside the scope of the query
capabilities, that could be answered by a simple map function applied to each
document, with a reduce to combine the results.

Seems like a pain to always query for the documents you want to process, find
some place to store them on disk, then run a program locally to get the
result, vs. writing a map and reduce job then pointing it at the documents in
the database to run against (this document store has a Hadoop integration
API). Hadoop also seems to have a lot of nice job frameworks and monitoring
tools and APIs to track job progress.

Anyone have similar situation where you used Hadoop just to get job management
and tracking and flexibility in performing data analysis tasks? Are there
easier ways to accomplish this goal?

~~~
Chico75
The company I work for creates a product which may help you :
[http://www.syncsort.com/en/Data-
Integration/Products/DMX-h/D...](http://www.syncsort.com/en/Data-
Integration/Products/DMX-h/DMX-h-Overview)

------
rcavezza
I understand and agree with the author's main point that many companies that
use big data do not need to use these technologies.

I do not agree that the tools are inferior to Sql. Hive is really close to sql
and Pig is extremely powerful. I would take a look at a few of the recent
updates to these tools before declaring them inferior to Sql.

~~~
ssalbiz
Disclaimer: I've worked with and submitted a few odd patches to Hive. I have
not worked with Pig directly.

I think you ought to consider what you mean by inferior here carefully. If you
mean 'Hive QL can capture many of the common semantics of SQL' then sure.

If you mean just about anything else, you're wrong. The performance and
reliability considerations of Hive/Hadoop are vastly different and very easily
inferior to a mysql or postgres setup for small to mid-size datasets.

(That doesn't even get into ease-of-usage. Anyone who's ever dealt with Hive's
dreaded 'error: return code: -9' can attest to how maddening Hive can be to
use).

------
ianstallings
I used a big data option on my last project because marketing expectations
were gigantic and the hype around the project was also enormous. Looking back
it was a poor choice because the expectations never panned out and we could
have saved some time and effort using a more traditional and well known SQL
database like PostgreSQL. Before that I had a fairly large project with ~1M
user profiles running with no problems on PostgreSQL. I think it would have
handled the latest project with ease and could be sharded and scaled when to
handle the growth it's seeing now.

But marketing was insisting we use _big data_ because "regular" databases
couldn't handle such enormous possibilities. I'll never believe that nonsense
again. It wasn't a terrible ending but it was more hassle than it was worth
IMHO. At least I got some resume material out of it..

------
aheilbut
And even if it is, consider using Spark/Shark instead.

------
sologoub
This goes back to the old adage "the right tool for the job".

As @davidmr points out, there are jobs on smaller data sets that can still
benefit from the distributed nature of HPC.

That said, my own experience with startups echoes much more what OP writes -
python scripts and csv processing has saved me days of headaches in resource
constrained environments. I was able to quickly produce analysis and make
crucial scaling decisions using data that would have taken days of engineering
resources to produce. I happened to have the right tool for that job handy,
and it worked out great.

You really have to think things through before restricting yourself to any
specific direction.

------
iblaine
I guess I haven't been around enough small startups (fewer than 100 people)
because I hardly get the sense that people are haphazardly spinning hadoop
clusters. People generally pick the right tools for the right jobs. This is
particularly true in the datawarehousing world.

That being said I can see problems where people pick hadoop without knowing
how it's going to integrate into their systems 1-3 years down the road.
Particularly with cloud computing these days, you can easily bring large
complex systems online with little effort. It's cool and scary at the same
time.

------
agibsonccc
Very good points. I liked learning how to use hadoop academically, but it's
not an end all be all tool.

If you want to do map reduce, it's perfectly reasonable just to use something
smaller scale on multiple cores to do some kind of data processing.

Another thing is real time processing, using something like storm
([http://storm-project.net/](http://storm-project.net/)) or even the
parallelism based systems like Akka on the JVM or Go will allow you to have
adequate performance. Hadoop has a lot of overhead in not only operations but
also job startup.

------
capkutay
This isn't directly related, but is hadoop the only java-based solution to
parallel computing? I've seen some examples of people attempting to do work in
java mpi[0] again. It seems like dealing with the gc and memory management in
general has been an issue when trying to do high performance computing in
java, especially at a distributed scale.

0:[http://blogs.cisco.com/performance/mpi-and-java-
redux/](http://blogs.cisco.com/performance/mpi-and-java-redux/)

------
calinet6
My company's data is that big, thank you very much.

But if yours isn't, sure, don't use a system designed for handling
astronomical data. Should be common sense, probably isn't.

------
CurtMonash
While the point in the headline is fine, the supporting reasoning is in places
dubious.

Don't use SQL for anything over 5 TB? Huh? You can put a lot more data than
that on a node with a nice open source columnar analytic DBMS, and of course
there are a lot of MPP relational analytic DBMS as well.

SQL on Hadoop requiring full table scans? Well, that's what Impala is for.
Hadapt is more mature than Impala. Stinger is coming on, and is open source.

Nice marketing line, however.

------
liranz
So true! I would 'up' it twice if I could.

Too many startups go over to Hadoop/no-sql solutions before the overhead is
indeed justified. SQL for most of the data with a bit of Redis and numpy for
background processing will take you much further than most people assume.

It's fun to think you must have DynamoDB, Hadoop or a Cassandra backend, but
in real life -- you better invest in more features (or analytics!)

------
fuziontech
I agree with this guy's main point of using the right tool for the job, but he
hates on Hadoop waaay too much. Even overlooking how simple you can make
building out mapreduce jobs in python using something like MRJob, or just
using Hive if SQL is really your fancy. Hadoop has its place, and as with any
tool, can be the hammer that makes everything look like a nail.

------
nfa_backward
The author is missing a big gap between 5TB - 1PB. For most workloads, I would
not look to Hadoop at the 5TB+ scale of data. I would first look at Impala or
Redshift.

[http://blog.cloudera.com/blog/2012/10/cloudera-impala-
real-t...](http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-
queries-in-apache-hadoop-for-real/)

------
leif
More than 5TB doesn't mean you need Hadoop either, it means you need a better
storage system like tokudb/tokumx or leveldb. These technologies can index the
data so that you can run selective queries instead of reading all the data for
every query like Hadoop would, and they can compress it so that you can keep
everything on one disk a while longer.

------
antonmks
Big advantage of Hadoop is that you do not need to pay license fees which in
case of relational databases can reach into tens of thousands dollars.

The big disadvantage is that Hadoop is two orders of magnitude slower than
relational databases. Also, Hadoop clusters are not what one would call a
"green" solution. More like a terrible waste of computing resources.

~~~
yummyfajitas
Postgres is free.

------
fiatmoney
Hadoop / MapReduce was invented for situations where the data is being
generated on the machines (eg, via a distributed web crawl). If you're not
generating the data in situ and have to ETL it anyway, it makes just as much
sense to load the data onto one Monster Box with a terabyte of RAM and 48 CPU
cores. You massively save on complexity.

------
CmonDev
There is no problem with CV-driven development.

~~~
debacle
I think that's disingenuous. CV-driven development leads to really messy
nightmares, in my experience, for the same reason that a machete is a poor
tool for heart surgery.

~~~
CmonDev
Not hiring programmers older than 40 is also disingenuous. You a right of
course, but it does solve one problem - your employability. Nobody cares about
your perfect VisualBasic architecture.

------
TeeWEE
I sort of agree. However when you think you will grow into the TB range, you
just as well do it in hadoop right away.

We are using Hive and HiveQL and have SQL like queries which generate the
correct output. The result is: we dont have to hassle with the hadoop mappers
and reducers. And we can write our "queries" in a human readable fashion.

~~~
vonmoltke
Why would Hadoop automatically be the right tool for the job in this case? In
fact, when is any tool automatically the right tool for the job?

I understand the desire not to do unnecessary work and to plan ahead. However,
something like Hadoop is very heavy weight and requires a significant change
to the way you write your software. It isn't something easy to undo if Hadoop
turns out to be the wrong direction. Thus, if you start there you are more
likely to stay there, even if "there" is not appropriate. On the other hand,
starting with a much simpler system and adapting it as your needs change may
add a little more work over the course of the project, but saves a buttload of
necessary technical debt.

~~~
TeeWEE
I do agree with you. I prefer too keep it simple too, and not plan ahead too
much According to the KISS principle. I'm not saying hadoop is the solution
all the time. I'm just saying that sometimes it makes sense.

------
krosaen
Great points in this article - though using something like cascalog makes
hadoop suck a lot less - e.g composable more complex queries closer in power
to sql. Wouldn't be quite as crazy to use on smaller datasets be it just for
fun or to prove you are ready if/when your dataset grows large enough.

------
progx
Thx! Like always: use the right tool for the job. But many customers dont
understand this simple rule.

~~~
w_t_payne
The people who sign the cheques tend to be managers, who tend to be the sort
of person who got to where they are because they are busy and aggressive, and
as a result tend to be burdened with a packed; stressful; attention-depleting
schedule.

Regardless of the intelligence that they were born with; as a result of all
the testosterone-fuelled busyness, they lack the cognitive resources to think
deeply or in detail about the decisions that they are making (And, Ironically,
as a result also lack the cognitive resources to realise just how
intellectually compromised they are).

The upshot of all of this stress and fatigue is that we end up with decisions
that are dominated by groupthink; buzzwords and the sales pitch of the latest
vendor to stick his shoe in the door. Oh, and by whatever HBR & McKinsey said
last week.

------
lemmsjid
While there is a point to be made here, this article does not make it. Or
perhaps it goes too far in attempting to make it, to the point where I feel
like it might tip people in the wrong direction.

The point of the article is taken if:

A) Your data is not large. B) You aren't creating large intermediary datasets
with the data. C) You aren't running an increasingly large number of analysis
jobs on the data. D) Your computational overhead is small. E) Your memory
overhead is small (this requires an asterix, because some tasks that require
extreme amounts of memory will not work well in hadoop and should be brought
outside) F) You don't need or want a system to track the increasingly large
number of analysis jobs you're running. G) You can guarantee you won't outgrow
A, B, C which will force you to rewrite all your code.

G is especially difficult because it's hard to predict. F is always
underestimated at the beginning of a project and bites you later. Yes you can
write analysis scripts--what happens when there are a hundred of them, written
by different developers? Time to write a job tracking system, with counters,
retries, notification, etc. Like Hadoop.

To further D and E, there are workloads that are relatively straightforward
across terrabytes of data, and there are workloads that are expensive over
gigabytes of data (especially those involving the creation of intermediate
indices, which is where MR itself speeds things up considerably, esp if done
in parallel).

Also, in a critique of Hadoop the article obsesses over MapReduce (in a way,
conflating Hadoop and MapReduce, just as it conflates 'SQL' with a 'SQL
database'), ignoring the increasingly powerful tools that can be used, such as
Hive, Pig, Cascading, etc. Do those tools beat a SQL database in flexibility?
The answer is that the question is not really relevant. If you already
understand the nature of your data, and you've gone through the very difficult
act of designing a normalized schema that fits what you need, then you're in a
good place. If you have a chunk of data in which the potential has not yet
been unlocked, or in which the act of writing to it happens to quickly to
justify the live indexing implied by a database, then Hadoop is an essential
tool. They really sit next to one another.

None of this is to knock writing analysis scripts against local data. I do
that all the time. In fact often I'll ship data from HDFS to the local system
so I can write and run a script. I just think it's important at a company to
make sure your people have access to good tools so there aren't hurdles in
front of them, and when it comes to data analysis I've come to the opinion
that you really want to have a Hadoop cluster set up next to your SQL
databases and your other tooling, because it will become useful in sometimes
unpredictable ways.

Yes, if there are a few hundred megabytes in front of you and you need to
analyze them, then write a script--and were I interviewing someone for a job I
would not hesitate to accept a script that solves a data analysis task, so
clearly the people being interacted with by the author were somewhat myopic.
But most companies require that an ecosystem be built to handle the increasing
complexity that will ensue over the years. And Hadoop is a huge bootstrap to
that ecosystem, regardless of data size.

~~~
bjelkeman-again
Thanks. That put my thoughts on screen rather succinctly for me.

------
akanet
Perhaps this is the beginning of the resurgence of the "Small Data Expert"

------
trimbo
Another chance to plug for my favorite command-line small data toolset, Crush
Tools. I used it often at Groupon.

[https://code.google.com/p/crush-tools/](https://code.google.com/p/crush-
tools/)

------
samspenc
Great article. I would say 1 TB is really the absolute minimum at which you
should start considering Hadoop - for anything less, use Python or variants
(for flat files) or MySQL or PostGRES (for relational data.)

------
rburhum
You would be surprised to find out how much processing 400GB of geodata needs
sometimes. 7 days in a normal machine to do some non-trivial analysis of
OpenStreetMap data. Hadoop can reduce that to hours instead.

------
sgt101
Key point : Hadoop storage is bloody cheap compared with a SAN, as in $1kTB vs
$15kTB for enterprise storage, and it costs a _pittance_ per year compared
with licensing costs.

Only problem is backing the bugger up.

------
angeladur
Yes the distinction is important - Is it a "Big Data" problem or is it a "Big"
data problem

------
mufumbo
was that a job interview? You probably shouldn't use your favorite super-
scripting language on a job interview that was asking for a more robust
answer. They were maybe looking for someone to transform their hacky platform
to something more robust.

------
richardlblair
Actually, it is. Thanks for the blanket statement, though.

