Hacker News new | comments | show | ask | jobs | submit login
Don't use Hadoop when your data isn't that big (chrisstucchio.com)
711 points by gcoleman 1074 days ago | hide | past | web | 227 comments | favorite

Hooray! Some sense at last.

I have worked for at least 3 different employers that claimed to be using "Big Data". Only one of them was really telling the truth.

All of them wanted to feel like they were doing something special.

The sad thing is, they were all special, each in their own particular way, but none of what made each company magic and special had anything to do with the size of the data that they were handling.

Hadoop was required in exactly zero of these cases.

Funnily enough, after I left one of them, they started to build lots of Hadoop-based systems for reasons which, as far as I could fathom, had more to do with the resumes of the engineers involved than the actual technical merits of the case.

Sad, but 'tis the way of the world.

There is principal-agent problems in a lot of companies.

CEOs want to run bigger companies, even if it's not best for shareholders.

Managers want to run large mini-empires, even if it's not best for the corporation.

Engineers like to use the latest tools, even when they're not best for the problems at hand.

Solving this requires well respected leadership that knows enough about the details, and which battles are worth fighting.

Yes I've worked for a couple of outfits that did this over the years. It's usually down to who sold them the solution though. It's worst if the salesperson is external.

The funniest has to be the engineering company which I used to work with (not for thank fuck) back in the 90's (when 100Gb was big data!). They managed to bag 2x full 42U racks full of HP N-Class HPUX UNIX kit, PDU systems and separate disk arrays for just over £1,000,000 supplied with one full time UNIX monkey to keep the plates spinning.

Turns out they had only 10 CAD/engineering staff and about 8Gb of data in an Oracle DB which a single Sun Ultra desktop (<£10,000) could have handled without a problem at the time.

After I moved from academia/gov't labs to private industry, I was absolutely flabbergasted at the waste. Many companies literally just buy whatever their vendors throw at them to solve a problem, and if a reasonably intelligent techie spent just an hour or two thinking through the problem, they'd realize that the solution was either far less complicated than the proposed kit or far more; either way, what's proposed is almost never right.

Another thing I haven't seen much of is acceptance testing. This is actually more surprising to me than the reliance on vendors for problem solving. It's an absolute no-brainer to say "I have problem X, which will be solved by solution Y, and the success will be measured by the set of tests Z. Unless the above happens, we will not pay the bill and you will take the hardware back." The vendors looked at me like I was from Mars when I started suggesting these.

I have not since complained about government waste in research computing budgets.

Outsourcing your own teams means there are no in-house experts to perform the oversight role you've described. Thus the principal-agent is amplified.

Yeah... it looks like my employer is going down the same route with an EMC Isilon NAS system.

We really just want high throughput for linear scans; possibly with a small amount of parallelism, which would argue for a fairly simple solution with as little cacheing as possible (ideally none whatsoever) ... but I cannot even tell (from the website & the literature that I can access) how these things are supposed to operate -- which is a worrying sign in itself.

Previously, I got good performance out of a SAN using a handful of NexSan SataBeasts -- which I would have recommended to use again, if anybody had bothered to ask.


Isilon is nice - expensive (it is EMC), but it gives good performance and is reliable.

When in the 90's? I'd expect it would have taken the full 2x 42U racks to keep the whole dataset in memory at one time. Not to say it's required, but that doesn't sound totally crazy for the time period.

Also, having an order of magnitude more hardware than you ought to need is totally understandable when hardware capability is increasing by an order of magnitude faster than hardware replacement cycles. It's when you're off by several orders of magnitude that it starts to become suspect.

1998. It had 4Gb of RAM in each node which wasn't that scary. Most of it was UPS and disk.

I had a Sun 1000E [1] for 3 years as a workstation from 1996. 8x 50MHz CPU, 2Gb of RAM, 16Gb disk array. It was made in'93 and was approx 12U for the unit and array. Things weren't that big in the 90's. 80's yes...

[1] http://en.m.wikipedia.org/wiki/Sun4d

Agreed, but I'm not optimistic that the view will become wide-spread.

There was a phrase I learned long ago: "if it fits into cache it's not a supercomputing problem."

This essay promotes the related view: "If it fits on a desktop, it's not Big Data."

BTW, you might like Greg Wilson's 2008 essay "HPC Considered Harmful". I can't find the slides anymore, but there's a lovely interview with him about this at http://itc.conversationsnetwork.org/shows/detail3682.html .

Just because it's easy to lose track, I just checked, and "Stereotypical Grandma walks into Best Buy and buys their best desktop because she doesn't know any better" "(and probably got bamboozled by the sales guy)" ends up with Grandma having 16GB of RAM for $900.

Tricky catch 22. You want the kind of employees that are interested in what they do enough to want to learn new skills and try new exciting technologies. But, you want them to actually use something boring.

There's nothing boring about pandas

It's even more exciting mixing pandas with the IPython notebook.

heh. Until you actually have to repeat something! It's quite useful for iterative refinements like tweaking pyplot graphics, but it's all to easy to lose track of what versions of which blocks executed to create the current state.

Maybe I just need more discipline..

Walking on a highwire is also exciting. :)

Until you run out of memory.

That's when the fun starts!

that hardly makes it boring... it's just inappropriate for some kinds of task

I agree with the general thrust of this article. But hadoop isn't just for scaling up the absolute size of the data set. It is also useful for scaling up the absolute amount of CPU power you can throw at a problem. If I have 1 GB data set, but the computations that I need to do on that data set are complex enough that it would take a single machine a long time to do them, then hadoop is still useful. I gain tremendously by being able to fire up 100 extra large EC2 servers and run my computation much more quickly than I could with SQL or Python on a single machine.

Now some might counter this point with the observation others have made here that using hadoop imposes a ~10x slowdown. But even then, my 100 EC2 servers will get the job done 10x faster. Running a job in 1 hour with hadoop is MUCH better than running the same job in 10 hours without it, especially when you're doing data analysis and you need to iterate rapidly.

So there is a point where using hadoop is not productive. But that limit is not 5 TB and depends on a lot more variables. Over simplification makes for catchy blog posts, but is rarely the way to make good engineering decisions.

In that case, your IO problem is easy, because it's small with regard to CPU time. You can get away with putting the whole data in sql databases, and/or making multiple copies of your data. Then you can use as many workers as you want, with usually simple partitioning logic.

Originally I did just that, but ultimately decided to move to Hadoop. When combined with Amazon EMR, launching arbitrarily large cluster is just a few clicks. You can then monitor progress, have robust cluser-wide error handling, and your data gets nicely merged into output files in S3 (not so easy with the home-baked solution).

We've had a lot of success with EMR as well - we have an hourly Pig job that produces data for our analytics database. It's not a particularly complex script, but our traffic volume is unpredictable so it's reassuring to know that we can add resources to a slow job and have it finish faster.

The downside of EMR is that it can be fairly expensive once you start needing the beefy machines. We're lucky that we can afford to have our analytics delayed an hour or two and can thus run on Spot instances (except for the Master node). When we move to a streaming architecture I'm not sure EMR will still be competitive, since we won't be able to have those machines go away on us.

Edit: clarity.

If you only look at CPU time, then yes, maybe you could do that. But there are many more factors at play.

I can understand this use case. e.g. you have some 100 parameter financial instrument that can only be valued with monte-carlo methods. You could have 1 GB of data, but many core years of computation.

In this case, is Hadoop actually a good tool? When you have such a small amount of data, it is unlikely you will ever be faced with a "reduce" challenge, and thus the problem is better solved with a simple queue rather than a hadoop cluster.

This assumes that it's easier to manage 100 servers using Hadoop than, say, using the same 100 EC2 servers with a normal grid scheduler, task queue, or even something like GNU parallel.

It's orders of magnitude easier to start 100 servers and reuse basic Unix skills than it is to setup and manage Hadoop on the same infrastructure.

does your grid scheduler checkpoint intermediate work automatically, so that node failures have small impact?

Yes, although you have control over that kind of thing: this is a problem HPC groups have been tackling for decades so there's a ton of prior art as well as options for whatever makes sense for your application (e.g. if you're doing some sort of shared memory simulation you have to restart every instance, not just the one which failed).

We use Hadoop/Hive/EMR for a sub-TB dataset for exactly this reason. The flip side is that I've found the ecosystem to be pretty flakey (specifically Hive). Debugging code you wrote yourself is pretty different to working your way through failures on a distributed system.

The point of the article that resonates with me is how frequently a technology that is poorly fit with a problem domain is selected because of conventional wisdom rather than data.

Related, it is remarkable how we developers routinely cite Knuth's advice about premature optimization to justify our decision when the shoe fits, and then turn around and flatly ignore the advice when it doesn't fit.

Selecting Hadoop before you have a specific and concrete need for it--or see that need approaching rapidly on the horizon--is in my experience often and surprisingly coupled with a disdain for other performance characteristics (because Knuth!). The developers prematurely selecting Hadoop as their data management platform will routinely be the same developers who believe it's reasonable for a web application with modest functionality to require dozens of application nodes to service concurrent request load measured in the mere thousands. The sad thing here being that application platforms and frameworks are not all that dissimilar; today, selecting something with higher performance in the application tier is relatively low-friction from a learning and effort perspective. But it's often not done. Meanwhile, selecting Hadoop on the data tier is a substantially different paradigm versus the alternative (as the article points out), so you have some debt incoming once you make that decision. And yet, this is done often enough for many of us to recognize the problem.

In my experience, for a modest web application, it's better to focus resources and effort on avoiding common and stupid mistakes that lead to performance and scale pain. Selecting Hadoop too early doesn't really do a whole lot to move the performance needle for a modest web application.

Trouble is, many web businesses are blind to the fact that they are a modest concern and not the next Facebook.

>Selecting Hadoop before you have a specific and concrete need for it--or see that need approaching rapidly on the horizon

the great achievement of Hadoop isn't performance, throughput, etc... (and in many cases it can actually be worse than alternatives).

The great achievement is that it is an easy (i.e. on the orders of magnitude) accessible (i.e. easy installable, configurable, scalable, supportable, programmable) cluster for "mere mortal" enterprise. Before Hadoop, any clusterable software (where even laughable by Hadoop standards 2 nodes were already priced at the cluster level) would cost arm and leg and 4 or 8 nodes would be considered a huge cluster. Hadoop gives a chance to an enterprise (not to CERN :) to taste an unconstrained (loose use of term) distributed computing, and the next generation of software is already coming - Impala, etc...

And obligatory - "nobody gets fired for using Hadoop".

"Related, it is remarkable how we developers routinely cite Knuth's advice about premature optimization to justify our decision when the shoe fits, and then turn around and flatly ignore the advice when it doesn't fit."

It is especially bad when people uses Knuth's advice to forgo the conception part where you think about what technologies you will use and why.

> the same developers who believe it's reasonable for a web application with modest functionality to require dozens of application nodes to service concurrent request load measured in the mere thousands

I'm assuming this is a dig at Rails and Django? What are you suggesting instead?

Just about anything on the JVM (Dropwizard, Play, Finagle, Scalatra, Rest-Express, Rest-Easy, Compojure, Unfiltered, Jersey, Vert.x, Spark), Go (Gorilla, Beego, Revel), Lua (Lapis), Haskell, Erlang...

Are any of these environments actually proven in more than a handful of real-world production environments?

I'd certainly like to get into alternatives but my impression was that none of these battle-tested the way Rails and Django are.

There is a reason that the largest Internet companies are still running their infrastructure on Java and C++. It scales and operations people know how to manage it.

Rails is the classic example of optimizing for developer ease of use instead of performance. Which probably makes sense for startups, but scaling is a pain in the ass.

Openresty is great for infrastructure level solutions, e.g. request routing and authentication. It is used by some of the largest Internet companies in China.

The benchmarks at http://www.techempower.com/benchmarks/ are a reminder of how poorly some popular frameworks perform. And how poorly cloud performs vs relatively cheap dedicated hardware.

I assume you mean battle-tested from a security standpoint.

I suppose you'd need to make that judgment call yourself. In my opinion, many of the options I listed have seen sufficient production usage that I consider them roughly as secure as any mainstream framework. For example, Play and Scalatra are used at Netflix, BBC, LinkedIn, etc.

Many expressly avoid notoriously dangerous (and frankly alarming) deserialize-user-input-and-execute patterns, which is in my experience the most common form of framework vulnerability.

As for performance, scalability, and reliability under pressure, that is precisely the point I was making earlier: higher-performance platforms combined with frameworks designed for performance and scalability are targeting grace under pressure moreso than legacy frameworks. Higher performance gives you greater peace of mind with availability, responsiveness, and user experience.

Eh, I just got bitten by a Python framework which could only do about 400 rps of "Hello World" on my laptop. Pure python could be pushed to 14000 with gevent and 4 processes, while a pure Java servlet easily got 100krps, async java around 140krps, haskell 200k with 4 workers. Out of the box Java servlet was better than fastcgi nginx. So I don't see what your'e getting at.

Did you happen to try with C?

99% of the cases I have seen where people have been working with tables that are in the 5+ TB range for analysis, there is some obvious way to compress the data that they have overlooked. Most analysts find some way to aggregate a dataset once, then do actual work on that aggregated dataset, rather than the raw data. In geospatial analytics, for example, a trillion records can be aggregated down to census blocks/block groups so you only have a few million records to deal with. The initial aggregation often takes several days, but after that you can calculate most things in a few seconds with reasonable hardware.

In addition to compression, let's not forget sampling. For a lot of problems a random subset of the data will give you a statistically meaningful answer with sufficient precision. It seems like the rise of "big data" has led to the assumption that all queries have to be run against the entire dataset.

True. The issue with "Big Data" is that sometimes, especially when you need to produce personalised recommendations, the sampling doesn't cut it (or at least produces sub-optimal results).

Exactly - encoding is the key.

I just gave a presentation last week at the SF Python Meetup about how AdRoll uses a single server to query terabytes of compressed data with sub-minute latencies, using a system that is implemented in Python:


This is probably a thousand times less resource intensive than using Hadoop for the same queries.

Compression and out-of-core are not nearly discussed enough in the "big data" circles. The key challenge is efficiently using bandwidth-limited channels, whether it's a network link between nodes, the SATA bus between disk and RAM, or the memory channel between RAM and L3.

It's also why we are both building compression into the native storage format for Blaze, and why Blaze is designed to run out-of-core from the start.


In the cases where I've had to use Hadoop, this was exactly what happened. We received a huge volume of data daily, which was aggregated by a nightly Hadoop job into a much more manageable amount of data for us to analyze with python/pandas.

I've found compression to be one of the big benefits of Hive / Impala; we are able to load and manipulate compressed data very quickly compared with our friends using a RDBMS who have to uncompress and then import the data. This shortens the cycle time on some analytics challenges by days (typically we get data in the 3-8TB range) where we can answer simple questions in a few hours (and senior management loose focus immediately afterwards).

While I couldn't agree more with the general point of the article, I have some small additional comments.

Just as a bit of background, I think that Chris would very much agree that I am not the intended recipient of this advice, and so my comments probably aren't keeping in the spirit of the article. I've spent the last 10+ years exclusively in very large HPC environments where the average size of the problem set is somewhere between 500TB and 10PB, and usually much closer to the latter than the former.

I think that, for the types of problems Chris mentions, for small data sets, hadoop is as silly a solution as he claims, and for the large map-reduce problem set of the (divide and conquer using simple arithmetic) of 5TB+, he's clearly in the right. Periodically I peruse job postings to see what is out there, and I'm personally ashamed at what many people call "big data", but just because your problem set doesn't fit the traditional model of big data (incidentally, I'm having trouble thinking of a canonical example of big data. perhaps genome sequencing? astronomical survey data?), doesn't mean that a) hadoop is not the right solution, and b) it's best done on a box with a hard drive and a postgres install, pandas/scipy, whatever.

We'll be generous and say that 4TB can do 150MB/s. A single run through the data at maximum efficiency will cost you ~8 hours. Since we've restricted ourselves to a single box, we're also not going to be able to keep the data in memory for subsequent calculations/simulations/whatever.

Take for example a 4TB data set. It is defined such that it would fit on a 4TB hard drive, but if your problem set involves reading the entire set of data and not just the indexes of a well-ordered schema, you're still going to have a bad time if you want it done quickly, or have a parameterization model that requires each permutation to be applied through the entire sequence of the data rather than chunks you can properly load into memory and then move on, you're going to have a really bad time.

I suppose all of this is to say that the amount of required parallelization of a problem isn't necessarily related to the size of the problem set as is mentioned most in the article, but also the inherent CPU and IO characteristics of the problem. Some small problems are great for large-scale map-reduce clusters, some huge problems are horrible for even bigger-scale map-reduce clusters (think fluid dynamics or something that requires each subdivision of the problem space to communicate with its neighbors).

I've had a quote printed on my door for years: Supercomputers are an expensive tool for turning CPU-bound problems into IO-bound problems.

"Big Data starts at 1.5 PB [because that's what fits in memory on Blue Waters]" -- Bill Gropp

On today's top HPC installations, it currently takes about an hour to read or write the contents of memory from/to global storage. If the workflow has broad dependencies (PDEs are especially bad, but lots of network analysis also fits the bill), it's much better to use more parallelism to run for a shorter period of time with the working set in memory. If you can't fit your working set in memory on the largest machine available, chances are you should either find time on a bigger machine or you can't afford to do the analysis. (The largest scientific allocations are in the 10s of millions of core hours per year, which would be burned in a few reads and writes on a million-core machine.)

Also note that MapReduce is not very expressive. When IBM's Watson team was getting started, some people suggested using MapReduce/Hadoop, but the team quickly concluded it was way too slow/constraining and instead opted for an in-memory database with MPI for communication.

1.5 PB is pretty huge for most people outside of supercomputing. The biggest dataset I ever worked with fell some way short of that (It was approaching 1PB, I think, although we were not too sure how much data we had exactly, and it might have been slightly over, depending on how you measured). Even handling small(er) working sets could be a challenge in organisation, since we only had a small compute cluster (& associated infrastructure) to work with.

"Supercomputing" has never really been about big data. I used to work in supercomputing (NERSC) and have used most of the major machines in the field in the past. The supercomputer centers claim they're moving lots of data around, but it you look closely, it's almost entirely message passing during a calculation. And in most cases, the calculations they are doing are just wasted resources- simulating a protein over one long trajectory. Supercomputers, as compared to high throughput clusters, are just wasted money.

It's not a waste of resources, it's just a different approach to solving a problem. Hadoop / "big data" clusters make the problem harder to solve (and probably even restrict the types of problems that can be solved) in exchange for cheap hardware. Supercomputers give engineers the ability to solve problems in a traditional manner, while moving the costs over to the hardware.

Nobody solves problems on supercomputers "in the traditional manner". You have to completely rewrite your application around message passing.

On the other hand, Hadoop also serves a lot more than a computing platform for "big data". The OP probably never worked in a mid/big sized company where sharing data is big issue. HDFS can easily replace a bunch of high maintenance and under performing NFSes. Actually, I think someone should write a Samba clone for HDFS.

The next huge thing in the Hadoop ecosystem is Hive. It's like having an unlimited Postgres server that hundreds of people can work on at the same time. Hive's got its quirks and developers need to be educated on how to write the Hive jobs properly, but it gives a lot more back by allowing joins between a lot of datasets that might not be able to fit in one single machine.

There is a reason why Hadoop is so popular, especially in enterprises. I think that's why Cloudera got such a high valuation.

There's a connector between Hadoop and the Gluster network file systems [1], giving a unified view of a large, distributed, POSIX compatible filesystem, which can be served up through Samba. We vastly prefer it to NFS, which we find to be far more flaky, slower, and when NFS fails it fails in really really painful ways. (Though looking again, I'm not sure if you meant NFS or NAS...)

[1] http://gluster.org/community/documentation/index.php/Hadoop

I wrote a different reply but deleted it. I'm curious about something: if, as you claim, supercomputers are wasted money, why are there so many of them? Have all the world's top supercomputing sites somehow colluded to convince all the world's largest governments that they're useful?

At this point, large capability supercomputers that solve problems we don't need solved are mostly kept around as trophy pieces with a few "show me" calculations. Note that when the Chinese obtained the leadership in TOP500 a lot of people got worried (as they did when Japan did it with the Earth Simulator). And so the US may spend a bunch of money to get on top of that list again. Big deal. It's LINPACK. Fortunately, TOP500 should eventually introduce better codes to benchmark.

I'm sure there are scientists who put their all into getting their code to run on a super computer, then apply for a pittance of time and sit in a queue for weeks, but they're not publishing as many (or as interesting) papers as the ones who are pulling out their credit card and building mini-supercomputers on Amazon that rely on conventional interconnects and better algorithms and data processing tools.

Don't assume I'm ignorant about what supercomputers are used for. I used to work on supercomputers; that includes writing, running and evaluating codes, and selecting proposals on some of the largest machines in the world (at the time).

But these new approaches to doing science and engineering on large computer systems have obsoleted conventional supercomputing for all but an extremely limited set of computational problems. And that set of problems becomes smaller when clever computer scientists figure out smarter ways to run things on cheaper architectures. For example, page rank is a classic eigenvector problem; you can solve it by building a big matrix on your supercomputer and doing the appropriate calculations, using 50+ years of numerical optimization (but not really very good support for nodes failing between checkpoints). However, you can also implement it as an iterative mapreduce. The mapreduce checkpoints every little bit of map work, and along the way, handles those failures quite well. It can also handle data sets larger than the sum of RAM on the machines quite well.

Guess which one works better operationally, scales to a larger data set size, and ports to a lot of architectures cheaply?

As far as I understand, you're saying that high-speed interconnects are largely a waste of time? (and apparently that doing protein MD is a waste of time, but that's a scientific question that I'm going to disagree with you on). I can imagine that might be the case for some simulations, but how I do, for example, a parallel FFT without significant communication?

I personally see it more as there's a pretty limited set of computational problems which are low-communication, often requiring rather gross approximations, even with the "clever computer scientists" looking at the problem, and they're often not the problems scientists actually want to solve, and so dismissing supercomputing is rather ridiculous.

High speed interconnects are not worth the extra cost. They just compensate for poor programming.

With regards to protein MD, no, I'm not saying protein MD is a waste of time. In fact, I run the Exacycle program at Google which has run the largest MD simulations (many milliseconds) ever done, and the results are quite good. You would never have gotten results as good as ours on a supercomputer- even the world's largest, with the most highly scaling MD codes.

I speak from experience- I used to code for supercomputers, in fact working on protein (and DNA/RNA, my interest) structural dynamics, with some of the leading MD codes.

I helped port a better approach to Google's infrastructure some time ago. It's not a clever hack or gross approximation: https://simtk.org/home/msmbuilder it's a distinctly better way of modelling protein dynamics than running long single trajectories on supercomputers.

> High speed interconnects are not worth the extra cost. They just compensate for poor programming.

Sorry, that's complete rubbish. Some problems just can't be made embarrassingly parallel or near so. You might be able to swap a problem for a similarish problem with better characteristics, but that's not the same as being a "poor programmer" since you're now using a different model rather than optimizing the implementation of the existing one. You still haven't explained how I do a parallel FFT without high speed interconnects -- I suspect the answer is "don't". I've tried running the stuff I do (DFT) over a 10GigE network on my cluster and it ran at about 10% of the speed of an Infiniband-enabled calculation. There are methods for getting better scaling, but they all (as I said) involve rather crass approximations which you don't always want to do. Those methods will probably increasingly be used more on large supercomputers due to their superior scaling characteristics, even with their nice high-speed interconnects, but you're still making a sacrifice in accuracy.

It might be possible to get good low-communication scaling for some models, as you were in the case of your sampling system (which typically parallelizes nicely, but only if the individual sampling jobs can fit on a single node), but you can't extrapolate that to assume that everyone can. As I said, exchanging a poorly-scaling model for a different, non-equivalent well-scaling model is a scientific question with tradeoffs, not a programming one.

Are you sure you need to do an FFT?

Are you sure there is not a network friendly FFT algorithm?

I used to think very differently about computing before I read Google's MR, BigTable, and GFS papers. After joining Google and working on problems like this, I can assure you that FFTs can indeed scale quite nicely.

note that infiniband isn't a high speed interconnect. it's a cheap commodity interconnect- in fact, per port, it can be cheaper than 10GB. but lots of effort has been put into making the MPI libraries work efficiently over it, compared to 10GBe, so codes run more effficiently (as you observed).

DFT seems like the next thing that's going to not need supercomputers. I see some nice new algorithms coming on line designed for Amazon cloud infrastructure.

> note that infiniband isn't a high speed interconnect. it's a cheap commodity interconnect- in fact, per port, it can be cheaper than 10GB.

Why do you think that an interconnect can't be both cheap and high-speed? As you said, there has been a lot of work to make MPI efficient on infiniband, but that has been possible because inifiniband offers much lower (~10 times) latency than 10GbE, not because no one has optimized MPI for ethernet. In fact, standards like RoCE and iWarp have been devised by Ethernet working group to compete with infiniband on this particular metric.

actually, wire latencies for 10GbE are much better than you think- they are basically the same.

Anyway, my point was that infiniband is not a high speed interconnect, and it's cheaper than 10GbE. The challenge is to deliver scaled infiniband, which is far harder than scaled ethernet.

I'm not disagreeing with you, but the argument of "people are using it and the government keeps paying for them" is hardly convincing.

It's not just the government that buy these things. Oil and gas, heavy manufacturing, automotive, genomics, financial, etc. all have them. The poster I was replying to appeared to be under the gross misconception that all parallel problems were more effectively solved with high throughput clusters.

Can you point to a genomics company that uses a supercomputer? Adn really claims that its unique capabilities are necessary>

Remember, the human genome was assembled on two different architectures: a Dec Alpha cluster with large memory with fast network (for the day) and a bunch of cheap linux machines with slow network. It turns out you didn't actually need the former, although the people at Celera insisted you did.

Genetics is normally very very data parallel. And often the ideal case for throughput computing. But there are rare times when doing modelling that a lot of chatter is required over interconnects. But even here people normally just get 4-6TB RAM machines instead of having lots of machines.

yes, and the people buying those machines often tell me that their science is limited by how much RAM they could buy in a single machine.

Which is poppycock.

Oh, you could make this run using MPI etc... no problem. But then all you are really doing is going from high density to low density ram. And that is just not as large a saving as you would think it is. Because you are expanding rapidly the number of cpu's and racks you need to compensate for the less dense RAM. A good sequencing lab does not need more RAM than that per assembly. If you have a reference even less.

Plus the science is hard enough that throwing mulitnode programming in the middle is tricky. Remember most of this code is written by a PhD student on his first real programming job. Its hard enough to go from getting it to run on something else than the PhD students laptop, never mind robustly parallelize this across nodes.

PDEs are kind compared to graphic codes. Molecular dynamics problems that are characterized by PDEs are some of the largest scaling codes - ever.

I work in particle astrophysics data processing and management, mostly with gamma ray events for Fermi gamma ray space-telescope. Our total data size after five years is somewhere around 2 PB, but the data set usually used for analysis runs about 200 TB, not really that large. AFAIK, none of the 100+ collaborators use Hadoop. Mostly they use software I develop to run batch jobs in parallel on one of several clusters.

LSST, which I also am starting to work on, will be different, as the raw data alone will approach 4TB a day.

CERN collects about 25 PB/year.

For some reason, wrapping my mind around this is like trying to fathom the Grand Canyon or the scale of the universe.

You and me both. It's made even worse when you learn that the first-level trigger removes 90% of the input stream before passing it to 2,000 computers, which in turn select only 0.2% of that data for storage and further analysis.


Fermilab in Batavia takes data from CERN over a 40Gb/s optical link (CMS Tier1, CERN CMS is Tier0); when I was there several years ago, we were staging the data to ~500TB of spinning disk in Nexans and cold storage were 2-3 Storagetek tape libraries the size of school busses.

Computing and particle physics is where awesome meet.

CERN and its associated teams (CMS and ATLAS) are awesome. They are the only outfit outside of Google that seems to be handling data management and computing properly.

Not to disappoint you, but I could tell you horror stories about my time working on the CMS team at Fermilab. Horrible, horrible stories.

I hope the CMS and ATLAS teams onsite at CERN were much better.

I would be curious to hear some of these stories.

I'll have to put a blog post together about it. Extremely poor planning, no accountability, a boss who was verbally and emotionally abusive to a co-worker (the reason I quit after only a year there and went back to the private sector).

ATLAS guy here. I agree with @toomuchtodo, we're stuck in the stone age. It's basically just batch processing and lots of copying data around for no reason.

You could say the same thing about google's backend (batch processing and copying data), but it seems to be working for them as well.

In the future, the Square Kilometre Array is forecast to require 300-1500 PB/year of archiving, somewhere between 10 PB/hr and 1 EB/day of raw data. (That's 3-12 TB/s...)

> We'll be generous and say that 4TB can do 150MB/s. A single run through the data at maximum efficiency will cost you ~8 hours. Since we've restricted ourselves to a single box, we're also not going to be able to keep the data in memory for subsequent calculations/simulations/whatever.

I agree with your overall point, but of course the actual limits for a fairly cost effective single box are fairly high these days. E.g. from my preferred (UK) vendor, 4x PCIe based 960GB SSD's (each card is really 4 SSD's + controllers) adds up to an additional $7500. We often get 800MB/sec with a single card.

Your point still holds - the actual thresholds are just likely to be quite a bit higher than what the performance of a single 4TB hd might imply.

Tack on 40+ cores, and 1TB RAM, and the cost to purchase a server like that is somewhere around the $60k mark in the UK (without shopping around or building it yourself), but that adds up to "only" about $2k/month in leasing costs. That makes a single server solution viable for quite a lot larger jobs than people often think.

Do you mean leasing, as in renting the box, and having it on-site? 2k/month sounds awfully cheap for a 64k box? Even if you rent it for 3 years without upgrading, that hardly seems worth it to the leasing company? Or maybe there is something going on with the margins here that I'm missing?

So many of these problems are IO bound to begin with. In fact, I don't think I have ever had a problem that wasn't ultimately IO bound. Hence the need to spend more time paying attention to the network infrastructure, and, ultimately, removing as much of it as possible from in between your storage and your compute infrastructure.


A rule of the thumb that I've inferred from many installations is: Just introducing Hadoop makes everything 10 times slower AND more expensive than an efficient tool set (e.g. pandas).

So it only makes sense to start hadooping when you are getting close to the limit of what you can pandas - everything you do before that is a horrible waste of resources.

And when you do get there - often, a slightly smarter distribution among servers and staying with e.g. pandas, will let you keep scaling up without introducing the /10 factor in productivity. Although, it might be unavoidable at some point.

There is a huge gap between Pandas and Hadoop. That gap is well served by a $2-3k Postgres box with 32-64GB ram and 5-10TB of disk.

The main reason to switch to Hadoop at the point when Pandas fails is because you expect to scale past that gap fairly quickly.

"We don't have big data" or "our data is rather small" -- said no dev team ever.

"Big data" is like "cloud" it is a cool label everyone applying to their system. Just like OO was in its time. Well once they applied the label they feel they need to live up to it so well "we gotta use what big data companies use" and they pick Hadoop. I've heard hadoop used when MySQL, SQLite or even flat files would have worked.

Also, although every new project starts out tiny, everyone _hopes_ to become really big some day. And to make that clear to everyone on the team, the architecture is immediately designed to handle the wildest success imaginable. Which ends up costing so much to implement that the first paying customer never comes...

Sad but true.

I did say that my former employer. Former.

*at my former employer.

Novelty Driven Development (NDD)

Chris points out a great example of NDD here with Hadoop.

I do a lot of client work and I see this mistake CONSTANTLY. So often in fact, that I recently wrote up a story to illustrate the problem. Rather than use a tech example, I use a restaurant and plumbing to drive the point home. When the same scenario is put into the context of something more concrete like physical plumbing, it shows how ridiculous NDD really is.


That blog's background is 2.6 MB, perhaps you should compress it a bit.

Also, I think it's a bit disingenuous to compare a complex system built to various specific needs versus a simpler one that doesn't address those needs at all and just say "their only difference is that one worked and the other didn't".

Obviously, if the only criterion was that it should just work, everyone would go with the simpler system. The more complex system probably had some things going for it, too, otherwise nobody would choose it.

"Obviously, if the only criterion was that it should just work, everyone would go with the simpler system."

It's obvious to you and me, but to many people it's not obvious at all. You only have to look at the very large deadpool of companies that made these mistakes and died because of it.

People optimize prematurely, scale prematurely, make ego based decisions, care more about interesting tech than making the business grow, etc etc. That's really the whole point of the parent article - people make bad decisions because of the allure of some "cool" but complex and unnecessary tool.

Also, thanks for the note on the background - I'll fix that :)

I get your point of course, but indoor plumbing is orders of magnitude more complicated than a well and outhouses. Replacing wells with fountains and eliminating outhouses are driven by population density and public health concerns -- but once available they are preferable to the alternative, despite their complexity.

actually in the developing world people prefer cellphones to indoor plumbing. i expect this to continue to be the case.

"A 2 terabyte hard drive costs $94.99, 4 terabytes is $169.99. Buy one and stick it in a desktop computer or server. Then install Postgres on it."

Done! Although with more drives and a backup server. Right now, we're pushing 15Tb with no loss in performance.

Do you use standard off-the-shelf consumer hard drives at those prices?

Most companies I've worked for have shelled out quite a bit more for "enterprise" class hard drives. I've always struggled to understand what these bring to the table, and my understanding is that it's some combination of greater reliability and a service agreement.

It's always seemed to me, in my software-centric naivete, that it would be more cost-effective to get off-the-shelf drives, RAID them for redundancy to increase uptime, and make regular backups to increase data longevity.

Hard drives are tricky little beasts.

On the one hand, cheap disks are much more prone to start operating slowly when they begin to fail, instead of transitioning directly from working to not working (or setting themselves as failed).

On the other hand, RAID write speeds are roughly that of the slowest drive in the set. The more drives you have, the higher the probability of a malfunctioning hard drive making the whole thing go slow. Slow drives mean I/O bottlenecks, rendering the whole server unusable.

Pair both things and you have actually increased your effective downtime instead of decreasing it. Now use expensive disks (that work more reliably and either work or don't with a higher probability) and just calculate wether the price increase is lower than the cost of the downtime/slowdowns when running cheap disks.

Oftentimes the price premium is worth it.

For the most part, you're right. Enterprise hard drives tend to come with better agreements, especially for situations where data protection standards make it prohibitive to send a dead drive back to RMA it; often they'll just say to send back the controller board or the like. Oddly, however, one of the big design features for enterprise hard drives is that they'll give up faster if there's a mis-read, etc. Because you're striping your data over a large set of spindles, it's often faster to just throw your hands up and work from the parity disk until you can diagnose the issue. This is also why most good sized arrays will have hot-spares, so you can begin the rebuild while you figure out if the drive was just having a temporary glitch or not.

> it's often faster to just throw your hands up and work from the parity disk until you can diagnose the issue.

One of the advantages of enterprise drives is how long the drive spends error checking. Consumer drives spend considerably longer when it detects an error, and that can cause the RAID controller to drop the drive as malfunctioning. Then it has to rebuild the array or operate on hot spare.

There used to be a firmware setting you could toggle on the WD Caviar Black drives to effectively turn them into the RE (enterprise) drives. They've since made some design changes so that's no longer possible.

One of the most important things is that enterprise drives are much more likely to be telling you the truth when they tell you your data is written to disk. Even if it hasn't actually hit the platter, they'll have a battery or cap backup power supply that allows them to write anything still pending to disk before they run out of power. Consumer drives often lie horribly about whether data is truly durable.

Not really an issue for building a reporting/aggregation database.

But certainly something to think about when processing realtime financial transactions.

That's totally understandable. Honestly, we haven't seen any appreciable difference in performance or failure rates between "enterprise" class drives and "consumer" drives (usually WD). Matching RPMs and cache sizes, there's just no way to justify the extra cost for them.

The one area we didn't penny-pinch on was the tape backup. Glacially slow compared to HDDs, obviously, but absolutely necessary.

MTTF should (and probably is) higher. Perhaps some other things like reallocated sectors count can be higher without affecting performance/stability much.

But from what I've seen, the most important thing is the service agreement, i.e. having new disk ready the next day without any questions asked (or money spent for a new drive).

> or money spent for a new drive

To put it other way - expected vs. unexpected expenses.

You would probably want some RAID + Backups, (and a very generous amount of RAM there) but this is ok.

In fact, I wouldn't doubt that for several "big data" users even SQLite would be enough.

I migrated from MongoDB to SQLite for a project a few years ago, it was much more performant:


I think the data size ended up being a GB or so.

I wouldn't doubt that either.

We do use RAID 6 (I think) + tape backups and the backup server itself is a hot standby. If you're curious, this is the chassis : http://www.supermicro.com/products/chassis/3U/836/SC836A-R12...

The motherboard is also Supermicro running dual AMD Opterons ( single processor on the backup server ) with 256Gb RAM.

From my experience SQLite is great when only one process with no parallelism needs to write to the DB and where no other process will frequently overlap reads with writes. However, for CPU-heavy computations that needs multiple cores to compute and write to the DB PostgreSQL has a much better fit for me even for fairly small data-sets. And the ease of use and setup of PostgreSQL is not that much more than SQLite.

As date guy that uses both, there is a higher level of expertise required to properly maintain PostgreSQL. SQLite is something that I can call instantly from Python without much forethought.

What is the price / performance tradeoff for using SSD in these applications nowadays?

The paper introducing graphchi has dozens of examples of graphchi on a mac mini running rings around 10-100 machine hadoop clusters - http://graphlab.org/graphchi/

I worked on a project earlier this year that was envisioned as a hadoop setup. The end result was a 200loc python script that runs as a daily batch job - https://github.com/jamii/springer-recommendations

I'm tempted to setup a business selling 'medium-data' solutions, for people who think they need hadoop.

I'm always torn by these headlines: yes, many organizations lack the size of data required to take advantage of Hadoop. Few of the articles really bother explaining the advantages of Hadoop, and how what you're doing really moves the break-even point in terms of data size:

- 3x replication: if the data needs to be retained long-term, slapping it on one hard drive isn't going to cut it. This is pretty poor justification by itself, but it's nice to have.

- working set: if you only pull 1GB out of your data set for your computations, it makes sense to pull data from a database and run Python locally. If you need to run a batch job across your full, multi-TB data set every day then Hadoop starts looking more attractive

- data growth: a company may only have 10GB of data now, but how much do they expect to have in a year? It's important to forecast how much data you'll accumulate in the future. Especially if you want to throw all your logs/clickstreams/whatever into storage.

So, if you're expecting explosive growth, you want to hang on to every piece of data ever, or you're going to do a lot of computation across the whole dataset, it makes sense to adopt Hadoop even if your dataset isn't 'big' to start.

As for this article, the author undersells MapReduce a bit. Human-written MR jobs can jam a lot of work into those two operations (and a free sort, which is often useful). Using a tool like Crunch can turn really complicated jobs into one or two phases of MR. Once Tez is widely available people won't even write MR anymore, they'll all likely write a 'high-level' language and compile it down to Tez.

You bring interesting points that you may need to analyze the future requirements, but at the same time, I feel you're underselling the things you can do with a modern SQL cluster.

* Replication: Disks are cheap, data can go on as many drives and servers as you need. Master-slave replication is pretty damn bulletproof these days, and multi-master isn't as terrible as it was even a few years ago.

* Working set: All servers can have the entire working set. Additionally, features like foreign data wrappers in postgresql and federated tables in mysql mean that you can query your sharded databases, and still get aggregate results back.

* Data growth: SQL servers were Big Data before Big Data was cool. Terabyte storage clusters are no problem for an SQL database engine, with some entities having clusters in the petabyte range.

Databases are a very fast moving target of late, with competition from many fronts meaning that what was true even last year may not be true today. Additionally, with things like Postgresql's foreign data wrappers, you can choose the best data processing engine for the job -- keep the data on the databases for the nice bits of SQL, but still be able to throw it towards a hadoop cluster at a whim if needed. Using the right tool for the job is important, and equally important is keeping up with what tools are out there, because this is a rapidly evolving area of tech, and analyzing each job separately, instead of relying on any mantra, is what's needed to keep one flexible to keep up.

Preface: "Big Data" is a stupid label, and I wish it would die. You have no idea how much time I spend explaining to non-technical people that comparing Hadoop and Riak is like comparing a tractor to a helicopter.

You can definitely put lots of data in a RDBMS, and you can get it back very quickly. I'm not advocating for Hadoop for problems where a database - be it sharded, vertically scaled, whatever - will do; I know you can scale relational to petabytes of data.

Hadoop isn't supposed to be a tool for having a lot of data, it's a tool for doing things with a lot of data. I've done lots of weird computation on Hadoop: OCR of time-series images is a good example. This is general purpose, parallel computation which takes advantage of Hadoop's scheduling infrastructure and the notion of data locality - some of the data lives on every compute node, and can be accessed with very low latency.

To put it another way, databases will get faster and more scalable, but there's still a need for ETL when moving between different data models. Some people use Hadoop just for ETL, particularly when ingesting semi-structured data, because it's good at computation, but only OK at storing the finished, structured data.

I'm going to try not to be offended by your closing line; I was trying to point out a particular dimension of considering whether an application is suited to Hadoop - the computation component. This is often ignored by people who treat Hadoop like yet another database, when that's really a complete mischaracterization. I certainly don't advocate 'relying on any mantra', but instead considering all solutions and selecting the most appropriate tool.

In these discussions it is mandatory to quote this paper: http://research.microsoft.com/pubs/163083/hotcbp12%20final.p...:

"We completely agree that Hadoop on a cluster is the right solution for jobs where the input data is multi-terabyte or larger. However, in this position paper we ask if this is the right path for general purpose data analytics? Evidence suggests that many MapReduce-like jobs process relatively small input data sets (less than 14 GB). Memory has reached a GB/$ ratio such that it is now technically and financially feasible to have servers with 100s GB of DRAM. We therefore ask, should we be scaling by using single machines with very large memories rather than clusters? We conjecture that, in terms of hardware and programmer time, this may be a better option for the majority of data processing jobs."

Their data is based on Hadoop jobs running at Yahoo, Facebook, and Microsoft -- companies most would agree do have real Big Data -- and they find the median job size is <14GB.

Even if all jobs take less than 14GB data as input, but the sum total of your data is far greater, it is easier to use Hadoop as a unified single filesystem for storing data rather than relying on multiple different machines that you manage by hand. Even if you treat Hadoop simply as a multi-machine filesystem, it automates the job of managing which files are where, that you inevitably have to manage by hand if you go with storing your data on multiple machines sans some kind of filesystem interface.

A study of jobs submitted to the Yahoo! cluster showed that the median job involved 12GB of data.

There's really nothing wrong with that at all, because breaking on 64MB blocks, that 12GB can be processed in parallel, which means turning an answer around really quick on that 12GB, say 30 seconds or so. Usually the work can be scheduled on machines that already have the necessary input, so the network cost is low.

Now, it might not be worth it for one hacker to build a Hadoop cluster to do that one job, but if you have a departmental-wide or company-wide cluster you can just submit your jobs, get quick answers, and let somebody else sysadmin.

Sure the M/R model is limited, but it's a powerful model that is simple to program. You can write unit tests for Mappers and Reducers that don't involve initializing Hadoop at all, and THAT speeds up development.

Yes, it is easy to translate SQL jobs to M/R, but M/R can do things that SQL can't do. For instance, an arbitrary CPU or internet intensive job can be easily embedded in the map or in the reduce, so you can do parameter scans over ray tracing or crack codes or whatever.

I built my own Map/Reduce framework optimized for SMP machines and ultimately had my 'shuffle' implementation break with increasing input size. At that point I switched to Hadoop because I didn't plan to have time to deal with scalability problems.


With cloud provisioning, you can run a Hadoop cluster for as little as 7.5 cents, so it's a very sane answer for how to get weekly batch jobs done.

Looking at the median is not very interesting, since jobs in these environments are always heavily skewed. You have those 5-10% jobs that are seceral orders if magnitude beyond those 13GB, and those are the ones you run the cluster for.

Of course, but the ability to run small jobs and get a quick turnaround can be transformational in the sense that it lets you try things out and "fail faster"

I remember starting to grasp what "big data" meant when I had a phone interview with Twitter.

@ Imagine you have some numbers spread over some computers -- too many to fit in one computer find the median.

▪ Uhh, sort them?

@ Can you find the median on a single computer without sorting them.

▪ :-(

@ We'll call you back tomorrow.

I was promptly rejected, but it set the tone for my later studies.

The criterion for Big Data seems to be that it fits on thousands of computers, perhaps several TB or a PB. Then I had to think of some examples:

* A million YouTube Videos

* All the tweets in the US in the past 15 minutes

* All US tax records

I still think the map-reduce philosophy is really cool. And I know at that scale there are special counting algorithms (like Bloom Filters) that may lead to some improvements at the GB or MB scales.

For reference, finding the median is an O(n) problem on a single computer, and sorting is an O(n log n) solution.

An O(n) algorithm is Quick select. It is basically like quick sort but you only recurse on the side of the pivot that contains the median.


Twitter often gets singled out for it's big dataness, though the latest numbers I've seen are only about 400 million tweets per day. Even allowing 1K/tweet this is a rather manageable 400GB uncompressed.

15 minutes of US tweets would fit on your phone :-)

I am pretty sure a tweet is more than 1kb: the message itself is 140 characters, not 140 bytes, and you have lots of metadata around a tweet. I would expect easily one order of magnitude more / weet.

There's another important dimension: the fan-out factor when a person with many followers tweets.

That second question is kind of dirty. Pretty much all algorithms to find the median will perform a partial sort. Without any sorting at all, the only thing I can think of is some kind of statistical approximation based on sampling.

Here's an interesting probabilistic way to sample up to the median: http://blog.aggregateknowledge.com/2013/09/16/sketch-of-the-...

The author has a point, but dataset-size is one dimension of several you need to consider when making the choice to use Hadoop.

A local python script is great, but what if it takes 2 or 3 hours to run? Now you need to set up a server to run python scripts. What if the data is generated somewhere that would have high locality to a hadoop cluster? Now you need to pull that data down to your laptop to run your job. What if there are a dozen people running similar jobs? Now your python script server is a highly stressed single point of failure. What if the data is growing 100% month-over-month? Your python scripts are going in the trash soon since they were not written in a way that can be easily translated to map-reduce and hadoop-sized workloads are inevitable.

The next step up is a centralized database, but in my experience running your own (large, highly used) database is a whole lot harder than just throwing files on S3 and spinning up hadoop clusters on EC2 if you have people that can write pig jobs.

A solution like elastic map reduce removes a lot of practical problems such as data distribution, resource management, and system operations beyond the fact that it makes it possible to easily run jobs over terabytes of data at a time.

Amen! Finally someone talking sense. Apart from being a hype, Hadoop is also a wet dream for VCs that want to have "exposure" to "big data".

Hadoop is a solution for cases where you have multiple petabytes of data, with query's that need to touch a significant portion of your data. Roughly speaking in this case your execution time will scale with the number of nodes in your cluster. Classic example is creating the inverted word list for a search engine.

For most other use cases, including all cases where you can index your data, you do not need Hadoop.

You are right that it should be used the proper tool for each particular problem. And Hadoop world is harder than single machine systems (like pandas). So, you shouldn't user Hadoop if you can do the job with simpler systems.

But I have something to add. Hadoop is not only introducing new techniques for distributed storage and computation. Hadoop is also proposing a methodological change in the way a data project is approached.

I'm not talking only about doing some analytic over the data, but building an entire data driven system. A good example would be the case of building a vertical search engine, for example for classified ads around the world. You can try to build the system just using a database and some workers dealing with the data. But soon you'll find a lot of problems for managing the system.

Hadoop provides you all the storage and processing power that you want (it is matter of money). Why if you build your system in a way where you recompute always everything from the raw input data? That can be seen as something stupid: Why doing that if you can run the system with less resources?

The answer is that with this approach you can:

- Being human fail-tolerant. If somebody introduces a bug in the system, you just have to fix the code, and relaunch the computation. That is not possible with stateful systems, like those based in doing updates over a database. - Being very agile in developing evolutions. Change the whole system is not traumatic, as you just have to change the code and relaunch the process with the new code without much impact in the system. That is not something simple in database backed systems.

The following page shows how a vertical search engine would be built using Hadoop and what would it be its advantages: http://www.datasalt.com/2011/10/scalable-vertical-search-eng...

> That is not possible with stateful systems, like those based in doing updates over a database.

Could you elaborate? This sounds like the NoSQL argument that relational databases are "not agile", which usually means relational databases complain that you have records that won't logically fit the changes you just made.

I couldn't disagree more with some of the statements in the article.

Hadoop is not a database! It's a parallel computing platform for MapReduce-style problems that could preserve locality. If your problem fits this, Hadoop absolutely rocks. If your problem is different then please use another tool. If your problem deals for example with high-resolution geographic or LiDAR data that can be easily processed independently (and that would easily give you a few petabytes each scan/flight so you can't stuff it into GPU), Hadoop is about the only open thing you can use to process them reliably (imagine having Earth surface data with the resolution of 1cm and the need to prepare multiple levels of detail, simplify geometry, perform object recognition, identify roads etc.). Even when your data is smaller if your problem fits in the MapReduce model well, Hadoop is a pretty convenient way to be future-proof while enjoying already mature application infrastructure. Why would you even bother working with toys that put everything into memory and then fail miserably in production (HW failure, need to reseed data after crash etc.)?

I worked in a company that routinely processed these kinds of data; usually people accepting the thinking from this article hit a wall someday in production, couldn't guarantee reliability and ended up writing endless hacks for their algorithms that didn't scale when it was needed and became frustrating bottlenecks for everyone.

Yes, I saw also some ridiculous uses of Hadoop (database that didn't have a chance ever growing over 20M records, problems not fitting MapReduce that needed custom messaging between jobs etc). Just use your reason properly, whatever has the potential for the future to handle large data do it with Hadoop or any appropriate system that supports your algorithmic model (S4, Kafka, OrientDB, Storm etc.) straight away.

Make your software future-proof now or you'll have to rewrite it from the scratch when you will be under huge pressure. Don't become complacent with what you "know" now.

> "Hadoop is not a database!"

Nor does the essay claim that Hadoop is a database.

> "If your problem fits this, Hadoop absolutely rocks"

The essay points out that most systems which do use Hadoop don't actually fit the Hadoop model, and that other 'mature application infrastructures' would be more effective than Hadoop. You misinterpreted it to mean the converse.

It also agrees with you that there are cases where "Hadoop might be a good option". It says "The only benefit to using Hadoop is scaling", and that it might be appropriate for >5TB data sets. Your 1PB example is of course larger than 5TB.

So I don't think you actually disagree with it.

> "Why would you even bother working with toys that put everything into memory and then fail miserably in production"

Thank you for calling my software a "toy" and disdaining my field of research. Do not presume that the needs of your field hold for all others.

Hadoop isn't useful for what I'm interested in, which is to support interactive search of ~2 million chemical structures. This requires a sub-100 ms query time. Hadoop, last I checked, was lousy for soft real-time work like this.

You actually say "Hadoop or any appropriate system that supports your algorithmic model".

My actual solution was the old-fashioned way: a combination of improved algorithms, multithreading, better data locality, and a bit of chip-specific assembly. The result is about 100x faster than the previous widely used tool, and gives me the sub-second search times I want.

Moreover, it scales well. I use the new code as part of the inner loop in a clustering task, what once took a week on a machine cluster is now being done on a single node in an afternoon.

Had I taken your suggestion I would have invested in more hardware, which would have been the wrong solution for my needs. Also, data size in my field doubles every 5-10 years, which is much slower than the rate of computer performance.

Sorry, I didn't mean to offend you in any way!

First, I really don't like if somebody by default compares Hadoop with SQL and this is a widespread confusion - they are completely different beasts; in fact an extension called Hive gives Hadoop + HTable an SQL-like syntax.

However Hadoop is a parallel, batch-processing platform. Hadoop is slow-responding, you can't even talk about latency because a single task takes a lot of time even to setup/execute. It's completely inappropriate for real-time low-latency computations. For those, the in-memory, GPGPU, streams are much better. However, if you have a large dataset whose loading times onto a single machine may be very long - imagine loading 1TB to memory from a network drive if your computer can handle it - you might be better off by partitioning your data across thousands of smaller nodes (e.g. ARM microservers) and perform computations on each of them independently. This way you don't need to transfer a lot of data, each smaller local dataset is loaded very fast (= you preserver locality), you find balance between being CPU-bound and IO-bound and likely finish your computations much faster than on a single computer with huge memory but limited bus.

Hadoop's tragedy is that it is now an established platform that is supported by large companies which are mostly driven by technologically clueless people, trying to put it everywhere as it is "in vogue" to do so. Google abandoned MapReduce model a few years ago but industry didn't seem to notice.

Think about your case for chemical structures - is there any part of your algorithm that needs to be computed only occasionally but it's a lot of data to process? Why not offload it to Hadoop for preparing digest from these data which you can use in your real-time algorithm? That's actually pretty common usage scenario - Hadoop handles the rough mining, extracting the precious stuff from crude data, assembling it to a form that is refined and can be used by other parts of system that have completely different requirements, for example low-latency interactions.

> "I really don't like if somebody by default compares Hadoop with SQL"

The first comparison was to Scala, and then to SQL. The main comparison was to SQL and a Python script. While it does talk more about SQL than other solutions, I don't see that as a default comparison.

> "For those, the in-memory, GPGPU, streams are much better"

I'll go off on a bit of philosophical tangent here. Yes, GPGPUs can be more effective for the task I'm doing, since I'm actually memory bound. However, GPGPUs require dedicated hardware, while the code I work is effective even on laptops with limited GPUs, including web servers.

Ideologically, I prefer to enable single-person developers, and scientists who do not have much training in hardware and network administration. For that situataion, GPGPUs are not "much better", because so long as the performance is fast enough, it doesn't need to be faster. And 100 ms is "fast enough."

> "This way you don't need to transfer a lot of data"

True. But to point out, I used the traditional approach of developing a file format which can be memory-mapped directly to my internal data structures, and a search algorithm which doesn't need to search the entire data set before loading it. These optimizations aren't synergistic with how I understand how Hadoop works.

> Hadoop's tragedy ... mostly driven by technologically clueless people

The author of the essay agrees with you. Here's the P.P.S.:

"I don’t intend to hate on Hadoop. I use Hadoop regularly for jobs I probably couldn’t easily handle with other tools. ... Hadoop is a fine tool, it makes certain tradeoffs to target certain specific use cases. The only point I’m pushing here is to think carefully rather than just running Hadoop on The Cloud in order to handle your 500mb of Big Data at an Enterprise Scale."

This is why I don't think you actually disagree with the author.

> Think about your case for chemical structures - is there any part of your algorithm that needs to be computed only occasionally but it's a lot of data to process?

Certainly, but there are two other important facets to that. 1) updates occur weekly, a full rebuild on a single core only takes 12 hours, and those 12 hours aren't critical, and 2) the deltas are relatively small and data from previous builds can be reused, so incremental updates should take about 30 minutes - I'll be working on that code in the next couple of weeks.

It's easier to have a cron job trigger a command-line program every week than to set up a Hadoop server.

Back in college in '97 or so, in our databases class on the first day, the professor asked, "Who's worked with databases before?" A bunch of hands went up. "Oh, sorry, let me rephrase, who's worked with databases larger than a few dozen gigs?" Only one or two hands remained up. "If it's smaller than that, just save yourself the effort and use a flat-file instead."

15 years later and it's the same thing, plus a few orders of magnitude.

Aside from the ACID properties others mentioned, if you're using a relational database (and most people are), there may be non-performance related benefits. Some data is inherently relational, and it can be easier to manage it using relational abstractions such as SQL.

That is perhaps the worst advice I've ever heard. I spent a summer some years ago writing what, in abstract, were hand-coded SELECT and INSERT statements for CSVs. It was a waste of time; the number of bugs was ridiculous and the data integrity was attained by brute force. Today I'd use sqlite and/or postgres.

What horrible advice. Hope you dropped the class.

Would have been hilarious to listen to the lectures about normalization and ACID topics... assuming there were any.

"Cod Normal Form? Never heard of it. I like my fish sticks made of haddock anyway."

Come to think of it, I've worked with guys who apparently learned everything they know about databases from that prof's database class, unfortunately.

Interesting, but it isn't really similar advice. Indeed, it sounds like absolutely terrible advice (unless there is some context that is missing).

I suspect it should have been "Who's worked with databases too big to fit entirely in memory".

Because if it all fits then you don't have to worry too much about performance overheads. If it doesn't then you need to think about how you're going to avoid table-scans and the like.

Many programs use SQLite in part because its ACID properties make it an excellent way to save system state.

A lot of people use databases to implement persistent user state on top of stateless HTTP, even if the data itself is small enough to fit into memory.

This latter use was known even in 1997. For example, the book "Database Backed Web Sites: The Thinking Person's Guide to Web Publishing" was published on Jan. 1 of that year.

So even when that advice was offered, it was wrong.

I remember buying a pretty kick-ass server at the time and it had, if I recall correctly, 196MB of memory.

However I would still strongly contest the analogy even if the database did fit in memory (the author may be off by magnitudes themselves, as in 1997 a 2GB database was a pretty substantial, unweidly thing for most people) -- doing the simple steps of putting your data in a database instantly enables enormous flexibility in the use of that data at very little cost or overhead, with better to enormously better performance than the average person is going to yield with a flat file.

This situation (Hadoop for big data), in contrast, is about throwing away a lot of flexibility, and paying a large performance price, to add big-data scale out flexibility. It is, in many ways, the opposite situation.

Yes, it is terrible advice. Just like this article.

Repeats all the usual myths about Hadoop... that it's just about MapReduce, that it doesn't support indexes (hint: Hive has a CREATE INDEX command, guess what it does?) and then adds some of its own.

After seeing this, plus an article advocating web frontend programming in C, I'm starting to think this place is going downhill fast.

For a lot of people, what the author says is absolutely right, but I think a lot of the comments here suggesting that only a handful of institutions are solving Hadoop-scale problems is simply inaccurate.

Yes, there are companies that are trying to appear more attractive by using Hadoop, but there are plenty of cases where Hadoop is replacing ad-hoc file storage on multiple machines.

It's primary use is as a large scale filesystem, so if you are running up against problems storing and analyzing data on a single box, and you feel that the amount of data you have will continue to accelerate, it is a good option for file storage. It doesn't replace your database, it complements it, and there's work being done to allow large-scale databases on top of Hadoop, although the existing ones aren't mature yet. But there's a lot of institutions that are taking on problems that a single-box setup cannot handle.

And MapReduce isn't a bad programming model, but it should be thought of as the assembly language of Hadoop. If you are solving a particular problem on Hadoop, writing a DSL for it is the way to go, or see if one of the existing DSLs fits your needs (HIVE, Pig, etc).

I'd love to understood how the Hadoop hype and marketing team generated so much unwarranted interest in Hadoop.

I'm witnessing a feeding frenzy for Hadoop talent in situations where there's absolutely no need for Hadoop, and I can't recall anything like this for any other software.

I think it goes like this:

Certain very popular companies that everyone wants to emulate, who have truly enormous needs in the initial data crunching (e.g., ETL) department, were running into bottlenecks related to raw I/O bandwidth. They hit on a "let the mountain come to Mohammed" insight that helped them get past that problem, so that they could do their ETL jobs in less time and ultimately keep their workhorse databases (e.g., web search indexes) better-fed.

Simultaneously, a whole lot of people who were not having the same problems, but who want to believe that they are like companies who have those problems (because who doesn't want to be Google?) started also running into problems with handling large amounts of data. Unfortunately, the most popular mistake in applying Feynman's Algorithm[1] is to skip the first step. Rather than investigating their problems and recognizing that the issue was poor tooling or inefficient implementations and they weren't actually coming anywhere close to any true limits of the kind that the companies that came up with Big Data were trying to get around, they instead just went, "Hey, X company that we look up to is also having problems that look cosmetically similar to ours, and they use Y technology - let's give that a try!" and proceeded to dive straight into constructing bamboo control towers and coconut radios without ever looking back.

After that, well, I think it's a tragedy of mostly-rational behavior. Managers don't understand these technologies well enough to take programmers' advice skeptically, so they have to listen to their engineers. Engineers want to make their CV's look nice and impress their managers, so they've got every reason to come up with excuses to use $HOT_NEW_TOY. As usual, everyone individually acting in accordance with their rational self-interest is not the same thing as everyone collectively acting in a way that produces ideal results for the parties involved.

[1] http://c2.com/cgi/wiki?FeynmanAlgorithm

I was with you until "and I can't recall anything like this for any other software."

Just for my curiosity, what other software had the same level of unwarranted demand and hype?

I don't think it's about software per se, but more about the idea of Big Data. We have seen many ideas that have come and go most of the times a good idea but a good idea is not something you should use for everything.

From the top of my head

Ajax (very handy but lot of times misused and misunderstood or just DHTML)

No SQL (can be handy, but for some people it is like a religion)

XML (why use CSV when you are able to use XML)

Cloud (Oh by cloud you mean the internet?)

and many more :)

Java around 1998. All bugs a thing of the past! Write once run anywhere! Applets taking over the world!

Er, Cloud?

Hehe, most of the Hadoop installations I've seen chew through data amounts that I used to process 10 times as quickly using dirty, rotten Perl scripts, sort, cat and flat files :)

So the OP claimed Hadoop skills, the interviewer asked him to use Hadoop, gave a him small example problem. He then didn't use Hadoop, and thinks there's something wrong with the interviewer for objecting to this?

Interview problems are sometimes kinda artificial, no shit. Given the impracticality of giving every candidate the kind of dataset Hadoop would be needed for, how would the OP suggest an employer test for Hadoop skills?

Thank you for saying this. Exactly what I was thinking when I read the article...

The author suggested that this was all their data - and thus the entire scope of the problem, rather than a test sample that, while inefficient, would show the interviewee's skill.

On a related note, I find it amusing that people seem to think that "Big Data == Hadoop". In actuality, there are plenty of other approaches to scaling out clusters to handle large jobs, including MPI and OpenMP, as well as BSP (Bulk Synchronous Parallel)[3] frameworks.

[1]: http://en.wikipedia.org/wiki/Message_Passing_Interface

[2]: http://en.wikipedia.org/wiki/OpenMP

[3]: http://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Isn't YARN / Giraffe BSP ?

Giraph was definitely inspired by BSP and resembles it to some degree. I'm not an expert on Giraph, so I'm not sure exactly how closely related they are.

I don't know much about YARN, so can't really speak to that.

Doh - got the spelling wrong!

Hmm I should go off and get to the bottom of this!

Why do you find this amusing?

I find it amusing that people are so sheep-like in the way that the herd mentality just takes over. To be fair, I am not claiming any sort of immunity from this effect, I am as guilty of it as the next person ... but it is quite funny in a sad way when you are able to step back and recognise it.

It's basically what w_t_payne said. It's amusing (in a sense) because it betrays such a lack of understanding of what's actually going on, and such a willingness to simply accept the "received wisdom" and not question it and think independently.

Some of us have backgrounds in MPI, BSP and other distributed computing paradigms. I switched to high throughput computing and I can tell you that I would never go back to MPI. People who use MPI almost always use it as a crutch to avoid solving the hard problem that would make their computation a lot easier.

As an example I used to write MPI code for molecular dynamics simulators. The goal was to use a large supercomputer to scale the computation up to longer trajectories. But then, with cheap linux boxes it made more sense to just run many simulations in parallel, because low latency MPI class networks cost much more than the machines. We and others made the switch to other approaches which didn't need MPI- we ran many sims in parallel and did only loosely coupled exchanges of data. In fact the coupling was so loose we would run for weeks, dumping output in a large storage system and processing the data with MapReduce.

When I look back at the people who still do the MPI simulations- they are doing dinky stuff and can't even process the data they generate because they spend all their time scaling a code using MPI that didn't need to be scaled with MPI.

Sure, and I'm not saying "MPI > Hadoop" or anything. Just pointing out that Hadoop is not the only game in town, and is hardly the only way to deal with "big data". I'm a Hadoop fan myself, but I also did a lot of MPI stuff in the past, and I believe there are still scenarios where MPI makes a lot of sense. I have less experience with OpenMP, but I think anybody planning a "big data" project would be well served to at least go out, and study up on all of Hadoop (and other M/R platforms), MPI, OpenMP, BSP, etc., before simply choosing Hadoop because it's the framework du jour.

Often times choosing the framework du jour is the best choice just because it's the framework du jour. Support, training, books, an active ecosystem, a rich base of developers to hire from with experience, corporations incentivized to fund further development, etc, etc all act as a hedge against "slight technical mismatch between our requirements and the technology vs more esoteric ones."

But what you're talking about here is still a considered decision, based on actual analysis and thought. It's not choosing a framework only because it's the framework du jour, but because of the second order effects of it being so.

See 'Nobody ever got fired for using Hadoop on a cluster' http://research.microsoft.com/pubs/163083/hotcbp12%20final.p...

Spot on. I recently audited a project that was using an over the top technical solution for a problem that would - with only minor nuts and bolts work - have fit easily on a single machine instead of on a cluster, and it would have run much faster too. Demonstrating this made the case and they've since happily converted. You can buy off-the-shelf machines with 256G of ram at reasonable (for large values of reasonable) cost with IO speeds to match if you equip them with SSDs.

Big Data to me means at a minimum 10's of T, and what big data means changes over time, so todays big data will fit the laptop of the day after tomorrow.

This can be a confusing topic. Hadoop is several things. A No-SQL data store, map-reduce and a global file system. No-SQL and Map-Reduce can be quite valuable, even on a single server. CouchDB runs on Android for example.

If you don't need a global file system, use MariaDB, CouchDB or Mongo depending on your use-case.

I have no experience with hadoop at all and it may be slightly off topic but this reminds me of a post titled "Taco bell programming"[1], after reading which I started learning and using Unix tools and commands much more than before instead of writing silly Python scripts for almost anything that needed automation.

[1]: http://web.archive.org/web/20110220110013/http://teddziuba.c...

Hadoop is the problem, MapReduce is not the problem. Having used both Hadoop and Disco, I can say that Disco is by far a net positive on all projects I used it on. And the overhead to coding it in Disco vs single node is about an extra 30 minutes. You can start with working single node and go multinode w/o much effort.


Hadoop on the other hand is a huge, massive pain in the ass. And I am a Hadoop consultant. I recommend that most customers NOT use it.

This is 100% true. Hadoop gets way too much attention given the other useful solutions that exist out there. I have known people to use Disco successfully on several hundred node cluster. You can also interact Disco with IPython parallel much more easily.

This is why we include it in Linux versions of Anaconda.

agreed, disco is really great becuase you can easily get access to the file descriptor which has your data, then you can load it however way you want

Cool project, thanks for the link.

The article focuses on data sizes and completely ignores per-row computations time requirements. In my case, our dataset is just 1-2 tb, but we need hundreds of cores to process it within reasonable timeframe - hence hadoop.

Why is the title "Don't use Hadoop when your data isn't that big"

When the article is "Don't use Hadoop - your data isn't that big"

Two totally different points?

And I'm sure it was correct to begin with.

Well written article. I think most people who do not have a background in data are unaware of the various options out there and fall for the marketing behind hadoop like tools.

I would urge people doing analytics to take a look at kdb+ from kx. Unless you have ridiculously large amounts of data(> 200 TB), I can bet that you would be better off with kdb. The only downside is that it costs a lot of money which is a pity.

You'd have to hire a team of people that can write good K. Since they're so highly in demand in the niche area of finance, you'll be paying a very pretty penny for them.

I have been part of a team which used kdb heavily and this is not really true. The best part of kdb is that it requires almost no administration and at the end of the day becomes just another programming environment/tool for you to use. Hiring a team of people who just do kdb is only done by banks or other unimaginative large teams.

I love how whoever mods this site now doesn't even have to follow their own rule about not editorializing headlines.

Mods: Don't be hypocrites. If you're going to enforce your "only use the source title" trash on us, follow it yourself.

Original title: "Don't use Hadoop - your data isn't that big"

Mod-invented title: "Don't use Hadoop when your data isn't that big"

I've also been sort of feeling like titles get rewritten just for the sake of being rewritten these days. I understand the need to combat sensationalism, but this sort of edit really takes away the writer's voice.

I have been looking into building simple recommendation engine (i have at max million data rows) using Python. I looked into Crab (https://github.com/muricoca/crab) and it seems to not have updated for 2 years.

Any suggestions for libraries or just use basic numpy/scipy and implement the algorithms?

Just implement the damn algorithms. Anything else is just another library to learn, and another tool to babysit.

Agreed. Take the Machine Learning class on Coursera if you need an introduction to them.

Okay. I already have read about the algorithms and even implemented them crudely while studying them.

Just not a big fan of NHI. But if in this case, thats the best way - let it be :)

Does anyone use Hadoop for job management?

We have millions of XML documents in a document database. Many of the questions we want to ask about those documents can be answered through the native database querying capabilities.

But there are always questions falling outside the scope of the query capabilities, that could be answered by a simple map function applied to each document, with a reduce to combine the results.

Seems like a pain to always query for the documents you want to process, find some place to store them on disk, then run a program locally to get the result, vs. writing a map and reduce job then pointing it at the documents in the database to run against (this document store has a Hadoop integration API). Hadoop also seems to have a lot of nice job frameworks and monitoring tools and APIs to track job progress.

Anyone have similar situation where you used Hadoop just to get job management and tracking and flexibility in performing data analysis tasks? Are there easier ways to accomplish this goal?

The company I work for creates a product which may help you : http://www.syncsort.com/en/Data-Integration/Products/DMX-h/D...

I understand and agree with the author's main point that many companies that use big data do not need to use these technologies.

I do not agree that the tools are inferior to Sql. Hive is really close to sql and Pig is extremely powerful. I would take a look at a few of the recent updates to these tools before declaring them inferior to Sql.

Disclaimer: I've worked with and submitted a few odd patches to Hive. I have not worked with Pig directly.

I think you ought to consider what you mean by inferior here carefully. If you mean 'Hive QL can capture many of the common semantics of SQL' then sure.

If you mean just about anything else, you're wrong. The performance and reliability considerations of Hive/Hadoop are vastly different and very easily inferior to a mysql or postgres setup for small to mid-size datasets.

(That doesn't even get into ease-of-usage. Anyone who's ever dealt with Hive's dreaded 'error: return code: -9' can attest to how maddening Hive can be to use).

There's a key nuance missing: SQL is mature, while even core Hadoop is struggling to get there. You can simply install Postgres, MySQL, etc. in a couple minutes and start working on your data (i.e. the actual work) and not spend hours dealing with … mixed quality … documentation, extensive configuration on multiple nodes, and writing code to provide what are built-in features in most databases. For anything not set in stone with massive data volumes, that overhead adds up quickly.

The other cost is interactivity: Hive takes a LONG time to return results compared to a SQL database even if you don't have massive data. If an analyst is working on a query interactively this is a significant impediment, particularly given the gap in ease of monitoring and performance optimization advice. Again, if you have enough data Hadoop might still be worth it but, particularly in the post-SSD big memory era, the time-to-correct result gap is massive.

I used a big data option on my last project because marketing expectations were gigantic and the hype around the project was also enormous. Looking back it was a poor choice because the expectations never panned out and we could have saved some time and effort using a more traditional and well known SQL database like PostgreSQL. Before that I had a fairly large project with ~1M user profiles running with no problems on PostgreSQL. I think it would have handled the latest project with ease and could be sharded and scaled when to handle the growth it's seeing now.

But marketing was insisting we use big data because "regular" databases couldn't handle such enormous possibilities. I'll never believe that nonsense again. It wasn't a terrible ending but it was more hassle than it was worth IMHO. At least I got some resume material out of it..

And even if it is, consider using Spark/Shark instead.

This goes back to the old adage "the right tool for the job".

As @davidmr points out, there are jobs on smaller data sets that can still benefit from the distributed nature of HPC.

That said, my own experience with startups echoes much more what OP writes - python scripts and csv processing has saved me days of headaches in resource constrained environments. I was able to quickly produce analysis and make crucial scaling decisions using data that would have taken days of engineering resources to produce. I happened to have the right tool for that job handy, and it worked out great.

You really have to think things through before restricting yourself to any specific direction.

I guess I haven't been around enough small startups (fewer than 100 people) because I hardly get the sense that people are haphazardly spinning hadoop clusters. People generally pick the right tools for the right jobs. This is particularly true in the datawarehousing world.

That being said I can see problems where people pick hadoop without knowing how it's going to integrate into their systems 1-3 years down the road. Particularly with cloud computing these days, you can easily bring large complex systems online with little effort. It's cool and scary at the same time.

Very good points. I liked learning how to use hadoop academically, but it's not an end all be all tool.

If you want to do map reduce, it's perfectly reasonable just to use something smaller scale on multiple cores to do some kind of data processing.

Another thing is real time processing, using something like storm (http://storm-project.net/) or even the parallelism based systems like Akka on the JVM or Go will allow you to have adequate performance. Hadoop has a lot of overhead in not only operations but also job startup.

This isn't directly related, but is hadoop the only java-based solution to parallel computing? I've seen some examples of people attempting to do work in java mpi[0] again. It seems like dealing with the gc and memory management in general has been an issue when trying to do high performance computing in java, especially at a distributed scale.


My company's data is that big, thank you very much.

But if yours isn't, sure, don't use a system designed for handling astronomical data. Should be common sense, probably isn't.

While the point in the headline is fine, the supporting reasoning is in places dubious.

Don't use SQL for anything over 5 TB? Huh? You can put a lot more data than that on a node with a nice open source columnar analytic DBMS, and of course there are a lot of MPP relational analytic DBMS as well.

SQL on Hadoop requiring full table scans? Well, that's what Impala is for. Hadapt is more mature than Impala. Stinger is coming on, and is open source.

Nice marketing line, however.

So true! I would 'up' it twice if I could.

Too many startups go over to Hadoop/no-sql solutions before the overhead is indeed justified. SQL for most of the data with a bit of Redis and numpy for background processing will take you much further than most people assume.

It's fun to think you must have DynamoDB, Hadoop or a Cassandra backend, but in real life -- you better invest in more features (or analytics!)

I agree with this guy's main point of using the right tool for the job, but he hates on Hadoop waaay too much. Even overlooking how simple you can make building out mapreduce jobs in python using something like MRJob, or just using Hive if SQL is really your fancy. Hadoop has its place, and as with any tool, can be the hammer that makes everything look like a nail.

The author is missing a big gap between 5TB - 1PB. For most workloads, I would not look to Hadoop at the 5TB+ scale of data. I would first look at Impala or Redshift.


More than 5TB doesn't mean you need Hadoop either, it means you need a better storage system like tokudb/tokumx or leveldb. These technologies can index the data so that you can run selective queries instead of reading all the data for every query like Hadoop would, and they can compress it so that you can keep everything on one disk a while longer.

Big advantage of Hadoop is that you do not need to pay license fees which in case of relational databases can reach into tens of thousands dollars.

The big disadvantage is that Hadoop is two orders of magnitude slower than relational databases. Also, Hadoop clusters are not what one would call a "green" solution. More like a terrible waste of computing resources.

Postgres is free.

Hadoop / MapReduce was invented for situations where the data is being generated on the machines (eg, via a distributed web crawl). If you're not generating the data in situ and have to ETL it anyway, it makes just as much sense to load the data onto one Monster Box with a terabyte of RAM and 48 CPU cores. You massively save on complexity.

There is no problem with CV-driven development.

I think that's disingenuous. CV-driven development leads to really messy nightmares, in my experience, for the same reason that a machete is a poor tool for heart surgery.

Not hiring programmers older than 40 is also disingenuous. You a right of course, but it does solve one problem - your employability. Nobody cares about your perfect VisualBasic architecture.

I sort of agree. However when you think you will grow into the TB range, you just as well do it in hadoop right away.

We are using Hive and HiveQL and have SQL like queries which generate the correct output. The result is: we dont have to hassle with the hadoop mappers and reducers. And we can write our "queries" in a human readable fashion.

Why would Hadoop automatically be the right tool for the job in this case? In fact, when is any tool automatically the right tool for the job?

I understand the desire not to do unnecessary work and to plan ahead. However, something like Hadoop is very heavy weight and requires a significant change to the way you write your software. It isn't something easy to undo if Hadoop turns out to be the wrong direction. Thus, if you start there you are more likely to stay there, even if "there" is not appropriate. On the other hand, starting with a much simpler system and adapting it as your needs change may add a little more work over the course of the project, but saves a buttload of necessary technical debt.

I do agree with you. I prefer too keep it simple too, and not plan ahead too much According to the KISS principle. I'm not saying hadoop is the solution all the time. I'm just saying that sometimes it makes sense.

Great points in this article - though using something like cascalog makes hadoop suck a lot less - e.g composable more complex queries closer in power to sql. Wouldn't be quite as crazy to use on smaller datasets be it just for fun or to prove you are ready if/when your dataset grows large enough.

Thx! Like always: use the right tool for the job. But many customers dont understand this simple rule.

The people who sign the cheques tend to be managers, who tend to be the sort of person who got to where they are because they are busy and aggressive, and as a result tend to be burdened with a packed; stressful; attention-depleting schedule.

Regardless of the intelligence that they were born with; as a result of all the testosterone-fuelled busyness, they lack the cognitive resources to think deeply or in detail about the decisions that they are making (And, Ironically, as a result also lack the cognitive resources to realise just how intellectually compromised they are).

The upshot of all of this stress and fatigue is that we end up with decisions that are dominated by groupthink; buzzwords and the sales pitch of the latest vendor to stick his shoe in the door. Oh, and by whatever HBR & McKinsey said last week.

While there is a point to be made here, this article does not make it. Or perhaps it goes too far in attempting to make it, to the point where I feel like it might tip people in the wrong direction.

The point of the article is taken if:

A) Your data is not large. B) You aren't creating large intermediary datasets with the data. C) You aren't running an increasingly large number of analysis jobs on the data. D) Your computational overhead is small. E) Your memory overhead is small (this requires an asterix, because some tasks that require extreme amounts of memory will not work well in hadoop and should be brought outside) F) You don't need or want a system to track the increasingly large number of analysis jobs you're running. G) You can guarantee you won't outgrow A, B, C which will force you to rewrite all your code.

G is especially difficult because it's hard to predict. F is always underestimated at the beginning of a project and bites you later. Yes you can write analysis scripts--what happens when there are a hundred of them, written by different developers? Time to write a job tracking system, with counters, retries, notification, etc. Like Hadoop.

To further D and E, there are workloads that are relatively straightforward across terrabytes of data, and there are workloads that are expensive over gigabytes of data (especially those involving the creation of intermediate indices, which is where MR itself speeds things up considerably, esp if done in parallel).

Also, in a critique of Hadoop the article obsesses over MapReduce (in a way, conflating Hadoop and MapReduce, just as it conflates 'SQL' with a 'SQL database'), ignoring the increasingly powerful tools that can be used, such as Hive, Pig, Cascading, etc. Do those tools beat a SQL database in flexibility? The answer is that the question is not really relevant. If you already understand the nature of your data, and you've gone through the very difficult act of designing a normalized schema that fits what you need, then you're in a good place. If you have a chunk of data in which the potential has not yet been unlocked, or in which the act of writing to it happens to quickly to justify the live indexing implied by a database, then Hadoop is an essential tool. They really sit next to one another.

None of this is to knock writing analysis scripts against local data. I do that all the time. In fact often I'll ship data from HDFS to the local system so I can write and run a script. I just think it's important at a company to make sure your people have access to good tools so there aren't hurdles in front of them, and when it comes to data analysis I've come to the opinion that you really want to have a Hadoop cluster set up next to your SQL databases and your other tooling, because it will become useful in sometimes unpredictable ways.

Yes, if there are a few hundred megabytes in front of you and you need to analyze them, then write a script--and were I interviewing someone for a job I would not hesitate to accept a script that solves a data analysis task, so clearly the people being interacted with by the author were somewhat myopic. But most companies require that an ecosystem be built to handle the increasing complexity that will ensue over the years. And Hadoop is a huge bootstrap to that ecosystem, regardless of data size.

Thanks. That put my thoughts on screen rather succinctly for me.

Perhaps this is the beginning of the resurgence of the "Small Data Expert"

Another chance to plug for my favorite command-line small data toolset, Crush Tools. I used it often at Groupon.


Great article. I would say 1 TB is really the absolute minimum at which you should start considering Hadoop - for anything less, use Python or variants (for flat files) or MySQL or PostGRES (for relational data.)

You would be surprised to find out how much processing 400GB of geodata needs sometimes. 7 days in a normal machine to do some non-trivial analysis of OpenStreetMap data. Hadoop can reduce that to hours instead.

Key point : Hadoop storage is bloody cheap compared with a SAN, as in $1kTB vs $15kTB for enterprise storage, and it costs a pittance per year compared with licensing costs.

Only problem is backing the bugger up.

Yes the distinction is important - Is it a "Big Data" problem or is it a "Big" data problem

was that a job interview? You probably shouldn't use your favorite super-scripting language on a job interview that was asking for a more robust answer. They were maybe looking for someone to transform their hacky platform to something more robust.

Actually, it is. Thanks for the blanket statement, though.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact