I have worked for at least 3 different employers that claimed to be using "Big Data". Only one of them was really telling the truth.
All of them wanted to feel like they were doing something special.
The sad thing is, they were all special, each in their own particular way, but none of what made each company magic and special had anything to do with the size of the data that they were handling.
Hadoop was required in exactly zero of these cases.
Funnily enough, after I left one of them, they started to build lots of Hadoop-based systems for reasons which, as far as I could fathom, had more to do with the resumes of the engineers involved than the actual technical merits of the case.
Sad, but 'tis the way of the world.
CEOs want to run bigger companies, even if it's not best for shareholders.
Managers want to run large mini-empires, even if it's not best for the corporation.
Engineers like to use the latest tools, even when they're not best for the problems at hand.
Solving this requires well respected leadership that knows enough about the details, and which battles are worth fighting.
The funniest has to be the engineering company which I used to work with (not for thank fuck) back in the 90's (when 100Gb was big data!). They managed to bag 2x full 42U racks full of HP N-Class HPUX UNIX kit, PDU systems and separate disk arrays for just over £1,000,000 supplied with one full time UNIX monkey to keep the plates spinning.
Turns out they had only 10 CAD/engineering staff and about 8Gb of data in an Oracle DB which a single Sun Ultra desktop (<£10,000) could have handled without a problem at the time.
Another thing I haven't seen much of is acceptance testing. This is actually more surprising to me than the reliance on vendors for problem solving. It's an absolute no-brainer to say "I have problem X, which will be solved by solution Y, and the success will be measured by the set of tests Z. Unless the above happens, we will not pay the bill and you will take the hardware back." The vendors looked at me like I was from Mars when I started suggesting these.
I have not since complained about government waste in research computing budgets.
We really just want high throughput for linear scans; possibly with a small amount of parallelism, which would argue for a fairly simple solution with as little cacheing as possible (ideally none whatsoever) ... but I cannot even tell (from the website & the literature that I can access) how these things are supposed to operate -- which is a worrying sign in itself.
Previously, I got good performance out of a SAN using a handful of NexSan SataBeasts -- which I would have recommended to use again, if anybody had bothered to ask.
Also, having an order of magnitude more hardware than you ought to need is totally understandable when hardware capability is increasing by an order of magnitude faster than hardware replacement cycles. It's when you're off by several orders of magnitude that it starts to become suspect.
I had a Sun 1000E  for 3 years as a workstation from 1996. 8x 50MHz CPU, 2Gb of RAM, 16Gb disk array. It was made in'93 and was approx 12U for the unit and array. Things weren't that big in the 90's. 80's yes...
There was a phrase I learned long ago: "if it fits into cache it's not a supercomputing problem."
This essay promotes the related view: "If it fits on a desktop, it's not Big Data."
BTW, you might like Greg Wilson's 2008 essay "HPC Considered Harmful". I can't find the slides anymore, but there's a lovely interview with him about this at http://itc.conversationsnetwork.org/shows/detail3682.html .
Maybe I just need more discipline..
Now some might counter this point with the observation others have made here that using hadoop imposes a ~10x slowdown. But even then, my 100 EC2 servers will get the job done 10x faster. Running a job in 1 hour with hadoop is MUCH better than running the same job in 10 hours without it, especially when you're doing data analysis and you need to iterate rapidly.
So there is a point where using hadoop is not productive. But that limit is not 5 TB and depends on a lot more variables. Over simplification makes for catchy blog posts, but is rarely the way to make good engineering decisions.
The downside of EMR is that it can be fairly expensive once you start needing the beefy machines. We're lucky that we can afford to have our analytics delayed an hour or two and can thus run on Spot instances (except for the Master node). When we move to a streaming architecture I'm not sure EMR will still be competitive, since we won't be able to have those machines go away on us.
In this case, is Hadoop actually a good tool? When you have such a small amount of data, it is unlikely you will ever be faced with a "reduce" challenge, and thus the problem is better solved with a simple queue rather than a hadoop cluster.
It's orders of magnitude easier to start 100 servers and reuse basic Unix skills than it is to setup and manage Hadoop on the same infrastructure.
Related, it is remarkable how we developers routinely cite Knuth's advice about premature optimization to justify our decision when the shoe fits, and then turn around and flatly ignore the advice when it doesn't fit.
Selecting Hadoop before you have a specific and concrete need for it--or see that need approaching rapidly on the horizon--is in my experience often and surprisingly coupled with a disdain for other performance characteristics (because Knuth!). The developers prematurely selecting Hadoop as their data management platform will routinely be the same developers who believe it's reasonable for a web application with modest functionality to require dozens of application nodes to service concurrent request load measured in the mere thousands. The sad thing here being that application platforms and frameworks are not all that dissimilar; today, selecting something with higher performance in the application tier is relatively low-friction from a learning and effort perspective. But it's often not done. Meanwhile, selecting Hadoop on the data tier is a substantially different paradigm versus the alternative (as the article points out), so you have some debt incoming once you make that decision. And yet, this is done often enough for many of us to recognize the problem.
In my experience, for a modest web application, it's better to focus resources and effort on avoiding common and stupid mistakes that lead to performance and scale pain. Selecting Hadoop too early doesn't really do a whole lot to move the performance needle for a modest web application.
Trouble is, many web businesses are blind to the fact that they are a modest concern and not the next Facebook.
the great achievement of Hadoop isn't performance, throughput, etc... (and in many cases it can actually be worse than alternatives).
The great achievement is that it is an easy (i.e. on the orders of magnitude) accessible (i.e. easy installable, configurable, scalable, supportable, programmable) cluster for "mere mortal" enterprise. Before Hadoop, any clusterable software (where even laughable by Hadoop standards 2 nodes were already priced at the cluster level) would cost arm and leg and 4 or 8 nodes would be considered a huge cluster. Hadoop gives a chance to an enterprise (not to CERN :) to taste an unconstrained (loose use of term) distributed computing, and the next generation of software is already coming - Impala, etc...
And obligatory - "nobody gets fired for using Hadoop".
It is especially bad when people uses Knuth's advice to forgo the conception part where you think about what technologies you will use and why.
I'm assuming this is a dig at Rails and Django? What are you suggesting instead?
I'd certainly like to get into alternatives but my impression was that none of these battle-tested the way Rails and Django are.
Rails is the classic example of optimizing for developer ease of use instead of performance. Which probably makes sense for startups, but scaling is a pain in the ass.
Openresty is great for infrastructure level solutions,
e.g. request routing and authentication. It is used by
some of the largest Internet companies in China.
The benchmarks at
are a reminder of how poorly some popular frameworks perform. And how poorly cloud performs vs relatively cheap dedicated hardware.
I suppose you'd need to make that judgment call yourself. In my opinion, many of the options I listed have seen sufficient production usage that I consider them roughly as secure as any mainstream framework. For example, Play and Scalatra are used at Netflix, BBC, LinkedIn, etc.
Many expressly avoid notoriously dangerous (and frankly alarming) deserialize-user-input-and-execute patterns, which is in my experience the most common form of framework vulnerability.
As for performance, scalability, and reliability under pressure, that is precisely the point I was making earlier: higher-performance platforms combined with frameworks designed for performance and scalability are targeting grace under pressure moreso than legacy frameworks. Higher performance gives you greater peace of mind with availability, responsiveness, and user experience.
I just gave a presentation last week at the SF Python Meetup about how AdRoll uses a single server to query terabytes of compressed data with sub-minute latencies, using a system that is implemented in Python:
This is probably a thousand times less resource intensive than using Hadoop for the same queries.
It's also why we are both building compression into the native storage format for Blaze, and why Blaze is designed to run out-of-core from the start.
Just as a bit of background, I think that Chris would very much agree that I am not the intended recipient of this advice, and so my comments probably aren't keeping in the spirit of the article. I've spent the last 10+ years exclusively in very large HPC environments where the average size of the problem set is somewhere between 500TB and 10PB, and usually much closer to the latter than the former.
I think that, for the types of problems Chris mentions, for small data sets, hadoop is as silly a solution as he claims, and for the large map-reduce problem set of the (divide and conquer using simple arithmetic) of 5TB+, he's clearly in the right. Periodically I peruse job postings to see what is out there, and I'm personally ashamed at what many people call "big data", but just because your problem set doesn't fit the traditional model of big data (incidentally, I'm having trouble thinking of a canonical example of big data. perhaps genome sequencing? astronomical survey data?), doesn't mean that a) hadoop is not the right solution, and b) it's best done on a box with a hard drive and a postgres install, pandas/scipy, whatever.
We'll be generous and say that 4TB can do 150MB/s. A single run through the data at maximum efficiency will cost you ~8 hours. Since we've restricted ourselves to a single box, we're also not going to be able to keep the data in memory for subsequent calculations/simulations/whatever.
Take for example a 4TB data set. It is defined such that it would fit on a 4TB hard drive, but if your problem set involves reading the entire set of data and not just the indexes of a well-ordered schema, you're still going to have a bad time if you want it done quickly, or have a parameterization model that requires each permutation to be applied through the entire sequence of the data rather than chunks you can properly load into memory and then move on, you're going to have a really bad time.
I suppose all of this is to say that the amount of required parallelization of a problem isn't necessarily related to the size of the problem set as is mentioned most in the article, but also the inherent CPU and IO characteristics of the problem. Some small problems are great for large-scale map-reduce clusters, some huge problems are horrible for even bigger-scale map-reduce clusters (think fluid dynamics or something that requires each subdivision of the problem space to communicate with its neighbors).
I've had a quote printed on my door for years: Supercomputers are an expensive tool for turning CPU-bound problems into IO-bound problems.
On today's top HPC installations, it currently takes about an hour to read or write the contents of memory from/to global storage. If the workflow has broad dependencies (PDEs are especially bad, but lots of network analysis also fits the bill), it's much better to use more parallelism to run for a shorter period of time with the working set in memory. If you can't fit your working set in memory on the largest machine available, chances are you should either find time on a bigger machine or you can't afford to do the analysis. (The largest scientific allocations are in the 10s of millions of core hours per year, which would be burned in a few reads and writes on a million-core machine.)
Also note that MapReduce is not very expressive. When IBM's Watson team was getting started, some people suggested using MapReduce/Hadoop, but the team quickly concluded it was way too slow/constraining and instead opted for an in-memory database with MPI for communication.
The next huge thing in the Hadoop ecosystem is Hive. It's like having an unlimited Postgres server that hundreds of people can work on at the same time. Hive's got its quirks and developers need to be educated on how to write the Hive jobs properly, but it gives a lot more back by allowing joins between a lot of datasets that might not be able to fit in one single machine.
There is a reason why Hadoop is so popular, especially in enterprises. I think that's why Cloudera got such a high valuation.
I'm sure there are scientists who put their all into getting their code to run on a super computer, then apply for a pittance of time and sit in a queue for weeks, but they're not publishing as many (or as interesting) papers as the ones who are pulling out their credit card and building mini-supercomputers on Amazon that rely on conventional interconnects and better algorithms and data processing tools.
Don't assume I'm ignorant about what supercomputers are used for. I used to work on supercomputers; that includes writing, running and evaluating codes, and selecting proposals on some of the largest machines in the world (at the time).
But these new approaches to doing science and engineering on large computer systems have obsoleted conventional supercomputing for all but an extremely limited set of computational problems. And that set of problems becomes smaller when clever computer scientists figure out smarter ways to run things on cheaper architectures. For example, page rank is a classic eigenvector problem; you can solve it by building a big matrix on your supercomputer and doing the appropriate calculations, using 50+ years of numerical optimization (but not really very good support for nodes failing between checkpoints). However, you can also implement it as an iterative mapreduce. The mapreduce checkpoints every little bit of map work, and along the way, handles those failures quite well. It can also handle data sets larger than the sum of RAM on the machines quite well.
Guess which one works better operationally, scales to a larger data set size, and ports to a lot of architectures cheaply?
I personally see it more as there's a pretty limited set of computational problems which are low-communication, often requiring rather gross approximations, even with the "clever computer scientists" looking at the problem, and they're often not the problems scientists actually want to solve, and so dismissing supercomputing is rather ridiculous.
With regards to protein MD, no, I'm not saying protein MD is a waste of time. In fact, I run the Exacycle program at Google which has run the largest MD simulations (many milliseconds) ever done, and the results are quite good. You would never have gotten results as good as ours on a supercomputer- even the world's largest, with the most highly scaling MD codes.
I speak from experience- I used to code for supercomputers, in fact working on protein (and DNA/RNA, my interest) structural dynamics, with some of the leading MD codes.
I helped port a better approach to Google's infrastructure some time ago. It's not a clever hack or gross approximation:
it's a distinctly better way of modelling protein dynamics than running long single trajectories on supercomputers.
Sorry, that's complete rubbish. Some problems just can't be made embarrassingly parallel or near so. You might be able to swap a problem for a similarish problem with better characteristics, but that's not the same as being a "poor programmer" since you're now using a different model rather than optimizing the implementation of the existing one. You still haven't explained how I do a parallel FFT without high speed interconnects -- I suspect the answer is "don't". I've tried running the stuff I do (DFT) over a 10GigE network on my cluster and it ran at about 10% of the speed of an Infiniband-enabled calculation. There are methods for getting better scaling, but they all (as I said) involve rather crass approximations which you don't always want to do. Those methods will probably increasingly be used more on large supercomputers due to their superior scaling characteristics, even with their nice high-speed interconnects, but you're still making a sacrifice in accuracy.
It might be possible to get good low-communication scaling for some models, as you were in the case of your sampling system (which typically parallelizes nicely, but only if the individual sampling jobs can fit on a single node), but you can't extrapolate that to assume that everyone can. As I said, exchanging a poorly-scaling model for a different, non-equivalent well-scaling model is a scientific question with tradeoffs, not a programming one.
Are you sure there is not a network friendly FFT algorithm?
I used to think very differently about computing before I read Google's MR, BigTable, and GFS papers. After joining Google and working on problems like this, I can assure you that FFTs can indeed scale quite nicely.
note that infiniband isn't a high speed interconnect. it's a cheap commodity interconnect- in fact, per port, it can be cheaper than 10GB. but lots of effort has been put into making the MPI libraries work efficiently over it, compared to 10GBe, so codes run more effficiently (as you observed).
DFT seems like the next thing that's going to not need supercomputers. I see some nice new algorithms coming on line designed for Amazon cloud infrastructure.
Why do you think that an interconnect can't be both cheap and high-speed? As you said, there has been a lot of work to make MPI efficient on infiniband, but that has been possible because inifiniband offers much lower (~10 times) latency than 10GbE, not because no one has optimized MPI for ethernet. In fact, standards like RoCE and iWarp have been devised by Ethernet working group to compete with infiniband on this particular metric.
Anyway, my point was that infiniband is not a high speed interconnect, and it's cheaper than 10GbE. The challenge is to deliver scaled infiniband, which is far harder than scaled ethernet.
Remember, the human genome was assembled on two different architectures: a Dec Alpha cluster with large memory with fast network (for the day) and a bunch of cheap linux machines with slow network. It turns out you didn't actually need the former, although the people at Celera insisted you did.
Which is poppycock.
Plus the science is hard enough that throwing mulitnode programming in the middle is tricky. Remember most of this code is written by a PhD student on his first real programming job. Its hard enough to go from getting it to run on something else than the PhD students laptop, never mind robustly parallelize this across nodes.
LSST, which I also am starting to work on, will be different, as the raw data alone will approach 4TB a day.
Computing and particle physics is where awesome meet.
I hope the CMS and ATLAS teams onsite at CERN were much better.
I agree with your overall point, but of course the actual limits for a fairly cost effective single box are fairly high these days. E.g. from my preferred (UK) vendor, 4x PCIe based 960GB SSD's (each card is really 4 SSD's + controllers) adds up to an additional $7500. We often get 800MB/sec with a single card.
Your point still holds - the actual thresholds are just likely to be quite a bit higher than what the performance of a single 4TB hd might imply.
Tack on 40+ cores, and 1TB RAM, and the cost to purchase a server like that is somewhere around the $60k mark in the UK (without shopping around or building it yourself), but that adds up to "only" about $2k/month in leasing costs. That makes a single server solution viable for quite a lot larger jobs than people often think.
A rule of the thumb that I've inferred from many installations is: Just introducing Hadoop makes everything 10 times slower AND more expensive than an efficient tool set (e.g. pandas).
So it only makes sense to start hadooping when you are getting close to the limit of what you can pandas - everything you do before that is a horrible waste of resources.
And when you do get there - often, a slightly smarter distribution among servers and staying with e.g. pandas, will let you keep scaling up without introducing the /10 factor in productivity. Although, it might be unavoidable at some point.
The main reason to switch to Hadoop at the point when Pandas fails is because you expect to scale past that gap fairly quickly.
"Big data" is like "cloud" it is a cool label everyone applying to their system. Just like OO was in its time. Well once they applied the label they feel they need to live up to it so well "we gotta use what big data companies use" and they pick Hadoop. I've heard hadoop used when MySQL, SQLite or even flat files would have worked.
Chris points out a great example of NDD here with Hadoop.
I do a lot of client work and I see this mistake CONSTANTLY. So often in fact, that I recently wrote up a story to illustrate the problem. Rather than use a tech example, I use a restaurant and plumbing to drive the point home. When the same scenario is put into the context of something more concrete like physical plumbing, it shows how ridiculous NDD really is.
Also, I think it's a bit disingenuous to compare a complex system built to various specific needs versus a simpler one that doesn't address those needs at all and just say "their only difference is that one worked and the other didn't".
Obviously, if the only criterion was that it should just work, everyone would go with the simpler system. The more complex system probably had some things going for it, too, otherwise nobody would choose it.
It's obvious to you and me, but to many people it's not obvious at all. You only have to look at the very large deadpool of companies that made these mistakes and died because of it.
People optimize prematurely, scale prematurely, make ego based decisions, care more about interesting tech than making the business grow, etc etc. That's really the whole point of the parent article - people make bad decisions because of the allure of some "cool" but complex and unnecessary tool.
Also, thanks for the note on the background - I'll fix that :)
Done! Although with more drives and a backup server. Right now, we're pushing 15Tb with no loss in performance.
Most companies I've worked for have shelled out quite a bit more for "enterprise" class hard drives. I've always struggled to understand what these bring to the table, and my understanding is that it's some combination of greater reliability and a service agreement.
It's always seemed to me, in my software-centric naivete, that it would be more cost-effective to get off-the-shelf drives, RAID them for redundancy to increase uptime, and make regular backups to increase data longevity.
On the one hand, cheap disks are much more prone to start operating slowly when they begin to fail, instead of transitioning directly from working to not working (or setting themselves as failed).
On the other hand, RAID write speeds are roughly that of the slowest drive in the set. The more drives you have, the higher the probability of a malfunctioning hard drive making the whole thing go slow. Slow drives mean I/O bottlenecks, rendering the whole server unusable.
Pair both things and you have actually increased your effective downtime instead of decreasing it. Now use expensive disks (that work more reliably and either work or don't with a higher probability) and just calculate wether the price increase is lower than the cost of the downtime/slowdowns when running cheap disks.
Oftentimes the price premium is worth it.
One of the advantages of enterprise drives is how long the drive spends error checking. Consumer drives spend considerably longer when it detects an error, and that can cause the RAID controller to drop the drive as malfunctioning. Then it has to rebuild the array or operate on hot spare.
There used to be a firmware setting you could toggle on the WD Caviar Black drives to effectively turn them into the RE (enterprise) drives. They've since made some design changes so that's no longer possible.
But certainly something to think about when processing realtime financial transactions.
The one area we didn't penny-pinch on was the tape backup. Glacially slow compared to HDDs, obviously, but absolutely necessary.
But from what I've seen, the most important thing is the service agreement, i.e. having new disk ready the next day without any questions asked (or money spent for a new drive).
To put it other way - expected vs. unexpected expenses.
In fact, I wouldn't doubt that for several "big data" users even SQLite would be enough.
I think the data size ended up being a GB or so.
We do use RAID 6 (I think) + tape backups and the backup server itself is a hot standby. If you're curious, this is the chassis : http://www.supermicro.com/products/chassis/3U/836/SC836A-R12...
The motherboard is also Supermicro running dual AMD Opterons ( single processor on the backup server ) with 256Gb RAM.
I worked on a project earlier this year that was envisioned as a hadoop setup. The end result was a 200loc python script that runs as a daily batch job - https://github.com/jamii/springer-recommendations
I'm tempted to setup a business selling 'medium-data' solutions, for people who think they need hadoop.
- 3x replication: if the data needs to be retained long-term, slapping it on one hard drive isn't going to cut it. This is pretty poor justification by itself, but it's nice to have.
- working set: if you only pull 1GB out of your data set for your computations, it makes sense to pull data from a database and run Python locally. If you need to run a batch job across your full, multi-TB data set every day then Hadoop starts looking more attractive
- data growth: a company may only have 10GB of data now, but how much do they expect to have in a year? It's important to forecast how much data you'll accumulate in the future. Especially if you want to throw all your logs/clickstreams/whatever into storage.
So, if you're expecting explosive growth, you want to hang on to every piece of data ever, or you're going to do a lot of computation across the whole dataset, it makes sense to adopt Hadoop even if your dataset isn't 'big' to start.
As for this article, the author undersells MapReduce a bit. Human-written MR jobs can jam a lot of work into those two operations (and a free sort, which is often useful). Using a tool like Crunch can turn really complicated jobs into one or two phases of MR. Once Tez is widely available people won't even write MR anymore, they'll all likely write a 'high-level' language and compile it down to Tez.
* Replication: Disks are cheap, data can go on as many drives and servers as you need. Master-slave replication is pretty damn bulletproof these days, and multi-master isn't as terrible as it was even a few years ago.
* Working set: All servers can have the entire working set. Additionally, features like foreign data wrappers in postgresql and federated tables in mysql mean that you can query your sharded databases, and still get aggregate results back.
* Data growth: SQL servers were Big Data before Big Data was cool. Terabyte storage clusters are no problem for an SQL database engine, with some entities having clusters in the petabyte range.
Databases are a very fast moving target of late, with competition from many fronts meaning that what was true even last year may not be true today. Additionally, with things like Postgresql's foreign data wrappers, you can choose the best data processing engine for the job -- keep the data on the databases for the nice bits of SQL, but still be able to throw it towards a hadoop cluster at a whim if needed. Using the right tool for the job is important, and equally important is keeping up with what tools are out there, because this is a rapidly evolving area of tech, and analyzing each job separately, instead of relying on any mantra, is what's needed to keep one flexible to keep up.
You can definitely put lots of data in a RDBMS, and you can get it back very quickly. I'm not advocating for Hadoop for problems where a database - be it sharded, vertically scaled, whatever - will do; I know you can scale relational to petabytes of data.
Hadoop isn't supposed to be a tool for having a lot of data, it's a tool for doing things with a lot of data. I've done lots of weird computation on Hadoop: OCR of time-series images is a good example. This is general purpose, parallel computation which takes advantage of Hadoop's scheduling infrastructure and the notion of data locality - some of the data lives on every compute node, and can be accessed with very low latency.
To put it another way, databases will get faster and more scalable, but there's still a need for ETL when moving between different data models. Some people use Hadoop just for ETL, particularly when ingesting semi-structured data, because it's good at computation, but only OK at storing the finished, structured data.
I'm going to try not to be offended by your closing line; I was trying to point out a particular dimension of considering whether an application is suited to Hadoop - the computation component. This is often ignored by people who treat Hadoop like yet another database, when that's really a complete mischaracterization. I certainly don't advocate 'relying on any mantra', but instead considering all solutions and selecting the most appropriate tool.
"We completely agree that Hadoop on a cluster is the
right solution for jobs where the input data is multi-terabyte
or larger. However, in this position paper we ask if this is
the right path for general purpose data analytics? Evidence
suggests that many MapReduce-like jobs process relatively
small input data sets (less than 14 GB). Memory has reached
a GB/$ ratio such that it is now technically and ﬁnancially
feasible to have servers with 100s GB of DRAM. We therefore ask, should we be scaling by using single machines with
very large memories rather than clusters? We conjecture that,
in terms of hardware and programmer time, this may be a
better option for the majority of data processing jobs."
Their data is based on Hadoop jobs running at Yahoo, Facebook, and Microsoft -- companies most would agree do have real Big Data -- and they find the median job size is <14GB.
There's really nothing wrong with that at all, because breaking on 64MB blocks, that 12GB can be processed in parallel, which means turning an answer around really quick on that 12GB, say 30 seconds or so. Usually the work can be scheduled on machines that already have the necessary input, so the network cost is low.
Now, it might not be worth it for one hacker to build a Hadoop cluster to do that one job, but if you have a departmental-wide or company-wide cluster you can just submit your jobs, get quick answers, and let somebody else sysadmin.
Sure the M/R model is limited, but it's a powerful model that is simple to program. You can write unit tests for Mappers and Reducers that don't involve initializing Hadoop at all, and THAT speeds up development.
Yes, it is easy to translate SQL jobs to M/R, but M/R can do things that SQL can't do. For instance, an arbitrary CPU or internet intensive job can be easily embedded in the map or in the reduce, so you can do parameter scans over ray tracing or crack codes or whatever.
I built my own Map/Reduce framework optimized for SMP machines and ultimately had my 'shuffle' implementation break with increasing input size. At that point I switched to Hadoop because I didn't plan to have time to deal with scalability problems.
With cloud provisioning, you can run a Hadoop cluster for as little as 7.5 cents, so it's a very sane answer for how to get weekly batch jobs done.
@ Imagine you have some numbers spread over some computers -- too many to fit in one computer find the median.
▪ Uhh, sort them?
@ Can you find the median on a single computer without sorting them.
@ We'll call you back tomorrow.
I was promptly rejected, but it set the tone for my later studies.
The criterion for Big Data seems to be that it fits on thousands of computers, perhaps several TB or a PB. Then I had to think of some examples:
* A million YouTube Videos
* All the tweets in the US in the past 15 minutes
* All US tax records
I still think the map-reduce philosophy is really cool. And I know at that scale there are special counting algorithms (like Bloom Filters) that may lead to some improvements at the GB or MB scales.
An O(n) algorithm is Quick select. It is basically like quick sort but you only recurse on the side of the pivot that contains the median.
15 minutes of US tweets would fit on your phone :-)
A local python script is great, but what if it takes 2 or 3 hours to run? Now you need to set up a server to run python scripts. What if the data is generated somewhere that would have high locality to a hadoop cluster? Now you need to pull that data down to your laptop to run your job. What if there are a dozen people running similar jobs? Now your python script server is a highly stressed single point of failure. What if the data is growing 100% month-over-month? Your python scripts are going in the trash soon since they were not written in a way that can be easily translated to map-reduce and hadoop-sized workloads are inevitable.
The next step up is a centralized database, but in my experience running your own (large, highly used) database is a whole lot harder than just throwing files on S3 and spinning up hadoop clusters on EC2 if you have people that can write pig jobs.
A solution like elastic map reduce removes a lot of practical problems such as data distribution, resource management, and system operations beyond the fact that it makes it possible to easily run jobs over terabytes of data at a time.
Hadoop is a solution for cases where you have multiple petabytes of data, with query's that need to touch a significant portion of your data. Roughly speaking in this case your execution time will scale with the number of nodes in your cluster. Classic example is creating the inverted word list for a search engine.
For most other use cases, including all cases where you can index your data, you do not need Hadoop.
But I have something to add. Hadoop is not only introducing new techniques for distributed storage and computation. Hadoop is also proposing a methodological change in the way a data project is approached.
I'm not talking only about doing some analytic over the data, but building an entire data driven system. A good example would be the case of building a vertical search engine, for example for classified ads around the world. You can try to build the system just using a database and some workers dealing with the data. But soon you'll find a lot of problems for managing the system.
Hadoop provides you all the storage and processing power that you want (it is matter of money). Why if you build your system in a way where you recompute always everything from the raw input data? That can be seen as something stupid: Why doing that if you can run the system with less resources?
The answer is that with this approach you can:
- Being human fail-tolerant. If somebody introduces a bug in the system, you just have to fix the code, and relaunch the computation. That is not possible with stateful systems, like those based in doing updates over a database.
- Being very agile in developing evolutions. Change the whole system is not traumatic, as you just have to change the code and relaunch the process with the new code without much impact in the system. That is not something simple in database backed systems.
The following page shows how a vertical search engine would be built using Hadoop and what would it be its advantages: http://www.datasalt.com/2011/10/scalable-vertical-search-eng...
Could you elaborate? This sounds like the NoSQL argument that relational databases are "not agile", which usually means relational databases complain that you have records that won't logically fit the changes you just made.
Hadoop is not a database! It's a parallel computing platform for MapReduce-style problems that could preserve locality. If your problem fits this, Hadoop absolutely rocks. If your problem is different then please use another tool. If your problem deals for example with high-resolution geographic or LiDAR data that can be easily processed independently (and that would easily give you a few petabytes each scan/flight so you can't stuff it into GPU), Hadoop is about the only open thing you can use to process them reliably (imagine having Earth surface data with the resolution of 1cm and the need to prepare multiple levels of detail, simplify geometry, perform object recognition, identify roads etc.). Even when your data is smaller if your problem fits in the MapReduce model well, Hadoop is a pretty convenient way to be future-proof while enjoying already mature application infrastructure. Why would you even bother working with toys that put everything into memory and then fail miserably in production (HW failure, need to reseed data after crash etc.)?
I worked in a company that routinely processed these kinds of data; usually people accepting the thinking from this article hit a wall someday in production, couldn't guarantee reliability and ended up writing endless hacks for their algorithms that didn't scale when it was needed and became frustrating bottlenecks for everyone.
Yes, I saw also some ridiculous uses of Hadoop (database that didn't have a chance ever growing over 20M records, problems not fitting MapReduce that needed custom messaging between jobs etc). Just use your reason properly, whatever has the potential for the future to handle large data do it with Hadoop or any appropriate system that supports your algorithmic model (S4, Kafka, OrientDB, Storm etc.) straight away.
Make your software future-proof now or you'll have to rewrite it from the scratch when you will be under huge pressure. Don't become complacent with what you "know" now.
Nor does the essay claim that Hadoop is a database.
> "If your problem fits this, Hadoop absolutely rocks"
The essay points out that most systems which do use Hadoop don't actually fit the Hadoop model, and that other 'mature application infrastructures' would be more effective than Hadoop. You misinterpreted it to mean the converse.
It also agrees with you that there are cases where "Hadoop might be a good option". It says "The only benefit to using Hadoop is scaling", and that it might be appropriate for >5TB data sets. Your 1PB example is of course larger than 5TB.
So I don't think you actually disagree with it.
> "Why would you even bother working with toys that put everything into memory and then fail miserably in production"
Thank you for calling my software a "toy" and disdaining my field of research. Do not presume that the needs of your field hold for all others.
Hadoop isn't useful for what I'm interested in, which is to support interactive search of ~2 million chemical structures. This requires a sub-100 ms query time. Hadoop, last I checked, was lousy for soft real-time work like this.
You actually say "Hadoop or any appropriate system that supports your algorithmic model".
My actual solution was the old-fashioned way: a combination of improved algorithms, multithreading, better data locality, and a bit of chip-specific assembly. The result is about 100x faster than the previous widely used tool, and gives me the sub-second search times I want.
Moreover, it scales well. I use the new code as part of the inner loop in a clustering task, what once took a week on a machine cluster is now being done on a single node in an afternoon.
Had I taken your suggestion I would have invested in more hardware, which would have been the wrong solution for my needs. Also, data size in my field doubles every 5-10 years, which is much slower than the rate of computer performance.
First, I really don't like if somebody by default compares Hadoop with SQL and this is a widespread confusion - they are completely different beasts; in fact an extension called Hive gives Hadoop + HTable an SQL-like syntax.
However Hadoop is a parallel, batch-processing platform. Hadoop is slow-responding, you can't even talk about latency because a single task takes a lot of time even to setup/execute. It's completely inappropriate for real-time low-latency computations. For those, the in-memory, GPGPU, streams are much better. However, if you have a large dataset whose loading times onto a single machine may be very long - imagine loading 1TB to memory from a network drive if your computer can handle it - you might be better off by partitioning your data across thousands of smaller nodes (e.g. ARM microservers) and perform computations on each of them independently. This way you don't need to transfer a lot of data, each smaller local dataset is loaded very fast (= you preserver locality), you find balance between being CPU-bound and IO-bound and likely finish your computations much faster than on a single computer with huge memory but limited bus.
Hadoop's tragedy is that it is now an established platform that is supported by large companies which are mostly driven by technologically clueless people, trying to put it everywhere as it is "in vogue" to do so. Google abandoned MapReduce model a few years ago but industry didn't seem to notice.
Think about your case for chemical structures - is there any part of your algorithm that needs to be computed only occasionally but it's a lot of data to process? Why not offload it to Hadoop for preparing digest from these data which you can use in your real-time algorithm? That's actually pretty common usage scenario - Hadoop handles the rough mining, extracting the precious stuff from crude data, assembling it to a form that is refined and can be used by other parts of system that have completely different requirements, for example low-latency interactions.
The first comparison was to Scala, and then to SQL. The main comparison was to SQL and a Python script. While it does talk more about SQL than other solutions, I don't see that as a default comparison.
> "For those, the in-memory, GPGPU, streams are much better"
I'll go off on a bit of philosophical tangent here. Yes, GPGPUs can be more effective for the task I'm doing, since I'm actually memory bound. However, GPGPUs require dedicated hardware, while the code I work is effective even on laptops with limited GPUs, including web servers.
Ideologically, I prefer to enable single-person developers, and scientists who do not have much training in hardware and network administration. For that situataion, GPGPUs are not "much better", because so long as the performance is fast enough, it doesn't need to be faster. And 100 ms is "fast enough."
> "This way you don't need to transfer a lot of data"
True. But to point out, I used the traditional approach of developing a file format which can be memory-mapped directly to my internal data structures, and a search algorithm which doesn't need to search the entire data set before loading it. These optimizations aren't synergistic with how I understand how Hadoop works.
> Hadoop's tragedy ... mostly driven by technologically clueless people
The author of the essay agrees with you. Here's the P.P.S.:
"I don’t intend to hate on Hadoop. I use Hadoop regularly for jobs I probably couldn’t easily handle with other tools. ... Hadoop is a fine tool, it makes certain tradeoffs to target certain specific use cases. The only point I’m pushing here is to think carefully rather than just running Hadoop on The Cloud in order to handle your 500mb of Big Data at an Enterprise Scale."
This is why I don't think you actually disagree with the author.
> Think about your case for chemical structures - is there any part of your algorithm that needs to be computed only occasionally but it's a lot of data to process?
Certainly, but there are two other important facets to that. 1) updates occur weekly, a full rebuild on a single core only takes 12 hours, and those 12 hours aren't critical, and 2) the deltas are relatively small and data from previous builds can be reused, so incremental updates should take about 30 minutes - I'll be working on that code in the next couple of weeks.
It's easier to have a cron job trigger a command-line program every week than to set up a Hadoop server.
15 years later and it's the same thing, plus a few orders of magnitude.
"Cod Normal Form? Never heard of it. I like my fish sticks made of haddock anyway."
Come to think of it, I've worked with guys who apparently learned everything they know about databases from that prof's database class, unfortunately.
Because if it all fits then you don't have to worry too much about performance overheads. If it doesn't then you need to think about how you're going to avoid table-scans and the like.
A lot of people use databases to implement persistent user state on top of stateless HTTP, even if the data itself is small enough to fit into memory.
This latter use was known even in 1997. For example, the book "Database Backed Web Sites: The Thinking Person's Guide to Web Publishing" was published on Jan. 1 of that year.
So even when that advice was offered, it was wrong.
However I would still strongly contest the analogy even if the database did fit in memory (the author may be off by magnitudes themselves, as in 1997 a 2GB database was a pretty substantial, unweidly thing for most people) -- doing the simple steps of putting your data in a database instantly enables enormous flexibility in the use of that data at very little cost or overhead, with better to enormously better performance than the average person is going to yield with a flat file.
This situation (Hadoop for big data), in contrast, is about throwing away a lot of flexibility, and paying a large performance price, to add big-data scale out flexibility. It is, in many ways, the opposite situation.
Repeats all the usual myths about Hadoop... that it's just about MapReduce, that it doesn't support indexes (hint: Hive has a CREATE INDEX command, guess what it does?) and then adds some of its own.
After seeing this, plus an article advocating web frontend programming in C, I'm starting to think this place is going downhill fast.
Yes, there are companies that are trying to appear more attractive by using Hadoop, but there are plenty of cases where Hadoop is replacing ad-hoc file storage on multiple machines.
It's primary use is as a large scale filesystem, so if you are running up against problems storing and analyzing data on a single box, and you feel that the amount of data you have will continue to accelerate, it is a good option for file storage. It doesn't replace your database, it complements it, and there's work being done to allow large-scale databases on top of Hadoop, although the existing ones aren't mature yet. But there's a lot of institutions that are taking on problems that a single-box setup cannot handle.
And MapReduce isn't a bad programming model, but it should be thought of as the assembly language of Hadoop. If you are solving a particular problem on Hadoop, writing a DSL for it is the way to go, or see if one of the existing DSLs fits your needs (HIVE, Pig, etc).
I'm witnessing a feeding frenzy for Hadoop talent in situations where there's absolutely no need for Hadoop, and I can't recall anything like this for any other software.
Certain very popular companies that everyone wants to emulate, who have truly enormous needs in the initial data crunching (e.g., ETL) department, were running into bottlenecks related to raw I/O bandwidth. They hit on a "let the mountain come to Mohammed" insight that helped them get past that problem, so that they could do their ETL jobs in less time and ultimately keep their workhorse databases (e.g., web search indexes) better-fed.
Simultaneously, a whole lot of people who were not having the same problems, but who want to believe that they are like companies who have those problems (because who doesn't want to be Google?) started also running into problems with handling large amounts of data. Unfortunately, the most popular mistake in applying Feynman's Algorithm is to skip the first step. Rather than investigating their problems and recognizing that the issue was poor tooling or inefficient implementations and they weren't actually coming anywhere close to any true limits of the kind that the companies that came up with Big Data were trying to get around, they instead just went, "Hey, X company that we look up to is also having problems that look cosmetically similar to ours, and they use Y technology - let's give that a try!" and proceeded to dive straight into constructing bamboo control towers and coconut radios without ever looking back.
After that, well, I think it's a tragedy of mostly-rational behavior. Managers don't understand these technologies well enough to take programmers' advice skeptically, so they have to listen to their engineers. Engineers want to make their CV's look nice and impress their managers, so they've got every reason to come up with excuses to use $HOT_NEW_TOY. As usual, everyone individually acting in accordance with their rational self-interest is not the same thing as everyone collectively acting in a way that produces ideal results for the parties involved.
From the top of my head
Ajax (very handy but lot of times misused and misunderstood or just DHTML)
No SQL (can be handy, but for some people it is like a religion)
XML (why use CSV when you are able to use XML)
Cloud (Oh by cloud you mean the internet?)
and many more :)
Interview problems are sometimes kinda artificial, no shit. Given the impracticality of giving every candidate the kind of dataset Hadoop would be needed for, how would the OP suggest an employer test for Hadoop skills?
I don't know much about YARN, so can't really speak to that.
Hmm I should go off and get to the bottom of this!
As an example I used to write MPI code for molecular dynamics simulators. The goal was to use a large supercomputer to scale the computation up to longer trajectories. But then, with cheap linux boxes it made more sense to just run many simulations in parallel, because low latency MPI class networks cost much more than the machines. We and others made the switch to other approaches which didn't need MPI- we ran many sims in parallel and did only loosely coupled exchanges of data. In fact the coupling was so loose we would run for weeks, dumping output in a large storage system and processing the data with MapReduce.
When I look back at the people who still do the MPI simulations- they are doing dinky stuff and can't even process the data they generate because they spend all their time scaling a code using MPI that didn't need to be scaled with MPI.
Big Data to me means at a minimum 10's of T, and what big data means changes over time, so todays big data will fit the laptop of the day after tomorrow.
If you don't need a global file system, use MariaDB, CouchDB or Mongo depending on your use-case.
Hadoop on the other hand is a huge, massive pain in the ass. And I am a Hadoop consultant. I recommend that most customers NOT use it.
This is why we include it in Linux versions of Anaconda.
I would urge people doing analytics to take a look at kdb+ from kx. Unless you have ridiculously large amounts of data(> 200 TB), I can bet that you would be better off with kdb. The only downside is that it costs a lot of money which is a pity.
Mods: Don't be hypocrites. If you're going to enforce your "only use the source title" trash on us, follow it yourself.
Original title: "Don't use Hadoop - your data isn't that big"
Mod-invented title: "Don't use Hadoop when your data isn't that big"
Any suggestions for libraries or just use basic numpy/scipy and implement the algorithms?
Just not a big fan of NHI. But if in this case, thats the best way - let it be :)
We have millions of XML documents in a document database. Many of the questions we want to ask about those documents can be answered through the native database querying capabilities.
But there are always questions falling outside the scope of the query capabilities, that could be answered by a simple map function applied to each document, with a reduce to combine the results.
Seems like a pain to always query for the documents you want to process, find some place to store them on disk, then run a program locally to get the result, vs. writing a map and reduce job then pointing it at the documents in the database to run against (this document store has a Hadoop integration API). Hadoop also seems to have a lot of nice job frameworks and monitoring tools and APIs to track job progress.
Anyone have similar situation where you used Hadoop just to get job management and tracking and flexibility in performing data analysis tasks? Are there easier ways to accomplish this goal?
I do not agree that the tools are inferior to Sql. Hive is really close to sql and Pig is extremely powerful. I would take a look at a few of the recent updates to these tools before declaring them inferior to Sql.
I think you ought to consider what you mean by inferior here carefully. If you mean 'Hive QL can capture many of the common semantics of SQL' then sure.
If you mean just about anything else, you're wrong. The performance and reliability considerations of Hive/Hadoop are vastly different and very easily inferior to a mysql or postgres setup for small to mid-size datasets.
(That doesn't even get into ease-of-usage. Anyone who's ever dealt with Hive's dreaded 'error: return code: -9' can attest to how maddening Hive can be to use).
The other cost is interactivity: Hive takes a LONG time to return results compared to a SQL database even if you don't have massive data. If an analyst is working on a query interactively this is a significant impediment, particularly given the gap in ease of monitoring and performance optimization advice. Again, if you have enough data Hadoop might still be worth it but, particularly in the post-SSD big memory era, the time-to-correct result gap is massive.
But marketing was insisting we use big data because "regular" databases couldn't handle such enormous possibilities. I'll never believe that nonsense again. It wasn't a terrible ending but it was more hassle than it was worth IMHO. At least I got some resume material out of it..
As @davidmr points out, there are jobs on smaller data sets that can still benefit from the distributed nature of HPC.
That said, my own experience with startups echoes much more what OP writes - python scripts and csv processing has saved me days of headaches in resource constrained environments. I was able to quickly produce analysis and make crucial scaling decisions using data that would have taken days of engineering resources to produce. I happened to have the right tool for that job handy, and it worked out great.
You really have to think things through before restricting yourself to any specific direction.
That being said I can see problems where people pick hadoop without knowing how it's going to integrate into their systems 1-3 years down the road. Particularly with cloud computing these days, you can easily bring large complex systems online with little effort. It's cool and scary at the same time.
If you want to do map reduce, it's perfectly reasonable just to use something smaller scale on multiple cores to do some kind of data processing.
Another thing is real time processing, using something like storm (http://storm-project.net/) or even the parallelism based systems like Akka on the JVM or Go will allow you to have adequate performance. Hadoop has a lot of overhead in not only operations but also job startup.
But if yours isn't, sure, don't use a system designed for handling astronomical data. Should be common sense, probably isn't.
Don't use SQL for anything over 5 TB? Huh? You can put a lot more data than that on a node with a nice open source columnar analytic DBMS, and of course there are a lot of MPP relational analytic DBMS as well.
SQL on Hadoop requiring full table scans? Well, that's what Impala is for. Hadapt is more mature than Impala. Stinger is coming on, and is open source.
Nice marketing line, however.
Too many startups go over to Hadoop/no-sql solutions before the overhead is indeed justified.
SQL for most of the data with a bit of Redis and numpy for background processing will take you much further than most people assume.
It's fun to think you must have DynamoDB, Hadoop or a Cassandra backend, but in real life -- you better invest in more features (or analytics!)
The big disadvantage is that Hadoop is two orders of magnitude slower than relational databases. Also, Hadoop clusters are not what one would call a "green" solution. More like a terrible waste of computing resources.
We are using Hive and HiveQL and have SQL like queries which generate the correct output. The result is: we dont have to hassle with the hadoop mappers and reducers. And we can write our "queries" in a human readable fashion.
I understand the desire not to do unnecessary work and to plan ahead. However, something like Hadoop is very heavy weight and requires a significant change to the way you write your software. It isn't something easy to undo if Hadoop turns out to be the wrong direction. Thus, if you start there you are more likely to stay there, even if "there" is not appropriate. On the other hand, starting with a much simpler system and adapting it as your needs change may add a little more work over the course of the project, but saves a buttload of necessary technical debt.
Regardless of the intelligence that they were born with; as a result of all the testosterone-fuelled busyness, they lack the cognitive resources to think deeply or in detail about the decisions that they are making (And, Ironically, as a result also lack the cognitive resources to realise just how intellectually compromised they are).
The upshot of all of this stress and fatigue is that we end up with decisions that are dominated by groupthink; buzzwords and the sales pitch of the latest vendor to stick his shoe in the door. Oh, and by whatever HBR & McKinsey said last week.
The point of the article is taken if:
A) Your data is not large.
B) You aren't creating large intermediary datasets with the data.
C) You aren't running an increasingly large number of analysis jobs on the data.
D) Your computational overhead is small.
E) Your memory overhead is small (this requires an asterix, because some tasks that require extreme amounts of memory will not work well in hadoop and should be brought outside)
F) You don't need or want a system to track the increasingly large number of analysis jobs you're running.
G) You can guarantee you won't outgrow A, B, C which will force you to rewrite all your code.
G is especially difficult because it's hard to predict. F is always underestimated at the beginning of a project and bites you later. Yes you can write analysis scripts--what happens when there are a hundred of them, written by different developers? Time to write a job tracking system, with counters, retries, notification, etc. Like Hadoop.
To further D and E, there are workloads that are relatively straightforward across terrabytes of data, and there are workloads that are expensive over gigabytes of data (especially those involving the creation of intermediate indices, which is where MR itself speeds things up considerably, esp if done in parallel).
Also, in a critique of Hadoop the article obsesses over MapReduce (in a way, conflating Hadoop and MapReduce, just as it conflates 'SQL' with a 'SQL database'), ignoring the increasingly powerful tools that can be used, such as Hive, Pig, Cascading, etc. Do those tools beat a SQL database in flexibility? The answer is that the question is not really relevant. If you already understand the nature of your data, and you've gone through the very difficult act of designing a normalized schema that fits what you need, then you're in a good place. If you have a chunk of data in which the potential has not yet been unlocked, or in which the act of writing to it happens to quickly to justify the live indexing implied by a database, then Hadoop is an essential tool. They really sit next to one another.
None of this is to knock writing analysis scripts against local data. I do that all the time. In fact often I'll ship data from HDFS to the local system so I can write and run a script. I just think it's important at a company to make sure your people have access to good tools so there aren't hurdles in front of them, and when it comes to data analysis I've come to the opinion that you really want to have a Hadoop cluster set up next to your SQL databases and your other tooling, because it will become useful in sometimes unpredictable ways.
Yes, if there are a few hundred megabytes in front of you and you need to analyze them, then write a script--and were I interviewing someone for a job I would not hesitate to accept a script that solves a data analysis task, so clearly the people being interacted with by the author were somewhat myopic. But most companies require that an ecosystem be built to handle the increasing complexity that will ensue over the years. And Hadoop is a huge bootstrap to that ecosystem, regardless of data size.
Only problem is backing the bugger up.
When the article is "Don't use Hadoop - your data isn't that big"
Two totally different points?
And I'm sure it was correct to begin with.