The author mentions this at the end almost as a footnote but in my experience, this is usually the motivating reason for IT to use Hadoop. It's about processing throughput and not just absolute data size. A lot of processing jobs can't use indexes (except for lookup tables) so a PostgreSQL 4TB db with a schema for OLTP workloads doesn't solve the issue.
A 4tb harddrive might be cheap for $150 but at 100MB/sec, it takes 11+ hours to read. However, a big data cluster with 10 nodes can each scan their local 400GB shard and the batch job gets done in 1 hour. Even newer 10TB mechanical harddrives with 150MB/sec (and multi disks RAID to saturate a SATA3 interface) isn't going to fundamentally alter this throughput equation. On the other hand, a potential 4TB SSD with 1GB/sec might be the sweet spot if the processing is not cpu-constrained. But then again, new technology enables the goal posts to move, and people will invent new workloads that require a 100-node cluster each having a 10TB SSD to solve bigger problems.
I don't have applications needing Hadoop right now, but in cases where (for example) I had a ton of data in S3 and needed to make some passes through it then Elastic MapReduce was appropriate and inexpensive.
Also, there are good machine learning libraries available for Hadoop and also Spark, and the overhead of using Hadoop and/or Spark might make sense to have head room for larger data sets without recoding.
What are the cases where an RDBMS can't use indexes on fact tables?
WHERE street_address LIKE "%MAIN%ST%"
WHERE xml_blob LIKE "%<deprecrated_feature_node_name%"
to find documents that need to be updated when you remove that feature, etc.
"Just use Mongo" or "just use the latest pgsql and convert to JSON" or "just do XYZ" is often a less practical answer than "just query/analyze the database you already have".
Also worth a read, "Command-line tools can be 235x faster than your Hadoop cluster (2014)": http://aadrake.com/command-line-tools-can-be-235x-faster-tha... | https://news.ycombinator.com/item?id=8908462
Let's pretend, though, that we woke up this morning and had a 600mb dataset.
We'd still have hundreds of jobs running on that 600mb dataset. Most of them would ideally run as often as possible--we'd only schedule them periodically because of performance. Some of them would explode the 600mb dataset into several gigs. Most of them would copy the dataset several times over. Some of them would be memory intensive. Some of them would be CPU intensive. Some of them would be disk intensive. Many of those jobs emanate counters, logs, and metrics, which we want to track over time. We'd want to track the overall resource utilization of those jobs. We'd want the particular jobs to be killed automatically if they use too many resources. If the jobs are in danger of taking up too many resources in the system, we want them to be automatically queued and scheduled piecemeal. We'd want the jobs to be written in a variety of frameworks targeted to the particular task at hand.
Given those hundreds of jobs competing for scarce resources, we'd want to not have to completely rewrite the system if the 600mb dataset became a 1 gig dataset, or a 1 tb dataset.
Sure, we could put together a multi server ssh/bash job scheduling system, but in the end that's kind of what hadoop is, except having addressed a lot of problems that you don't even know you're going to have up front.
If you're doing enough processing that you can't do it on one beefy database server then sure, the concerns that go with that are similar to when you have enough data that you can't handle it on one beefy database server. But most people with 600mb of data don't want or need to do multiple cpu-hours worth of processing on it.
(1) Company xyz is having problems generating reports that pull from an sql database. They take too long, IT says it's because the data is too large and their hardware is old. Obviously this is big data! Wrong, re-writing the sql queries and indexing some fields solved the problem. Like, beginner-level stuff. Nobody had even looked at the code or structure of the database when initially trying to solve the problem.
(2) Company xyz can't load a dataset into memory to perform some computations on it, it's too large. Needs better hardware or big data solution obviously. With millions of rows and many columns, ok. Only 6 of those columns were even used by the processing job, and a fraction of the rows were relevant. There was no need to fit the entire dataset into memory in the first place, problem solved.
I could go on. Lots of companies just lack the skills and knowledge to do things right, and have no idea what's going on under the hood or how things work. So naturally, they are drawn to the marketing buzz surrounding big data, cloud computing, data warehousing, machine learning, etc etc etc.
My only hope is that the data is the thing that is not going away, and this forces us to think in a logical way about it(eventually devs/web devs will start doing the data maniplation and etc the right way, I guess you're pretty familiar with the approach in mind).
Meanwhile, a small minority of people were like "Guys? GUYS! You can just use flat files and in-memory data structures for that. Why bother with a database when 3 hashtables and a list will do the same thing several orders of magnitude faster." They had outsized accomplishments relative to their shrillness, though: Yahoo, Google, PlentyOfFish, Mailinator, ViaWeb, Hacker News were all built primarily with in-memory data structures and flat files.
These things run in cycles. The meta-lesson is to let your problem dictate your technology choice instead of having your technology choice dictate your problem, and build for the problem you have now rather than worry about the problems that other people have. For many apps, a hashtable is still the right solution to get a v0.1 off the ground, and it will scale to hundreds of thousands of users.
hashtalbe or flat files,huh, hard to believe this is going to be useful for a non-trivial app as 'data storage engine'(so to say).
You will have to implement everything yourself:
- concurrency - several users updating one and the same entry
- data consistency - to return the data as of the start of the 'select' for the user,what about others who updated the entry but still didn't commit.
- security - fine grain access to specific data entries - what can be extracted by whom, what about auditing(who did what at what time and etc.)
More code most of the time means more bugs.
Not everyone is Google , Amazon or Facebook to have the resources to create good custom solutions and support, improve them.
For most cases, a simple rdbms database will provide you with enough 'standard' functionality to not reinvent the wheel(think security,concurrency,consistency and etc.) = to code these on your own from scratch.
But in the end, it's again the nature of the problem(whether at all you'll need security, concurrency or consistency and etc.). It's nice to know what the database as a tool gives you , and I don't think this is the case nowadays - too many java/.net, and many other developers , have no idea what they can get out of a database. And I really hate when one spent several weeks to write something that could be done in several hours by using a specific database feature.
Sure, you won't be able to run on more than one core, and this won't work if your data exceeds RAM. And I wouldn't do this for anything mission-critical; it's better for the types of free or cheap services where users will tolerate occasional downtime or data loss. But if you never hit the disk and never need to make external connections to databases or the network, you can easily do upwards of 10K req/sec on a single core these days, even in a scripting language like Python. Assume that you've got 10,000 simultaneous actives (this is more than eg. r/thebutton and most websites, and usually translates to around ~1M registered users), this is one req/user/sec, which is pretty generous engagement.
Hacker News is built with an architecture like this. There've been a couple massive fails related to it (like when PG caused a crash-loop by live editing the site's code and corrupting the saved data in the process), but by and large it seems to work pretty well.
It really achieves 10k simple requests/transactions per second on a single core (or a $5/month VPS). The software support for it might have been better though. I should really release some code that helps with things like executing a function in another process, or verifying that some code executes atomically.
Unfortunately, Spark doesn't scale well on large datasets (10TB+). Sure, it's possible (and has been done), but right now there are too many rough edges to make it a better choice than Scalding/Cascading for data processing at scale. Most of this boils down to fine tuning certain Spark parameters, which is a pain when you're dealing with long-running, resource intensive workflows.
Personally, I'm finding myself increasingly burned out on "frameworks" that appeal to out-of-touch corporate decision-makers (who are attracted to the idea of a packaged "product" that solves everything) but tend to be overfeatured while under-delivering (especially if time to learn how to use the framework properly is included) when it counts. I'm much more interested in the modularity that you see in, say, the Haskell community where "framework" means something that would be a microframework anywhere else.