Times change, but in the sense of adding additional systems with new requirements.
The existing systems have not gone away, and are not going anywhere. In fact there are more than ever of them around. I certainly still want my bank to run a system that has ACID transactions. Likewise my mobile phone's billing system, utilities bill, etc.
Even if you look past all that, RAC and ASE will run into limits as they're still relational, which is not a truly scalable approach to data modeling. There are real limits to their scalability as well. Oracle RAC will only scale to 100 nodes. That's a lot of dough for a 100 node limit.
So there it is. SQL doesn't scale (effectively or efficiently).
I think you just re-iterated the author's reason for writing perfectly.
Instead of saying 'SQL' (the query language) doesn't scale, perhaps you can say full ACID doesn't scale cheaply.
We spent a bunch of time looking at large scale distribution at Berkeley for the Mariposa project. The short version: if you want to break ACID then you have to take the semantics of the data into consideration. Transactional financial data needs full ACID, many other types of data have looser needs (status updates, etc) and those looser needs can be taken advantage of.
As it turns out, even common DB's like Oracle, Postgres and MySQL can be used differently to get much cheaper scale for certain types of data. See the FriendFeed article on MySQL as a docdb (also the Friendly ruby mapping on github).
To paraphrase the author:
"ACID scales just fine if you need it (and if you need it you most likely have budget for it). If you don't need full ACID, there are plenty of other cost effective solutions you can run on low bandwidth disks/memory limited CPUs. Mix and match to fit your needs.
Please, enough with the religion."
It seems like every NEXT BIG THING comes in two parts.
1) The provocation: "Throw away design and do everything with testing" - "never use SQL again, just key-values" "Objects are Worthless!"
2) The sensible fall-back position you can use when challenged "Test are useful" - "You might not need full ACID", "We can benefit from shorter inheritance trees"...
I suppose every new idea needs a way to make a splash...
Thesis (established view): SQL/ACID solves all you data problems
Antithesis: Nobody needs SQL/ACID, let's throw it all out
Synthesis: By carefully considering data integrity constraints we can find a more optimal data management solution for a particular problem (alternatively: let's go to the pub!)
Bones heal faster than internet battles :)
We built most of our data center with cheap hardware. It is quite possible that we splurged on a load balancer and some other funky networking gadgets; I don't know. We did, however build quite a lot of custom software to abstract away the fact that we have split our data among dozens of machines. This is essentially the work of two people (one hardware, one software). Finding the people who can build such a system is difficult, to put it mildly, but once you find them, you don't need deep pockets.
Reddit is another example of a site that runs on SQL. Part of the problem with the "SQL won't scale" argument is that essentially all of the live queries function like key-value lookups (or lookups against a very small number of rows). If you're tempted to do something insane like service a request with a join, then you're doing it wrong and you need to cache.
Lookups by index will have the same performance characteristics no matter what. It doesn't matter if you're using a Python dict, an index in MySQL, or a key-value store.
THEY ARE ALL O(1).
You might just want to say "cheaply".
(Didn't we go through this with the XML hype? Wasn't it the "simple" golden nail? Have you checked out the "Simple" landscape of SOAP and W[TF]S-*?)
We're not living in past; its more like back to the future. (Just wait for it, and "No-NoSql" ..)
I mean, the fact is that there are a lot of servers out there even today that are running SQL databases to be used simply as dumb data stores. It's not the most efficient coding, but it happens, at least partially because MySQL is available on every shared host and their brothers (wait, what?).
I have believed for a very long time that you cannot be a good OO developer unless you really understand data modelling (which includes relational data modelling). There are often good reasons to break the "rules" of data normalisation (which is pretty much what NoSQL data stores do) but you shouldn't do so unless you know why you're doing this.
IMO, most data is more interrelated than people think that it is, and at least having a good model of how the data is related will make your entire design better.
XML is simple. The problem was (and is) misapplication in areas it doesn't belong.
I'm sure that will happen with no-sql. No biggie.
Instead, kernel-based virtualization such as BSD Jails, Linux-VServer or OpenVZ is the right approach. This type of virtualization comes without any performance penalties! And by using hard links, you can avoid wasting disk space, too.
The benefits are better hardware abstraction, i.e. easier movement to other machines, improved backup possibilities, etc. In fact, the BSD manual recommends the usage of jails for years, if not decades.
I'd say that this kind of virtualization is still underused, while inappropriate kinds of virtualization start to become overused. Maybe this is due to "successful" marketing of VMware. They promote the right techniques, but the wrong technology.
Because you can host the VMWare instance on a SAN and not lose it when the machine dies, which allows cheaper machines.
Complete machine failover is easier.
Sadly, pointy-haired bosses have started putting developers on VMs which is exactly the opposite use case: all people loading their machines at roughly the same time (e.g. I've heard people claim that in Java 1/3rd of time is spent compiling. If you have more then 3 devs....). A fail.
I know this sounds parochial and if you're a close to no money start-up then it might as well be Greek but there is a reason for the business tier in a 3 tier architecture, it translates transactional data in to your data structures... Done well it can scale like a mutha.
I do sort of appreciate the approach of NoSQL though. Map/Reduce is a concept missing from RDBMS and it seems like it could be incredibly valuable for certain problems. There are probably other concepts that could be applied as well. At some point I see the two camps sort of merging back together. ACID is just too important.
If your tool requires people who don't exist, it might be otherwise great but you've got problems.
I'm not even a NoSqlista by any mean, it just seems ironic that you are making a good for nosql - you just need programmers, not DBAs.
Edit: Or, now the programmer is the DBA, and he still requires the skillset a DBA.
There are 253 million Internet users in China alone. What happens when your site needs to scale from 0.001% of them using the site simultaneously to needing to scale to 1% of them using the site simultaneously? Within a month?
"Of course if you index poorly or create some horrendous joins"
Which in the Twitter and Facebook cases is exactly what they have to do on many of their requests. As I've personally found out, relying on a database to do a join across a social network graph is a recipe for disaster. One day you'll be woken up because your database's query planner decided to switch from using a hash join to doing a full table scan against 10s of millions of users on every request. Then, you'll be left either trying to tweak the query plan back to working order, or actually doing what you should've done in the first place: architect around message queues and custom daemons more suited to your query load.
"Even with billions upon billions of help tickets."
At 50 million tweets a day, Twitter would hit 18 billion tweets within a year. Good luck architecting a database system to handle that kind of load. That is, one in which the database system is serving all of the requests (including Twitter streams) and isn't just being used for data warehousing.
"Such a solution — even on a stodgy old RDBMS — is scalable far beyond any real world need"
The disconnect here is this guy's needs are not the needs of a lot of us who are actively looking at alternatives. He is simply not familiar with the problem domain.
So, bringing up Twitter or Facebook really doesn't make a good case against RDBMSes as a good tool in the toolbox -- they've got a very unique set of needs that don't apply to a lot of rest of the world. So, of course, SQL isn't the best solution when you're dealing with trillions of rows of data, and don't really want to spend hundreds of millions of dollars on the infrastructure required to guarantee that you never go down, and never lose a tweet.
And keep in mind, RDMBS helped them get to point where they could enjoy these problems; Twitter probably wouldn't exist in all its current glory if they spent a year building it to be 'scalable' before launching.
I think the reason that a lot of people end up hating RDBMS and SQL is because of one-or-more of (a) their only experience is with MySQL, which really isn't that awesome; (b) they've been burned by bad schema design; or, (c) they don't really get relational algebra or set theory.
For an example of 'bad schema design', I once worked at a company that had indices on nearly every column of their DB, even though almost none of these ever got queried. There was one database table with five indices on three columns, and of course this was the table that logged every single HTTP request processed by the front end. Including API calls. Did I mention that this table was never queried by any part of the application?
It was a poor design decision, and sure enough, it completely torpedoed performance. But the problem wasn't the RDBMS, because it did exactly what it was told to do, no matter how asinine.
So, in short, RDBMS aren't the solution to all problems, but they do solve a lot of problems adequately. NoSQL databases also serve an important role in the toolbox, but are much more narrowly-focused.
However, the original author's point basically boiled down to: if you define scalability as the problems you can scale an RDBMS to solve, RDBMS systems are scalable. I'm not big on arguing the finer points of someone's tautology.
The particulars of a situation determine the scalability of a solution. For a lot of us working at web scale or on interesting new problems, an RDBMS won't scale. Sometimes it won't scale within the constraints we have, but sometimes it won't scale because we won't be able to build the system we're trying to build. His example of a company-internal billing system really only served to highlight the disconnect between the crowd following along well-trod ground, and the people out front doing innovative work.
As for the "innovative work" versus "well-trod ground," there are still businesses who need better, more innovative solutions to well-trod problems, and I, for one, am not willing to ignore money on the table. The problem I'm working on works well with a combination RDBMS/key-value system for different pieces of the puzzle.
I'm curious: lets say you have a tweeter scale app that must satisfy the consistency and reliability as a primary requirement. Is there really a NoSQL solution that can take you there without (effectively) raising the costs to the point that a scalable (money not an object) SQL solution would provide? (Kinda like how the difficult to extract North Sea oil became economically viable once oil prices crashed through a certain ceiling?)
With the advent of cheap, reliable, available commodity hardware and network access, previously difficult data problems are solvable. SQL doesn't make sense for all of those problems.
Also, the definition of scalability seems fairly clear: how many times can you square your capacity before you need to re-architect?
Say that his internal help ticket tracking system was built for IBM, one of the largest corporations out there with 300,000 employees. 300k users is tiny for a consumer app. We had more than that when I was volunteering for a Harry Potter fanfiction website. Even if he was working at the largest company on earth by employee count (Wal-Mart), he'd still have fewer users than we had for Harry Potter fanfiction. And usually employees don't submit more than one or two help tickets a day, while Harry Potter fans tend to view a forum thread every minute or so.
It really hits home how consumer data processing has changed the game for data management. When I was working in the financial industry, we dealt with about 50GB of data/day coming off the exchanges. I thought that was a lot. But at Google, there's terabytes per day - at least two orders of magnitude more - and the total volume of financial transactions is basically rounding error on the data we handle.
It makes sense that with this exponential explosion in data, we'd need different techniques to handle it. Quite likely, RDBMSes do scale for the scale he's talking about. But a bunch of industries have opened up within the last ten years that require several orders of magnitude more data, and it's naive to think that just because it works for a corporate help desk or POS system, it'll work for a system that logs every page view and every action for millions of users.
Active users, not the total number of users in the user table. People grossly overestimate the scale of most web properties, where the number of active users is far lower than you likely imagine.
>But a bunch of industries have opened up within the last ten years that require several orders of magnitude more data, and it's naive to think that just because it works for a corporate help desk or POS system, it'll work for a system that logs every page view and every action for millions of users.
Strangely it says nothing of the sort. Yet here, again, you've used Google as the example. How many Googles are there? How much does that apply to about 99.999% of people who deal with databases?
Yet it always appears as the example.
There's tens of thousands of alternative search engines out there, see this for a sample from '07:
Most of them fail, but there's a lot of people trying to solve Big Data problems on a shoe string budget. That's part of the disconnect, us NoSQL folks are excited by having cheap solutions to problems we wouldn't be able to afford to tackle otherwise, whilst SQL folks are are shaking their heads at the mess we'll have to clean up if our prototypes do become successful.
I truly dislike the NoSQL stance of "never SQL;" it has its place, and its place isn't necessarily at the twitters or the facebooks of today. SQL scales very well with datasets that make sense for RDBMS. CRUD style applications. Core business apps. Data that doesn't necessarily need to be mined furiously. Trying to shoehorn, say, a high-volume message system or intensely self-referential (graph) dataset into an RDBMS is a recipe for disaster, however. Many of the performance issues people see with RDBMS seem to stem from this, I believe.
If your app is hugely real-time datadriven and the datamining (if necessary) can be offloaded to cron jobs, a K/V store is great. And can scale very quickly and relatively cheaply.
If you're doing something hugely relational that, if loaded into a SQL server would require an immense number of self-joins (I'm looking at you, graph analysis) that must be done in real-time and not offloaded to a cron job, a graph database is probably the way to go. They're harder to scale to a huge amount of data, but certain datamining tasks are made much easier - and don't require distributed map-reduce execution. Scaling will get much easier once some systems come out that use k-means (or similar) to cluster and shard the data. That kind of smart scaling would be nigh impossible on either a KV store or a traditional RDBMS. Google gets away with it with BigTable because they can throw so much cheap iron at it - a truly brute force solution. The same solution that needs to be taken with RDBMS when you shoehorn datasets into it that don't make sense.
Emil Eifrem (Neo4J) said it best in his presentation ( http://nosql.mypopescu.com/post/342947902/presentation-graph... ): NoSQL doesn't mean Never SQL, it just means Not Only SQL.
You generally have to partition your data horizontally and thus give up many of the features that SQL has to offer: ACID transactions, unique keys, auto-increment primary keys, etc.
Then you have to come up with your own solutions to replace those features: eventual consistency, UUID keys, map/reduce, etc. And these happen to be exactly the kind of features that many NoSQL databases can give you out-of-the-box.
There are plenty of databases that will partition data without giving up any SQL features, but they cost money.
They also either rely on a single huge SAN for storage (single point of failure + expensive as hell) like Oracle RAC, or they require specialized gear like infiniband to reduce intra-node latency like Exadata (starting price: seven figures) or they're analytics databases that are designed for huge queries with latencies to match like Vertica, ParAccel, etc. (Think minutes between data being loaded and being available to query.)
I'll take NoSQL, thanks.
As for the high-latency analytics databases (Vertica, Greenplum et al), I don't see much market for them either. Their big advantage over Hadoop was claimed to be the ability of non-programmer analysts to use them (via SQL), but Hive (which now even has JDBC drivers for it, allowing it to work with existing OLAP tools) solves that problem as well.
It's a classic design hack -- redefining your goal as an easier problem.
Name one please. It seems you are either fundamentally mistaken about what CAP implies or are constraining the "solution" to a clustered system that is effectively a single RDBMS hiding behind lots of tightly-coupled components.
I'd love to see Postgresql get more attention, as I feel that they have scaling up and scaling out handled fairly well, whereas MySQL/InnoDB has a hard time even scaling up (which is why the Drizzle project even exists).
When you really dig deep down into each and every article on this subject, whether for or against NOSQL, the most important (and yet unstated) fact is this:
It isn't that RDBMS systems scale or don't scale, or that NOSQL systems scale or don't scale, it's that any solution which prioritizes (just for example) consistency and availability is not going to scale effectively for a problem that instead prioritizes availability and partition tolerance.
I am willing to bet that any time a problem and a given solution don't align on the two attributes they've respectively prioritized from CAP, there will be a claim that the solution "doesn't scale". The reality is just that the solution wasn't applicable to the problem at hand. If instead one evaluates solutions which match the problem's CAP priorities, the solutions will scale effectively (modulo their individual pros and cons relative to the other options within the evaluated CAP-priority-matching set of possible solutions, of course.)
Brilliant. I've forwarded this to my team.
We make a tax solution and I've been dealing with vague "we should use NoSQL" comments from a few of the less capable members of the team.
If his team members read all the way to the comments it's going to be very awkward tomorrow at work.
Care to buy some classified material?
It doesn't support transactions and acid most of the time. There's no "we pay $xxxM for the support and blame you for everything" company in nosql products. It's not the same workload as you'd expect from kv/document-store used as a webpage backend.
One of the few serious "nosql" databases for enterprises like that is Berkeley DB - it's got what they need. I'm not sure why did he write that blog post... it just stated the obvious, but in form of a rant.
The funny thing though is - Berkeley DB is exactly what NoSQL is about... and it is used for local reliable storage in many big enterprises. Replication, logging, transactions, etc. - and it's just a kv-store really.
Sqlite has columns and can search based on them. It can also save you a couple of lines on manual joins. Does it provide something more?
(although tbh, I'd take Tokyo Tyrant over both of them - has columns, writer lock + server-side scripts instead of transactions, same model of replication as BDB)
Ad-hoc SQL queries on the sqlite database are also a huge win. Much better than defining your own data structures and then tools to read/write them.
I don't know you mean by "proper" transactions. Sqlite has transactions for DML.
It is possible to honestly disagree on a technical issue, you know. And it is good professional and personal practice to only accuse someone of lying when you are quite sure that they are deliberately telling you something they know to be false...
I've been dealing with vague "we should use NoSQL" comments from a few of the less capable members of the team.
This appears to be a common attitude of development managers in the corporate world: that anyone who starts suggesting anything vaguely "new fangled" is surely a naïve novice.. rather than being good at picking up and investigating new technologies.
The only valid point made seems to be that vertical scaling of a RDBMS can be a multi-million dollar exercise.
> If you lose a Status Update, or several thousand of them, it will likely go unnoticed.
What? If Facebook lost half of their photos, or of Twitter lost a few thousand tweets, there'd be riots in the streets. Okay, maybe not quite that much unrest, but still.
A SQL database may be difficult to scale, but it is something that can be largely encapsulated and outsourced. If Amazon RDS, or some other product, handles the hardware and software configuration, then the developers can just focus on the application portion of it.
This isn't to say that scalability is guaranteed, it's still important to optimize queries and the data structure. Also, there are problems where NoSQL is simply the better and/or cheaper option.
But if these services can encapsulate a lot of the difficult part of scaling SQL, it still makes SQL a very attractive and powerful option for most(?) problems.
My point is, I think there is an opportunity providing enterprise-level SQL scalability on a per-use basis. It won't replace NoSQL systems, because there are some problems where they're clearly better, but it could be done and provide relatively-affordable, scalable SQL access to startups.
Azure might come close, but I don't know what performance it can provide.
It won't. It's just hosted MySQL.
Even if they are that rare startup that has heavy weights on both sides of the data/process divide, the second issue is that intelligent schema design (imo) does not fit the agile model quite that well.
Its really more of a economic issue than a technological one: it is quite possible that the poorly designed and implemented SQL - see above - solution can not scale. Obviously the problem is not SQL or RDBMs. But when you have a very smart, but young and not yet seasoned, tech team who hasn't seen it all, do you think they will blame themselves or lament that "SQL can't scale"?
Ergo, some not-by-choice SQL users only scratch the surface and use common features: INSERT, UPDATE, DELETE, SELECT, also known as CRUD.
Ergo, it becomes easy (although wrong) to conclude that SQL isn't doing very complicated work behind the scenes and that it's just another overcomplicated dinosaur POS like the Windows operating system, one that remains popular only because it's a standard.
Thus, a lot of undue hate gets directed at SQL with little attention paid to the subtleties of what it does extremely well and where it behaves poorly.
I think "SQL isn't scalable" is in the same league as "Java isn't concurrent". It can be, if you have learned the necessary skills and are willing to deal with a bit of pain. Is Clojure astronomically better for concurrency? Sure, but people can and do scale with SQL databases, and they do this in large part because there are problems for which SQL is the appropriate solution.
However, it also points out much lower-end hardware solutions that cost under $10k but perform much better than the largest EC2 instance, for I/O.
ETA: This is why I tend to roll my eyes at the notion of "commodity" hardware. The article's low-end array is 400MB/s, but rolling ones own can yield over twice that for the same or lower price tag. All this well before reaching the unscalable cliff of enterprise pricing.
The main difference is that, even in startup companies, you have a supported solution and can call someone at 2 am when your storage dies and expect to find replacement parts and support. Good luck trying to drive to Fry's and buy replacement hard drives for a server that was built 2 years ago by someone that no longer works there.
This may not be the case at a startup outside of a technology hub and is almost certainly not the case where nobody in the company is competent with hardware.
Otherwise, a failure isn't just easy to correct but cheap, too. It's so cheap that keeping cold spares around is a no-brainer, unlike with the "gold-plated" products.
The other danger is that the "engineer" who comes out to hand-deliver and install the replacement part gets it wrong. If he pulls the working, rather than failed, part, there's not much consequence for him personally or the vendor, unlike with startup founders or even staff.
In fact, even engineering it with enough hot spares to ride out its useful life. 4% AFR for your 88 disks? Just add 8 hot spares. Worried about cables or the controller card? Double up. Both together raises a $9k 1 gigaBYTE/second (4GB/s peak) array to a whopping $10k.
Even just doubling everything is likely to cost far less than a year of tarnished-bronze support from a big vendor.
 By which I mean assembling discrete consumer parts, not soldering or anything lower level.
 Personally, I prefer to refer to "enterprise" targeted pricing as "hookers and blow," but it is, admittedly, without a catchy adjective.
 Yes, I've had this happen. I've also watched a colleague pull the wrong drive out of an array, against his better judgment, at the insistence of the vendor's phone support.
What's that? You can't buy that exact drive so now your homemade RAID 5 is running in degraded mode and you hope it will stay up long enough to copy your data off onto another system? Sucks to be you, you tried to save a few bucks and got burned.
In the enterprise, we pay big bucks because we want to KNOW that we can call an 800 number and get an exact replacement hard drive, even if they stopped selling them 3 years ago.
With arrays I build, I don't have onerous constraints like requiring identical size. Moreover, if I'm not already already retiring disks at the 3 year mark, I'm very much remiss in my duties.
* Sucks to be you, you tried to save a few bucks and got burned.*
We're not talking about a few bucks. We're talking hundreds of thousands of dollars. That's enough to pay a salary for those 3 years as well as having replaced with something less than a couple generations old.
 I assumethat's what you mean, since true geometry is all but impossible to detect on modern drives.