Hacker News new | comments | show | ask | jobs | submit login
Dear NoSQL: "SQL-isn't-scalable" is a lie (yafla.com)
138 points by portman on Mar 3, 2010 | hide | past | web | favorite | 102 comments

This is a very reasoned, well-written article. While I like the idea that the NoSQL movement is questioning some core database assumptions, I have often been uncomfortable with the meme that "SQL doesn't scale". Scalability is always a function of your circumstance. This article makes a good point that the group of people for which SQL doesn't scale need another option (internet startups with little cash and massive server load). While that case gets a lot of attention around here, it's important that people not try to extrapolate that experience to other areas of IT.

What the NoSQL "movement" is doing is questioning some basic axioms of architecture and operations that may no longer hold in the modern era. There are many aspects to data, how it is used, how it should be stored & accessed, and what is or it not important within the set of available data. The RDBMS cabal set a particular set of standards (ACID) that were a good match for a lot of the early "big data" systems and the computing infrastructure of the time, but times change. As more people are bringing new data sets, new uses and access patterns for this data, and the computing infrastructure is changing from large servers to swarms of small systems the options for storing and accessing this data are also changing. I don't know many in the NoSQL crowd who are actually opposed to a RDBMS as a solution, but it is not necessarily the first choice for any problem; most of the people in this group seem to start by considering the data and the application and then picking and choosing the characteristics of the data system that will be used to meet these needs. The massive and highly distributed systems get a lot of attention (and they are almost always NoSQL systems) but that does not mean that alternative data systems do not have a place up and down the IT stack, nor does it mean that a RDBMS is not a good option for some situations.

The RDBMS cabal set a particular set of standards (ACID) that were a good match for a lot of the early "big data" systems and the computing infrastructure of the time, but times change

Times change, but in the sense of adding additional systems with new requirements.

The existing systems have not gone away, and are not going anywhere. In fact there are more than ever of them around. I certainly still want my bank to run a system that has ACID transactions. Likewise my mobile phone's billing system, utilities bill, etc.

"The RDBMS cabal set a particular set of standards (ACID)" What's the problem really? ACID or SQL? They are mutually exclusive. You can have ACID without SQL (BDB) or SQL without ACID (MySQL 3.X). Which of the two is the problem?

The real crux of the issue isn't that SQL won't scale, it's that it won't scale efficiently or effectively. SQL scales, but you need expensive, exotic hardware (InfiniBand switches, SAN storage) that requires niche engineering skills just to get up and going. By the time a startup-type company purchases all of this hardware and has figured out how to use it, the window of opportunity will have closed. In addition, you're then pretty much locked into support from the vendor for performance issues. With solutions like Cassandra, you've got the source and can solve your own problems, which, to a company rich in software engineering talent, is exactly what they want.

Even if you look past all that, RAC and ASE will run into limits as they're still relational, which is not a truly scalable approach to data modeling. There are real limits to their scalability as well. Oracle RAC will only scale to 100 nodes. That's a lot of dough for a 100 node limit.

So there it is. SQL doesn't scale (effectively or efficiently).

"So there it is. SQL doesn't scale (effectively or efficiently)"

I think you just re-iterated the author's reason for writing perfectly.

Instead of saying 'SQL' (the query language) doesn't scale, perhaps you can say full ACID doesn't scale cheaply.

We spent a bunch of time looking at large scale distribution at Berkeley for the Mariposa project. The short version: if you want to break ACID then you have to take the semantics of the data into consideration. Transactional financial data needs full ACID, many other types of data have looser needs (status updates, etc) and those looser needs can be taken advantage of.

As it turns out, even common DB's like Oracle, Postgres and MySQL can be used differently to get much cheaper scale for certain types of data. See the FriendFeed article on MySQL as a docdb (also the Friendly ruby mapping on github).

To paraphrase the author:

"ACID scales just fine if you need it (and if you need it you most likely have budget for it). If you don't need full ACID, there are plenty of other cost effective solutions you can run on low bandwidth disks/memory limited CPUs. Mix and match to fit your needs.

Please, enough with the religion."

You're arguing semantics. NoSQL = No"ACID-Compliant Relational Database with Normalized Table Design"

Uh, that is a bit of a leap...

It seems like every NEXT BIG THING comes in two parts.

1) The provocation: "Throw away design and do everything with testing" - "never use SQL again, just key-values" "Objects are Worthless!"

2) The sensible fall-back position you can use when challenged "Test are useful" - "You might not need full ACID", "We can benefit from shorter inheritance trees"...

I suppose every new idea needs a way to make a splash...

This is what as the triad of thesis, antithesis and synthesis. A lot of progress works this way. [http://en.wikipedia.org/wiki/Thesis,_antithesis,_synthesis]

Thesis (established view): SQL/ACID solves all you data problems

Antithesis: Nobody needs SQL/ACID, let's throw it all out

Synthesis: By carefully considering data integrity constraints we can find a more optimal data management solution for a particular problem (alternatively: let's go to the pub!)

Excellent plan. Alcohol always helps people get over arguments.

Bones heal faster than internet battles :)

This is simply not true. I work for a major internet company that has millions of users. We use SQL for all of our data (images and SWFs excepted, of course). We do all logging to SQL tables. We use key-value cache servers for hot, non-relational data, but we don't have a lot of that.

We built most of our data center with cheap hardware. It is quite possible that we splurged on a load balancer and some other funky networking gadgets; I don't know. We did, however build quite a lot of custom software to abstract away the fact that we have split our data among dozens of machines. This is essentially the work of two people (one hardware, one software). Finding the people who can build such a system is difficult, to put it mildly, but once you find them, you don't need deep pockets.

Reddit is another example of a site that runs on SQL. Part of the problem with the "SQL won't scale" argument is that essentially all of the live queries function like key-value lookups (or lookups against a very small number of rows). If you're tempted to do something insane like service a request with a join, then you're doing it wrong and you need to cache.

Lookups by index will have the same performance characteristics no matter what. It doesn't matter if you're using a Python dict, an index in MySQL, or a key-value store.


> So there it is. SQL doesn't scale (effectively or efficiently).

You might just want to say "cheaply".

I wonder how many people in the NoSQL and SQL doesnt scale crowd either have never met a truly competent, much less good DBA (trust me, they are very very rare) or decided it could not scale because they applied their programmatic and procedural logic to a tool that operates in a very different (SET based) paradigm.

I've always thought good DBAs were the rarest thing in the industry. And the most valuable.

"Good" DBAs share digs with santa and the tooth fairy, but competent and opinionated DBAs are not as hard to find as you might think. Just whisper 'NoSQL' and they seem to crawl out of the woodwork :)

Then that's another win for many NoSQL databases -- they often operate in a way that's much more natural for developers.

Lets wait a few years until all the bells and whistles are bolted (in ad hoc, product specific manner (until someone invents ... a formal query language ;) in layers and tiers that take a NoSQL data store to a full blown database which requires management.

(Didn't we go through this with the XML hype? Wasn't it the "simple" golden nail? Have you checked out the "Simple" landscape of SOAP and W[TF]S-*?)

We're not living in past; its more like back to the future. (Just wait for it, and "No-NoSql" ..)

It's another case of "right tool for the job". If you're just looking for a data store, NoSQL may be right up your alley. If you need a query language, you may want SQL.

I mean, the fact is that there are a lot of servers out there even today that are running SQL databases to be used simply as dumb data stores. It's not the most efficient coding, but it happens, at least partially because MySQL is available on every shared host and their brothers (wait, what?).

NoSQL datastores are the new object databases without the compile-time shackles. (This already makes them better than ODBMS failures; they're not as hard to adapt to new programming languages or frameworks.)

I have believed for a very long time that you cannot be a good OO developer unless you really understand data modelling (which includes relational data modelling). There are often good reasons to break the "rules" of data normalisation (which is pretty much what NoSQL data stores do) but you shouldn't do so unless you know why you're doing this.

IMO, most data is more interrelated than people think that it is, and at least having a good model of how the data is related will make your entire design better.

Just because some people can fuck up a good thing doesn't mean it isn't a good thing.

XML is simple. The problem was (and is) misapplication in areas it doesn't belong.

I'm sure that will happen with no-sql. No biggie.

Talk about mis-application; I know this is off topic but the over use of virtual machines everywhere is starting to trip me out. Why in the world would you buy a brand new machine to run your production web site and then put VMWare on it and run one machine instance?? I could understand if you buy a big powerful machine and carve it into multiple smaller machines all serving separate and unique purposes - that is a common and good use case for it but people aren't just doing that, they are taking it overboard.

Using virtualization has a lot of benefits even when you have just one virtual instance. In your example, the real overkill is VMware, not virtualization per se!

Instead, kernel-based virtualization such as BSD Jails, Linux-VServer or OpenVZ is the right approach. This type of virtualization comes without any performance penalties! And by using hard links, you can avoid wasting disk space, too.

The benefits are better hardware abstraction, i.e. easier movement to other machines, improved backup possibilities, etc. In fact, the BSD manual recommends the usage of jails for years, if not decades.

I'd say that this kind of virtualization is still underused, while inappropriate kinds of virtualization start to become overused. Maybe this is due to "successful" marketing of VMware. They promote the right techniques, but the wrong technology.

Because you can run the exact same machine instance on other hardware with VMWare, which makes you much more flexible in purchasing hardware.

Because you can host the VMWare instance on a SAN and not lose it when the machine dies, which allows cheaper machines.

Complete machine failover is easier.

It allows for easy live migrations for maintenance, and allows us to build VMs that are happily 100% oblivious of the underlying hardware. While most of our servers are split into multiple VMs, we consistently deploy new hardware with VMs even if they're intended for a single function simply because of consistency and ease of maintenance.

Virtualization is good in e.g. cloud scenarios where the usage is spread out to different machines through out the day. Rather than have 10 machines averaging 5% utilization at all times you can have 1 machine with e.g. 60% utilization all the time. A win.

Sadly, pointy-haired bosses have started putting developers on VMs which is exactly the opposite use case: all people loading their machines at roughly the same time (e.g. I've heard people claim that in Java 1/3rd of time is spent compiling. If you have more then 3 devs....). A fail.

The biggest SQL problems I've seen the last few years have been ORM using developers using the database as a data structure store, rather than a database. For some reason, the act of getting data in and out of a database was the "hard problem" that had to be solved.

I know this sounds parochial and if you're a close to no money start-up then it might as well be Greek but there is a reason for the business tier in a 3 tier architecture, it translates transactional data in to your data structures... Done well it can scale like a mutha.

I do sort of appreciate the approach of NoSQL though. Map/Reduce is a concept missing from RDBMS and it seems like it could be incredibly valuable for certain problems. There are probably other concepts that could be applied as well. At some point I see the two camps sort of merging back together. ACID is just too important.

DBAs don't scale!

If your tool requires people who don't exist, it might be otherwise great but you've got problems.

I'm not even a NoSqlista by any mean, it just seems ironic that you are making a good for nosql - you just need programmers, not DBAs.

I'm not sure where people are getting this idea from. DBA != "Guy who works with relational databases." If you have a sufficiently large amount of data and/or a sufficiently high amount of requests for that data, you will eventually need someone to manage your data(base| store) and the accompanying cluster.

Edit: Or, now the programmer is the DBA, and he still requires the skillset a DBA.

"Such a platform can yield very satisfactory performance for tens or hundreds of thousands of active users"

There are 253 million Internet users in China alone. What happens when your site needs to scale from 0.001% of them using the site simultaneously to needing to scale to 1% of them using the site simultaneously? Within a month?

"Of course if you index poorly or create some horrendous joins"

Which in the Twitter and Facebook cases is exactly what they have to do on many of their requests. As I've personally found out, relying on a database to do a join across a social network graph is a recipe for disaster. One day you'll be woken up because your database's query planner decided to switch from using a hash join to doing a full table scan against 10s of millions of users on every request. Then, you'll be left either trying to tweak the query plan back to working order, or actually doing what you should've done in the first place: architect around message queues and custom daemons more suited to your query load.

"Even with billions upon billions of help tickets."

At 50 million tweets a day, Twitter would hit 18 billion tweets within a year. Good luck architecting a database system to handle that kind of load. That is, one in which the database system is serving all of the requests (including Twitter streams) and isn't just being used for data warehousing.

"Such a solution — even on a stodgy old RDBMS — is scalable far beyond any real world need"

The disconnect here is this guy's needs are not the needs of a lot of us who are actively looking at alternatives. He is simply not familiar with the problem domain.

I don't think the original poster was making the argument that Twitter should run fine on a SQL database; in fact, I think he seemed to indicate the opposite. Namely, that large, nominally non-relational datasets that can afford to lose a little data here and there, or at the very least just take awhile to save it, are really what you need for serving up a big, fresh pile o' Social Networking.

So, bringing up Twitter or Facebook really doesn't make a good case against RDBMSes as a good tool in the toolbox -- they've got a very unique set of needs that don't apply to a lot of rest of the world. So, of course, SQL isn't the best solution when you're dealing with trillions of rows of data, and don't really want to spend hundreds of millions of dollars on the infrastructure required to guarantee that you never go down, and never lose a tweet.

And keep in mind, RDMBS helped them get to point where they could enjoy these problems; Twitter probably wouldn't exist in all its current glory if they spent a year building it to be 'scalable' before launching.

I think the reason that a lot of people end up hating RDBMS and SQL is because of one-or-more of (a) their only experience is with MySQL, which really isn't that awesome; (b) they've been burned by bad schema design; or, (c) they don't really get relational algebra or set theory.

For an example of 'bad schema design', I once worked at a company that had indices on nearly every column of their DB, even though almost none of these ever got queried. There was one database table with five indices on three columns, and of course this was the table that logged every single HTTP request processed by the front end. Including API calls. Did I mention that this table was never queried by any part of the application?

It was a poor design decision, and sure enough, it completely torpedoed performance. But the problem wasn't the RDBMS, because it did exactly what it was told to do, no matter how asinine.

So, in short, RDBMS aren't the solution to all problems, but they do solve a lot of problems adequately. NoSQL databases also serve an important role in the toolbox, but are much more narrowly-focused.

I'd also suggest that even in a social networking space, there's multiple types of data and some of it should be stored in an ACID environment (which probably means SQL). You may not care if you can't save a few thousand (or million) status updates to your data store immediately, but you need to care a lot more about your customer profile data (e.g., account, password, etc.) and changes there should be as ACID as possible. Your advertising or subscriber billing models should probably be ACID-backed as well. In both of these cases, that probably means SQL.

You make some good points.

However, the original author's point basically boiled down to: if you define scalability as the problems you can scale an RDBMS to solve, RDBMS systems are scalable. I'm not big on arguing the finer points of someone's tautology.

The particulars of a situation determine the scalability of a solution. For a lot of us working at web scale or on interesting new problems, an RDBMS won't scale. Sometimes it won't scale within the constraints we have, but sometimes it won't scale because we won't be able to build the system we're trying to build. His example of a company-internal billing system really only served to highlight the disconnect between the crowd following along well-trod ground, and the people out front doing innovative work.

No. The author provides counterexamples to the argument, "RDBMSes don't scale."

As for the "innovative work" versus "well-trod ground," there are still businesses who need better, more innovative solutions to well-trod problems, and I, for one, am not willing to ignore money on the table. The problem I'm working on works well with a combination RDBMS/key-value system for different pieces of the puzzle.

"[SQL deal for when] "Data consistency and reliability is a primary concern"."

I'm curious: lets say you have a tweeter scale app that must satisfy the consistency and reliability as a primary requirement. Is there really a NoSQL solution that can take you there without (effectively) raising the costs to the point that a scalable (money not an object) SQL solution would provide? (Kinda like how the difficult to extract North Sea oil became economically viable once oil prices crashed through a certain ceiling?)

The essence of NoSQL is that it gets its scalability by giving up consistency and reliability. Trying to run NYSE or Visa on NoSQL is pointless.

Well, "reliability" can be sliced in a couple of different ways since that term can cover both the A & P in the CAP options and it can also mean the elimination of single points of failure and an architecture that can degrade gracefully when components fail. Some NoSQL systems let you select the mix of consistency and reliability you need at a rather fine-grained level -- one thing that does distinguish these systems from the traditional RDBMS is that you are almost never in an all or nothing situation regarding any particular part of the data space unless you explicitly want to create that choice to enable other options.

NoSQL gets scalability by giving up a huge feature set, and, yes, cross-object consistency. I question the lower reliability argument though. Dynamo, one of the original NoSQLs, allows each instance to specify how reliable they want writes to be. Most of the NoSQLs allow you to tune the reliability factor, just as MySQL and PostgreSQL allow you to do.

The disconnect here is that only a tiny percentage of sites on the Internet have the requirements of Twitter and Facebook. They are outliers. Using them as an example here is ridiculous.

You don't need to have Twitter's userbase to have Twitter's data problem. Look at someone like FlightCaster -- they're using a NoSQL database to handle a huge amount of data to be useful to even their first user.

With the advent of cheap, reliable, available commodity hardware and network access, previously difficult data problems are solvable. SQL doesn't make sense for all of those problems.

Those are the problems I'm dealing with at the moment, so they're the ones that came to mind.

This is a perfect response: thank you.

Also, the definition of scalability seems fairly clear: how many times can you square your capacity before you need to re-architect?

I thought his example really telling.

Say that his internal help ticket tracking system was built for IBM, one of the largest corporations out there with 300,000 employees. 300k users is tiny for a consumer app. We had more than that when I was volunteering for a Harry Potter fanfiction website. Even if he was working at the largest company on earth by employee count (Wal-Mart), he'd still have fewer users than we had for Harry Potter fanfiction. And usually employees don't submit more than one or two help tickets a day, while Harry Potter fans tend to view a forum thread every minute or so.

It really hits home how consumer data processing has changed the game for data management. When I was working in the financial industry, we dealt with about 50GB of data/day coming off the exchanges. I thought that was a lot. But at Google, there's terabytes per day - at least two orders of magnitude more - and the total volume of financial transactions is basically rounding error on the data we handle.

It makes sense that with this exponential explosion in data, we'd need different techniques to handle it. Quite likely, RDBMSes do scale for the scale he's talking about. But a bunch of industries have opened up within the last ten years that require several orders of magnitude more data, and it's naive to think that just because it works for a corporate help desk or POS system, it'll work for a system that logs every page view and every action for millions of users.

>300k users is tiny for a consumer app.

Active users, not the total number of users in the user table. People grossly overestimate the scale of most web properties, where the number of active users is far lower than you likely imagine.

>But a bunch of industries have opened up within the last ten years that require several orders of magnitude more data, and it's naive to think that just because it works for a corporate help desk or POS system, it'll work for a system that logs every page view and every action for millions of users.

Strangely it says nothing of the sort. Yet here, again, you've used Google as the example. How many Googles are there? How much does that apply to about 99.999% of people who deal with databases?

Yet it always appears as the example.

> How many Googles are there?

There's tens of thousands of alternative search engines out there, see this for a sample from '07:


Most of them fail, but there's a lot of people trying to solve Big Data problems on a shoe string budget. That's part of the disconnect, us NoSQL folks are excited by having cheap solutions to problems we wouldn't be able to afford to tackle otherwise, whilst SQL folks are are shaking their heads at the mess we'll have to clean up if our prototypes do become successful.

While SQL can scale, I think this argument is a little spurious.

I truly dislike the NoSQL stance of "never SQL;" it has its place, and its place isn't necessarily at the twitters or the facebooks of today. SQL scales very well with datasets that make sense for RDBMS. CRUD style applications. Core business apps. Data that doesn't necessarily need to be mined furiously. Trying to shoehorn, say, a high-volume message system or intensely self-referential (graph) dataset into an RDBMS is a recipe for disaster, however. Many of the performance issues people see with RDBMS seem to stem from this, I believe.

If your app is hugely real-time datadriven and the datamining (if necessary) can be offloaded to cron jobs, a K/V store is great. And can scale very quickly and relatively cheaply.

If you're doing something hugely relational that, if loaded into a SQL server would require an immense number of self-joins (I'm looking at you, graph analysis) that must be done in real-time and not offloaded to a cron job, a graph database is probably the way to go. They're harder to scale to a huge amount of data, but certain datamining tasks are made much easier - and don't require distributed map-reduce execution. Scaling will get much easier once some systems come out that use k-means (or similar) to cluster and shard the data. That kind of smart scaling would be nigh impossible on either a KV store or a traditional RDBMS. Google gets away with it with BigTable because they can throw so much cheap iron at it - a truly brute force solution. The same solution that needs to be taken with RDBMS when you shoehorn datasets into it that don't make sense.

Emil Eifrem (Neo4J) said it best in his presentation ( http://nosql.mypopescu.com/post/342947902/presentation-graph... ): NoSQL doesn't mean Never SQL, it just means Not Only SQL.

I don't think NoSQL people usually claim that SQL isn't scalable, just that it's unnecessarily complicated to scale.

You generally have to partition your data horizontally and thus give up many of the features that SQL has to offer: ACID transactions, unique keys, auto-increment primary keys, etc.

Then you have to come up with your own solutions to replace those features: eventual consistency, UUID keys, map/reduce, etc. And these happen to be exactly the kind of features that many NoSQL databases can give you out-of-the-box.

You generally have to partition your data horizontally and thus give up many of the features that SQL has to offer

There are plenty of databases that will partition data without giving up any SQL features, but they cost money.

> There are plenty of databases that will partition data without giving up any SQL features, but they cost money.

They also either rely on a single huge SAN for storage (single point of failure + expensive as hell) like Oracle RAC, or they require specialized gear like infiniband to reduce intra-node latency like Exadata (starting price: seven figures) or they're analytics databases that are designed for huge queries with latencies to match like Vertica, ParAccel, etc. (Think minutes between data being loaded and being available to query.)

I'll take NoSQL, thanks.

Afaik Exadata wasn't even originally meant for OLTP. Seems like another case of a high-latency analytics/warehousing system being marketed as a "distributed database". They're now claiming that they can get OLTP grade performance with SSDs on Exadata, but I don't buy it. The promotion of Exadata is ironic, given I remember one of their engineers claiming (on his personal weblog) about impossibility of OLTP on top of shared nothing not too long ago.

As for the high-latency analytics databases (Vertica, Greenplum et al), I don't see much market for them either. Their big advantage over Hadoop was claimed to be the ability of non-programmer analysts to use them (via SQL), but Hive (which now even has JDBC drivers for it, allowing it to work with existing OLAP tools) solves that problem as well.

Would would the need for "specialized gear like infiniband to reduce intra-node latency" be limited to parallel databases? (I assume you mean "inter-node"...)

This whole discussion is about parallel databases since that's the only way to scale beyond the performance of one machine.

Well, replace "parallel databases" with your favorite term for the parallel databases that fall outside NoSQL (VoltDB, Exadata, shared MySQL, etc). My point being that the alleged need for high-speed interconnects is orthogonal to SQL vs. NoSQL.

But it's not. Because SQL databases (strictly speaking any requiring strong consistency... which is mostly RDBMSes) are highly latency sensitive, where NoSQL databases like Cassandra design around that by saying "hey, you could not see the most recent write for a few ms, unless you request a higher consistency level." And most apps are fine with that. As a bonus you get multi-datacenter replication with basically the same code, another place most RDBMSes are weak.

It's a classic design hack -- redefining your goal as an easier problem.

Oh, I see what you're saying. Yes, the interconnect is orthogonal (although you could argue that strong consistency requires more complex protocols like 2PC so interconnect latency becomes critical).

Consistency, Availability, Partition tolerance. NoSQL stores usually sacrifice consistency, and instead settle for eventual consistency. SQL (i.e. RDBMS) stores, with their usual emphasis on transactions, must necessarily sacrifice something else. SQL that doesn't hew to a hard line on consistency and transactions doesn't really have all the features of SQL. This is the distinction that matters most, IMHO, in the NoSQL strand of thinking.

I admit that the SQL vendors (besides MySQL) made a mistake by putting ACID above scalability; that's clearly not always the right choice. However, CAP still allows a SQL database that is scalable, consistent, and available.

No, that's exactly what CAP doesn't allow. Unless by scalable you mean non-horizontal scaling. In which case yes, but we already knew that big machines make things fast.

> CAP still allows a SQL database that is scalable, consistent, and available

Name one please. It seems you are either fundamentally mistaken about what CAP implies or are constraining the "solution" to a clustered system that is effectively a single RDBMS hiding behind lots of tightly-coupled components.

A tightly-coupled (whatever that means) cluster sounds like a perfectly legitimate way to scale to me.

And what if that datacentre goes down? And what if you want reasonable (<50ms) latencies in different parts of the world?

Except for postgresql.

It certainly seems that the NoSQL movement is largely fueled by two facts: 1. MySQL sucks 2. Oracle is expensive

I'd love to see Postgresql get more attention, as I feel that they have scaling up and scaling out handled fairly well, whereas MySQL/InnoDB has a hard time even scaling up (which is why the Drizzle project even exists).

Does Postgres partition across a cluster? That's what we're talking about here.

Yes. This is what Skype does.

I'm not even so sure that they are unnecessarily complicated to scale, or that you should need or have to replace the features you list. I am sure, however, that everything you list ends up needing to happen if the solution chosen doesn't apply well to the problem at hand.

When you really dig deep down into each and every article on this subject, whether for or against NOSQL, the most important (and yet unstated) fact is this:

It isn't that RDBMS systems scale or don't scale, or that NOSQL systems scale or don't scale, it's that any solution which prioritizes (just for example) consistency and availability is not going to scale effectively for a problem that instead prioritizes availability and partition tolerance.

I am willing to bet that any time a problem and a given solution don't align on the two attributes they've respectively prioritized from CAP, there will be a claim that the solution "doesn't scale". The reality is just that the solution wasn't applicable to the problem at hand. If instead one evaluates solutions which match the problem's CAP priorities, the solutions will scale effectively (modulo their individual pros and cons relative to the other options within the evaluated CAP-priority-matching set of possible solutions, of course.)

Was anybody else dumbstruck by the article's first comment by this bright chap, "Jeff"?

Brilliant. I've forwarded this to my team.

We make a tax solution and I've been dealing with vague "we should use NoSQL" comments from a few of the less capable members of the team.

If his team members read all the way to the comments it's going to be very awkward tomorrow at work.

Well if some of the dumbshits I work with at the White House read this I'll be in trouble too but I don't think that's going to happen.

Care to buy some classified material?

Reading this article was a bit annoying. Right from the start: "I work in the financial industry [...] I worked in the insurance, telecommunication, and power generation industries." I was thinking only - you're not even supposed to look at nosql from that perspective - there's nothing for you there... just go away.

It doesn't support transactions and acid most of the time. There's no "we pay $xxxM for the support and blame you for everything" company in nosql products. It's not the same workload as you'd expect from kv/document-store used as a webpage backend.

One of the few serious "nosql" databases for enterprises like that is Berkeley DB - it's got what they need. I'm not sure why did he write that blog post... it just stated the obvious, but in form of a rant.

The funny thing though is - Berkeley DB is exactly what NoSQL is about... and it is used for local reliable storage in many big enterprises. Replication, logging, transactions, etc. - and it's just a kv-store really.

I'm not sure why, these days, anyone would choose Berkeley DB over something like sqlite.

BDB has proper transactions, provides the same type-safety as sqlite (i.e. none), doesn't have to go through abstractions like JDBC (overkill for sqlite), is backed by Oracle, has pure java implementation, has replication.

Sqlite has columns and can search based on them. It can also save you a couple of lines on manual joins. Does it provide something more?

(although tbh, I'd take Tokyo Tyrant over both of them - has columns, writer lock + server-side scripts instead of transactions, same model of replication as BDB)

The berkeley interfaces that come out of the box with e.g. Python don't have any of the advanced features that the sqlite interface has, e.g. concurrent access. Any advanced usage of Berkeley DB is far more complex than sqlite, e.g. try opening a database so that multiple users can read/write it. With sqlite, no extra steps are necessary if you have an occasional separate process that needs to access the database (obviously it's not built for efficient concurrent access).

Ad-hoc SQL queries on the sqlite database are also a huge win. Much better than defining your own data structures and then tools to read/write them.

I don't know you mean by "proper" transactions. Sqlite has transactions for DML.

I wish that people would understand that people who disagree with them are not 'liars' but are instead people who disagree with them.

It is possible to honestly disagree on a technical issue, you know. And it is good professional and personal practice to only accuse someone of lying when you are quite sure that they are deliberately telling you something they know to be false...

I like the article, but the first comment left on it is telling:

I've been dealing with vague "we should use NoSQL" comments from a few of the less capable members of the team.

This appears to be a common attitude of development managers in the corporate world: that anyone who starts suggesting anything vaguely "new fangled" is surely a naïve novice.. rather than being good at picking up and investigating new technologies.

They're dealing with taxes (ie, money) and the comments are vague, so it could just as well be that the comments are coming from ungrounded n00bs who get blown about by the winds of every passing fad. I'd expect someone good at actually figuring out new technologies to be able to come up with some rather non-vague reasons to use them.

Unfortunately, managers who aren't defensive about new technologies end up working with XML databases! Not all new technologies are better and certainly the benefits of NoSQL are up for debate.

XML databases are exactly the sort of glossy new technology that stuck-in-the-mud managers did pick up on.. mostly because they respond to glossy vendor pamphlets, sales calls, and trade show pitches than the grassroots findings of their underlings.

My favorite part of the rant is the suggestion that the nosql alternative to spending millions on a ginormous RDBMS was throwing away throughput by using Amazon AWS. I guess any argument becomes easier to make when you define the other side by its incompetence.

The only valid point made seems to be that vertical scaling of a RDBMS can be a multi-million dollar exercise.

Unfortunately, what you call incompetence a lot of other people call best practice.

If throughput and transaction/analysis speed is a requirement then AWS is not the answer -- anyone who suggests otherwise has never used it for anything larger than a toy dataset. I am currently migrating a large data analysis system (20+ TB running over about 500 EC2 hadoop workers) to a dedicated cluster because the internal EC2 latency reached a tipping point in our analysis runs. If you have the dough to spend on a big vertical system you can spend half as much on a dedicated cluster running a NoSQL solution and probably meet the required spec. The original article was comparing a Porsche Cayenne to a bicycle; the bicycle works for some problems and in certain cases it (or a fleet of them) can solve the problem better, but it was a dishonest strawman comparison for the subject at hand.

I was really with this article, right up to this:

> If you lose a Status Update, or several thousand of them, it will likely go unnoticed.

What? If Facebook lost half of their photos, or of Twitter lost a few thousand tweets, there'd be riots in the streets. Okay, maybe not quite that much unrest, but still.

I've run a few medium sized sites (with traffic most people here would drool over) and I would say that people are much more forgiving of slow pages than lost data. Losing a few forum posts would cause riots in the streets.

I'm curious how services like Amazon's RDS will change this perception.

A SQL database may be difficult to scale, but it is something that can be largely encapsulated and outsourced. If Amazon RDS, or some other product, handles the hardware and software configuration, then the developers can just focus on the application portion of it.

This isn't to say that scalability is guaranteed, it's still important to optimize queries and the data structure. Also, there are problems where NoSQL is simply the better and/or cheaper option.

But if these services can encapsulate a lot of the difficult part of scaling SQL, it still makes SQL a very attractive and powerful option for most(?) problems.

RDS doesn't scale at all. The performance of RDS won't exceed that of MySQL running on a "Quadruple Extra Large" instance (because that's what RDS is).

Ok, but how about a service taking whatever scalable SQL systems the financial or pharmaceutical industry use, even Oracle, Postgres, or Microsoft SQL server, and then selling it as a service similar to RDS?

My point is, I think there is an opportunity providing enterprise-level SQL scalability on a per-use basis. It won't replace NoSQL systems, because there are some problems where they're clearly better, but it could be done and provide relatively-affordable, scalable SQL access to startups.

Enterprisey databases are just too expensive for the HN crowd; making them into a service can't fix that.

Azure might come close, but I don't know what performance it can provide.

Azure doesn't run a lot of SQL Server functionality.

> I'm curious how services like Amazon's RDS will change this perception.

It won't. It's just hosted MySQL.

One complaint that I have with SQL databases that you don't often hear elsewhere is that they are very complex. People often use them where a much simpler solution would suffice because they are what people are used to using. All else equal, we should choose the simple solution over the complex one, because it is less likely to fail and easier to extend.

The word "butthurt" comes to mind when thinking of this article. ;D

maybe SQL scales, ACID probably don't

I would wager that SQL, and ACID, will scale larger than many startups will ever actually need.

True, but my sincere hope is to outgrow Postgres someday.

I think the problem is that in a startup, the leads are typical code slingers who are very good with programming, competent with SQL, and manage basic DBA (hand held by documentation).

Even if they are that rare startup that has heavy weights on both sides of the data/process divide, the second issue is that intelligent schema design (imo) does not fit the agile model quite that well.

Its really more of a economic issue than a technological one: it is quite possible that the poorly designed and implemented SQL - see above - solution can not scale. Obviously the problem is not SQL or RDBMs. But when you have a very smart, but young and not yet seasoned, tech team who hasn't seen it all, do you think they will blame themselves or lament that "SQL can't scale"?

SQL has, by modern standards, a shitty API. The DSL is an ugly mess and it's extremely difficult to reason about performance, especially when joins are involved.

Ergo, some not-by-choice SQL users only scratch the surface and use common features: INSERT, UPDATE, DELETE, SELECT, also known as CRUD.

Ergo, it becomes easy (although wrong) to conclude that SQL isn't doing very complicated work behind the scenes and that it's just another overcomplicated dinosaur POS like the Windows operating system, one that remains popular only because it's a standard.

Thus, a lot of undue hate gets directed at SQL with little attention paid to the subtleties of what it does extremely well and where it behaves poorly.

I think "SQL isn't scalable" is in the same league as "Java isn't concurrent". It can be, if you have learned the necessary skills and are willing to deal with a bit of pain. Is Clojure astronomically better for concurrency? Sure, but people can and do scale with SQL databases, and they do this in large part because there are problems for which SQL is the appropriate solution.

Why are SQL and RDBMSes conflated? Couldn't one database speak both SQL and something else if it wanted to? In the same way that Clojure and Java both end up as JVM bytecode, couldn't SQL just become another layer on the encoding stack, above something simpler?

tl;dr: "sql does so scale! if you throw $millions of hardware at it." yawn.

The article points out that it has already scaled with hundres of kilobucks or even millions.

However, it also points out much lower-end hardware solutions that cost under $10k but perform much better than the largest EC2 instance, for I/O.

ETA: This is why I tend to roll my eyes at the notion of "commodity" hardware. The article's low-end array is 400MB/s, but rolling ones own can yield over twice that for the same or lower price tag. All this well before reaching the unscalable cliff of enterprise pricing.

Yes, but there's a big difference between a 400MB/sec. supported storage array for < $10K, and a garage built "roll your own" storage array that can do 800MB/sec. for half the price.

The main difference is that, even in startup companies, you have a supported solution and can call someone at 2 am when your storage dies and expect to find replacement parts and support. Good luck trying to drive to Fry's and buy replacement hard drives for a server that was built 2 years ago by someone that no longer works there.

Yeah, right. I have worked on storage systems that run into the tens of megabucks and even with pricey "4 hour" support we were often SOL when we actually needed the company to live up to its claims. The different with the garage-built system is that I sometimes have components sitting around my desk or that can be pillaged from a VPs desktop to repair the system -- try doing that when the specialized disk controller on your gold-plated solution goes tits-up.

Indeed, this is another benefit, or, rather, a host of benefits, to using cheap, commodity parts.

This may not be the case at a startup outside of a technology hub and is almost certainly not the case where nobody in the company is competent with hardware[1].

Otherwise, a failure isn't just easy to correct but cheap, too. It's so cheap that keeping cold spares around is a no-brainer, unlike with the "gold-plated"[2] products.

The other danger is that the "engineer" who comes out to hand-deliver and install the replacement part gets it wrong. If he pulls the working, rather than failed, part[3], there's not much consequence for him personally or the vendor, unlike with startup founders or even staff.

In fact, even engineering it with enough hot spares to ride out its useful life. 4% AFR for your 88 disks? Just add 8 hot spares. Worried about cables or the controller card? Double up. Both together raises a $9k 1 gigaBYTE/second (4GB/s peak) array to a whopping $10k.

Even just doubling everything is likely to cost far less than a year of tarnished-bronze support from a big vendor.

[1] By which I mean assembling discrete consumer parts, not soldering or anything lower level.

[2] Personally, I prefer to refer to "enterprise" targeted pricing as "hookers and blow," but it is, admittedly, without a catchy adjective.

[3] Yes, I've had this happen. I've also watched a colleague pull the wrong drive out of an array, against his better judgment, at the insistence of the vendor's phone support.

Totally incorrect. Ok, go ahead and build me a home built storage array with 1TB SATA drives off the shelf. Then, 3 years from now, when one of your drives fails and you don't have any spares, try to find a new one that matches the exact geometry of the existing one.

What's that? You can't buy that exact drive so now your homemade RAID 5 is running in degraded mode and you hope it will stay up long enough to copy your data off onto another system? Sucks to be you, you tried to save a few bucks and got burned.

In the enterprise, we pay big bucks because we want to KNOW that we can call an 800 number and get an exact replacement hard drive, even if they stopped selling them 3 years ago.

* Then, 3 years from now, when one of your drives fails and you don't have any spares, try to find a new one that matches the exact geometry of the existing one.*

With arrays I build, I don't have onerous constraints like requiring identical size[1]. Moreover, if I'm not already already retiring disks at the 3 year mark, I'm very much remiss in my duties.

* Sucks to be you, you tried to save a few bucks and got burned.*

We're not talking about a few bucks. We're talking hundreds of thousands of dollars. That's enough to pay a salary for those 3 years as well as having replaced with something less than a couple generations old.

[1] I assumethat's what you mean, since true geometry is all but impossible to detect on modern drives.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact