It boils down to this: do you need addressing that would not be well-served by structured, tabular data? Relational databases are (or should be) extremely efficient at accessing data that can be arranged neatly into tables and read using the equivalent of pointer math. They suck, though, when the data is stored outside of the table (as large text fields, blobs, and so forth would be). The more heterogeneous and variable the data, the worse performance gets. A schemaless database gives up efficiency with tabular data, often precluding efficient runtime joins and, consequently, multiple orthogonal data access paths (data normalization) for a more efficient generalized access.
As a thought experiment, imagine a data quagmire where, in order to make the data fit a SQL table, every cell in every row in the database would need to be varchar(17767) and may contain multiple character-delimited values (or not --each record can be unique) and every row can have an arbitrary number of cells. That's what schemaless data can, and often does, look like -- and something like Notes or CouchDB can work with that comfortably and with an efficiency that cannot be matched by a relational representation of the same data.
One thing I don't yet understand is the big revolutionary deal about the NoSQL systems vs. storing one's hierarchical documents in blobs in an SQL database or BDB. At the end of the the day, I'm not sure that there is any magic - B-trees are B-trees, joins are joins, and disk seeks are slow.
1. It's not just which can do reads/writes better--it's not apples to apples, it's apples to oranges. You're right.
2. If you look the type of reads Cassandra can do and the type of reads MySQL can do. MySQL blows Casandra out of the water in terms of flexibility. (at a cost obviously)
MySQL simply has the best flexibility. Cassandra has a ways to go.
3. In Cassandra you essentially have to hack your data model to fit the data structure to make things work and if you decide one day you read things differently it's not always easy. You have to massively denormalize. (but hey disks are cheap as they say)
In a nutshell, use the best tool for the job. Cassandra happens to just fit lots of use cases so it's worth looking at. I don't think it would be best for a company to port everything over, MySQL is still very good at many things and the flexibility in reads is worth the cost.
You shouldn't wait for someone to have to tell you it's going to work for y use case. It really matters what queries you end up asking your data store and Cassandra really shines for simplistic ones and be hacked for complex ones. You need to just dive into the docs and http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-mode... is a great resource.
FYI: I love what Casandra does and I think it's the best of the NoSQL options out there.
I'd say that the differences between types of NoSQL databases preclude any attempt to choose a winner. It's makes no more sense than trying to pick the better between NoSQL and RDBMS.
I think this illustrates much of the problem with the NoSQL moniker - it seems like an attempt to group a bunch of completely different data storage systems by arbitrarily excluding a single one.
Exactly, and I didn't feel OP was saying anything but that. What I thought he was trying to get across was that there situations where NoSQL is a no-brainer superior, and by the sounds of it as a response to seasoned SQL DBA's criticizing the NoSQL option.
I work at your typical large corporation writing typical applications on top of sql databases all the time. SQL databases are incredibly flexible and perform excellent, and I'd need a VERY good reason to not use one on a project. For me, 49 times out of 50 they are more than sufficient.
But, at least in my industry, there are cases where its just the wrong technology for certain special requirements. I was AMAZED how poorly Oracle performed in one certain use case (real time aggregation of a relatively large hierarchial database), despite MANY hours devoted to indexing and tuning, with every table pinned in memory. So our solution was to write kind of an in-memory database, which out performed Oracle by >>1000 times or so. Of course this is not surprising.
What is surprising, and I think a large motivation for the OP, is how ignorant some of even the most skilled SQL DBA's are of what's happening in the real world. About a year ago I was talking with one of the most senoir database people at our company, and the topic of high performance databases came up, and I made some reference about how of course if you are dealing with huge performance, of course you wouldn't use something like Oracle. (I think maybe we were actually talking about Google). And he made some reference that Google uses Oracle, and I said basically, no they don't, and then he says he knows they do because ~he knows someone that works there as an oracle dba~. Of course, Google almost undoubtedly runs lots of Oracle internally, but he was under the impression that Google is run on Oracle. And this is a guy that is smart (always relative of course); but he really knows what he's talking about when it comes to Oracle. The thing is though, there is this whole other world, that he is entirely oblivious to.
And then there's just your standard DBA "professionals", who are just covering their asses. I personally consider myself to be largely uninformed when it comes to SQL Server and Oracle, but it's rare that I come across a xSQL DBA that knows more about the internal system tables than I do, or how script actions (sql server DBAs mostly), or write triggers, or do anything even remotely advanced. In the large corporations I work in, I'd estimate at least 50% of DBA's don't know anything about system tables at all beyond what they learned in their certification course. They've likely never used them in their work. A lot of them don't even know they exist. Their profession is largely a mystery to them.
tl;dr: SQL DBA's aren't very well informed about what's happening outside their bubble.
Out of interest, what kind of hierarchial database was this?
Something like this?
Even more unfortunate still is that virtually all these installed systems do not have the performance capacity or advanced search-ability to adequately mine this growing horde of data. Administrators, under capex constraints, do not allocate resources for secondary systems which would duplicate data for mining purposes while alleviating strain on the principle production system. The no money budget problem will lead in house programmers to build these research systems on top of open source NoSQL solutions. There, the technology will prove itself.
Additionally, "NoSQL" comes in different flavors. Generally, all of them forgo the Consistency in CAP for Availability and Partition tolerance, which is fine for many use cases - just not primary medical data acquisition use cases. As the field matures programmers and system designers will learn how to make this work better to the point where one day NoSQL systems may be used as the primary data repository for medical data. However, that day has not come. For instance, Riak allows you to tweak knobs in order to favor of certain aspects of CAP theorem at different times while in production (specifically the w and dw parameters, http://blog.basho.com/2010/03/19/schema-design-in-riak---int...). But having just started working with Riak in the last month or two I would still only use it as an analytics tool exposing my medical record data to m/r jobs at this point. And before jbellis smacks me, I think Cassandra is awesome and I'm looking forward to spending some time with it but I'm still not putting my med app data in Casssandra just yet as a primary data store.
/Disclaimer. I work for a major University Medical Center and write business web applications in this area./
The Correlation database is a NoSQL database management system (DBMS) that is data model independent and designed to efficiently handle unplanned, ad hoc queries in an analytical system environment. Unlike relational database management systems or column-oriented databases, a correlation database uses a value-based storage (VBS) architecture in which each unique data value is stored only once and an auto-generated indexing system maintains the context for all values. Queries are performed using natural language instead of SQL.
Learn more at: www.datainnovationsgroup.com
The flexibility of the NoSQL model coupled with the exposure of your data to the m/r paradigm is the big win from my vantage. Almost every NoSQL solution will expose your data via bindings in virtually every programming language allowing almost any programmer to leverage NoSQL. Now you kinda have to have experience with data warehousing. Cost wise, what you would spend in licensing can be invested in hardware but more specifically talent instead.
Secondary data reuse as a concept is highly unstructured. Sure your primary systems capture data in specific formats but your analysis can take you in all kinds of directions. Being able to slice and dice without having to pre-define will be huge.
flame on? maybe you don't understand because your startup doesn't have a lot of traffic?
you even admit that you don't really understand how the nosql dbs even work!
I don't think people building those systems are supposed to talk about them...
Others are less secretive, but still want to protect the details of their schema. Properly and carefully developping a schema for a major system can be a big project that reveals a lot about your business and can give an upstart competitor an advantage.
Whether a business chooses to talk about their setup depends on a variety factors including their business model. However, keeping such things private is often security through obscurity (http://en.wikipedia.org/wiki/Security_through_obscurity).
I don't think people building those systems are supposed to talk about them...
In a former job I worked on systems similar to what was described and my former employer would be very unhappy with me if I revealed so much as their hardware setup much less schema. I am not supposed to talk about such things, and there is an NDA that says so....
Whether it is legal for an authorized person from the company to discuss those matters is a separate matter.
Also, I must point out that this is not security through obscurity. Security through obscurity cannot be relied on, I agree. But in this case, it is a matter of preventing you competitors from knowing what you are doing.
You know your competitors can develop the same thing you did in time, but you want to make sure they have to spend that time rather than being "inspired by" reading over your source code or even stealing it entirely. In some competitive environments, even just knowing what your competitors are or are not capable of at that moment can be a huge advantage.
SQL vs RDBMS are further apart. There are cases where it makes sense to use both for different parts of your system. Sure you can force one to do the job of the other, but that shows you don't know when to pick one over the other (or were otherwised forced to do that...), not that they really target the same use-cases.
It is really impossible to argument something based on the fact that people that are supposed to be very smart are doing it. The only way to support arguments is by showing facts...
shameless plug as I don't want to post a message just to say this, but isn't HN too slow lately? I'm at the point that I visit the site less often than I was used to do as I don't want to experience the delay.
Facebook has absolutely insane sparse matrices to handle. They handle enourmous volumes of traffic querying very specific (read: not cachable between users) datasets. Moreover, they've already invested mind-boggling amounts of capital into their stack. Same goes for Amazon with Dynamo. These people operate on scales that startups like us can't even comprehend; and they've found it worthwhile to write their own datastores for those scenarios. Moreover, their use of those databases has apparently contributed to their success. That, to me, is strongly suggestive evidence.
That and HA/fault-tolerance is a no-brainer; Cassandra's scaling characteristics rock the socks off of any SQL DB I've used. The consistency tradeoff is well worth it for some use cases.
Compare them to StackOverflow, which at recent evidence, has about 10% of the traffic of Digg. They're running a very straight forward RDBMS configuration on rather pedestrian hardware. If Digg has a 50 node cluster (for example), StackOverflow should require at least a 5 node cluster.
So if Digg is as 10x times the size they would need a (assuming perfect scaling)
3.33 Ghz quad core x 20, 480 GB RAM, 60 drive RAID 10
Oh, but you can't get anything more than 8 sockets in x86, and Windows only runs on x86 now. So assuming you switch away from Windows, you'll need a Sun or IBM for that.
Sun's kit is only Dual-Core and the processors aren't as fast (either per Ghz or in clockrate), so here is the 64 CPU model you'd need and it's already got 64 disks:
"For a 64-processor gorilla, 2.4 GHz SPARCVI dual-core chips, 6 MB of on-chip L2 cache, 128 GB of memory and a 64 x 73 GB SAS drive raise the price tag to $10,100,320."
And not forgetting you'll probably want to upgrade the 128GB of memory it comes with and you'll need two of these really (the other for failover)
PlentyOfFish.com is, according to these posts:
512 GB of RAM, 32 CPU’s, SQLServer 2008 and Windows 2008
$10 Million? Try ~$100,000. (Granted, the article you linked was from 2007).
My company spends >$10k on fricking meetings to discuss whether they should spend $20k on a server (as well as the other technical details they are unqualifed to be discussing). Of course, anyone that actually knows anything about it is not welcome at these meetings. :)
So still a factor of 2.5 away from our hypothetical 80 core requirement, and not a refutation of GP's claim that x86 maxes out at 8 sockets.
That is not how capacity planning works at all.
One big problem I see in these comparisons is when a NoSQL person claims that their box is processing 5000 req/sec, what does that mean? Are they denormalizing this so much that it's equivalent to 500-1000 req/sec on a RDBMS?
Another thing: when Digg was starting their type of site was very novel. There wasn't much out there that approached the scale and growth they experienced. I'm sure that StackOverflow has been designed with scaling in mind.
1) I'm much less likely to vote a answer/question up/down on SO than I am at Reddit. On SO, if I'm not asking or answering, I'm rarely causing any writes to the data store. On Reddit, I vote on most of what I look at. I could see this having a huge impact.
2) Obviously Reddit can do some caching, but I think SO can cache much larger pieces of data. As far as I know, everyone who goes to the main page of SO sees the exact same list of questions. On Reddit, each subreddit's top items can be cached, but they are mixed differently for each user.
Facebook, on the other hand, is incredibly complex because of all the interactions between users, not to mention the data is stored if I recall correctly, geographically disparately throughout the world. I don't have a link, but the shit that happens behind the scenes when you logon to your facebook account is wild.
There's been a strong undercurrent of posts which basically consist of ad hominem "non-SQL databases are only for idiots who can't figure out how to manage an RDBMS properly". Pointing out that there are people who very definitely are not idiots and who can manage an RDBMS quite effectively, but who feel non-SQL databases are still appropriate for their use cases is, so far as I'm concerned, an acceptable rebuttal to that.
I've noticed that loading my comment and submission history is really slow, but loading everything else is as snappy as it's ever been.
I have a feeling that the front-page and comments get heavily cached, while our comment and submission history does not.
I read that as the obligatory appeal to authority that seems to impress some people. The rest of the post however was extremely interesting and likely as fact-filled as it gets when it comes to these SQL vs. NoSQL arguments.
I wonder what he thinks is a "monster db server", and considering he included the DBA in the price, is this the price per year or what?
Having recently set up a dual E5620 with 48GB of RAM and 8 SSD drives(160GB each) with a 3ware controller as well for just shy of 10K USD, I guess my understanding of "monster" is quite different. For 13K USD the same server would have 96GB of RAM.
If you don't need a 50 node cluster because your RDBMS is pulling down big numbers, then you don't multiply the cost of the RDBMS solution by 50 either.
The numbers posted here are pretty reasonable. 37Signals spending $7,500 on disks isn't outrageous. That's less than the cost of a single developer integrating a different solution over a few months. How long has Digg been working on this transition and how many employees did it require? They've probably spent a fortune. Just not on hardware.
On the other hand, Nosql and object-databases allow a programmer to just stuff data into the a data-store without worrying about a cohesive datamodel. If we consider this as mission-critical data that multiple departments of a large organization would want to see in multiple forms, then we can find many ways that the approach of "just save this array of values" produces serious problems.
But there are many applications where these problems don't appear. Diggs seems like it could get away with doing nosql. A health-record site seems like it could not do nosql since it ultimately is going to want ACID-and-beyond in its data model.
The relational model is a fantastic model of data independent of application. It can even be a great model for an application using the same data in different ways.
But this approach clearly has a cost. In ways, there's the question - is this an application with a company built around it or a company with a application built around it? Digg and Google are applications with companies built around them. Here the RDMS model doesn't make sense.
One more thing I'd add is I have no clue who these upset DBAs are and who is thinking Stump & Co. are dumb. Everyone making these sql/no-sql blog posts seems like they're starting a war with made-up enemies.
You can take a system like Cassandra and treat data very much in a normalized way which would reduce performance. You can take a system like MySQL and completely denormalize your data which would increase performance.
Any test where one set of data is normalized and one isn't is not a fair test.
Also, denormalization can be a big deal. Unless you have some sophisticated code managing it for you, you're trading performance for data storage management complexity. Now you have to manage many instances of data X. But there is a benefit in that you avoid crazy joins.
I think both have concepts to learn from each other. For example, in order to use a NoSQL option effectively, you end up implementing your own concept of indexes, something very easily done for you in RDBMSes.
I liked the way he personalized his argument to his own deployment situation rather than making generalizations. I also liked to hear about his experience with Cassandra (5 minutes to clone a hot node and have it balanced and in production).
(slides: http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=no-sql... )
Also I can say rotational disks may not provide the economics that make RDBMS seem attractive—but FusionIO cards have really changed that. And I didn't just read the datasheet and get a nerd boner. I watched the queries from 8 beefy physical database boxes (that were getting hammered) combined onto one physical box that was identical in all ways except it had an FusionIO card. It handled 8x the number of queries with ease and could have taken a lot more punishment. Yes, the cards are expensive but in the scheme of getting rid of 7 servers it was actually saving significant amounts of money.
This is why people still use MySQL even for projects that aren't suitable for RDBMS. I use hosted MySQL at dreamhost and don't have to bother with anything except my app and data. It just works and is free with the web hosting package. Is there anything out there that comes close? I don't mind $1/month for 1GB of data. $25 for 2GB is not worth it.
Why not host it your self? Deploying a server like MongoDB is trivial to get going.
For instance, Heroku and MongoHQ are both using Amazon Web Services, so it wouldn't increase the latency to swap a MongoHQ database instance for a Heroku one.
"Shocked by the incredibly poor database performance described on the Digg technology blog, baffled that they cast it as demonstrative of performance issues with RDBMS’ in general, I was motivated to create a simile of their database problem."
The central question here isn't so much the maximum performance you can get out of RDBMS system, or how it compares to a NoSQL solution, but how Digg is getting such terrible performance out of their RDBMS design! The numbers are just don't add up.
This article is just a bunch of straw men and that avoids that main issue. And arguing that $7,500 is too much for a serious web SaaS vendor to spend is just comical.
"Has anyone ran benchmarks with MySQL or PostgreSQL in an environment that sees 35,000 requests a second? IO contention becomes a huge issue when your stack needs to serve that many requests simultaneously."
my answer to this point is that IO contention can be vastly reduced in MySQL (and probably even better handled in Postgres, I bet) with some tweaking of settings and lots of memory. Memory is pretty cheap these days, so stuffing a server full of RAM is really not a bad option.
This may be part of the problem, actually. 100 tables to serve posts with attached comments? Um.
Something about this seems broken. Why would it be inherently "nicer" to spend money on a service as you use it than on a product that you get to keep?
However, you can buy servers through a leasing company to get this benefit; you don't have to use EC2.
(Cube farms aren't necessarily cheaper than nice permanent construction, but you have to amortise tax deductions on the latter over 39 years).
NoSQL is faster to develop/prototype with also, since you only need to understand json dictionaries.
So your shit ships faster, and scales cheaper.