His whole concept of "SQL doesn't scale" is the typical crap I always hear from people that are either not database experts, are using the wrong database technologies, or don't know what they're doing. More than likely a combination of all three.
And just because you can create an object model, and a simplistic data model, does not make you an architect of large, scalable database systems.
I have, over the past 4 years alone, built Oracle-based, fully scalable databases that handle over 25 million users daily.
Go read up on their RAC architecture, and the "shared everything" implementation.
Sure, it's expensive, but it works great.
For instance, if you play sports games from the world's largest video game corporation, all of your online transactions (achievements, etc) go through exactly such a system that I spent a year architecting and implementing.
If you do any online banking in Canada, or with some of the larger banks in the US, that, too, is on my resume.
I find that this type of FUD comes about from people that aren't good at designing and implementing large databases, or can't afford the technology that can pull it off, so they slam the technology rather than accept that they, themselves, are the ones lacking.
Most of them tend to come from the typical LAMP/SlashDot crowd that only have experience with the minor technologies.
Those of us that do it for a living, using the right technologies, seem to have no problems whatsoever scaling SQL.
Anyway, having built both transactional systems (trading algorithms) and big social systems (delicious) I think the main issue is: The right data structure and tool for the right job.
Banking is a radically different problem then internet-scale social software (which I assume is what they are talking about when they say "doesn't scale".) Access patterns are different, read and write loads are different, etc.
The main issue here is that a lot of social software wants something like a fast per-user data store, something like a distributed inverted index for globally finding things.
There's NO great reason for one user's data to be in the same table as someone else's. I guess it was handy for stuff like calculating the average number of tags, etc.
Instead, you want to be able to have better control over data locality and similar, given your access patterns.
Now, personally, I would use an actual SQL engine for a single-machine persistent store, and build a distribution layer on top of that. Concurrency is hard, etc.
But assuming RAC is the right solution to all problems is probably not a good one. I've seen this go terribly awry.
SQL wins at things that btrees and hash indexes are good at; but a lot of things are better with other organizations of data.
But there's a huge problem -- AppEngine succeeds at seamless multi-tenant truly-distributed clustered hosting thanks to BigTable. Heroku needs to support standard Rails apps, so Postgres is the best they can do, and it's a huge hole in their offering.
You just can't make Postgres (or Oracle) scale on an ideal horizontal the same way you can distribute IP, DNS, HTTP proxying, HTTP serving, memcache, message queues, or bigtable. You can't expose Postgres as an ideal service that just keeps up with what you throw at it.
However, as I said in the first comment to the blog post, if you define an interface using stored procedures (a pattern familiar to many Oracle DBAs), then PL/Proxy (http://pgfoundry.org/projects/plproxy/) lets you do hash-based partitioning in a way that's more or less transparent to the end-user of a DB, again assuming that it has a defined interface. The PL/Proxy installs form a 'bus' between the DB and the user.
Self-promotion: I'm currently working on hacking PL/Proxy into something that can be used to auto-scale a Postgres cluster on-demand, which has interfaces as a pre-requisite, along with some other things. The end-goal is to do exactly what you say: to scale-out Postgres like any other internet service on an ideal horizontal. Link: http://code.google.com/p/hotrepart
(Hum, I'm gonna get accused of spamming for repeating myself so often :-)
However, under my re-qualified assertion :-) for large, complicated commits, the logic would either have to be at the database level for it to be in the same transaction, or a solution using temporary tables could be put together for more convoluted calls.
Neither of these are elegant, I'll grant you, but the two basic approaches - longer transactions over logic closely coupled to the datastore, or staged writes - are what most DBAs on high-end databases end up doing anyway, and would probably be reproduced in some form or another in any ACID system, no? Either way you'd still have a database that scales.
We design apps to treat the datasource as a network service, and have no problems load balancing DB connections across our database cluster, and adding new DB nodes as required.
The biggest problem I've seen is the relative lack of support for Oracle in something like Rails, or non-Oracle app servers in general. I've had to custom write some connection pool failure detection for a few apps to deal with cluster failure (something that comes out of the box with the Oracle app server, but we wanted to use JBoss and GlassFish), so while it's not perfect, it's most definitely doable.
Just because his particular architecture or offering doesn't fit well with typical database deployment architectures, doesn't mean that "SQL doesn't scale".
Google's datastore promises that if you write your app for their platform and it works on the small scale, it will scale out without problems. Datastore latency is constant regardless of how many records you have there. How will you scale a join between two tables, each being partitioned between multiple different servers, transparently for the app?
It would seem that problems of a bank with a large already established userbase and lucrative, stable business model is very different to a start-up with no tested business model, may never become popular enough to need to scale and may not survive.
The original article mentions nothing about cost or the ability (or lack thereof) of any company to afford it.
He made a flat out statement that said SQL DOESN'T SCALE, which is wrong.
There are other methods for dealing with initial growth and the cost constraints... no need to blow your wad on Oracle out of the gate.
But then I imagine a lot of non-Oracle types aren't even aware that Oracle can be very flexible with their licensing for startups. You can, for example, lease/rent your Oracle licenses on a monthly basis to help get the most out of your cash flow.
I've been involved with a number of startups where we did our initial rollout on Postgres, but ensured that the application architecture allowed for us to fairly easily swap in a larger, more scalable solution if/when needed.
For that matter, in my opinion, too many people and startups tend to over-engineer their initial product offerings, making them too complex and worrying more about nonexistent problems (like Google-sized scaling) rather than on solid features and business process. But that's fodder for another thread, I'm sure.
Now let's get to the facts: while perhaps Oracle might have nice features, that doesn't mean that SQL is the best we can come up with, which is one of the primary points of the article (and one that seems completely unaddressed here). In a sense he brings to attention a common occurrence throughout human history wherein we reject change for comfort and these comments are doing nothing more than supporting that. Oracle is, for the most part, entirely too expensive for most businesses and even the businesses that are large enough to adopt it aren't making the profit they could with more affordable technology. You should also make note that, while banks do have strict requirements on data handling, they are really responsible for serving very small user bases when compared to things like Google, Amazon, Ebay, Facebook and the like. Sure the requirements are different for each of these, but ultimately your argument hold no water against these infrastructures, which are inherently not SQL and, at the same time, seem to be much more accepted as innovators of scalable technologies for the future...
One big problem is that real innovation comes at the expense of backwards compatibility, which would involve making a lot of changes. I can relate, since massive changes in most case imply bugs, which is a very unsettling position for some of these, but it doesn't mean there isn't a problem. Sure banks and other largely established companies would rather shell out the cash to support their legacy ideals than innovate new solutions, but he's right in that we've been spending many many years of man-hours trying to tackle the problem of porting a dated philosophy to an age that requires more scalability at lower cost. Lets also say that banks haven't been proving their practices to be economically sounds lately, so what is their input worth in this matter? Other than large boatloads of cash for Oracle that is.
And I didn't rationalize anything with a "job history", but rather with large systems that actually have been built and are working.
I'd be very interested to see just how many people in this discussion actually have actual experience building large, scalable systems?
The original article made an asinine, generic statement without any context, or mention of cost, and I said it was silly, and pointed to the obvious (and easiest) reason why it was silly, and that was Oracle.
All of a sudden a bunch of people started making statements and assumptions about the scenario the article was probably talking about, that weren't actually made in the article, to discount Oracle as an option. Cost, commodity hardware, etc., etc. Then people started to point out other large websites that actually HAVE scaled, without the use of Oracle, as if that disproves Oracle's abilities... and yet it totally disproves the statement of the original article.
If you want to get into some context-specific details of why certain specific SQL technologies don't scale well (or at all), at a certain price-point, then that's a whole other discussion that I'd be happy to enter into, and would probably agree with.
It's also interesting to note that most of the "DB technologies" that are being used to scale those sites aren't DB's at all, but rather various levels of data caching that are employed to reduce the load on the databases, and are only applicable to the general read-only and non-transactional nature of social sites.
The whole reason I brought up banking sites in the first place is that they are one of the few, more obvious scenarios where most of your end-user interactions are actually hitting the database in real-time, and all data must be current and consistent. There is no real option for caching to save your ass, except at the DB layer itself, via such mechanisms as Oracle's Cache Fusion technology.
Social sites generally don't have any of those real-time, consistent constraints, and are therefore much easier to scale larger, because the nature of the site and the data allows for so much more technology to be used in front of the database.
The plain and simple fact of the matter is that building a large, scalable system is hard work. It requires that you analyze and design ALL aspects of the entire system to scale, not just the database. (Network, caching servers, application, database, hardware, etc).
The example you have given has nothing to do with the technologies, and everything to do with the application requirements.
An Internet Banking site has the need for real-time, centralized, transactional updates. There is no real option to cache a lot of stuff, or to delay the distribution of updates, or to shard/replicate data for reads, etc. It also has the need for real-time transactional replication, sophisticated auditing, global fault-tolerance, etc., etc. It's also a much more transactional site than something like Facebook.
Facebook's application requirements allow for a totally different set of tools to be used in a totally different manner. The dynamics of the site also has a lot to do with how it can be built. For example, Facebook is, for the most part, a read-only, fairly static site. That makes things a HELL of a lot easier to build out with their choice of tech. Same goes for SlashDot.
There are a ton of ways to build out something to the scale of Facebook, Slashdot, or LinkedIn using nothing but open source tools/technologies.
You cannot, however, build a realistic, large scale internet banking site using those same technologies. The only option is something like Oracle.
That's why a truly scalable system is so much more than just bolting on a DB to make it scale. It's about mapping the proper technology and tools to the business requirements, and figuring out how to deal with scaling issues before you even start to write a line of code.
Or, even better, abstract the database layer (Hibernate, etc), so that you can drop in more sophisticated DB technologies as you need to.
Just because something is free (MySQL), doesn't mean it sucks. Likewise, just because something costs money (Oracle), doesn't mean it sucks.
But there are many reasons (some of them even technical) why lots of companies have no problem paying stupid money for Oracle.
FaceBook's actually one of the more dynamic sites out there. I'd bet that the average FaceBook user makes many more updates to their FaceBook profile than they make financial transactions.
The real difference in requirements is that FaceBook can - and does - drop updates on the floor. If your friend throws a sheep at you and you don't get it on your news feed - oh well, FaceBook never claims the news feed is exhaustive anyway. But if somebody writes you a check for $10,000 and they see that you cashed it and you don't see it in your account, that's a problem.
Facebook has the ability to do a TON of edge caching, with very few (relatively speaking) operations having to go to the database to perform an actual write operation.
It's the DB writes that really kill performance and limit your caching strategies and gains.
I find it strange that everyone trotted out their preformed opinion for this piece, but no one comments on the cassandra article which actually proposes a solution...although it requires you to learn something, how horrible.
The nice thing about HIVE is - you're not limited to just SQL, though. You can still analyze your files on HDFS any whichaway, with Pig, with your own MapReduce jobs, whatever. Personally though, I look forward to Apache Pig getting SQL, and being able to run SQL queries on any intermediate state of a Pig script.
I don't think that anyone is really complaining about SQL. SQL is a swell query language for certain kinds of data. They're complaining about static schemas in relational dbs and having to store objects via SQL - reasonable complaints.
Actually I suspect Mr. Wiggins is really only complaining about problems with MySQL and PostgreSQL. He would be less inaccurate if he admitted as much rather than making sweeping uninformed claims about SQL and relational systems.
We use Dell 2950's for our DB nodes in our Oracle clusters. Biggest one right now is 12 nodes.
And if it scales, but costs too much (a relative concept at best), then it somehow doesn't scale any more?
Of course cost matters, and it comes down to what your application requirements are, and what your revenue model is, and your risk management requirements are, etc., etc.
If you want to talk technology only, it's a no-brainer.
If you want to talk cost and context-appropriate implementation of technology, then provide a detailed context, not generalizations.
Again, you're not going to see an online bank use CouchDB as their back-end, and likewise, you're not going to see a relatively free service use stupidly expensive technology.
But yes - if its too expensive to do, that means it doesn't scale well. Since when does money not matter in everything? The entire point of all of this is to use commodity PCs to achieve linear scalability cost-effectively, and to escape relational structure for data ill-suited to relational schemas.
I don't think anyone is suggesting an online bank should not use existing commercial databases and SQL.
Regarding the importance of money, look at the existing infrastructure in financial institutions. They spend millions of dollars a year on mainframes from IBM. The cost of oracle compared to this is relatively low.
We're talking about two entirely different markets and applications of "databases" here. The claim of SQL db's not scaling is not true.
I feel like the propaganda here is that because RDBMS doesn't scale to youtube or google scale they suck and that's not true. Like SQL is a waste of time because at some point, you're going to need to shard your database.
Look, at that kind of scale, you're going to have problems with any solution to any problem. Handling that kind of scale is going to be expensive no matter what solution you implement, whether it's map/reduce or flat files or some other solution.
But deciding to build a system from the beginning on something non-relational because someday you may have to accommodate that kind of scale is an example of premature optimization. The vast majority of features you get with SQL are going to outweigh the limitations of noSQL.
I've worked on some pretty high scale systems built on SQL and yes, there are problems, but there's just something irrational going on here and it's off-putting. It's like we are throwing out the baby with the bath water or something.
DB2, Teradata and Oracle users regularly tackle problems 100x larger than MySQL can handle.
It's not that I don't believe that Oracle can scale better than MySQL, it's just that I haven't seen any convincing data that it can scale so much better that it's worth the cost. I'm no expert, so maybe the data exists. I just haven't seen it.
Seeing the amount of ugly hacks people are willing to come up with and employ and features they are willing to cut, just to handle trivial loads, kinda makes me think that MySQL can only be considered free if your time is worthless.
Also that he says SQL database and not relational database. The whole article reeks of inexperience.
The retail price (I know discounts can be negotiated, but for a point of reference...) for MS SQL Server Standard edition is $5000, enterprise edition is $25000. I have never had to research Oracle prices but I understand they run even higher.
I worked as an MS SQL Server DBA for a while for a mid-sized company, and I wrote and employed some "ugly hacks" to emulate some of the Enterprise features because management at the time was unwilling/unable to pay for Enterprise Edition.
Google is one of those excellent examples of a non-sql datastore that works just fine and seems to blow the socks of anything the competition has come up with to date.
It is not a point of whether or not it is expensive. Scaling (nearly) always has expense associated with it. The issue is how much expense, and with some applications it is significantly less with one of the "no-SQL" solutions.
But deciding to build a system from the beginning on something non-relational because someday you may have to accommodate that kind of scale is an example of premature optimization. The vast majority of features you get with SQL are going to outweigh the limitations of noSQL.
You are making it sound as though a relational database is the correct choice barring any scaling. Perhaps you have not yet thoroughly evaluated some of the alternatives out there, because in many applications, there would be no step back from *SQL.
This whole post could be summed up as "ACID doesn't scale", which has been proven. Consistency, Availability or Partition Tolerance; pick two (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1...).
Good introduction to non-ACID databases:
The mere fact that he's talking "RAID" instead of SAN speaks volumes. (No pun intended). Any storage engineer worth their salt would be rolling their eyes right now.
The doc you linked to was authored in 2002, about the same time that Oracle's RAC was introduced (late 2001). Some of his citations are from the 80's. THE 80'S!
DB technology has come a LONG way in 8 years, and that paper is no longer valid, unless you're talking about some basic, "minor"/open source db technologies.
If you want to say "open source DB technologies have problems scaling", then go right ahead, and I'll agree.
Just don't mind those of us who continue to build large, scalable systems, using the proper DB technologies, that disprove that "sql doesn't scale" generalization.
Go take a look at Adam Leventhal's work with the Fishworks stuff.
Specifically, go check out the Sun Storage 7310.
It will scale HUGE, and is nowhere near the stupid cost of NetApp or EMC or the other major vendors.
I'd love to see how a bunch of commodity PC's will scale to 100TB, and still be manageable, and have anywhere near the same feature sets.
Again, if you're railing against something as ubiquitous as a SAN, I'm not sure there's anything I can say to change your mind.
Not that I'm really here to change your mind.
Again, the original focus was about SQL not scaling, and you seem to be fixated on the cost of that scaling.
100T is 666 spindles when you go RAID10 with 300G drives (common size in the SAS/FC area).
If you go S-ATA then it'll fit on 200 spindles using 1T drives. I'll stick to the latter variant for now because I'm too lazy to lookup Sun's pricing for SAS spindles, for my comparison below.
So, 200 spindles amounts to roughly 15 hosts (throwing in a few spares for good measure). The whole setup will comfortably fit into one rack, including the FibreChannel machinery and other fluff that you'll likely want.
Thus from the hardware side this is trivially managable, 100T is just not a lot of data nowadays.
On the software side it's up to your creativity and mostly depends on what you actually need to store. I've seen people setup commodity postgres clusters, as well as more fancy things like HDFS, GFS or homegrown storage layers that way. And it worked.
And, regardless of the indeed relatively sane pricing of the Sun (formerly StorageTek) products, the bottomline is what makes the difference.
In figures, for a 100T SAN on the 7310 you're looking at something like $50k for one head, plus around $75k for three trays. We're in the $125k ballpark, hardware-only. And I'm being rather optimistic here: This setup actually holds only 96T and the head is maxed out (3 trays max per controller/head). That means your next upgrade will incur another $25k markup for the next head, good thing you didn't ask for 150T...
Squeezing the same amount of storage into 15 min-spec supermicro pizzaboxes I arrive at roughly $3000 per node, including spindles. A good buyer will get them cheaper. This commodity cluster sets us back only ~$45k in hardware.
That's 1/3 the price of the 7310 solution, being optimistic on the Sun and pessimistic on the commodity side.
That kind of difference makes up for a lot of development effort for the custom solution - most of which is a one time investment anyways and yields better flexibility in the long run. It's also the reason why most of the big boys don't use off-the-shelf SANs for their primary storage.
Allowing for redundancy (RAID 5 on six disks for data, and mirrored the boot drives), we ended up with 6.5TB of storage at about $3500 (that includes i7 proc and 12Gb RAM, and a fancy controller). My labor added some more on top of the $3500.
Mostly, we've been happy. It is more sensitive to heat issues than our HP G5s, and of course, generates more of that heat. If I were to go with any kind of density, I'd need a much more robust cooling solution.
Wrt cooling, for actual servers (i.e. not storage nodes) we've had good success with Sun XFires, the 4100+ range. They are very well built and the markup over the SuperMicro junk is minimal (around 15% last time I checked).
Among the niceties is a nice array of hot-swap fans - something that SuperMicro still doesn't seem to deliver in their popular chassis.
I'd also consider the xfires for storage nodes if they'd take 3,5" drives, but afaik all their low-end models only have max eight 2,5" slots. That's just no worthwhile density when compared to the larger SuperMicro tins (which go up to 30x3,5" now I think).
The 2.5 drives are good, and seem to becoming typical, because they run much cooler. We've got some G5s in the same closet, with just as many drives, and they don't run nearly as warm.
But of course, if you're updating the data often - you wouldn't use hadoop (or if you were, you would run HBase or some such on top of it). But there are many use cases where you only write once at that scale. And in those cases, from my perspective - its much nicer to scale on commodity hardware than on big iron.
Open-sourcing is the last step in the technology lifecycle, after the technology has become widely understood and commoditized. When people can still make money off something, they will. What's the incentive for them to give it away?
And if you insist on being cheap, don't be surprised when it turns out that your free database was indeed some cheap stuff which doesn't come fully featured.
> Don't be surprised when it turns out that your free database was indeed some cheap stuff which doesn't come fully featured.
But if being "fully featured" is the goal, then it would be very surprising indeed if what you considered a "competitor" was not, in fact, fully featured. It would not, then, by definition, be a competitor. Or, at least, it would not be worth calling version 1.0 yet.
To refine that: Oracle currently has no FOSS competitors, because there are no FOSS databases that are trying to compete with Oracle. They may be trying to take parts of Oracle's market share, but this is a different thing—optimizing their fit for a situation where Oracle itself is a bad fit.
When I think artifical scarcity, I think something like DeBeers who has warehouses full of diamonds they are hoarding. Whereas with software, there actually is a scarcity of great programmers. There are only so many of them.
With software, you know... it takes time and energy to support it. It costs money to write help documentation, pay for servers where people can download it. All that stuff costs money.
Is open source anti-capitalism?
Is the artificial scarcity coming from the thought that because software is just 1's and 0's that it can be copied almost infinitely? It takes a lot of time and energy to build something that is worth copying. What about all that research and development and risk that was taken to build it. How are those risk takers and R&D staffs going to generate a return on that investment without charging for the product they create?
To put it another way, the "default state" of a naturally scarce commodity, when it is simply produced and then discarded, is still high-value. It can be resold at auction, or traded in a market, and will be fought over even in a state of legal and moral anarchy. The default state of an artificially scarce commodity, however, is value asymptotically approaching zero; without some sort of agency to "prop up" its value, it is worthless, and no secondary party will assign it value unless it is forced to as part of a larger deal (e.g. accepting the legal code to be a citizen of a country.)
Your question ("is open source anti-capitalism?") is just a matter of equivocation. Open Source fits just fine with capitalism, but it is not the capitalism you would first imagine (i.e. that of the US); it is instead "true", or lassiez-faire Capitalism.
In Lassiez-faire Capitalism, there can be no artificially scarce commodities; they are simply a market inefficiency to be eliminated, along with those producing them. A true Capitalism would destroy the value in all "information products"—movies, books, music, games, and, yes, software. In a true Capitalism, value is just "what people are willing to pay"; you don't deserve an ROI just because you worked hard, you only earn money if people feel that your commodity has value to them. In such a market, the only software that could exist is that which was produced for other motives than profit—open source software—or produced as a means to an end of profit, e.g. software that makes a business process more efficient, rather than software that is a product in-and-of-itself.
And now you see why we do not use such a capitalism ;) Though some "creatives" would survive, whether as on-the-whole consultants or by donations from fans (see http://pc.ign.com/articles/967/967564p1.html), the majority would not. In reality, a majority of people desire to keep these creatives around and producing, even at the cost of large market inefficiencies. Realize thus that copyright, more than anything, is a form of socialism, in that it redistributes wealth to those we think deserve a "fair share" for their efforts. Sort of mixes up the arguments most people have on the subject: Open Source is anti-socialist ;)
Good customer service and a deep understanding of specific technical issues are not trivially reproducible.
From that, I gather that we can't use the traditional economics of Supply and Demand to allow a market to decide the value of software. Because the denominator in the equation demand divided by supply is infinite, essentially, it doesn't matter what value demand has in the numerator, because anything divided by infinity is zero.
But my initial impression of this model is that it is doomed to failure. It is doomed to failure, because the artificially scarce product being produced itself requires the consumption of naturally scarce products that are subject to the laws of supply and demand. The developers that produce the software for example need to eat, the need to live in a house, they need to consume energy to drive to work, etc...
So how do we find a balance? Do we find a new way to value software or will software just go away? Will software be relegated to a charitable organization like in the article you linked to, which was quite interesting by the way, so thanks for that. Will the only valuable software be that software which is paid for in advance to be developed at the risk of the consumer? "If you pay me, I'll build it, otherwise, I won't" sort of scenario?
It's like the software only has value if it doesn't exist. If the mechanism to produce the software exists only in the minds of a scarce few who can implement the solution or who have the ability to control access to it, perhaps through SaaS or some other monthly subscription mechanism like WoW or battle.net.
Perhaps we can value software based on how much additional revenue it helps you generate or how much savings it generates through optimizations or automations.
How are we going to strike that balance? Is the gold rush for software over? If there is no carrot, who will run those wheels and invest in our future? Who will take the time to solve the problems before the problems arise? Take Oracle vs. MySQL. Oracle has solved a lot of the problems MySQL has. Over time, as people contribute to the MySQL code base, those problems can be resolved, but consumers of MySQL have to wait for someone else to implement the solution, or they have to pay for a computer programmer to find a work around or implement some solution outside MySQL.
It feels like it will slow us down. Corporations like Oracle and Microsoft will survive a little while longer, but if MySQL ever actually does become as good as Oracle, then people will stop paying for Oracle. If people stop paying for Oracle, Oracle can't hire the best and the brightest.
Well, I appreciate the answer, I do understand better, but that just leaves me with more questions, so I'm thinking out loud.
All I can think of right now is that software will all move to aaS models or be embedded in physical devices so that we can attach a natural scarcity to them. There are only so many factories that can produce microchips. The article you linked to suggested selling plastic figurines with the software, which is a similar idea. It harkens back to the age of dongles.
My former company tried to do this. People balk at this. "Why should I pay when there's 'free?'"
First kernel developers like Linus, Ted, Randy etc aren't really GNU folks.
But of course designing DBs in this new way makes most of the SQL features not needed in most scenarios... and you still have the overhead. And there are a number of other limits now that complex designs are not a good idea, for instance there is no way to get back the data in the natural ordering (the order you pushed this data, or the reverse)...
This is where KV stores start to be interesting as real world alternative, not just in order to learn the paradigm of scalability. But again, if you don't trust KV stores it is truly possible to use an SQL database in a more conservative and scalable way.
I find this claim laughable. If the databases Mr. Wiggins chooses to work with fail to meet his scalability requirements, perhaps he should consider different databases. Scalable SQL databases have been around for over 3 decades.
Companies such as Teradata have long offered SQL systems which meet the classic definitions of scalability and availability.
I think TimothyFitz's reply above was accurate. Mr. Wiggins article would be better titled something like "ACID databases have scalability problems, especially cheap ones startups use" but then, like http://news.ycombinator.com/item?id=690653 , it wouldn't get much response.
I do not believe this is usually true, for two reasons. One may be nitpicking, but 'where to get the data' is not part of the business logic: it's pure application logic, dependent on your solution of the problem. In that sense, his argument is wrong. The other reason definitely isn't nitpicking: you can solve the problem by adding a layer between your DAO's and the databases, that handles the 'where to get the data' question. So yes, it requires some programming, but it is still transparant to the business logic. It does not require an invasive change in your application and I think he is grossly exagerating this point.
Software Engineering saves the day!
As a junior member of Hacker News, I demand an education when I'm downvoted! :)
Most of your code is data-handling code on some fundamental level, so it's pretty inevitable that it's going to matter to your application how that data is stored and what sorts of operations you can do on it, and that's especially true if you care about performance (which you generally always do after some level). So even if you have a nice abstraction of your query layer, the kinds of questions you can ask the database efficiently depend heavily on the underlying storage mechanism, and so your application logic has to be built so it only asks the right kinds of questions.
Even abstracting away the differences between, say, Oracle and MS SQLServer and MySQL is difficult enough, because of the different capabilities and performance characteristics. When you do that, you basically end up coding to the lowest common denominator, which can often limit what your applications does.
Trying to come up with an abstraction that can encapsulate the difference between a row- versus column-oriented database, or between a relational database and some other kind of storage, is pretty much a losing proposition: they're too fundamentally different in terms of what kind of data you can store, how you can store it, how you can query it, what kinds of transactional guarantees you get, and what operations are fast and which ones are slow.
So you really do kind of have to take your best shot at it, choose an approach, and if you choose wrong and have to change, it's just going to hurt. A lot. Good encapsulation and abstraction will ease some of the pain, but it's more like drinking whiskey before your leg gets sawed off than it is like general anesthesia.
That simply isn't true.
For me, thousands of transactions per second and 10s of terabytes of data on a single database is normal. It's unremarkable, it's everyday, it's what we do, we have done it for years. And I know of installations handling 10x that. It's only people who's only "experience" is websites that whinge about how RDBMS can't handle their tiny datasets.
When hundreds of companies and thousands of the brightest programmers and sysadmins have been trying to solve a problem for twenty years and still haven’t managed to come up with an obvious solution that everyone adopts, that says to me the problem is unsolvable.
I can appreciate the point the author makes about partitioning schemes requiring heavy integration with the business logic, but I disagree with the claim that sharding doesn't work.
It's probably more accurate to say that sharding only works well if you design your database very carefully, or just get lucky about how your data model maps to sharding schemes. A bad sharded database can probably hobble your app, but a good one does get you remarkably close to true horizontal scaling.
In my experience, at least.
Then there's the whole distributed transaction thing, which is so painful that some people just live with wrong answers instead.
Also going back to the first place isn't really an option, shards usually come of a system that has grown, not ground up design. Ultimately I still support the distributed model myself, but a shard model does support relational data much more so than a distributed DB (at least out of the box).
Well, perhaps that's as difficult as solving the "SQL scalability that 20 years of brilliant programmers haven't solved"?
...because he declines to attempt to reframe the question.
[ or maybe he's leaving us wanting more: like his next post? ]
Anybody up for reformulating the question?
(the other part is made up from file system scalability issues).
Once you you get past a certain level these are non-trivial problems and anybody out there that is busy solving them has my interest.
1. Mentions RAID, not SAN
2. Mentions MySQL and only MySQL (with the exception of PostgreSQL once).
3. Mentions Master-slave replication as a killer scalability feature.
If You start sharding you may get write gains, but you have to work very hard to keep things consistent (depending) and you may have to duplicate shards to make them highly available ($$$). Oh - and later down the track your schema might change in ways which your sharding scheme is just not flexible enough to deal with and depending on who you are that may be too much of a risk.
Besides the issue is really with availability, consistency and performance. It is very hard to scale all three of these together and even your cashcow solutions will hit their limits (although some of their limits are quite high :))