I'm sure schema-less datastores are a huge win for your MVP release when it's all greenfield development, but from my days working for enterprises, it seems like you're just begging for data inconsistencies to sneak into your data.
Although, in the enterprise, data actually lives longer than 6 months--by which time I suppose most start ups are hoping to have been bought out.
(Yeah, I'm being snarky; none of this is targeted at bu.mp, they obviously understand pros/cons of schemas, having used pbuffers and mongo, I'm more just talking about how any datastore that's not relational these days touts the lack of a schema as an obvious win.)
Yeah -- that's one of the huge benefits of marking a field as ``required'' in a protobuf. The ability to enforce a contract prevents a ton of unexpected and incomplete data making it on disk (and also, e.g., across the wire to clients). Having strict types represented in the serialization format is also handy; when one pulls out an int32 from a protobuf it's going to be an int32, and not an integer that somehow found its way into being a string.
Could you elaborate more on how you have used Protobuffs, as I'm not sure I fully understand. I've previously used Riak in Erlang and Ruby projects, so am fairly familiar with how it works. It exposes a HTTP and Protobuffs API which allows you to store objects of arbitrary types (JSON, Erlang binary terms, Images, Word Documents, etc). From the sounds of it you are serialising a Protobuffs packet and sending this as the content of the object. Why did you choose this, over say JSON, which MongoDB uses?
Data inconsistencies will always sneak in unless you're vigilant.
SQL data stores provide a way to limit certain kinds of inconsistency, but a) I rarely see a system that uses all of that power, and b) there are plenty of inconsistencies that you can't prevent with standard SQL features.
Personally, I'm ok with schema-less stores in the same way I'm ok with saving files on disk. I don't expect my filesystem to enforce application-level file format quality. I just expect it to store things and give them back when I ask. That doesn't mean I don't care about data integrity, it just means I solve the problem somewhere else in the system.
I was reading along and nodding my head until I got to the 1000 line haskell program that handles issues stemming from a lack of consistency.
I'm not exactly a SQL fanboy, but maybe ACID is kinda useful in situations like this and having to write your own application land 1000 liners for stuff that got solved in SQL land decades ago isn't the best use of time?
Indeed, these problems were solved years ago, for single servers. Conflict resolution in distributed systems is a significantly more complex problem, and invariably requires tradeoffs specific to the app. Those 1000 lines are likely declarations like "Merge changes to this list via set union", "Merge changes to this set using 2P-Set CRDTs", and so forth.
The RDBMS space has been addressing how to do this with distributed systems for quite some time as well (a couple of decades at least), and at least the more sophisticated systems tend to support a fairly broad set of approaches to addressing this problem (and means of expressing those choices quite succinctly).
While these are all great products and can provide highly available layers on top of MySQL, only MySQL cluster satisfies rtdsc's requirements, and I have my doubts about its behavior around clocks. In general, the MySQL HA space prefers CP to AP.
"So how do you do what they did with MySQL. Can you build an always available multi-master cluster that is available, partition tolerant and eventually consistent?"
Obviously in the article they picked Riak. So yes Riak and other Dynamo-type database have those properties. I asked if the properties of Riak+custom eventual consistency code can be achieved with MySQL (as one of the above comments was asking people to maybe not jump to all these fancy and exotic solution when MySQL is available)
Aha. I interpreted "what they did with MySQL" as "What the MySQL team did by writing a successful database", but you meant "what the Bump team did with their application, were it to be on top of MySQL." Got it.
I sorta get that, but it's not like Postgres/MySQL/etc are operational nightmares. It's just that going from one barely proven DB to another barely proven DB in a migration that involves a boatload of application side code seems heavy handed from the outside.
it's not like Postgres/MySQL/etc are operational nightmares
Every database is an operational nightmare. Just wait until MySQL segfaults every time it gets more than 4 SSL connections at once, or takes four hours to warm up its inno cache. Or Cassandra compaction locks up a node so hard its peers think its down, turning your cluster into a recursive tire fire of GC. Or a Riak node stops merging bitcask and consumes all your disk in a matter of hours. Or Mongo ...
My point is that every database comes with terrifying failure modes. There is no magic bullet yet. You have to pick the system which fits your data and application.
You don't seem to have any experience with solid, mature RDBMSs like PostgreSQL or Oracle or SQL Server. MySQL sucks in so many ways and all the other names you mentioned are not RDBMSs. People running large numbers of (for example) MS SQL Servers will tell you that they are ridiculously robust and mature and do NOT randomly crash in the middle of the night. And even when your hardware lets you down your hot failover machine is ready to take over without skipping a beat.
Magic bullet? These systems are the closest you can find in the world of software.
<sigh> Or when the Oracle SCN advances too quickly, causing random systems to refuse connections or crash outright. Just because software is large, supported, and mature doesn't mean it is free of serious design flaws.
[Edit:] Don't get me wrong, Oracle's DBs are serious business; an incredible feat of engineering. If I could afford them I'd probably use them more often. But everything in this business is a cost tradeoff--in licenses, in dev time, in risk.
Aware Postgres is free: it's an excellent database and I've used it successfully (like most of the other tools I've mentioned.) It also has spectacular failure modes; though the only ones I've personally encountered were around vacuuming.
I'll agree they are not operational nightmares, but now that we're set up with Riak we can do things that I'm pretty sure your Postgres/MySQL/etc. setup cannot.
1. Add individual nodes to the cluster and have it automatically rebalance over the new nodes.
2. Have it function without intervention in the face of node failure and network partitions.
3. Query any node in the cluster for any data point (even if it doesn't have that data locally).
I'm sure there's other things I'm missing, but the point made by timdoug is the key one. We're at a scale now where it's worth trading up-front application code for reliability and stability down the line.
Yes, Riak gives you certain flexibilities that make certain things easier. If a node dies, you generally don't have to worry about anything. Stuff Just Works. (Well, in the Riak case, this is true until you realize that you have to do some dancing around the issue that it doesn't automatically re-replicate your data to other nodes and relies instead on read repair. This puts a certain pressure on node loss situations that I find is very similar to traditional RDBMS.)
But of your list, I have done all of these things in a MySQL system and for a comparable 1000 lines of code.
1. We implemented a system that tracks which servers are part of a "role" and the weights assigned to that relationship. When we put in a new server, it would start at a small weight until it warmed up and could fully join the pool. Misbehaving machines were set to weight=0 and received no traffic.
2. Node failure is easy given the above: set weight=0. This assumes a master/slave setup with many slaves. If you lose a master, it's slightly more complicated but you can do slave promotion easily enough: it's well enough understood. (And if you use a Doozer/Zookeeper central config/locking system, all of your users get notified of the change in milliseconds. It's very reliable.)
Network partitions are hard to deal with for most applications more so than most databases. It is worth noting that in Riak, if you partition off nodes, you might not have all data available. Sure, eventual consistency means that you can still write to the nodes and be assured that eventually the data will get through, but this is a very explicit tradeoff you make. "My data may be entirely unavailable for reading, but I can still write something to somewhere." IMO it's a rare application that can continue to run without being able to read real data from the database.
3. In a master/slave MySQL environment you would be reading from the slaves anyway unless your tolerance for data freshness is such that you cannot allow yourself to read slightly stale data. I.e., payment systems for banks or things that would fit better in a master/master environment. Since the slaves are all in sync, broadly speaking, you can read from any of them. (But you should use the weighted random as mentioned in the first point.)
Please also note that I am not trying to knock Riak. It's neat, it's cool, it does a great job. It's just a different system with a different set of priorities and tradeoffs that may or may not work in your particular application. :)
But to say that it can do things the others can't is incorrect. Riak requires you to have lots of organizational knowledge about siblings and conflict resolution in addition to the operational knowledge you need. A similar MySQL system requires you to have a different set of knowledge -- how to design schemas, how to handle replication, load balancing, etc.
Is one inherently better than the other? I don't think so. :)
Overall I enjoyed reading the article and the specific tradeoffs you had to consider when comparing Riak to Mongo for your specific use cases. I'm curious if you had any problems w/r/t the 3 items you highlighted above when using Mongo, as none of those were mentioned in the linked article.
(these points are for MongoDB 2.0+)
1. Adding a new shard in Mongo will cause the data to be automatically rebalanced in the background. No application-level intervention required.
2. Node failure (primary in a RS) is handled without intervention by the rest of the nodes in the cluster. Network partitions can be handled depending on what the value set for 'w' is (in practice it's not an isolated problem with a specific solution).
3. Using mongos abstracts the need to know what node any data is on as each mongos keeps an updated hash of the shard keys. Queries will be sent directly to the node with the requested data.
Can't reply any deeper, figure this is relevant...
Databases are, in my experience and those of my peers, the aspect of a system most likely to crash and burn. On the surface, it looks like regular old bugs--every project has them. But there's a reason databases are so hard: they're the perfect storm of unreliable systems, bound together with complex and unpredictable code to try to present a coherent abstraction. Specifically:
1. DB's need to do disk IO for persistence, which pits them against one of the slowest, most-likely-to-fail components of modern hardware.
2. They must maintain massive, complex caches: one of the hardest problems in computer science in general.
3. They must be highly concurrent, both for performance on multicore architectures and to service multiple clients.
4. Query optimization is a dark art involving multiple algorithms and frankly scary heuristics; the supporting data structures can be quite complex in their own regard.
5. They need to send results over the network, an even higher latency and less reliable system.
6. They must present some order of events: whether ACID-transactional, partially ordered with vclocks, etc.
Almost all these problems interact with each other: a logically optimal query plan can end up fighting the disk or network; concurrency makes data structures harder, networks destroy assumptions about event ordering, caches can break the operating system's assumptions, etc.
Moreover, the logical role of a database puts it in a position to destroy other services: when it fails, shit hits the fan, everywhere. Recovery is hard: you have to replay logs, spool up caches, hand off data, resolve conflicts. Yes, even in mostly-CP systems like MSSQL. All of these operations are places where queues can overflow, networks can fail, disks can thrash, etc.
Furthermore, the quest for reliability can involve distributing databases over multiple hosts or geographically disparate regions. This distribution causes huge performance and concurrency penalties: we're limited by the speed of light for starters, and furthermore by the CAP theorem. These limits will not be broken barring a revolution in physics; they can only be worked around.
The only realistic approaches (speaking broadly) are CP (most traditional DBs) and AP (under research). CP is well-explored, AP less so. I expect the AP space to evolve considerably over the next ten years.
First, I'm going to quibble with #5. The networks databases typically must communicate over are typically much lower latency and are at the very least far less subject to spontaneous hardware failure (human error is another matter that I'll concede is certainly debatable as that has a lot to do with the complexity of your storage and network systems...).
From elsewhere (lost the link) you wrote:
> My point is that every database comes with terrifying failure modes. There is no magic bullet yet. You have to pick the system which fits your data and application.
I'm totally with you in this part. "State" is the magic from whence most (all?) software bugs seem to stem. This has lead to emergent design time behaviour such that as few of the components as possible manage the authoritative logical state of our systems. It lead to it being desirable for those few remaining stateful components to be as operationally resilient in the face of adversity as conceivable. These forces have conspired to ensure herculean efforts to make such systems resilient and perfect, almost as much as they have conspired to ensure those components present the most impressive catastrophic failure scenarios. ;-)
While on one hand I completely agree that one needs to select the system which best fits your data and application, the maturity of more established solutions ensures they've been shaped by a far more robust set of different contexts. They've crossed off their list far more "fixing corner case X is now our top priority" moments than those who have come since. They really can better address a much broader range of the needs of at least applications that have been developed to date. By comparison most "NoSQL" stores have advantages in a fairly narrow set of contexts. Particularly if you are a startup and really don't know what your needs will be tomorrow, the odds favour going with as high a quality RDBMs as you can get your hands on.
I really do think far too few startups consider the reality that the thought process should probably be, "Are the core technical problems that stem from our long term product goals fundamentally not well suited to solutions available with RDBMS's?", and only asking about alternatives if the answer is no.
Another way to look at is is, from 1-6, only #4 could be argued as not being applicable to a file server (I'd throw in SAN too, but those are so often lumped in with RDBM's, so let's skip that one)? And really, #4 is both simpler and harder for filesystems: they have a far more trivial interface to contend with... but consequently have a far more difficult time determining/anticipating query access patterns for selecting optimal data layout, data structures and algorithms to maximize IO efficiency.
Why don't we worry as much about choosing a filesystem to suit our app as we do a database?
First, I'm going to quibble with #5. The networks databases typically must communicate over are typically much lower latency and are at the very least far less subject to spontaneous hardware failure (human error is another matter that I'll concede is certainly debatable as that has a lot to do with the complexity of your storage and network systems...).
I really wish this were true man! For many shops it is, but under load, even local networks will break down in strange ways. All it takes is dropping the right syn packet and your connect latency can shoot up to 3 seconds. It's also common to have a database system with 0.1+ light-second distances between nodes, which places a pretty firm lower bound on synchronization time.
I'm totally with you in this part. "State" is the magic from whence most (all?) software bugs seem to stem.
Absolutely agree; databases are simply the most complicated state monads we use. I actually count filesystems as databases: they're largely opaque key-value stores. Node-networked filesystems like coda, nfs, afs, polyserve, whatever that microsoft FS was... those are exactly databases, just optimized for a different use case and with better access to the kernel. SANS are a little less tricky, at least in terms of synchronization, but subject to similar constraints.
While on one hand I completely agree that one needs to select the system which best fits your data and application, the maturity of more established solutions ensures they've been shaped by a far more robust set of different contexts. They've crossed off their list far more "fixing corner case X is now our top priority" moments than those who have come since. They really can better address a much broader range of the needs of at least applications that have been developed to date. By comparison most "NoSQL" stores have advantages in a fairly narrow set of contexts.
You're totally right; traditional RDBMSs have benefited from exhaustive research and practical refinement. Strangely, most other old database technologies have languished. Object and graph stores seem to have been left behind, except for a few companies with specific requirements (linkedin, twitter, google, FB, amazon, etc). AP systems in general are also poorly understood, even in theory: the ongoing research into CRDTs being a prime example.
So to some extent the relatively new codebase of, say, Riak contributes to its horrible failures (endless handoff stalls, ring corruption, node crashes, etc). Those will improve with time; I've watched Basho address dozens of issues in Riak and it's improved tremendously over the past three years. The fundamental dynamo+vclock model is well understood; no major logical concerns there.
Now... the AP space seems pretty wide open; I would not at all be surprised to see, say, distributed eventually-consistent CRDT-based, partial-schema relational, kv, or graph DBs arise in the next ten years. Lots of exciting research going on. :)
Why don't we worry as much about choosing a filesystem to suit our app as we do a database?
If I had to guess, it's because non-esoteric local filesystems offer pretty straightforward capabilities; no relational algebra, no eventual consistency, etc. If you have access to ZFS it's a no-brainer. If you don't, there are decent heuristics for selecting ext4 vs XFS vs JFS et al, and the options to configure them (along with kernel tweaks). The failure modes for filesystems are also simpler.
That said, I agree that more people should select their FS carefully. :)
This doesn't really answer any of the questions I asked you but from what you wrote here you have to agree that going with a tried-and-true RDBMS that's been around for decades is a much better prospect than choosing one of the new, unproven NoSQL products.
Sorry, was trying to address underlying concerns rather than specific software.
you have to agree that going with a tried-and-true RDBMS that's been around for decades is a much better prospect than choosing one of the new, unproven NoSQL products.
I disagree. Like I said, CP vs AP are two very different strategies for addressing mutable shared state, and there are no AP RDMBSs that I'm aware of. If your system dynamics are fundamentally AP, no number of Oracle Exadata racks will do the job. If your dynamics are fundamentally CP, Riak, SimpleDB, and Cassandra will make your life a living hell. Few applications are all-AP or all-CP, which is why most teams I know use a mixture of databases.
Do you mean the physical server crashing? You should have a hot spare replicated machine for that. If you mean the RDBMS failing randomly at night, that really doesn't happen on mature systems like PostgreSQL or MS SQL Server. I mean it could happen but it's as rare as hen's teeth.
Every time I hear someone make a claim like this I have to wonder how much experience they actually have with these systems in a demanding environment.
Relational databases, all relational databases, have a disturbing tendency to fall over suddenly under load with little advance warning. They don't have a pleasant gradual failure mode. Instead some point of contention goes from 99% of capacity (at which point it takes very little load) to 101% of capacity (at which point everything falls apart).
If you've never experienced this, then I'm pretty confident that you've never scaled a database to its limit.
Which isn't that surprising. Most companies don't have sufficient load to make their database break a sweat. But once you've encountered the region where problems happen, life gets much more difficult.
I've been using SQL Server 5 hours a day for almost 15 years now and yes, I've taken databases to the limits of the hardware many times. We've upgraded our SQL Servers' hardware many times due to increasing load. We've never had a scaling problem that we could not immediately and easily resolve.
If your server falls apart it's usually because of a single bad query and with SQL Server it's super easy to determine which query is at fault and kill it, at which point everything is immediately back to normal with NO LOST DATA. You can even set limits to how much resources a query can consume if you want.
As for the 'no warning' thing, all you need to do is monitor your server. It should not run at 99% CPU or IO capacity at peak times! If you know the limits of your hardware it's really not difficult to monitor the actual usage and plan your upgrades accordingly.
It doesn't matter how bad things get you can rest assured that you'll end up with a consistent database once the dust settles. You can even do database restores up to an arbitrary point in time if you need to! We've fucked up in every way imaginable but we've never lost any data let alone an entire database. I have nothing but praise for SQL Server.
Where I first hit this is in Oracle. Which unexpectedly locked up at a million dynamic pages/hour served out of the database, over lock contention. Ever hit a lock contention issue?
That is the scaling problem that gives no warning. It is humming along fine with reasonable load, and suddenly is falling over. Sure, you can identify the problem query if you have good monitoring, but you can't just kill one instance because there are another 100 coming along in the next second...
Having dived into the guts of those failures, and having talked with expert DBAs for multiple databases, that failure mode is endemic to databases and nobody has figured out how to catch it with monitoring. (At least that was the state of the art not many years ago.)
Yes, there are ways to tune it and to scale it out horizontally. However you never know if you will need to.
If you've got a reporting database, as opposed to a transactional one, scaling is much more straightforward to predict and handle. From your reference to a single problem query, I strongly suspect that that is what you dealt with.
I like SQL engines for moderate data sets that fit nicely on one machine and well within the normal performance envelope. But even there I will often have to try a few different incantations and cross my fingers that one of them will perform reasonably because that's easier than trying to figure out what that 1 MLOC engine is up to. And I don't know anybody who does very large MySQL setups without a lot more hassle than that.
For some things I'd much rather deal with 1KLOC that I had to write myself than the 1 MLOC that I'm scared to even start digging through.
It's a stable, well understood DB vs an immature, not well understood DB AND 1KLOC to deal with not being consistent.
To be clear, I'm not saying any given DB is the OneTrueWay, just that people seem to be a bit cavalier in regards to some of this crap and chasing the newest shiny thing while rediscovering why some of the braindamage in those 1MLOC was put there in the first place.
How do you run your stable, well understood DB that probably uses thread locks and shared memory, across a cluster of 10 machines?
> just that people seem to be a bit cavalier in regards to some of this crap and chasing the newest shiny thing while rediscovering why some of the braindamage in those 1MLOC was put there in the first place.
But usually when people need to scale they need to scale, they usually know it. Since only successful companies need to scale, they probably know a thing or two about their domain and specific dataset.
It used to be that when you wanted to scale a database you used to be one of the large companies out there and your manager went and played golf with an Oracle salesman and you ended up with Oracle all of the sudden. That's what I call "golfware".
Small companies that all of the sudden had to handle 100ks of thousands of client connections was not very common.
So I think don't have a choice but to be cavalier about this crap. They either end up with an overpriced blade server that still has on single lock below all those expensive blades or you have to think of a distributed solution (or you just give up and move out of the way and let others eat your lunch).
Oh, I agree that some people are being a bit cavalier about the new shiny. These folks are migrating away from MongoDB for a reason.
On the other hand, even if half of that 1 MLOC is still relevant to new ways of building systems (and given that SQL databases are a 70s tech, I doubt it's that much), that still leaves half that isn't.
The only way we'll find out which part matters is for people to try different approaches. So I fully support experimentation like this. If we don't rediscover for ourselves the good parts in the tools we use, then we're just stuck honoring the traditions of our heavily bearded ancestors.
Sure. But I think the problem space for databases has changed quite a bit in ways that aren't true for TCP, and are only partly true for C.
My dad was writing code at the time, and he saw the big benefit as allowing developers to manage larger amounts of data on disk (10s of megabytes!) without a lot of the manual shenanigans and heavy wizardry in laying out the data on disk and finding it again. Plus, the industry thought the obvious future was languages like COBOL, what with their friendly English-like syntax and ability for non-programmers to get things done directly.
So little of that is true anymore. For a lot of things that use databases, you're expected to have enough RAM for your data. We don't distribute and shard because we can't fit enough spinning rust in a box; we do it because we're limited on RAM or CPU. A lot more people have CS degrees, the field is much better understood, and developers get a lot more practice because hardware is now approximately free. And nobody really thinks the world needs another English-like language so that managers can build their own queries.
TCP, on the other hand, is solving pretty similar problems: the pipes have gotten faster and fatter, but surprisingly little has changed in the essentials.
C is somewhere in between. A small fraction of developers working these days spend much time coding in plain C, and many of its notions are irrelevant to most development work.
But unlike SQL databases, you could ignore C if you wanted to; there were other mainstream choices. That wasn't true for SQL until recently; the only question was Oracle or something else. I'm very glad that has changed.
My apologies for the rant here - it's not directed specifically at you, but toward a general attitude I see on HN and in the tech community.
I was more commenting on your phrase "if half of that 1 MLOC is still relevant to new ways of building systems (and given that SQL databases are a 70s tech, I doubt it's that much)".
There has been a TON of academic and industrial research on SQL databases since they were invented in the 70s. Calling them 70s tech is akin to calling LISP 50s tech. The basic ingredients haven't changed much (sets in SQL and lists in LISP), but the techniques on top have evolved by leaps and bounds.
To your point here - there are plenty of companies that have way more data in their databases than RAM available. The early users of Hadoop, etc. were primarily constrained by disk I/O on one machine, rather than constrained by RAM or CPU on one machine. It is certainly convenient that a distributed architecture can solve both sets of problems if the problem space lends itself to distributed computation.
I'm not a huge defender of SQL - I think it has some serious problems - one fundamental problem is lack of strong support for ordered data, and it can be a huge pain to distribute. I agree that having some options with distributed K/V stores is really nice, but you have to admit that much of it hasn't yet been proven in the field.
I, for one, DO think that the world needs something like an english-like language so that "managers" can write their own queries. Honestly, the roles of programmer and data analyst are often wholly different. I think it's a huge kludge that the only people capable of manipulating massive datasets are people with CS degrees or the equivalent. Programmers suck at statistics, and while they're generally smart folks, they often don't have the business/domain expertise to ask the right questions of their data. Software is about enabling people to solve problems faster - why should we be only allowing people with the right kind of academic background to solve a whole class of problems.
Finally, saying that you doubt SQL is relevant to building modern systems is borderline irresponsible. Experimentation with new tools is good - but you have to also keep in mind that people were smart way back in the 70s, too, and that their work may be perfectly relevant to the problems you're trying to solve today.
There has indeed been a ton of research on SQL databases. But still, Stonebraker, one of their pioneers, said that they should all be thrown out:
"We conclude that the current RDBMS code lines, while
attempting to be a “one size fits all” solution, in fact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of “from scratch” specialized engines. The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow’s requirements, not continue to push code lines and architectures designed for yesterday’s needs." -- http://nms.csail.mit.edu/~stavros/pubs/hstore.pdf
I'm sure a lot of their intellectual work is indeed something I could learn from. But SQL databases are an artifact of a particular moment in technological and cultural history. The thing I really want to learn about isn't the residue of their thoughts as interpreted through 30 years of old code, it's their original insights and how to transfer those to today's world.
Hadoop is a fine example of Stonebraker's point. The original sweet spot of relational databases was easy queries across volumes of data too large to fit in RAM. But Google realized early on that they could do orders of magnitude better with a different approach.
I agree that these new approaches haven't been fully proven in the field, but I'm certainly glad that people are trying.
As a side note, I think the right way to solve the pseudo-English query language problem is by making the query language actual English. If you have a programmer and an analyst or other business person sit next to one another, you get much better results than either one working alone.
Stonebraker's research and commercial ventures the last several years have been focused on building specialized variants of existing database systems. Vertica (C-Store), VoltDB (H-Store), Streambase (Aurora), and SciDB are all specialized DBMS systems designed to overcome the one size fits all nature of things.
Regardless, there's always going to be a balance between specialized systems and platforms, but my point is that we should be willing to trust the platforms that have proven themselves, avoid reinventing the wheel (poorly), and not be too quick to throw them out in favor of the new shiny.
I agree that programmer/analyst working together is a terrific pair, but the beauty of software is that we live in a world where we can have 1 programmer write a platform on which 10 programmers build 10 systems that 1000 users use to get their jobs done and make society that much more efficient.
Oh, I trust the current stable platforms to be the current stable platforms. My beef isn't with people who use them. It's with people who don't know how to do anything else, which was a plague on our industry for at least a decade. At least the people who get burnt on the latest fad will make new and interesting mistakes.
I agree that when we can find ways to let users serve themselves, that's best for everybody. I just don't think universal pseudo-English query languages are the way to do that, because the hard part isn't learning a little syntax, it's learning what the syntax represents in terms of machine operations.
Once the programmer and the analyst have found something stable enough to automate, by all means automate it. Reports, report builders, DSLs, data dumps, specialized analytic tools: all great in the right conditions. But people have been trying to build pseudo-English PHB-friendly tools for decades with near zero uptake from that audience. I think there's a reason for that.
>On the other hand, even if half of that 1 MLOC is still relevant to new ways of building systems (and given that SQL databases are a 70s tech, I doubt it's that much)
SQL is not a technology in the sense of a specific artifact (say, a PDP-11), it's a design based on relational algebra, a formal specification for relational data.
A specific RDBMS implementation might "age", but Math do not age. For as long as we have relational data (data with relations to each other), the relational algebra will be the best, and formally proven correct, way to model it. Period.
The same holds true for every other "tech" that is based on Computed Science. All those technologies are older than the '70 and will be used FOREVER: garbage collection, hash maps, linked lists, regular expressions, b-trees, etc etc, ...
Even specific artifacts remain relevant: TCP/IP, C, UNIX, windowing UIs, etc etc...
The chant that "X is best, period" is a religious notion, not a practical one. You're welcome to worship whatever you please, but for those of us who are here to have a real-world impact, "best" is defined in terms of utility for a particular situation.
Even one of the great RDBMS pioneers, Stonebraker, agrees that RDBMSes are an artifact of a particular era in technology and commerce and should be thrown out and done over:
It is material as a measure of complexity when it becomes necessary for a developer to understand that complexity. Database tuning is a black art to many because databases are very complicated.
It turns out that having less code does in fact fix some issues. Consider Prevayler, for example, an object persistence layer that provides full ACID guarantees in something like 3 KLOC. It's also radically faster. It has a number of limitations (e.g., data must fit in RAM, no query language) but if you're ok with that, it's great: a Prevayler system is radically easier to reason about and optimize than something 300x as complicated.
Also, your 15 MLOC for Linux is a bit of a red herring. Look at this (somewhat old but presumably representative) breakdown here:
70% of the code is in drivers and arch. The kernel itself is circa 1% of the total, which at that point was 75KLOC. I think that they've been so disciplined in keeping it small is part of what has made Linux such a success.
The way I look at it is that a database, much like Linux, is a platform. I very seldom look at the source code for either for day to day programming.
As popular platforms they have in common that they are very well tested through everyday use, and are likely to operate as documented for ordinary configurations.
When you can rely on the correct operation of the system, the code of the underlying implementation is irrelevant. What you care about is how well the system supports your requirements, and what performance you can get by tweaking the available knobs.
Contrast this with your average 1000 line script. It has simplicity on its side, but when something breaks, that script is a suspect, and the source code of your DB probably isn't.
> Consider Prevayler, for example [...]
I'm not really sure what you're saying here. That an in-process in-memory object persistence framework without indexing can be faster than a heavy-duty relational database? That's not just "less code", that's "less features". Or "different features", at any rate; they're not the same species. I'm just going to assume what you mean to say is, "Not everyone needs a relational database".
> Also, your 15 MLOC for Linux is a bit of a red herring. [...]
All the more reason that the LOC count by itself is a meaningless metric.
Sure, databases are a platform all their own. Like any platform, as long as you are operating well within the expected envelope, they work as advertised. When you get near the edge, though, you really need to understand how they work. As we are seeing with the rise of all sorts of RDBMS alternatives, a lot of people are getting near a lot of different edges.
The ability to understand how something works is a function of complexity. LOC is correlated with complexity, so it's a good rough metric. If you have a better one, please offer it. But otherwise I'll stand by my original point, which is that the guy bitching about 1000 lines of consistency code is ignoring the much larger amount of code used in alternative approaches.
What I'm saying with Prevayler's example is that if you don't need all the features of a database, then the extra complexity is a drag on what you're trying to get done. Less features means less code means less work to master.
> All the more reason that the LOC count by itself is a meaningless metric.
Yes, you throwing in a bullshit number is definitely proof that all numbers are bullshit. Bravo.
> The Linux kernel has over 15 million lines of code, people normally don't hold it against it.
They would if all the Linux kernel did was play Tetris for example. The point was that here is 1000 lines of what someone thinks is awkward code to deal with eventual consistency vs 1M lines of code to deal with consistency in another case. If consistency for a particular application can be dealt with in 1000 lines, you should usually go for that instead of for the millions of lines solution.
Think of it the other way. They have availability and partition tolerance from Riak and they can handle eventual consistency with 1000 more lines. Now imagine you have MySQL and you have to make run in a multi-master distributed mode, how many lines of code would you need to handle 2 of the CAPs and then haven an application specific way to handle an incomplete (or untimely third part)? I bet it would be more than 1000 lines...
It is true that the most important thing is a good design, which will hopefully get you good performance with minimal and maintainable code.
However, in my experience, there are almost always additional optimizations that can be done after you have implemented your basic design. Things like "this part could make smarter choices with a more complicated heuristic", "we could use a faster datastructure here, though it requires a lot of bookkeeping", or "We could cut a lot of computation here with an ugly hack that cuts through the abstraction".
Of course, more code makes it harder to change the structure of the program, so it's the classic trade-off of maintainability versus optimization.
A good example of this, besides databases, is CPUs.
Modern CPUs use loads of silicon on complex optimization tricks; out-of-order execution, register renaming, prefetchers, cache snooping. And all that "bloat" is actually making it faster. You can't make a super-fast CPU by removing all the cruft to get a minimal design. (Or rather, you can make it faster for certain cases, but it would be slower at doing almost anything useful.)
>Wtf? Since when did bloat make code "secure and perfomant" ?
WTF? Since when one reads the phrase "Some problems are just hard, and you'll want as much code as is necessary to make it secure and performant." and deduces (who knows by what logic) that the guy means _bloat_ and not _necessary_ code (error checking, code for handling corner cases, etc)?
Not to mention that bloat is a silly term used by non-programmers to mean "this program is large" or "I don't use that feature, so it must weight down the program needlessly".
That is, people who don't understand that features they don't make use of (e.g the full text search capability of MySQL) are not even loaded from disk by the OS in the first place, or that most of the size of a large program like Office is not comliled code but assets (graphical etc).
Why do mysql and PHP apologists think "people have managed to succeed despite deliberately making things more difficult for themselves" is a compelling argument? Mysql and PHP didn't make them succeed, or even help them succeed.
It proves that the technologies are capable products at the most massive scales. Do you have evidence to support that Facebook, Tumblr, or Etsy would have been better off had they chosen different technologies (of course not)? Or that they made things more difficult for themselves? How would Facebook be improved by PostgreSQL? At scale, data is so massively partitioned that the fact that you don't have windowing functions is utterly irrelevant. I hate PHP as much as the next guy, but the facts are that it gets the job done. And that, in business, is what matters. The angst on here about MySQL is unfounded and is largely a symptom of groupthink.
I said capable, not awesome, or cutting edge, or the only thing you'll ever need, or the best thing since sliced bread. Capable is very different from your obviously flawed analogy. Windows 3.1 and the original Macintosh were successful at the time because they were capable products at the time which have evolved into the good, but not flawless products that exist today.
The groupthink MySQL/PHP hate here implies these are terrible products which only a moron with bad taste would use, which is demonstrably false when you look at the choices made by those using them.
I don't see any conflict between the notion that PHP is a turd of a language and that it's perfectly adequate for building something major. If you're willing to spend years holding your nose. Or if you have little aesthetic sense. Or aren't experienced with other languages and don't know any better.
As far as I'm concerned, PHP is the Bud Light of programming languages. Popular and perfectly adequate for a large audience, but definitely not a sign of taste and discernment.
Thanks, papsosouid. Facts are much more useful to discussions like this. It still doesn't change the fact that PHP is capable of running the most trafficked sites in the world (side note: I don't like PHP, either. I just hate haters who can't see the world in shades of gray). Having heard Facebook engineers speak about MySQL many times, I have not seen any inclination that they are dissatisfied with MySQL as a platform choice. (And please, no one reference on that utterly BS article published on gigaom.com)
You aren't supporting your case at all. I said "just because you can accomplish something with bad software, doesn't mean you accomplished it because of that software". And you responded with "it is possible to build big things with bad software". Yeah, I know. That doesn't make the bad software good though, which was my point.
well, yes and no. When you violate Consistency in SQL, your write fails. If it's a rare race condition, then the error probably just bubbles up through your application as an exception. Perhaps, if resolving conflicts was not something that we avoided but something that we baked into our application design, then we would be more likely to write code that handled it gracefully.
It its entertaining to see those weekly stories about NoSQL disasters. Hopefully one or two people will learn one or two things in three process. Let's try one: don't judge technologies on their sex appeal: the SQL old lady will take better care of your data than these young siliconed dolls.
The tests were done using Basho's own benchmarking tool, with the partitioned sequential integer key generator, and 250 byte values. I tried adjusting the ring_size (1024 and 128), and tried adjusting the LevelDB cache_size etc and it didn't help.
Be aware of the poor write throughput if you are going to use it.
That's strange, that doesn't look like a normal graph to me, it looks like a cache or queue of some sort is backed up. Did you try to use dtrace / iosnoop / iostat etc to see what might be the bottleneck?
For average commodity hardware I found something like 400 reqs/s/node was normalish, even sustained. Yours looks like about 2 minutes in it dies. Come to think of it, could you have your open file descriptors limited in the OS settings? That looks just like pattern I'd expect to see from that.
Might be unrelated but common pitfalls I had were:
- Using the HTTP proto. Protobuf is way faster.
- You can tweak the r and w values to get less read and write consensus when you can afford to, depending on the task and data.
- ulimit open file descriptors might be too low.
In any case, if you were to do a short writeup, I'm sure the basho guys at the mailing list would be interested.
So what you are saying is I was right. Thank you. People who report a bug and give less than half of a day for someone to investigate has never dealt with a vendor like oracle or IBM. This tells me you haven't had a data problem before and based on your willingness to give up so quickly leads me to believe you won't end up with data problems that this article is talking about anyway.
Ha. I've had and have plenty of data problems. After 2 days of making adjustments as per Basho's suggestions to try and improve the write throughput, I moved on. You seem to be making a lot of judgments and assumptions about that decision based on very little information. I guess this is troll food.
Riak loves random read/writes, spinny discs do not, try things out with a SSD sometime and watch things go from a shoddy XXX ops/sec to XXXX(X) ops/sec.
As a simple remark on this, I've gotten 1000+ ops/sec on a single machine operating as 3 nodes (equating to about 3000 ops/sec per node) when using an SSD and a measly 150 ops/sec with a spinny disc in the same setup (equating to about 450 ops/sec per node)
I had the same experience about throughput being a bit sub-par. For me it was a test on a single macbook pro with a regular 2.5" hdd.
Which client did you use to write to riak? protobuf or http? Also: which language? did you use threading? Did you enable search?
Well, for the benchmark, I was using Basho's benchmarking tool which is erlang, and I was testing with protobuf. I had 5 concurrent clients running for the benchmark, but also tried with more and less, and got about the same results.
I find these kinds of stories interesting, but without some feel for the size of the data, they're not very useful/practical.
I've heard of Bump, and used it once or twice, but I don't actually know how big or popular it is. If we're talking about a database for a few million users, only a tiny percentage of which are actively "bumping" at any time, it's really hard for me to imagine this is an interesting scaling problem.
Ex. If I just read an article about a "data migration" who's scale is something a traditional DBMS would yawn at, the newsworthiness would have to be re-evaluated.
Even at 90 million users, with anything approaching a reasonable level of activity, we're not talking about serious data.
90 million rows of denormalized data isn't a big deal, and if I had to guess, their ops per second is probably no higher than what a dedicated single, or maybe a small master-slave postgres deployment could handle.
Again, something a DBA would yawn at.
And I say this as someone who scaled up an API for a service that plugged into multiple ad networks concurrently for a total of billions of impressions per month with a high level of reliability. Using NoSQL and an RDBMS combined.
People who want to preach the NoSQL message should probably have some actual experience. Otherwise, it just makes very viable NoSQL solutions look really bad.
I'm not sure what exactly qualifies as respectable scale, but the Mongo master was running out of space and IO capacity with 24 SSDs and 90 million user records, and was replaced by a sixteen-node Riak cluster.
I'll happily share any other statistics you're interested in.
Edit: the Riak cluster actually contains lots of other data (communications, object metadata, etc.); we didn't need sixteen boxes for the user records.
90 million users is a great datapoint, yes! In my book that's more than respectable.
The only other stat that I'm curious about is the total size of the DB. Certainly databases with tens of millions of records can be held completely in RAM these days... but that also depends on how big each record is.
All-told the users database when we started migration was about 600 GB on disk, so not the most easily stored database in RAM, but not impossible if you get enough large machines.
In fact, we use Redis a lot at Bump, although almost exclusively for queueing and transient state, and not as a persistent database. For a period of time we did store long-lasting metadata in Redis, and as we became more popular instead of throwing engineering effort at the problem we threw more memory, culminating with a handful of boxes with 192 GB of RAM each. We've since moved that entire database to Riak. :)
I have decided on wanting to use riak as well. I was wondering if anyone had examples of how they used it with their data model?
For example this article mentions "With appropriate logic (set unions, timestamps, etc) it is easy to resolve these conflicts" however timestamps are not an adequate way to do this due to distributed systems having partial ordering. The magicd may be serialising all requests to riak to mitigate this (essentially using the time reference of magicd) in which case they're losing out on the distributed nature of riak (magicd becomes a single point of failure / bottleneck).
Insight into how others have approached this would be awesome.
There are a several ways to approach this. The simplest is to just take last-write-wins, which is the only option some distributed databases give you. For cases where this isn't ideal, you resolve write-conflicts in a couple ways.
One way is to write domain-specific logic that knows how
to resolve your values. For example, your models might
have some state that only happen-after another state,
so conflicts of this nature resolve to the 'later' state.
Another approach is to use data-structures or a library
designed for this, like CRDTs. Some resources below:
Unless I'm missing something, I would assume they run magicd on all servers that run the application. Thus Riak's degree of redundancy is independent of magicd's degree of redundancy since each instance of magicd can communicate to the entire Riak pool.
Yep! This is exactly how it works. Each app node runs a magicd which connects to an haproxy instance on localhost (connected to every machine in the database cluster), so when a Riak node goes down we don't miss a beat.
Random thought on proto buffers:
OP is advocating using the "required" modifier for fields and touting it as an advantage in comparison to JSON.
I would move the field value verification logic to the client, because it can cause backwards compatibility problems if you un-require it.
We experienced some significant difficulties with sharding; the automatic methods of doing so only seemed to shard a single-digit percentage of our data. We've also encountered some wildly unexpected issues with master/slave replication and related nomination procedures.
You're right that this post is vague with regard to those details; they would be a good candidate for a future blog post, but the desired takeaway from this one is that we're quite pleased with the performance and scalability that Riak provides.