A particular advantage of MySQL over the current crop of NoSQL systems or a home-grown system is there's a lot of people who understand how MySQL fails and how to tune it. It's predictable. I suspect that's part of why it had such a long life in AdWords.
● Very high commit latency - 50-100ms
● Reads take 5-10ms - much slower than MySQL
● High throughput
(And that's reads-with-no-joins, since they are not supported either.)
"With F1, we have built a novel hybrid system that combines the scalability, fault tolerance, transparent sharding, and cost beneﬁts so far available only in “NoSQL” systems with the usability, familiarity, and transactional guarantees expected from an RDBMS."
In other words: sure, you can build a distributed, transactional system by making all writes synchronous... but most (all?) NoSQL systems prefer to optimize for low-latency operations over transactionality. As an author of one of those systems, I think that low latency is a more useful property in general, but the fact remains that it's a tradeoff, and it's a shame that the F1 authors deliberately gloss over that.
> The strong consistency properties of F1 and its storage system come at the cost of higher write latencies compared to MySQL. Having successfully migrated a rich customerfacing application suite at the heart of Google’s ad business to F1, with no downtime, we will describe how we restructured schema and applications to largely hide this increased latency from external users. The distributed nature of F1 also allows it to scale easily and to support signiﬁcantly higher throughput for batch workloads than a traditional RDBMS.
Only an idiot can assume that F1/Megastore is a drop-in replacement to MySQL (or Cassandra, jbellis!), but those guys invented BigTable, and battle tested it, long before writing the papers so they know where their priorities lie nowadays. On the other hand, I am highly curious about Spanner, the successor of BigTable, that powers F1.
Am I missing something? The very next slide seems to describe how they support one-to-many joins by flattening the table hierarchy. (They also claim to allow "joins to external sources", but unfortunately there are no details given for that feature.)
Making it run, and then run well inside a single organization with that organizations workload is 3x multiplier.
Turning it into something that can be sold to a large number of customers is at least another 3-10x multiplier.
One of the reasons Google is not releasing it is because they have no incentive to invest that 3-10x multiplier.
edit: oddly, it appears that the "full-text" pdf that the ACM has is just the abstract. So I can't get at the full paper, either. Sorry about that.
Imagine being able to see Google's search engine source code. That'd be truly something.
Unfortunately market forces are still pretty much in control of much of the world's most interesting software.
The reason for this: almost all single purpose software is a complete disaster area. (And a massive proportion of other stuff isn't much better). As long as it works within certain parameters then there really is no need to fix it up.
Avoiding ORMs that assume a chatty connection make pragmatic sense, and they note the code complexity trade-off of having to explicitly fetch the data up-front.
I like that they basically added aggregates to the SQL schema, I'd been hoping a RDBMS would come out with that feature as a way to transparently handle sharding.
I'm pleasantly surprised they allow (AFAICT) transactions/joins across aggregates, albeit with 2PC.
I would have been just fine accepting transactions/joins only within aggregates, as even if the application code has to handle coordination across aggregates in an eventually consistent manner, as long as each aggregate itself is consistent, I think programmers can still easily deal with/reason about that sort of model.
If I were a VC, I'd give $40 million to a company building this product over MongoDB. Enterprises (who spend money on licenses/support much more than startups do) historically want features like schemas and transactions (i.e. there is a reason relational databases have been popular for so long).
The same trade-off also happens with transactions: if all the WRITEs are for the same customerID, it will be fast with almost no performance hit compared to issuing the same commands without transactions. They also support transactions across customerIDs, which involves an expensive 2-phase commit (2PC) round across the machines (across the Paxos quorums, really), which will be much slower.
Disclaimer: I wrote a distributed database with similar design decisions.
There is actually quite a list of things that are trying to make this work. DBShards, Clustrix, Xeround, and I am sure that isn't all of them.
What's interesting to me is that they moved from MySQL, Google's engineers are very smart about code. I wonder if Google looked into writing a storage engine and then realized how awful the code of MySQL is.
Also the fact that Oracle owns MySQL probably gave them an extra impetus to get off of it immediately.
Hierarchical databases have been around for a long, long time. The concepts look similar to some ancient legacy system that I worked on a decade ago. Doesn't sound incredibly novel.
Cassandra's query language, CQL, is not really comparable since it only supports such a small subset of SQL. Also, Cassandra uses eventual consistency in place of doing distributed transactions.
Getting access as an end user is a huge PITA, mostly intended for ad agencies that manage lots of customer campaigns for them.
Nice to see that unlike the NoSQL amateur hour (startups using flimsy non-dbs where they don't belong ), Google, for it's core and crucial business, uses a) relational-like features, b) a strictly enforced schema, c) transactions.
 There are places where these belong, they just are few. And, no ease of development and "SQL is hard and has impedance mismatch we my programming language man" is not a reason to use them in situations where they don't belong.
Seriously, there are many parallel RDBMS products out there that scale pretty well. Just none of them are open source. Why?
NoSQL is just a brand name for a loosely defined set of engineering trade-offs. It's easier to achieve the scalability and fault tolerance needed for a lot of modern Internet systems if you don't also require ACID and the huge weight of features and old usage patterns that the market expects of a relational (which they read as "SQL") database. If somebody wants to pour $10-50 million into making an open source NoSQL-scale RDBMS, great, but some projects are too complex for any but the most robust open source communities to maintain, never mind create.
Depending on the project, NoSQL solutions can relax both of those dimensions. I typically refer to "NoSQL" as non-relational databases.
If you don't like NoSQL solutions, or they don't fit your project's needs, don't use them. There's no "scam" here.
If you think slony was bad then Burcardo will give you nightmares.
It's based on a brittle conglomerate of elaborate triggers - and it's effectively undocumented.
With something as complex as multi-master replication I'd think long and hard before trusting my data to a bolt-on hack like that.