Hacker News new | past | comments | ask | show | jobs | submit login
F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business (research.google.com)
201 points by pwpwp on May 30, 2012 | hide | past | web | favorite | 56 comments

I worked on the AdWords MySQL system while at Google around 2003-2005. We put a lot of work into it and with the right partitioning and caching strategies MySQL held up pretty well. I'm a little surprised it took so long for Google to migrate away from it, given all the systems expertise the company has. But most of Google's engineering efforts are for enormously large data where a little data loss and slow consistency is OK. AdWords is more like a traditional ACID application than, say, a search engine.

A particular advantage of MySQL over the current crop of NoSQL systems or a home-grown system is there's a lot of people who understand how MySQL fails and how to tune it. It's predictable. I suspect that's part of why it had such a long life in AdWords.

I'm surprised nobody's picked up on the main tradeoff here: very high write latencies (and high read latencies too, just not as spectacularly). From slide 9:

Performance: ● Very high commit latency - 50-100ms ● Reads take 5-10ms - much slower than MySQL ● High throughput

(And that's reads-with-no-joins, since they are not supported either.)

I don't think that has much to do with F1; any system with synchronous cross-datacenter replication will have that kind of write latency. Presumably if you ran in one datacenter it would be lower latency and just as scalable.

I think it's a fair thing to bring up, given that the F1 abstract comes across as claiming your-cake-and-eating-it-too:

"With F1, we have built a novel hybrid system that combines the scalability, fault tolerance, transparent sharding, and cost benefits so far available only in “NoSQL” systems with the usability, familiarity, and transactional guarantees expected from an RDBMS."

In other words: sure, you can build a distributed, transactional system by making all writes synchronous... but most (all?) NoSQL systems prefer to optimize for low-latency operations over transactionality. As an author of one of those systems, I think that low latency is a more useful property in general, but the fact remains that it's a tradeoff, and it's a shame that the F1 authors deliberately gloss over that.

Deliberately gloss over it? Second paragraph of the abstract:

> The strong consistency properties of F1 and its storage system come at the cost of higher write latencies compared to MySQL. Having successfully migrated a rich customerfacing application suite at the heart of Google’s ad business to F1, with no downtime, we will describe how we restructured schema and applications to largely hide this increased latency from external users. The distributed nature of F1 also allows it to scale easily and to support significantly higher throughput for batch workloads than a traditional RDBMS.

F1 is a successor of Megastore, even with relation to high write/read latencies. From bits leaked here and there by Google, one can safely assume they put highl value on cross datacenter consistency nowadays even if it means to hurt write ratios so much -- I recommend this video of Google IO 2011: http://www.youtube.com/watch?v=rgQm1KEIIuc (from minute 35:40 on). Megastore and F1 present abysmal latencies because they synchronously write to 3 data-centers at least before acknowledging the client side through Paxos or 2PC.

Only an idiot can assume that F1/Megastore is a drop-in replacement to MySQL (or Cassandra, jbellis!), but those guys invented BigTable, and battle tested it, long before writing the papers so they know where their priorities lie nowadays. On the other hand, I am highly curious about Spanner, the successor of BigTable, that powers F1.

> (And that's reads-with-no-joins, since they are not supported either.)

Am I missing something? The very next slide seems to describe how they support one-to-many joins by flattening the table hierarchy. (They also claim to allow "joins to external sources", but unfortunately there are no details given for that feature.)

You're right, I missed the part where "can't do cross-shard transactions or joins" was describing the old mysql infrastructure, not F1.

::sigh:: Would love to see this thing in detail and work on it. Frustrating that some of the most promising and interesting technologies in Google (e.g. BigTable) come out of there only in whitepaper or spec form. I can see several business advantage reasons for them to do so, but still... frustrating.

From experience (and MMM), let's say writing the software takes X mandays of effort.

Making it run, and then run well inside a single organization with that organizations workload is 3x multiplier.

Turning it into something that can be sold to a large number of customers is at least another 3-10x multiplier.

One of the reasons Google is not releasing it is because they have no incentive to invest that 3-10x multiplier.

You might be interested to learn about Analytics, AppEngine, Predictions API, Developer Storage, and I/O 2012 is a few weeks away...

To be fair, a SIGMOD paper is much more detailed than a white paper. But perhaps you said that because the Google page only links to the presentation from the conference, not the paper itself. You can find that here: http://dl.acm.org/citation.cfm?id=2213954

edit: oddly, it appears that the "full-text" pdf that the ACM has is just the abstract. So I can't get at the full paper, either. Sorry about that.

If you check http://www.sigmod.org/2012/, it appears this was an "industrial presentation" and not a full paper submission.

It's an industrial presentation. As such, there's no paper about it, just the talk given at SIGMOD and the slides. When Google unveiled Megastore, F1 predecessor, it did it as an industrial presentation too. Years, and many improvements later, they published a paper on Megastore.

Most of the awesome software running the WWW is running in tightly sealed compartments, hidden away from our sight.

Imagine being able to see Google's search engine source code. That'd be truly something.

Unfortunately market forces are still pretty much in control of much of the world's most interesting software.

I've not seen it, but I'd be willing to bet their search engine code is a mess.

The reason for this: almost all single purpose software is a complete disaster area. (And a massive proportion of other stuff isn't much better). As long as it works within certain parameters then there really is no need to fix it up.

except that Google's algorithms are being constantly tweaked and improved. Crap code is only left alone as is when it won't be touched anymore. If the code being crap makes your day to day hard, the best bet would be to refactor (given that you have the money and time, which Google has).

Well, wait 5 years and someone will have a working copy (Apache Foundation or someone else) :( oh well

And another 5 years for the even-remotely-production-usable 2.0 release.

and another 5 years to get hardware running with the performance characteristics of the original software.

Looks pretty awesome.

Avoiding ORMs that assume a chatty connection make pragmatic sense, and they note the code complexity trade-off of having to explicitly fetch the data up-front.

I like that they basically added aggregates to the SQL schema, I'd been hoping a RDBMS would come out with that feature as a way to transparently handle sharding.

I'm pleasantly surprised they allow (AFAICT) transactions/joins across aggregates, albeit with 2PC.

I would have been just fine accepting transactions/joins only within aggregates, as even if the application code has to handle coordination across aggregates in an eventually consistent manner, as long as each aggregate itself is consistent, I think programmers can still easily deal with/reason about that sort of model.

If I were a VC, I'd give $40 million to a company building this product over MongoDB. Enterprises (who spend money on licenses/support much more than startups do) historically want features like schemas and transactions (i.e. there is a reason relational databases have been popular for so long).

I think the most interesting things about F1 are the hierarchical storage model and a client-side programming model that's not dependent on SQL. SQL was originally intended as an analyst-level query language -- not a database API -- and that's the role it seems to play in F1.

I might be reading this wrong, but from the slides it seems like JOINs are only fast when they are on the same machine. Their own ORM library avoids using JOINs all together. Doesn't that defeat the point of using an RDBMS?

If you look at the example tables, what they're doing is sharding based on customerID, so sections of those tables with the same customerID "stick" together and are guaranteed to be stored together, on the same machine(s) (plural because of replication). Their workload is such that the workload is dominated by JOINs that do not mix customerIDs, hence the JOIN is run on a single machine, inside the database process, which is fast. They do support JOINs across customerIDs, but it probably involves the client (or a proxy process) fetching rows and merging/sorting them, ie. it happens outside the server process, probably on another machine, and is much slower.

The same trade-off also happens with transactions: if all the WRITEs are for the same customerID, it will be fast with almost no performance hit compared to issuing the same commands without transactions. They also support transactions across customerIDs, which involves an expensive 2-phase commit (2PC) round across the machines (across the Paxos quorums, really), which will be much slower.

Disclaimer: I wrote a distributed database with similar design decisions.

is there something similar for any opensource rdbms, shard all tables by the same customer_id?

VoltDB makes some similar tradeoffs that allow single sharded SQL to continue to work as expected.

There is actually quite a list of things that are trying to make this work. DBShards, Clustrix, Xeround, and I am sure that isn't all of them.

Yeah, that's what I thought too. I wasn't sure how much the devs could do from a brief look at the PDF, and it felt like they got vague at that point.

I've been watching Akiban build this for a while, and they are planning on releasing an open source version of something similar in a month.

What's interesting to me is that they moved from MySQL, Google's engineers are very smart about code. I wonder if Google looked into writing a storage engine and then realized how awful the code of MySQL is.

Also the fact that Oracle owns MySQL probably gave them an extra impetus to get off of it immediately.

My favourite quote is from the summary page: “In short, we made our database scale, and didn't lose any key database features along the way.”

Except that it "Can't do cross-shard transactions or joins". It either must have some really intelligent way to distribute the shards, or you would be losing some key database features along the way.

That quote is from the slide "Our Legacy DB: Sharded MySQL", not describing F1.

Am I missing something here? This sounds alot like IMS.

Hierarchical databases have been around for a long, long time. The concepts look similar to some ancient legacy system that I worked on a decade ago. Doesn't sound incredibly novel.

Their goal with the hieararchy is to create "table groups", ie. [regions of] groups of tables that stick together in terms of _sharding_, achieved by looking at the primary key (PK) of the root table in their case, which if you notice is shared by all its descendants.

I gotta wonder how Clustrix compares to this. Also is this similar to using a SQL layer with Cassandra.

Clustrix relies on an Infiniband cluster interconnect, so the latency is several orders of magnitude smaller. This makes distributed transactions across shards much quicker at the cost of requiring more expensive hardware.

Cassandra's query language, CQL, is not really comparable since it only supports such a small subset of SQL. Also, Cassandra uses eventual consistency in place of doing distributed transactions.

It would be nice to add this to App Engine so people outside Google could try it. There is an issue on Google Code for it.

wha, a SOAP API?


Getting access as an end user is a huge PITA, mostly intended for ad agencies that manage lots of customer campaigns for them.

>Many of the services that are critical to Google’s ad business have historically been backed by MySQL. We have recently migrated several of these services to F1, a new RDBMS developed at Google. F1 implements rich relational database features, including a strictly enforced schema, a powerful parallel SQL query engine, general transactions, change tracking and notification, and indexing, and is built on top of a highly distributed storage system that scales on standard hardware in Google data centers.

Nice to see that unlike the NoSQL amateur hour (startups using flimsy non-dbs where they don't belong []), Google, for it's core and crucial business, uses a) relational-like features, b) a strictly enforced schema, c) transactions.

[] There are places where these belong, they just are few. And, no ease of development and "SQL is hard and has impedance mismatch we my programming language man" is not a reason to use them in situations where they don't belong.

Your asterisks have messed up your text formatting. Use a dagger or a number instead.

Yeah, thanks. I keep forgetting that _asterisk_ is used for italics in this markup.

NoSQL's a scam. Film at 11.

Seriously, there are many parallel RDBMS products out there that scale pretty well. Just none of them are open source. Why?

Why don't we have vehicles with the speed of airplanes, the low capital and operational cost of motorcycles, the cargo capacity of trains, the cabin space of luxury yachts?

NoSQL is just a brand name for a loosely defined set of engineering trade-offs. It's easier to achieve the scalability and fault tolerance needed for a lot of modern Internet systems if you don't also require ACID and the huge weight of features and old usage patterns that the market expects of a relational (which they read as "SQL") database. If somebody wants to pour $10-50 million into making an open source NoSQL-scale RDBMS, great, but some projects are too complex for any but the most robust open source communities to maintain, never mind create.

That is one of the clearest explanation of this term. Maybe NoACID would have been more representative.

There's really two dimensions: relational and ACID. The relational aspect is the R in RDBMS: if you have relational data, you can interact with it in mathematically defined ways. The ACID part is the MS: in the past, when we wanted to store and ask questions of our data, we also wanted it managed such that we had guarantees on it.

Depending on the project, NoSQL solutions can relax both of those dimensions. I typically refer to "NoSQL" as non-relational databases.

The opposite of ACID is BASE.

Looks like that's what Oracle has been working on lately: http://www.oracle.com/technetwork/products/nosqldb/overview/...

How do you get scam from this paragraph?

"With F1, we have built a novel hybrid system that combines the scalability, fault tolerance, transparent sharding, and cost benefits so far available only in “NoSQL” systems with the usability, familiarity, and transactional guarantees expected from an RDBMS."

If you don't like NoSQL solutions, or they don't fit your project's needs, don't use them. There's no "scam" here.

The scam is that some people said NewSQL (including F1) is impossible and thus everyone will have to use NoSQL.

Google's web index is NoSQL, called BigTable, which Cassandra cloned.

Bucardo - multimaster postgres replication under BSD license


I have actually looked at Burcardo a while back and found it not fit for general usage.

If you think slony was bad then Burcardo will give you nightmares.

It's based on a brittle conglomerate of elaborate triggers - and it's effectively undocumented.

With something as complex as multi-master replication I'd think long and hard before trusting my data to a bolt-on hack like that.

Haven't tried it. I'm using postgres on a project right now and was looking at this as an option if I need to replicate my DB based purely on a description of its features (that was dumb, I know). I guess I'll have to rethink that now. Thanks for the heads-up!

But this is a NoNoSQL product...

could you elaborate more?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact