Zanzibar: Consistent, Global Authorization System

argd678 · on June 8, 2019

The distinguishing feature I see compared to other systems is the ACL ordering and consistency, which is indeed difficult to do at scale. Looks like Spanner is doing most the heavy lifting, great use case for the database.

usaar333 · on June 9, 2019

Well even more broadly it is how generalizable it is, while still providing ordering guarantees (though not necessarily perfect ones.. see my long sibling post)

Using Windows style ACEs for ACLs is also perfectly scalable and consistent, (and more performant) so long as users don't end up in too many groups and objects only inherit ACLs from objects on the same shard. It's just no where as generalizable as Zanzibar which allows much more complex dependencies.

There's always tradeoffs! But this is the best system I've seen for the general ACL evaluation against non recently updated objects.

argd678 · on June 9, 2019

I’ve been part of similarly generalized ACL systems and it’s pretty straightforward and very similar to Zanzibar. Though we didn’t need n ACLs and could assume the list wasn’t too long, so we didn’t need a tree. If we did, then we’d have ended up in a similar place as Zanzibar I believe, there are a limited number of ways to solve that problem.

cryptonector · on June 9, 2019

If you don't have negative ACL entries then ordering is not important.

usaar333 · on June 9, 2019

GP means ordering with respect to time for snapshot reads, which is essential for correctness.

(You might be thinking of ordering ACEs in the ACL which isn't even a concept in Zanzibar)

iampims · on June 8, 2019

“Zanzibar scales to trillions of access control lists and millions of authorization requests per second to support services used by billions of people. It has maintained 95th-percentile latency of less than 10 milliseconds and availability of greater than 99.999% over 3 years of production use”

Impressive!

the-rc · on June 8, 2019

"This caching, along with aggressive pooling of read requests, allows Zanzibar to issue only 20 million read RPCs per second to Spanner." ("Only")

I'm surprised by all the numbers they give out: latency, regions, operation counts, even servers. The typical Google paper omits numbers on the Y axis of its most interesting graphs. Or it says "more than a billion", which makes people think "2B", when the actual number might be closer to 10B or even higher.

star-trek-fleet · on June 8, 2019

Kind of needed in the cloud war where many actually question if the mysterious Google supermacy in global infrastructure is really true.

wbl · on June 8, 2019

Throwing servers at the problem is less impressive then thinking very hard and solving it with less.

star-trek-fleet · on June 8, 2019

If your conclusion is "throwing servers at problems" after years of reading the papers about Google infrastructure, you probably are a non infrastructure guy.

A serious conclusion should be that all these infrastructure enable application devs and researchers alike "to throw servers at problem". And these work are exactly the opposite, where they spent years and sometimes even decades meditating the nifty and figure out the most effective and efficient way of utilizing the servers.

paulryanrogers · on June 9, 2019

With the demise of Moore's Law and physics education improving I imagine efficiency of the machines will eventually overtake developer time concerns.

okket · on June 8, 2019

It's not servers. It's who will have the best submarine cables. A game in which Apple is not participating btw. Not even with an Elon style moon, err, low earth orbit shot.

https://blog.apnic.net/2019/04/03/the-future-of-undersea-int...

threeseed · on June 9, 2019

Apple has no need for undersea cables.

It doesn't matter if iCloud is slow or not since most of the interactions with it are in the background. And all of it's content e.g. App Store or Apple Music are cached by CDNs which are hosted in pretty much every country.

dgemm · on June 8, 2019

A bit simplistic, no? There is more than one important factor.

vajaya · on June 8, 2019

there's physical limit as what algorithms and tricks can do. Their numbers are incredible though.

YjSe2GMQ · on June 8, 2019

And there's much new fun to be had when you have this many servers around. For example - once you start shuffling around tens of petabytes a day you quickly notice that bit flips are very real. Computers do what we tell them to do with incredibly high probability, but it is always below 1.

wbl · on June 9, 2019

Which is exactly my point: no one cares about the absolute number each cloud has. They care about the capacities.

wallflower · on June 8, 2019

> There's also a story behind that project name. That is not the original project name. The original project name was force-removed by my SVP. Once my hands are free again, I can explain

https://mobile.twitter.com/LeaKissner/status/113663143751427...

pronoiac · on June 8, 2019

https://twitter.com/LeaKissner/status/1136691523104280576

> Zanzibar was not the original name of the system. It was originally called "Spice". I have read Dune more times than I can count and an access control system is designed so that people can safely share things, so the project motto was "the shares must flow"

adfm · on June 8, 2019

“The people who can destroy a thing, they control it.“

delroth · on June 8, 2019

https://mobile.twitter.com/LeaKissner/status/113669152310428...

koboll · on June 8, 2019

Yeah, that's kinda understandable. Although I'm not sure how the current name fits the project any better.

Iv · on June 8, 2019

It sounds pretty appropriate. This is another SF reference "Stand on Zanzibar", one of the earliest inspirations for cyberpunk, in which we are routinely remembered that all of humanity would be able to stand on the small island of Zanzibar, but that growth is problematic.

https://en.wikipedia.org/wiki/Stand_on_Zanzibar

kjeetgill · on June 8, 2019

For others that were confused as I was: SF here is Science Fiction not San Francisco.

dsagent · on June 8, 2019

I was hoping it was a Metal Gear solid reference.

https://metalgear.fandom.com/wiki/Zanzibar_Land

victor106 · on June 8, 2019

What do other large (non-google scale) to medium companies use for authorization? Can anyone recommend open source (preferably) or close source products?

jamafu · on June 8, 2019

We use Keycloak at our place and are really happy with it.

Website: https://www.keycloak.org/

GitHub: https://github.com/keycloak/keycloak

dusty_mc_dusty · on June 9, 2019

Jup, it's quite nice!

ucarion · on June 8, 2019

https://github.com/ory/ladon is an option. Essentially, it imposes a lot of the fine-grained access control model on you, but then it's up to you to implement the actual database/business-logic layer [1] as well as the API layer to actually expose the service.

[1] You do so by implementing this interface: https://github.com/ory/ladon/blob/master/warden.go

perlgeek · on June 8, 2019

There's also Open Policy Agent (OPA) https://github.com/open-policy-agent/opa

gen220 · on June 8, 2019

We use LDAP for managing group memberships (i.e. person x is a member of `engineering` and `eng_team_y`; only members of `eng_team_y` can change the deployment status of service Z). We then define ACLs for these groups. IDK how they are enforced, but they're visible/malleable via Ansible recipes, such that the process of adding permissions for your group (or user) involves submitting a diff to said Ansible recipe and getting approval from an SRE.

In practice, we use Kerberos to obtain/distribute authorization tokens, which live for less than 24 hours. The authorization-value of these tokens is determined by the LDAP affinities of the bearer. If everything is configured correctly (which it always is, until you need new permissions / switch teams), all you have to do is auth with kerberos at the beginning of each day. We have ~200 engineers.

MrSaints · on June 9, 2019

A pretty simple, and configurable one: https://github.com/casbin/casbin

usaar333 · on June 9, 2019

Excellent paper. As someone who has worked with filesystems and ACLs, but never touched Spanner before, I have some questions for any Googler who has played with Zanzibar. (in part because full-on client systems examples are limited)

A check my understanding: Zanzibar is being optimized to handle zookies that are a bit stale (say 10s) old. In this case, the indexing systems (such as Leapord) can be used to vastly accelerate query evaluation.

Questions I have (possibly missed explanations in the paper):

1. If I understand the zookie time (call it T) evaluation correctly, access questions for a given user are effectively "did a user have access to a document at or after T"? How in practice is this done with a check() API? The client/Zanzibar can certainly use the snapshots/indexes to give a True answer, but if the snapshot evaluation is false, is live data used (and if so by the client or Zanzibar itself?)? (e.g. how is the case handled of a user U just gaining access to a group G that is a member of some resource R?)

2. Related to #1, when is a user actually guaranteed to lose access to a document (at a version they previously had access to?) E.g. if a user has access to document D via group G and user is evicted from G, the protocol seems to inherently allow user to forever access D unless D is updated. In practice, is there some system (or application control) that will eventually block U from accessing D?

3. Is check latency going to be very high for documents that are being modified in real time (so zookie time is approximately now or close to now) that have complex group structures? (e.g. a document nested 6 levels deep in a folder where I have access to the folder via a group)? That is, there's nothing Zanzibar can do but "pointer chase", resulting in a large number of serial ACL checks?

4. How do clients consistently update ACLs alongside their "reverse edges"? For instance, the Zanzibar API allows me to view the members of a group (READ), but how do I consistently view which groups a user is a member of? (Leapord can cache this, but I'm not sure if this is available to clients and regardless it doesn't seem to be able to answer the question for "now" - only for a time earlier than indexed time).

Or for a more simple example, if I drag a document into a folder, how is the Zanzibar entry that D is a child of F made consistent with F's views of its children?

E.g. can you do a distributed transaction with ACL changes and client data stored in spanner?

5. It looks like the Watch API is effectively pushing updates whenever the READ(obj) would change, not the EXPAND(object). Is this correct? How are EXPAND() changes tracked by clients? Is this even possible? (e.g. if G is a member of some resource R and U is added to G, how can a client determine U now has access to R?)

ruomingpang · on June 9, 2019

Excellent questions. We have encountered all of them in practice and have solutions for most of them, e.g., (4) requires an ACL-aware search index.

Unfortunately we don't have enough space to explain them in the paper. Please consider coming to Usenix. :-)

sa46 · on June 9, 2019

Used to be a Googler and worked on an ACL model built on top of Zanzibar. I didn't work directly on Zanzibar so listen to ruomingpang over me.

> 3. There's nothing Zanzibar can do but "pointer chase", resulting in a large number of serial ACL checks?

Zanzibar enforced a max depth and would fail if the pointer-chasing traversed too deeply. Zanzibar would also fail if it traversed too many nodes.

> 4. How do clients consistently update ACLs alongside their "reverse edges"?

One of the recommended solutions was to store your full ACL (which includes a Zookie) in the same Spanner row of whatever it protected. So, if your ACL is for books, you might have:

    CREATE TABLE books (
      book SERIAL PRIMARY KEY
      acl ZanzibarAcl
    );

Alternately, you could opt to only store the current zookie instead of the full ACL. Then the check becomes:

1. Fetch Zookie from Spanner

2. Call Zanzibar.Check with the zookie

> but how do I consistently view which groups a user is a member of?

I remember this as a large source of pain to implement. Zanzibar didn't support this use-case directly. As rpang mentioned in a sibling comment, you need an ACL-aware index. Essentially, the algorithm is:

1. Call Zanzibar.Check on all groups the user might be a part of.

There's a bunch of clever tricks you can use to prune the search space that I don't know the details of.

sb8244 · on June 9, 2019

How would you deal with questions like "provide all content accessible to a user" in a system like this? Would you watch and replicate to your own database?

ruomingpang · on June 13, 2019

You will need an ACL-aware index, which is one of the main use cases of Zanzibar.

sb8244 · on June 13, 2019

Do you get the ACL from Zanzibar to your data store using a watch?

I'm just confirming that replication is the best strategy and there's not some magic that I'm not aware of

eximius · on June 9, 2019

Semi-off topic: What is the latest and greatest in authorization mechanisms lately?

I like capability-based at on OS level, but sadly I'm not doing anything that interesting. For things like webapps, is there anything better than ACLs or Role-based. Or at least any literature talking about them? Probably overkill for the application I work on, but it'd be nice to take inspiration from best practices.

ubercow · on June 10, 2019

Semi-off topic but is there a curated "best of" list for systems papers like this that anyone knows about, from Google or otherwise?

bretthardin · on June 10, 2019

If you find this, please let me know.

zeeed · on June 9, 2019

I got stuck on the first line in the abstract:

> Determining whether online users are authorized to access digital objects is central to preserving privacy.

Can someone dissect that sentence and explain why that is? I honestly fail to make the connection.

Dowwie · on June 9, 2019

Replace "digital object" with "a PDF of your checking account transactions for 2018". You want to control who can do what with that PDF. Your privacy is at stake.

zeeed · on June 9, 2019

Sure, that’s privacy in the sense of “no one can access my stuff, unauthorized”.

I struggled with the sentence cause, at the same time, creating one global centralized authentication source creates the opposite of privacy in the sense of anonymity. Certainly OT wrt the actual content of the work...

BitPolice · on June 12, 2019

I may be misunderstanding the issue you're pointing out here... but I note that while the paper/sentence talks about "authorization" you're talking about centralized "authentication."

As an authorization system Zanzibar focuses on: can agent A (identified through some means) perform action X on object Y. It isn't about deciding whether an arbitrary actor is agent A but proscribing what actions agent A can perform against the universe of all possible objects (which likewise are referenced abstractly and not stored within the system itself).

The knowledge that A could do X on Y is information that might be disclosed (and thus entails some privacy risk)... but inherently doesn't reveal: anything about the identity of A; whether A has ever done X; or what Y's contents are or what it represents.

On the other hand, perhaps you mean that because membership in sets of users is also stored within it (via a sort of "is member of" permission) you can use that to de-anonymize who a given actor is. This might work but it assumes you can uniquely derive which agent from a set of abstract agents represents that individual and that you extrinsically something about the person being the only person in this specific set of sets.

cryptonector · on June 9, 2019

This reminds me I need to get my authz paper published, and now sooner than later...

I've built an authz system that is built around labeled security and RBAC concepts. Basically:

  - resource owners label resources
  - the labels are really names for ACLs in a directory
  - the ACL entries grant roles to users/groups
  - roles are sets of verbs

There are unlimited verbs, and unlimited roles. There are no negative ACL entries, which means they are sets -- entry order doesn't matter. The whole thing resembles NTFS/ZFS ACLs, but without negative ACL entries, and with indirection via naming the ACLs.

ACL data gets summarized and converted to a form that makes access control evaluation fast to compute. This data then gets distributed to where it's needed.

The API consists mainly of:

  - check(subject, verb, label) -> boolean
  - query(subject, verb, label) -> list of grants
    (supports wildcarding)
  - list(subject) -> list of grants
  - grant(user-or-group, role, label)
  - revoke(user-or-group, role, label)
  - interfaces for creating verbs, roles, and labels,
    and adding/removing verbs from roles.

Note that access granting/revocation is done using roles, while access checking is done using verbs.

What's really cool about this system is that because it is simple it is composable. If you model certain attributes of subjects (e.g., whether they are on-premises, remote, in a public cloud, ...) as special subjects, then you can compose multiple check() calls to get ABAC, CORS/on-behalf-of/impersonation, MAC and DAC, SAML/OAuth-style authorization, and more. When I started all I wanted was a labeled security system. It was only later that compositions came up.

Because we built a summarized authz data distribution system first, all the systems that have data will continue to have it even in an outage -- an outage becomes just longer than usual update latencies.

check() performance is very fast, on the order of 10us to 15us, with no global locks, and this could probably be made faster.

check() essentially look's up the subject's group memberships (with the group transitive closure expanded) and the {verb, label}'s direct grantees, and checks if the intersection is empty (access denied) or not (access granted). In the common case (the grantee list is short) this requires N log M comparisons, and in the worst case (the two lists are comparable in size) it requires O(N) comparisons. This means check() performance is naturally very fast when using local authz data. Using a REST service adds latency, naturally, but the REST service itself can be backended with summarized authz data, making it fast. Using local data makes the system reliable and reliably fast.

query() does more work, but essentially amounts to a union of the subject's direct grants and a join of the subject's groups and the groups' direct grants.

special entities like "ANYONE" (akin to Authenticated Users in Windows) and "ANONYMOUS" also exist, naturally, and can be granted. These are treated like groups in the summarized authz data. We also have a "SELF" special entity which allows one to express grants to any subject who is the same as the one running the process that calls check().

galaxyLogic · on June 9, 2019

Cool. Keep us posted

1023bytes · on June 8, 2019

Why is it called Zanzibar though? I'm kind of intrigued

pronoiac · on June 8, 2019

There's another thread about naming it - https://twitter.com/LeaKissner/status/1136691523104280576

The original name was Spice, which was nixed from a higher-up; they went to Zanzibar, one of the Spice Islands.

ddebernardy · on June 9, 2019

Odd. Zanzibar is off the coast of Tanzania in East Africa. The Spice Islands (the Moluccas) are in Eastern Indonesia.

pronoiac · on June 9, 2019

I was going by the twitter thread, but I looked and found this in Wikipedia:

> the Zanzibar Archipelago, together with Tanzania's Mafia Island, are sometimes referred to locally as the "Spice Islands" (a term borrowed from the Maluku Islands of Indonesia).

https://en.wikipedia.org/wiki/Zanzibar

GMLOOKO · on June 12, 2019

sonnyblarney · on June 8, 2019

What's interesting to me here is not the ACL thing, it's how in a way 'straight forward' this all seems to be.

It's the large architecture of a fairly basic system, done I supposed 'professionally'.

I'm curious to know how this works organizationally. What kind of architects involved because this system would have to interact with any number of others, so how do they do requirements gathering? Do they just 'have experience' and 'know what needs to be done' or is this something socialized with 'all the other teams'?

And how many chefs in that kitchen once the preparation starts? Because there's clearly a lot of pieces. Do they have just a few folks wire it out and then check with others? Who reviews designs for such a big thing?

Or was all of this developed organically, over time?

delroth · on June 8, 2019

Zanzibar is basically the brainchild of a Bigtable Tech Lead + a Principal Engineer from Google's security and privacy team [1]. This led to a very sound and robust original design for the system. But it also greatly evolved over time as the system scaled up and got new clients with new requirements and new workloads.

[1] https://twitter.com/LeaKissner/status/1136626971566149633

the-rc · on June 8, 2019

Especially at Google, you first see the same problem appearing and getting solved in multiple products, then someone tries to come up with a more generic solution that works for most projects and, just as importantly, can serve more traffic than the existing solutions. Having to rewrite things on a regular basis because of growth is painful, but can also be a blessing in disguise.

Who that someone is who works on the generic solution, can vary. Sometimes it's one or more of the teams already mentioned. Sometimes, like in this case, it's someone with expertise in related areas that takes the initiative. And a project of this scope invariably gets reviewed on a regular basis by senior engineers, all the way to Urs (who leads all of technical infrastructure). Shared technologies require not just headcount to design and write the systems, but also to operate them (by SREs when they're large enough), so you need to get upper management involved as well.

sonnyblarney · on June 8, 2019

This project says way more about the organization than any specific technical competence.

I'm not close to Google, but from those I know on the product side it can be 'a Gaggle' with nobody really in charge ... but I guess if you have enough self-motivated conscientious actors, and mature people, without ugly turf wars, who can have reasonable discussions, and responsible enough people in charge that can steer things in an appropriate direction ... it works.

But the fact this is an evolution and not a 'new product' is probably prerequisite - so many smart people are hard to coral around new ideas, but if it's done A B C times, then a 'Z' solution speaks to an Engineers sense of efficiency and it should be natural for such an org to want to do it.

I won't name names, but I worked at a large tech company that could not get 'Single Sign On' to work. It was really frustrating to think so many reasonably smart people couldn't figure that out.

We don't need genius I think just a wealth of experience and a lot of common sense.

usaar333 · on June 9, 2019

The system is actually pretty complicated and nonobvious once you consider its caching layers, heavy reliance on spanner, assumption that ACL read times can be stale, and the various assumptions and limitations in the namespace controls.

The underlying model of role based access control (and viewing groups as just other resources with ACLs) is already well known.

shereadsthenews · on June 8, 2019

That’s how you design at this scale: keep it simple, don’t be a jackass. If the result looks complicated from the outside, you blew it.

colesantiago · on June 8, 2019

I love reading about Google's systems, but I wish I could work on those problems at scale, that is my dream really. I wonder what more systems Google has that we don't know about.

I know Borg has become what we know as k8s but surely there must be more things that Google has made internally that are not open source.

Curious about this and would like to know more about it from anyone in the trenches at Google.

rifung · on June 8, 2019

> I love reading about Google's systems, but I wish I could work on those problems at scale, that is my dream really. I wonder what more systems Google has that we don't know about.

I work for Google and I used to have this exact thought too. I think the reality is not quite as rosy, though far from bad!

You have to realize that there are hundreds of people who work on systems like this, and as a consequence, your day to day work is more or less the same as what you would do on systems of a smaller scale.

Before I joined Google I always wondered what things they did differently and what magical knowledge Googlers must have possessed. After joining I realized that while on average the engineers are definitely more capable than other places I've worked, there's no special wisdom and instead they just have more powerful primitives/tools to work with.

Of course, maybe I am mistaken and just don't know of the magic?

ngrilly · on June 8, 2019

> there's no special wisdom and instead they just have more powerful primitives/tools to work with

Reminds me of compound interests. Google operates at a scale where the company has enough brainpower to design systems like GFS/Colossus and Borg, which enable systems like Spanner, which enable systems like Zanzibar, and so on.

fierro · on June 9, 2019

this is correct

ikiris · on June 9, 2019

There's still plenty of opportunity to do things at scale and change or replace major systems entirely.

gregorygoc · on June 8, 2019

The harsh truth of working at Google is that in the end you are moving protobufs from one place to another. They have the most talented people in the world but those people still have to do some boring engineering work.

izacus · on June 8, 2019

But you can reduce any job to this can't you? Pretty much all engineering is just moving some strings around.

jjeaff · on June 8, 2019

Work in finance and you can move integers around!

have_faith · on June 8, 2019

Work in the news industry and you can flip booleans! (too subtle?)

ci5er · on June 8, 2019

> (too subtle?)

It was for me! :-)

(Of course, because it's after 5pm somewhere, I'm unfortunately already lit)

rossjudson · on June 9, 2019

Can you believe that people at Google still have to, like, eat lunch and stuff? And talk to each other? The coffee is the same damn color every day too.

Maybe there's a place, somewhere, for the purest-of-the-pure non-boringest thoughts.

yegle · on June 8, 2019

"Larry&Sergey Protobuf Moving Co."

gniv · on June 9, 2019

https://qph.fs.quoracdn.net/main-qimg-b777f994326fa91fa509ca...

unixhero · on June 8, 2019

Sandstorm.io kentonv protobufs?

kentonv · on June 9, 2019

Oh hai, you rang?

Sandstorm.io doesn't use protobuf, it uses Cap'n Proto, which was designed to replace protobuf.

Fun fact: The Zanzibar project was started in ~2011 specifically to replace my main project at the time, which was trying to solve the same problems. Apparently, some senior engineers felt letting me work on core infrastructure was too dangerous. They succeeded in turning my project into a lame duck and making me quit, which is when I then started working on Cap'n Proto and Sandstorm.io. In retrospect I'm glad it happened.

Yeah... Google is not always the most fun place to work on big infrastructure projects.

duality · on June 8, 2019

What is the right data format to move around? JSON?

adrianmonk · on June 8, 2019

The point is you're writing mostly business logic and glue. You get a server request, you transform it with some logic, call some other servers, combine the responses and run some more logic, and return a response.

The scalability and interesting work has been factored out and handed off to infrastructure teams that build stuff like this auth framework, load balancers, highly scalable databases, data center cluster management tools, etc.

Which really is the smart way to do it. To the extent that you can stand on the shoulders of giants who've basically made scalability the default, you are free to focus on what you're actually trying to build. The only downside is if all the interesting engineering challenges are already solved for you, the remainder might not that be that interesting to people who enjoy engineering challenges.

Xorlev · on June 8, 2019

It's just a saying. All we do is move protos from one service to another.

JSON is definitely not the right stuff.

dmoy · on June 8, 2019

The encoding/decoding cost is painful :(

I mean in this context if you're doing that level of scale. For a lot of purposes json is totally fine.

isatty · on June 9, 2019

It's really not - compared to the wire cost/static type checks and loads of other stuff you give up.

anonygler · on June 8, 2019

The most impressive part about Google is how its emphasis on internal standards has allowed it to build some really impressive stuff.

Eg, You can do a sql join on any dataset, in any datacenter. You can turn any query into a hosted visualization.

Every test invocation is streamed to a central server and results can be shared with a url.

There’s more, but those are my two favorites.

atombender · on June 8, 2019

Is the join a Spanner query, or is there a system on top of a Spanner that federates/aggregates databases?

summerlight · on June 9, 2019

Most of adhoc analysis don’t even need Spanner. With Dremel, you can simply define a table on bunch of sharded files and do a SQL query on them.

fierro · on June 9, 2019

yeah the TAP/Forge testing infra is pretty gnarly

idlewords · on June 8, 2019

Wait until you get a glimpse into the exciting world of real-time ad bidding. It's every engineer's dream!

manigandham · on June 8, 2019

Adtech is interesting because of the scale, complexity, and timing required compared to many other software projects. It gets a bad look but the engineering involved is not boring.

eeZah7Ux · on June 8, 2019

Borg is very different from k8s!

xfitm3 · on June 8, 2019

The worst part about any job is politics, the extremely competitive nature of Googlers makes it a less than fun place to work.

shereadsthenews · on June 8, 2019

The competitive nature of Googlers is what make Google a very fun place to work.

xfitm3 · on June 8, 2019

Run in the rat race for a decade or two and report back.

manigandham · on June 8, 2019

The rat race typically describes the cycle of working to live and living to work without much else going on.

Competition at work is something different, and you're always in competition with other companies and people to survive anyway.

cameronbrown · on June 8, 2019

To each their own I guess.

TheMagicHorsey · on June 8, 2019

There are open source projects you can work on. You already mentioned one: Kubernetes. But there are also others.

jamesblonde · on June 8, 2019

Colossus is their data center scale filesystem. They dont talk about it..

the-rc · on June 8, 2019

Colossus is actually the only project I can think of for which they had one of the leaders sit down with Kirk McKusick and have a chat for ACM Queue, instead of a paper. https://queue.acm.org/detail.cfm?id=1594206

jamesblonde · on June 8, 2019

And they reveal exactly zero details. I know a bit about it, but not enough to say exactly what semantics it offers to file system clients. I believe it is not POSIX-like, hence the need to layer Spanner and GCS over it.

shereadsthenews · on June 8, 2019

There's been a teensy bit more details than that, e.g. [1]. If you think about exactly the file semantics that Bigtable would require (append, pread) that's exactly what is provided. Note that Colossus and D are two separate things. Google systems can use D without Colossus and a long time ago people used Colossus without D, although today Colossus/D is implied. The presentation gives the broad strokes of how Colossus is able to bootstrap itself from Chubby. It helps if you've also read the Bigtable paper [2].

1: http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Ke... 2: https://static.googleusercontent.com/media/research.google.c...

the-rc · on June 9, 2019

??? The original GFS paper, which the chat references repeatedly, was clear about the semantics not being POSIX-like. The interview mentions that, too, along with stuff like snapshots. Colossus is basically the same, with increased scalability.

Boulth · on June 8, 2019

On the other hand Google engineers are vendor-locked-in using Google specific tech one way or another.

shereadsthenews · on June 8, 2019

Yeah the whole time I was there every time I had to use bigtable or Spanner I always muttered to myself “I wish I could be using some free software garbage instead of this proprietary Google stuff right now.” Every Googler secretly yearns for the performance, reliability, and elegance of MySQL.

sametmax · on June 8, 2019

Uncool, given that Google only exist thanks to free software.

Google started as a C, Java and Python shops. Android is born out of a Linux Kernel.

If I remember well, long ago, the initial database they used was actually an in-shop fork of MySQL.

Easy to say bad things about free software now that you have billions and hundred of geniuses to work on your own stuff.

Actually, this kind of comment reminds me of the "linux is cancer" era of Microsoft. Funny like Google is now becoming the new MS now they got all the markets, while MS pretend to be nice now that they are not the top dog anymore.

kjeetgill · on June 8, 2019

I agree with you both! I don't think Google (or probably even the parent) meant to disrespect or discredit the value and contributions of open source. But internal, non-open platforms aren't really "vendor lock-in" or just NIH. They're often much of much higher quality (if you have the resources) simply by solving the exact problems you have directly.

Disclaimer: I've been bitter about MySQL-for-everything-ism lately too but think Java is pure heaven.

the-rc · on June 8, 2019

To be fair, quite a few engineers at Google did yearn for MySQL and a single instance at that. Not for any of the traits you list°, but because it would have let them no longer worry about HA, replication, request hedging, key hotspots, etc. It would have also meant not having a product that works when there are more than a handful of users, but that's another story. BT was a lot of work to write for and that's why Spanner evolved the way it did.

°I sense some irony

pciexpgpu · on June 8, 2019

You should add a /s to make this more obvious .

lopsidedBrain · on June 8, 2019

Um, not really. Care to elaborate? If something is available open-sourced we're typically free to use it, as long as we are abiding by the license conditions.

jchw · on June 8, 2019

Not strictly true. Most software would probably require at least some modifications to run internally but as far as I know there’s no policy preventing open source software in production, quite the opposite.

For example, here’s some information about memcache: https://www.quora.com/Does-Google-use-memcached-or-does-it-u...

There’s more, but if I can’t find a reference to them on Google Search I’ll assume its not in my place to discuss it publicly.

Using protobufs as a base layer may seem like lock-in, but it very much is the opposite. Protobufs are surprisingly simple and maybe even elegant once you get past the ugly parts, but most importantly it decouples software from arbitrary protocols and makes it much easier to deal with changing implementations. (Not to mention the potential for rich backwards and forwards compatibility.)

manigandham · on June 8, 2019

Why? The build-or-buy trade-off is very different at Google scale and this is one of the few organizations that can build everything in-house for their specific needs.

nippoo · on June 8, 2019

As a side-note: 95th percentile latency statistics are pretty meaningless at this scale. With a million requests per second, a 95th percentile latency of 10ms still means that 50,000 requests per second are slower than that.

jsty · on June 8, 2019

They do give p99 latencies in the table on page 10

ktta · on June 8, 2019

99 - < 20ms

99.9 - < 90ms

That is amazing.

alexeldeib · on June 8, 2019

This is absolutely incredible. Since we saw login with Apple yesterday, makes me wonder if any of the other big companies can compete with this. Curious about Facebook/Netflix/Amazon.

Netflix seems zippy, but I've never looked at the request timings, which could differ pretty dramatically from UI load times. I imagine Google also dwarfs their login scale. Would be interesting to see numbers capturing full load time from clicking the login UI to successful redirect (or however you would measure this without including the time of the page load post-login).

lclarkmichalek · on June 8, 2019

This isn't a login service, this is an ACL service. Related space, but different concerns. You wouldn't send a user's password here to find out if it's correct (authentication), you'd use this to figure out if a user can do something once you know who they are (authorization) :)

Also, generating the login page etc is often more expensive than the actual 'validate the username and password'. Getting to the server is also going to dwarf these latencies; you probably don't store all your passwords in PoPs, so you need to make the full trek to your local Google datacentre to complete a login :)

jameshart · on June 8, 2019

In fact, validate the username and password might need to be artificially slowed down to protect against side channel and credential stuffing attacks :)

alexeldeib · on June 9, 2019

Awkward. I realized it was used for authz, but for some reason I assumed it would be used for authn as well. Now I’m wondering how Google does authn...

And yeah, the second half of my comment is trying to scope down the comparison to one that is reasonably “fair”

scottlamb · on June 9, 2019

> Now I’m wondering how Google does authn...

That's my corner of Google. We haven't published anything comparable to this paper in the time I've worked on it (maybe we could—I'm pleasantly surprised to see the Zanzibar folks got approval to share qps numbers and everything) but here's a bit about how it worked back in 2006:

https://www.usenix.org/legacy/event/worlds06/tech/prelim_pap...

Some of that still applies.

fwiw, while we do our fair share of password checking, we do a _lot_ more oauth token and cookie checking. Most folks just stay signed in on both mobile and web, so no need to recheck their passwords. In contrast, session credentials get checked on every request.

aasasd · on June 8, 2019

In addition to what the neighbor comment says about authorization, an ACL is an internal service: it provides an “if (the user is allowed to X) then ...” to the business logic code. It's not a user-facing service.

alexeldeib · on June 9, 2019

I did assume this system handled authn as well as authz, which was a mistake.

Rapzid · on June 8, 2019

I'm not saying this is the case at all, but I've noticed through experience that depending on how a system at scale is distributed that .1% outside your 99.9% may be impacting a specific user or group of users, or group of resources, or etc. So they may be getting 100% of their requests outside your 99.9% latency.

Something interesting to think about.

usaar333 · on June 9, 2019

That's a great point and something pxx numbers often miss.

Zanzibar almost certainly has this type of behavior. More complex ACL structures under recent evaluations perform worse.

I'd love to see percentile data for depth of operation, but clearly the paper is limited in content size.

demarq · on June 8, 2019

Not sure how I feel about adopting a countries name for a project.

Or more to the point I'm not sure how I would feel if every time I searched my countries name on the web this Google project appears rather than my actual country.

i.e Zanzibar is a national identity not just a "spice" island

NameOfTeam · on June 9, 2019

To be clear, Zanzibar is neither a country nor a national identity. It’s a semi-autonomous region of Tanzania.

ocdtrekkie · on June 9, 2019

Countries who have the Amazon rainforest within their borders are still a little annoyed about the company. https://mashable.com/article/amazon-domain-name-icann-approv...

stingraycharles · on June 8, 2019

Am I alone in thinking that 99.999% measured availability for a service so completely in the critical path for almost everything is relatively low?

Phrased another way, when it is not availability, do end users experience service disruption, and if not, how is that mitigated?

gtirloni · on June 8, 2019

I think you might be alone. 5.26 minutes of down time per year is beyond excellent for any moderately complex system.

cheez · on June 8, 2019

I have a simple system that depends on another system and I can't keep it up for a week without 15 minutews downtime

jacques_chester · on June 8, 2019

You may have missed how they're defining it:

> We define availability as the fraction of “qualified” RPCs the service answers successfully within latency thresholds: 5 seconds for a Safe request, and 15 seconds for a Recent request as leader re-election in Spanner may take up to 10 seconds. ... To compute availability, we aggregate success ratios over 90-day windows averaged across clusters. Figure 5 shows Zanzibar’s availability as measured by these probers. Availability has remained above 99.999% over the past 3 years ofoperation at Google. In other words, for every quarter, Zanzibar has less than 2 minutes of global downtime and fewer than 13 minutes when the global error ratio exceeds 10%.

Basically, they're counting by number of requests. That's fairly typical for Google, who in their SRE book point out that measuring only total outages is a poor indicator of actual user experiences. Imagine if you had an electric company that had frequent brownouts and rolling blackouts but bragged about never having a total blackout. You'd be fairly unimpressed.

Google SREs also make the point that beyond five nines, your efforts are rendered moot by reliability issues you cannot control. Mostly network issues. If you have 99.99999% reliability but the mobile data network only has 99.99%, you've wasted a lot of money on something most folks will never notice.

stingraycharles · on June 9, 2019

Got it, this makes a lot of sense. Thanks for the explanation.

bradleyjg · on June 8, 2019

Overall uptime isn't the only stat that matters here, the distribution of downtime matters too. One 15 minute outage in three years is a lot worse than 900 1 second outages over that same time period. One second blips are a part of the web, we click refresh and move on--not even knowing who's fault it was.

GauntletWizard · on June 8, 2019

It says greater than 5 nines, and it's usually much greater - in usual times, these core services are usually at six or seven nines as measured client side. But it doesn't take long at three nines to destroy your five nine SLA.

The other portion is client side retry logic. It's incredibly easy for developers to mark a lookup with a retry policy and timer, and one of the reasons that that latency is so low is so that even if there's a timeout, the pageview can succeed. The application code doesn't see the error at all if the retry is successful, it just takes longer. The retry code is very good and it's already known at the first rpc call where the retry should go - the connection pool maintains connections to multiple independent servers.

xyzzy_plugh · on June 8, 2019

You might not be alone but five nines is pretty good.

I seen many internal facing teams across many companies have SLOs of four nines or less. Five is pretty rare.

lclarkmichalek · on June 8, 2019

It kinda depends what availability means? That .001% unavailability might be degraded service, might be .001% of clients having a bad time across the entire year, might be 'acts of god' (i.e. broken CPUs and the like). This kind of service is also usually fairly low down on the stack, and higher level applications can usually degrade gracefully. If they couldn't, complex applications such as Google would fail to operate; there's always _something_ broken.