Maybe a dumb question on standalone authorization services: does the authorization service end up having a representation for every single object in all of the rest of your datastores? (e.g. every document, every blob of storage, every user in every org).
If so, does that become a chokepoint in a distributed microservice architecture? Or can that be avoided with an in-process or sidecar architecture in which a given microservice's objects are not separately referenced in auth persistence? If not, how do folks determine which objects to register with the auth service and which to handle independently?
A Zanzibar-style service does not need _every_ object from your DB replicated into it, but only the relationships between the objects that matter for authorizing access. Many of these relationships require little/no metadata in your DB so they can live _solely_ in Zanzibar rather than being in both your DB and Zanzibar. This is pretty great because when permissions requirements change, you can often address them by only changing the Zanzibar schema, completely avoiding a database migration.
>does that become a chokepoint in a distributed microservice architecture?
It actually does the opposite because now all of your microservices can query Zanzibar at any time to get answers to authorization questions that were previously isolated to only a single application.
Full disclosure: I work on https://authzed.com (YC W21) -- a permission system as a service inspired by Zanzibar. We're also planning on doing a PapersWeLove NYC on Zanzibar in the coming months, so stay tuned!
> because now all of your microservices can query Zanzibar at any time
This sounds a bit like a chokepoint. Is the important point here that Zanzibar is distributed and therefore is a good thing to be querying from all over the system (as supposed to one centralised application).
Contrary to microservice cargo cult, it's possible to build a relative monolith that scales infinitely. The bottleneck is the db, but if you have a schema where data is easily sharded you can scale it infinitely.
There's plenty of giant monoliths that scale fine. Like Google's analytics and gmail. If you have a database that can scale microservices are more about isolating code between different teams than any performance advantage
The novel aspect of the Zanzibar paper is its application of
distributed systems principles to avoid such a chokepoint. This includes not only the design of the service itself, but also the consistency model used in the APIs that are consumed by applications that make many operations cacheable.
As someone who’s not the founder of an authorization provider, I’d tend to agree with you. Sure looks and sounds and quacks like a choke point!
But it’s also fundamentally hard to avoid isn’t it?
The challenge is that authn is so easy to implement statelessly, since you can verify a token anywhere you have a public key. But authz is far more complicated, since it requires an ACL list of arbitrary length along with the token. It’s not like GitHub can stuff a list of every repository I can access into my access token.
>But authz is far more complicated, since it requires an ACL list of arbitrary length along with the token. It’s not like GitHub can stuff a list of every repository I can access into my access token.
This is exactly the problem that Zanzibar solves that makes it exciting!
I've written about why giant lists of claims are not a good way to structure permission systems[0] and Zanzibar-inspired services do not function this way.
Instead they ask you to query the API server when you need to check access to an item.
All API calls return a response along with a revision.
The response will always the same at a given revision, which means you can cache the response.
If Zanzibar disappears, your app can function so long as content is not modified, which would force you to invalidate the revision.
And that's only if you want consistency in your permission system -- a feature that not all permission systems even support.
Most applications can tolerate just using the cached response regardless and relying on eventual consistency.
All of this is also ignoring the global availability of the Zanzibar service itself which it gets from using a distributed database like Spanner and replicating into data centers in every region in the world (which is why you want someone else to run it for you).
As with everything, it depends on your requirements.
Say your goal is to externalize just your authorization policies from your code. A simple implementation might look like an OPA sidecar to your services, with the policy itself being sourced from a separate control plane - this might be something as simple as a centrally-managed S3 bucket.
The service implementation provides the attributes to OPA to allow it to evaluate the authorization policy as part of the query. e.g. which groups is this user in, what document are they accessing, is this a read, write or delete operation.
If you want to externalize sourcing of the attributes as well, that becomes more complicated. Now you need your authorization framework to know that Bob is in "Accounting" or that quarter_end_results.xls is a document of type "Financial Results".
You can either go push or pull for attribute sourcing.
The push model is to have the relevant attribute universe delivered to each of the policy decision points, along with the policy itself. This improves static stability, as you reduce the number of real-time dependencies required for authorization queries but can be a serious data distribution and management problem - particularly if you need to be sure that data isn't going stale in some sidecar process somewhere for some reason.
The pull model is to have an attribute provider that can you can query as necessary; probably backed with an attribute cache for sanity's sake. The problems are basically the opposite set - liveness is guaranteed but static stability is more complicated.
The methods are not equivalent: in particular, the pull model is sufficient to answer simple authorization questions like 'can X do Y to Z?' - we pull the attributes of X, Y and Z and evaluate the authorization policy.
However, if you need to answer questions like 'to which Z can X do Y?', how does that work? For simple cases you may be able to iterate over the universe of Z's asking the prior question; but it generalizes poorly.
Evaluated scenario was: a company employs a director and IT staff, the director contracts a consultant, the IT staff subscribes to external services. Find out what the company pays for directly and indirectly.
I've been writing about application authorization here: https://www.osohq.com/academy/chapter-2-architecture (I'm CTO at Oso, but these guides are not Oso specific). It covers this in the later part of the guide.
Depending on your requirements, yes that's kind of what happens if you want to centralise. It can make sense for Google-scale problems where you really do need to handle the complex graph of relationships between all users and resources, and doing that in any one service is non-trivial.
In practice though, a lot of service-oriented architectures can get the same benefits by having a central user management service, and keeping most of the authorization in each service. That central service can provide information like what organizations/teams/roles etc. the user belongs to, and then the individual services can make decisions based on that data.
This is the approach I covered with the hybrid approach. With this you can still implement most complex authorization models.
This is a really interesting question that gets at the heart of service federation.
I don't know the answer for Zanzibar, but take a look at how AWS IAM solves it. IAM has very few strong opinions in its model of the world (essentially it divides the world into AWS account ID namespaces and AWS service names/namespaces, and there's not much detail beyond that). Everything else is handled through symbolic references (via string/wildcard matching) to principals, resources, and actions in the JSON policies, as well as variables in policy evaluation contexts (and conditions, which are predicates on the values of those variables, or parameters to customizations (policy evaluation helper procedures) provided by each service).
IAM is loosely coupled with the namespaces of the services it serves, and that allows different services to update their authz models independently with pretty much no state or model information centralized in IAM itself. This is a key, underappreciated part of what makes AWS able to move so fast.
I can't speak for Google, but I'm working on something similar as a personal project and here is my architecture;
Each service has its own store of objects. Each store also has a directory of Metadata describing the objects contained in each service.
When you send an Auth request to a service; the service you are sending the request to looks up which service is the authority for the given object and then routes the request to that service for auth.
You can do away with the Metadata store if you offload responsibility for remembering which store to use to the user. You provide them with a cookie that tells any of your Auth servers which store contains this users data.
You can and it doesn't have to be a choke point; as far as the ACL is concerned it's just a namespace (the microservice) and an opaque ID inside that namespace.
What the Zanzibar paper describes is two big things:
(1) The auth service gives those microservices an ability to set their own inheritance rules so that you do not need to store the fullest representation of those ACLs. If you are propertly targeting the DDD “bounded context” level with one bounded context per microservice, then in theory your microservice probably defines its own authorization scopes and inheritance rules between them. (A bounded context is a business-language namespace, and it is likely that your business-level users talking about “owners” in, say, the accounting context are different than the users talking about “owners” in a documentation context—or whatever you have.) Some upfront design is probably worthwhile to make the auth service handle that, rather than giving the clients each a library implementation of half of datalog and having each operation send a dozen RPCs to the auth service for each ACL check.
(2) The microservices agree on part of a protocol to allow some eventual consistency in the mix for caching: namely, the microservices agree that their domain entities will store these opaque version numbers called zookies (that the auth service generated) whenever they are modified, and hand them to the auth service when checking later ACLs. This, the paper says, gave them an ability to do things like building indexes behind the scenes to handle request load better, without sacrificing much security. Most of the ACL operations are going to not affect your one microservice over here because they happen in a different namespace or concern different objects in the same namespace: so, I need a mechanism in my auth service
to tell me if I need an expensive query against the live data, or if I can use a cache as long as it's not too old.
This is relevant only if you have zillions of objects of hundreds of types and more such types of objects are likely to emerge in future (as you launch more products/use-case).
And you have billions of users and their sharing permission models are complex and likely to keep evolving in the future with more devices, concepts of groups/family etc.
In such a scenario, doing access control in a safe and secure way that scales and evolves well to such a large base is itself a major undertaking. You want to decouple the access control metadata from the data blob storage itself so that they each can be optimally solved for their own unique challenges and they can evolve independently too.
If so, does that become a chokepoint in a distributed microservice architecture? Or can that be avoided with an in-process or sidecar architecture in which a given microservice's objects are not separately referenced in auth persistence? If not, how do folks determine which objects to register with the auth service and which to handle independently?