AWS Identity service handles 400M API calls every second

electricshampo1 · on Aug 15, 2021

Seems 400M is aggregated qps worldwide. Wonder what avg qps looks like per iam server (and size of server).

the_arun · on Aug 15, 2021

Also, what about peak t/qps per iam server? Average usually flattens the graph.

jstx1 · on Aug 15, 2021

Say what you want about Amazon and how they treat their developers, from the outside this looks like something really cool to work on.

orf · on Aug 15, 2021

Big numbers. Are there any talks or papers out there about how IAM is implemented under the hood?

bpicolo · on Aug 15, 2021

Not sure on IAM, but Google put out a paper on Zanzibar.

https://research.google/pubs/pub48190/

p_l · on Aug 15, 2021

I can confirm that IAM works definitely not like Zanzibar - in last job we have started to regularly hit issues where various AWS components would refuse to work because IAM Policy got too big, so it's definitely not querying another API for specific permissions but somehow goes around with the whole resulting policy :-|

jzelinskie · on Aug 15, 2021

What kind of policies were you writing? IAM is structured in a hierarchical fashion to limit policy complexity so that performance is fairly consistent at the cost of expressiveness. Zanzibar-like systems tend not to limit the complexity/expressiveness, by instead relying on denormalized views of the data to keep queries fast.

Disclaimer: I'm a founder of authzed (YC W21), a productized Zanzibar implementation

p_l · on Aug 15, 2021

What we know is that we started to have increasing amounts of "policy is too big" errors in AWS APIs. Not IAM APIs. But things like Sagemaker being unable to start an EC2 instance because the networking system couldn't evaluate IAM policy (the error messages were annoying in both being quite specific and not helpful ultimately).

The computational complexity of the policies wasn't high (in fact, they were pretty much key-value things, as the only time we had something more complex was in AssumeRoleUsingWebIdentity calls - btw, that API sucks, embarassingly so - everything else was plain list of what a role could access).

_kb · on Aug 15, 2021

Auth0 are also working on a Zanzibar inspired system. Some interesting notes on their exploration via: https://twitter.com/auth0lab

ram_rar · on Aug 16, 2021

400M RPS is mind boggling, even if one considers most of the data to be embarrassingly shard-ed. Kudos to engineers who built it.

Otoh, it bothers me that every single service call needs to go to IAM to check for permissions. Has anyone explored other architectures/designs to circumvent centralized auth?

CalRobert · on Aug 15, 2021

So three calls per minute for every human being on the planet.

new_realist · on Aug 16, 2021

Agreed, this is nothing to brag about.

Pxtl · on Aug 15, 2021

Imho, the fact that this is an impressive achievement speaks to how much overhead the web costs.

"400m operations per second? Wow, that's almost 5% of the number of operations per second a typical consumer processor can do!"

I realize this is an apples to Buicks comparison, I just get bothered by how millions of anything might be impressive when we have PCs that are designed with billions of everything.

asdfasgasdgasdg · on Aug 15, 2021

> I just get bothered by how millions of anything might be impressive when we have PCs that are designed with billions of everything.

That is a strange thing to be bothered by. Are you similarly unimpressed by a $400M lottery jackpot, since CPUs can execute billions of instructions per second?

lukeramsden · on Aug 15, 2021

Have AWS / employees ever spoken about the stack they run for IAM? Sounds fascinating

onlyrealcuzzo · on Aug 15, 2021

400M qps seems mind boggling. What other things exist at this scale??

As people have pointed out, something like Dynamo is likely FAR more complex - but my understanding is that it's also 3-4 orders of magnitude smaller.

nicoburns · on Aug 15, 2021

These are staggering numbers, but my understanding is that IAM roles are wholly contained within the boundaries of an AWS account, so it seems like something that could relatively easily be partitioned horizontally with no shared state required between instances.

esotericimpl · on Aug 15, 2021

Policies can shared across accounts with orgs, in addition arns can be set to allow roles access from other accounts.

Boundary is mainly self contained but there can be interactions outside, sometimes significantly.

Having said that, under the hood who knows .

amelius · on Aug 15, 2021

How often do identities or access settings change? Not often, I guess. So it seems (to me) that they are accessing almost static data at a very high speed, which is not really a surprise. Am I missing something?

bastawhiz · on Aug 15, 2021

Even if the response changes only very infrequently, permissions are not what you want to be serving stale responses for. Can you confidently invalidate the right caches in a system serving 400M QPS? I would bet that relatively little is cached in IAM today.

isbvhodnvemrwvn · on Aug 15, 2021

Not very stale, but there is definitely a period where the policy is unstable, e.g. if you allow some action, for a relatively short while (usually less than a minute) you can see stale responses where the action is blocked, sometimes interleaved with allowed access.

asdfasgasdgasdg · on Aug 16, 2021

It depends on what exactly they're providing. Are they providing bounded staleness for certain types of reads? And is that the case for all reads, or do they also cover the new enemy problem? I.e. do they prevent people who previously had access to a resource from viewing changes to that resource?

idunno246 · on Aug 15, 2021

iam is eventually consistent...because its caching so much. we can see changes easily take a minute to propagate

bastawhiz · on Aug 15, 2021

Replication delay and stale caches are two different problems with vastly different causes

josephh · on Aug 15, 2021

Can you elaborate more? Is it not possible to have replication delay due to stale caches?

bastawhiz · on Aug 16, 2021

I'm not aware of a data store which populates its replicas with data from a cache. Surely there are caches within any given data store, but the whole point of a replica is to have the data replicated (for scaling reads, redundancy in case your leader fails).

If you're using caches, you're deliberately sending old data to the replicas. I can't think of why you'd do that. I guess you could have replicas of your caches, but I'm referring specifically to data store replicas.

nicoburns · on Aug 15, 2021

Is a replica not a cache of sorts?

bastawhiz · on Aug 16, 2021

Not really. Replicas are constantly replicating live data from your leader. All of it, in its raw form. A query against a replica isn't inherently faster than a query against the leader, the data is likely just a bit old.

A cache stores the output of an operation to avoid needing to perform the same operation again. You deliberately put data into a cache to avoid needing to create that data a second time—it won't magically appear there.

jeffbee · on Aug 15, 2021

You don't always need to shoot down caches for consistency. See the Zanzibar paper for how one system maintains consistency without cache invalidation.

amelius · on Aug 15, 2021

> Even if the response changes only very infrequently, permissions are not what you want to be serving stale responses for.

But perhaps the permission updates are slow. Like, you change a permission, wait 5 seconds, and the system tells you that they are updated across all servers.

bastawhiz · on Aug 15, 2021

The delay is either in replicating the data to all of the sources of truth, or invalidating the cache. Invalidate the cache too soon and you're recaching stale data. Caches only increase the delay more.

foolfoolz · on Aug 15, 2021

classic hacker news comment. 400M RPS belittled to commonplace service development. you truly underestimate how high of throughput this is

Spivak · on Aug 15, 2021

Just getting to 400m RPS (from scratch, no using AWS or a CDN to help) for just a toy app / static content is a huge accomplishment.

brink · on Aug 15, 2021

Sure, from a numbers standpoint it's a lot. But from a technical perspective, serving mostly static data is not that interesting.

asdfasgasdgasdg · on Aug 15, 2021

> But from a technical perspective, serving mostly static data is not that interesting.

The difference between mostly static and completely static is where the interesting bit is. The two categories beget completely different solutions, especially when serving stale data is not acceptable, as in the case we're looking at.

brink · on Aug 16, 2021

Well I guess it is to some people..

asdfasgasdgasdg · on Aug 16, 2021

You aren't aware of what is involved in a system like this. You believe it's simple, and so you believe it would be technically uninteresting. But you are mistaken in that belief. Go read the Zanzibar paper[1] -- Google's analogue to the AWS system we're discussing -- then come back and tell me with a straight face that that is technically uninteresting. If so, I'll at least learn that we have different views on what technical interesting-ness is.

[1]: https://research.google/pubs/pub48190/

new_realist · on Aug 16, 2021

Google’s problems are interesting. Amazon’s are mostly brute force obviousness.

amelius · on Aug 15, 2021

> you truly underestimate how high of throughput this is

I suppose this comment falls in the same category as "never underestimate the bandwidth of a truck full of tapes".

Yes, the numbers are huge. Whether it is a technical achievement is another question entirely.

artwr · on Aug 15, 2021

A lot of the calls in the organizations I have worked for use assume role and are usually associated with temporary credentials. Those are often shortish lived and not static

orf · on Aug 15, 2021

The policy documents don't change that much I think, but there is more than that that goes into evaluating a policy. You've got resource + org policies, all kinds of tags, complex conditionals, per-assumption stuff etc etc. Fascinating problem space.

faizshah · on Aug 15, 2021

“Accessing almost static data at a very high speed”

This is a globally distributed service operating at massive throughput and ultra low latency scaling requirements where if it fails millions of customers and a majority of online commerce as a whole comes to a halt. What you’re referring to is just the very tip of the iceberg.

amelius · on Aug 15, 2021

Identities can be sharded very easily.

faizshah · on Aug 15, 2021

You seem to think that this is just a service that reads and writes data like any CRUD app.

The hard problems stem from how the system deals with failures and how the system propagates writes across the replicas while meeting latency and consistency SLAs. On top of that the system needs to be built in a way that it can be maintained by many developers each working on a small piece of the system without knowing the ins and outs of the system as a whole. In addition, when the system fails debugging and mitigation needs to be able to be parallelized across many developers so that availability SLAs can be maintained. You can read about this in “Designing Data-Intensive Applications” by Martin Kleppman where he discusses the complexity involved in building distributed systems.

Spooky23 · on Aug 15, 2021

People and things need access tokens all of the time. Off the shelf services can handle most loads - Even things like AD can scale out to millions of users.

Used to work on 20M+ user identity service that didn’t see volume anything like this, but when utilization went up usually it was just a matter of adding more replica databases and auth servers.

faizshah · on Aug 15, 2021

The scale is the problem, the assumptions you make for a 100k rps system break for a 1m rps system. For example, in a 10k rps system “just shard it and add another replica” might work but at 100x the throughput your replication strategy itself can break.

The load isn't the only issue either you also have availability SLAs you need to meet that force you to design the system to tolerate faults at that scale or else half the internet gets taken down. Your system has to be broken up at a granularity where many engineers can work in parallel to diagnose and mitigate an issue when your service does get taken down to minimize downtime. “Scale breaks everything” as they say.

ctvo · on Aug 15, 2021

Here's an idea: implement a static HTML page with a counter. The counter is updated every 15 minutes. You serves this HTML page 400 million times per second. It needs at least 99.99 uptime. When the counter is updated, ensure new requests have the latest value within a second.

amelius · on Aug 15, 2021

> ensure new requests have the latest value within a second.

That's the point: nobody said permission updates have to happen within a second. They are performed infrequently, so the system can afford to make them slow.

ctvo · on Aug 15, 2021

> Q: What happens if I delete an IAM role that is associated with a running EC2 instance? Any application running on the instance that is using the role will be denied access immediately.

https://aws.amazon.com/iam/faqs/

P90 values provided to us in 2017 (large enterprise considering AWS) was under 1 second.

amelius · on Aug 15, 2021

Google's Zanzibar claims a P95 latency of 10 milliseconds, by the way.

ctvo · on Aug 16, 2021

What does that have to do with your claims here?

refulgentis · on Aug 15, 2021

Yes - their claim has no relation to time-to-response, so static data, etc. is irrelevant.

Thought experiment: if it suddenly took 10 seconds to respond to the calls, the number of responses still equals the number of requests, so the request velocity mentioned in the headline remains the same.

jen20 · on Aug 15, 2021

The amount of work in the system at any given instant goes up with latency, however, assuming a constant request rate. That would actually make the scale of this service _more_ impressive.

anamexis · on Aug 15, 2021

And it would be equally (if not more) impressive.

elango · on Aug 16, 2021

would be interesting to see how much each server can handle, that is engineering

anurag · on Aug 15, 2021

While this is an impressive technical achievement, it's worth taking a step back to ask why IAM needs to serve 400M API calls every second when AWS has between one and two million active users. How would this number change if IAM were less complex?

Edit: I understand how every machine needs to invoke IAM APIs and how temporary credentials and other uses increase the number super-linearly with every active user. Still, 400M RPS (nearly 35B requests/day) could be reduced significantly by improving the underlying object model so it scales down better. Right now, even a simple Lambda function that needs to access other AWS resources requires 3 API calls: create a policy, create a role, and connect the two.

orf · on Aug 15, 2021

> it's worth taking a step back to ask why IAM needs to serve 400M API calls every second when AWS has between one and two million active users

The thought process that would lead to such a statement intrigues me.

IAM activity scales with total aggregate activity across all workloads that run on AWS, which I would assume scales with the total activity of the billions of people who interact with AWS directly or indirectly each day.

sofixa · on Aug 15, 2021

IAM isn't only used for users. Every API call needs to do auth/authz, including cross-service.

snak · on Aug 16, 2021

Just for clarity:

3.456 × 10^13

34 560 000 000 000

~35 long-scale, european billions (10^12)

~35000 short-scale, american/english billions (10^9)