If these keys are so important, why are they not treated with similar security measures as CA keys?
Why is it just a handful of keys shared by dozens, hundreds, or likely even thousands of servers? Where is the HSM-backed root key, per-DC intermediate, and per-server transient signing key? Where is the certificate revocation list?
Is it even possible to secure a bearer token from an attack like this? Should we go back to a nonce which we present to the originating service to securely retrieve the payload?
> If these keys are so important, why are they not treated with similar security measures as CA keys?
No incentive, right? If you mess up as a CA, you go out of business because every browser drops you. But if you're an identity provider and you have an incident, maybe 1% of your customers opens an angry support ticket and you give them 5% off their bill next month. So why bother? Are YOU going to move your 100,000 employee organization off of Microsoft because of this? No? Then why would they care at all? They won the sale and you're stuck with what you get.
Here's a list of companies that have gone out of business because they didn't take information security seriously:
<list not found>
Here's a list of billionaires who had to sell their yachts because they didn't prioritize security over other concerns:
I get what you’re saying, but there’s more nuance here. The smaller the business is, the likelier it is to fold within 1-1.5 years if a major compromise. I’m asserting that in the back of some FBI statistics published about BECs awhile back.
I would posit that your first list needs a conditional for “too big to fail”.
Did you mean to reply to another comment? I don’t see the correlation between your comment and mine. I didn’t assert the US was small business friendly, just that smaller businesses are much more at risk than it seemed the parent comment acknowledged.
Which I think is an important point because by volume the majority of businesses in the US are small and mid-sized businesses.
>No incentive, right? If you mess up as a CA, you go out of business because every browser drops you.
Not going out of business is not the only possible incentive though. Security is a pretty key part of the sales pitch of every cloud provider: "We can do this better than your in-house team" they keep saying.
No one is going to move 100,000 employees off Microsoft over this, but what about signing up the next 100,000? Will it impact next quarter's growth figures?
And for the people directly responsible for security at Microsoft breaches like these may be a career risk or at least a huge embarrassment. There will be meetings. People will get blamed. Some will feel they haven't met their own professional standards.
I'm not sure I would call it technical merit, but sentiment and reputation do play a role in negotiations. No cloud provider wants to be perceived as having a weakness in terms of security compared to its closest competitors.
There are debates about risks when large companies move to the cloud - in some industries more than in others. Opposing groups within companies as well as competitors' sales people will be looking for weak spots to make their case.
In isolation, incidents like this don't make a difference. If it looks like a pattern that would be bad for business though.
Does Microsoft have a reputation for good security, adequate security or poor security?
How has this affected sales over the years?
I believe they won the PC wars years ago and there has never been a good replacement system with enough customer incentive that met the same price point. It is this cow they continue to milk.
It's not clear to me why there aren't more signing keys. After all, the public part is exposed at an URL you can retrieve whenever you're verifying the token.
I've seen service provider implementations where the access token is exchanged for a cookie or other similar token, in which case you can't really do anything about it anymore for the lifetime of the cookie.
But if you're always verifying the access token with a reasonably short cached key list, the service provider should be able to refuse the token reasonably quickly once the key is no longer advertised.
Anecdotally, I ran an experiment with Envoy to see how far the number of signing keys could scale. This was for a B2B “API Key” auth solution; we wanted user keys to be self revocable, but just be a relatively standard JWT format for maintainability. The hypothesis was that, rather than running a whitelist or blacklist, we could improve the security signature by have 1 signing key for each JWT.
When we ran some stress tests, turned out Envoy could happily run with ~300K signing keys in its JWK Set before noticeable service degradation occurred. Even then, by bumping up the memory on the validation servers, there was a small sacrifice of a few ms per extra 100K keys.
This makes me fully agree that, for many applications, there’s probably an opportunity to vastly improve the security surface by bumping up the number of signing keys dramatically.
As long as both Keys and Signing keys define a KID, key verification is prefaced only by a hash table lookup or a tight loop through a keyset to find the appropriate Signing Key, before the slower verification procedure.
I guess 640K keys ought to be enough for anybody (TM)(r) (c)
More seriously, though, I wonder how AzureAD is implemented and how hard it would be to scope keys per tenant, if not per application. If I'm not mistaken, SAML certificates are per application.
If you want 1 signing key per JWT, you would need to generate a new key pair for each JWT; wouldn’t that be too expensive? Or was the generation included in your tests?
>People don't seem to know that old saying about not putting all your eggs in one basket anymore.
Probably because by the time this saying was forged, there wasn't this big push from chickens to integrate as many features as possible and improve their functionality and QoL that's only avaible when you do put all the eggs in the same place
Exactly... People just don't learn. For example currently there is a huge number of companies here in Poland that are migrating all of their cloud apps to GCP (Google cloud) from AWS. Why? Because Google has opened a region in Poland.
AWS has live customer support you can ring (if you pay for it) and frequently as a client you get an account manager you can call in case of trouble and that person has direct access to support teams. These support peoole can actually fix stuff for you. Back in the day I've handled lots of support cases like this.
Google on their website claims they also offer live support, but I read enough stories with headlines like "Google deleted my business overnight and there's no one to talk to" to question it's usefulness. I haven't had a chance to deal with Google's GCP support yet so I don't know how good or bad it might be, but I had a couple of support cases raised for other products (play store, book publishing etc) and it was obvious people that work there can't really do anything if stuff breaks. They're there just for Google to be able to say "we too have live support" and to tell you how to do stuff in lieu of documentation. When stuff breaks... You get an email "we're escalating it _to_developers_" to never hear from them again (or you get an email every month asking you if the issue is still ongoing)
So I think it is the biggest case of "putting all your eggs in one basket" I saw in a while. If anyone has contrary experience of GCP support I'd love to hear it.
While I was always suspicious of using “Google” for any critical business purposes, I have also learnt that “Google” the search engine is different than “Google Cloud” the internal division within “Google” that runs GCP.
I am yet to see any examples of “Google Cloud” shutting down services on their own whim or not providing a human custom service agents.
I've not interacted with GCP support very often, but I have used GCP extensively for the past 3 years coming from an AWS background previously.
All in all it's been a good experience and it's become my favourite cloud. I find the documentation spot on for the most part, and covering things in both approachable language and diving deep where needed.
The IAM system is easy to use, GKE workload identity works great, PubSub works a dream and Big Query is amazing (though pricey!). SDK's are well documented and generally have nice APIs as well.
We've had very little reason to need support, operating a bunch of GKE clusters, VPNs to various partners plenty of databases, buckets, message queues, etc (non trivial setup)
I do have some minor gripes that come to mind:
- Cloud SQL not having a richer API, I'd love to be able to manage postgres permissions by IAM group membership, and grant/revoke postgres roles using the rest API instead of connecting as a postgres user (it'd make secure automation via terraform etc easier if I could lean on IAM)
- VPC peering only allowing "one hop", which necessitates proxies even when using private service connect with two GCP products in some instances (eg: cloud SQL to datastream - why can't we just peer the two Google managed VPCs together? That would also avoid the proxy in our VPC)
- On datastream, why can't I grant a Google managed service account IAM access to postgres instead of configuring a user/pass based user?
- Why can't I configure a longer token expiration for artefact registry? When developing locally I don't want to reauth npm every 30 minutes
- Occasionally missing APIs prevent automation using terraform
So like anything it's not perfect but it also has felt like it's continuously improved overtime so I remain a happy developer
Yep my employer is all-in on Office 365, Teams, the whole thing. I have a supply of popcorn ready for when it all comes crashing down. One thing I know is they won't blame themselves.
> One thing I know is they won't blame themselves.
Which is precisely why they went all in on Microsoft Cloud. Using an in-house stack (no matter if it's Jira, Confluence, Exchange, AD, Postfix, Exim, OpenLDAP, Samba, ...) will always lead to people blaming the C level for any outage, hack, whatever. Miss one tiny little patch and insurance won't pay.
Go for Atlassian Cloud/Microsoft/AWS/GCP? You can now deflect any blame onto the cloud provider. No personal liability, nothing. You followed industry best practices, so insurance pays out.
I’m with you on that I prefer not to use MS tech. That said, you might be creating an unhealthy dynamic for yourself by expecting it all crash and burn down. Objectively, MS tech still gets the job done.
My employer has had to invest I don't even want to know how much in an EDR (or whatever it's called nowadays) and a slew of services around this circus just to pretend their Windows systems are secure. Didn't catch a home-grown cryptolocker, though, and happily allowed it to encrypt its own files.
They mostly use web-bases SAAS applications anyway, so could trivially replace windows with an OS with a better security track record.
Plus, AzureAD always seemed kludgy. Until recently, they insisted on having SMS or phone as a second factor for password recovery. Which allowed you to reset the stronger 2nd factor used for auth. They only started supporting Fido tokens like a year or so ago for regular 2fa. Their authenticator is a joke: until recently, you had no idea what you were approving. It still doesn't support group inheritance, so if you base it on your local AD as the source of truth, you have to jump through more hoops and add more ad-hoc groups and maintain them. Good times.
Ah yes, I long for the halcyon days of IT stability… oh wait, business has always been fraught with risk and trade-offs. You get insurance, put mitigations in place, and hope when it does go bad, you can blame Microsoft anyway.
I understand your point, but the probability math works against you as the number of platforms grows, unless you have excellent data segregation and access controls. What happens in most cases, each new platform serves as an attack vector to be exploited to gain access to all of the data.
From the point of view of the employer it's stupid not to put all your eggs in one basket. For $10 a month per account, you get email (unless you have thousands of accounts, this is already worth it), instant messages and a softphone solution, office tools that you needed to buy anyway, and centralized user management for it all. All alternatives involve either still contracting it out to a probably les competent team, or hosting it in house at significant cost. On top of that, the day it all crashes down and burns their business to the ground, insurance pays out because it was cretified and wasn't an in-house solution that the insurer can point at and drag you to court with.
Assuming square coops, four small coops have literally double the attack surface compared to one big coop of equal total area. The ratio only gets worse when you divide them up further. The impact of a breach may be less for many small coops, but breaches will happen more often as well.
Our researchers concluded that the compromised MSA key could have allowed the threat actor to forge access tokens for multiple types of Azure Active Directory applications, including every application that supports personal account authentication, such as SharePoint, Teams, OneDrive, customers’ applications that support the “login with Microsoft” functionality, and multi-tenant applications in certain conditions.
Oh wow I hadn't heard of this! Their naming debacles are an endless source of entertainment lol
When I think of a security product, I really want to think about the vendor letting people _in_ instead of keeping bad actors _out_ (entra from entrar = to go in, to come in) /s.
In Spanish, it's actually the imperative, so it makes you think of ordering someone to go in lol!
> The old public key’s certificate revealed it was issued on April 5th, 2016, and expired on April 4th, 2021, and its thumbprint matched the thumbprint of the key Microsoft listed in their latest blog post, named “Thumbprint of acquired signing key”
Am I reading this right? The key was expired? And still in use??
I have no idea how validity works for the systems these keys are used in, but I do know that in general there are two ways that key or certificate based verification handles an expired key or certificate. Anyone know which of these is the way whatever Microsoft was using these things for works?
In one of them, which is the way TSL verification works, it goes something like this when checking certificate Cn that is signed by Cn-1 which is signed by ... is signed by C0.
1 time_check = now()
2 for cert in Cn to C0
3 if time_check < cert.valid_from || time_check > cert.valid_to
4 return EXPIRED
5 return NOT_EXPIRED
Each certificate's expiration is checked against the current time.
The other, which is used for code signing, goes something like this:
1 time_check = now()
2 for cert in Cn to C0
3 if time_check < cert.valid_from || time_check > cert.valid_to
4 return EXPIRED
5 time_check = cert.issue_time
6 return NOT_EXPIRED
Cn is checked against the current time. The rest of them are checked against the time at which they signed the next downstream certificate.
I understand why code signing works like that. It's essentially digital notarization, and you don't want your notarized documents to become no longer notarized just because the notary public you used has since stopped being a notary public and let their license expire.
"Storm-0558 acquired an inactive MSA consumer signing key"[1]
They should have said:
"Storm-0558 acquired an EXPIRED MSA consumer signing key"
And when they said:
"a validation issue allowed this key to be trusted for signing Azure AD tokens."[1]
They should have said:
"MULTIPLE validation issues allowed this key to be trusted for signing Azure AD tokens."
And when Microsoft said:
"The actors are keenly aware of the target’s environment, logging policies, authentication requirements, policies, and procedures."
They should have said something like:
"The actors could have been aiming for a brazen hit-and-run and didn't stop for a second to think about policies and other nonsense. Alternatively, the actors were unsophisticated and completely unaware of facts like the US State Department paying us a stack more money for the privilege of being able to log mailbox access events in detail (thanks Boeing for the tip from MCAS!). Additionally, the actors appear to have neglected the fact that Chrome 92 was last updated in 2021 and therefore a fairly bad choice of user agent to use for an attack against a US department whose ICT systems should be using the latest version of web browsers. Additionally, the actors likely had no idea how US State Department mailboxes are accessed, and made a very poor guess of it by using random public data centres throughout Europe."[2][3]
Keys on an HSM are difficult to use by a team. Network attached HSMs are a thing but they have poor integration with the toolchains developers often use. Recently I was able to write my own ideal tool to help bridge the gap between toolchains and hardware secure enclaves: https://keymux.com
> Keys like this should never be used by a team. They should never touch a developer or devops workstation.
That's the point of KeyMux: keys never leave a secure enclave. In fact, KeyMux has no logic to load private keys or to perform key operations in-memory, except in the context of ephemeral TLS. The TLS layer does support using HSM keys for mTLS authentication, so you can use a security token for peer mTLS authentication to a network attached HSM.
That's the whole point--you can use all the tools built and commonly used with in-memory keys, but using enclaved keys, instead.
A follow-on idea I'd like to spend time on, if I have the chance, is to write modules for Vault and/or some KMIP implementation that integrate with the app so that authorization for key operations on a networked-attached key store require the consent of multiple parties (with that consent using local secure enclave-based authentication, of course). So, for example, if you're going to roll a release, one developer or an automated pipeline submits the request but the signing awaits confirmation from other team members.
> High performance HSMs capable of handling 10k+ transactions/second are well within the price range of a well funded startup.
Microsoft knows this, and even offers HSMs as a cloud service.
HSMs simply don’t scale up to something the size of azure AD. Even if you could use 10K+ of them in a global cluster, copying keys between HSMs inherently exposes the master keys anyway. And how do you secure access to the HSMs, with another secret shared on every validation server? Turtles all the way down.
I had to draw a line so I could actually release something. But Windows and Linux support is definitely a target. In fact, the core software was originally developed on both macOS and Linux with PC/SC smart cards (both PIV and OpenPGP) and Vault as key stores, but without any GUI components--everything compiled into a set of PKCS#11 (for OpenSSH, OpenSSL ENGINE, etc) and PC/SC modules (for GnuPG). And I stayed away from macOS and Linux APIs as much as possible to ease a Windows port.
But friends and coworkers I explained and showed the idea to didn't get it conceptually (e.g. people have the idea of key rotation drilled into their head as if that's the alternative to HSMs, instead of it being a mitigation for a fundamentally broken key management ecosystem), plus most people just wanted something they could point SSH_AUTH_SOCK at, so there needed to be a daemon/menubar/taskbar service. Ultimately the hard part was modeling and building a GUI around the concept, so it would be easier to understand and use. To get something out the door I targeted my daily desktop environment, macOS. It's using Yue as the GUI toolkit, which supports Linux/Gtk, macOS, and Windows.
I took a year off of work to finally get the idea out of my head, which I had been mulling over for many years. But now I find myself in the middle of a downturn in the software engineering job market (anybody hiring or interested in investing?), so while I have work in progress to round out the macOS app features, Linux and Windows (which is really where the commercial viability exists, I think), will need to wait until I have some cashflow.
I'd like to eventually release the Linux work as open source. But FWIW if someone has a specific use case in mind and is willing to fund development, I could very quickly build and release a [non-GUI] Linux package. PKCS#11 and PC/SC Linux modules still build in the tree, and adding a TPM key store adapter along side the other internal adapters would be relatively simple. In fact, a Linux PKCS#11 module for accessing Vault Transit Engine keys with TPM 2.0 mTLS authentication would be maybe 1-2 weeks of effort; slightly less if just straight TPM 2.0 support. Most of the implementation is already there, but polishing and testing something which can be supported long-term takes some effort.
Oh yeah. The MacOS is something magic you’ve worked out. I use my yubikey for ssh keys and I was never able to figure out how to get macOS to work for other processes (like IntelliJ) unless it was started from a shell. Then one day IntelliJ changed how that works and it hasn’t worked since.
Anyway, this is really cool. I use Windows, Linux, and Mac every day for work and having a consistent method of doing this kind of stuff sounds amazing. Keep up the good work.
One of the most popular HSM is Thales Luna Network HSM, which can perform 20,000 ECC operations per second [1]. Even with the size of Azure AD, Microsoft may not need a lot of HSMs for signing purpose. HSMs are not particularly easy to manage though, maybe that is one of reasons they are not used as much as they should be.
There are two scenarios: First: Microsoft uses the JWT signing keys in memory and the attacker were able to get access to it by injecting code or get access to the memory image of such a process. Second: Microsoft actually uses HSMs but has to distribute the keys geographically and the attackers were able to get access to the key this way.
The first scenario is more likely, but you cannot exclude the second as well.
A hardware security module (HSM) is a physical computing device that safeguards and manages secrets (most importantly digital keys), performs encryption and decryption functions for digital signatures, strong authentication and other cryptographic functions. These modules traditionally come in the form of a plug-in card or an external device that attaches directly to a computer or network server. A hardware security module contains one or more secure cryptoprocessor chips.
A tamper resistant piece of hardware that does key management as well as encryption services. Essentially once a key is installed (or generated) it never sees the light of day again.
Should we just legalize the FAANGs to engage in cyber warfare/espionage against each other? Seems like then they would have to reach a minimum of best security practice instead of relying on obscurity.
They all have bug bounty programs. If you avoid causing actual damage or downloading actual customer data, and disclose what you find, it’s already legal.
It's hard to imagine many worse compromises than a foreign power getting into the email account of our representative to that power. I hope the US government is seriously rethinking the initiatives it's undertaken to move computing to major cloud providers.
Update on the summer Friday news dump: by Monday evening, it's been crickets.
And in the first 5 pages of Google news search hits for "microsoft" (before giving up), I saw only 2 mentions total. Most of the hits are PR fluff, product promotion, and stock market noise 'news'.
> The full impact of this incident is much larger than we Initially understood it to be. We believe this event will have long lasting implications on our trust of the cloud and the core components that support it, above all, the identity layer which is the basic fabric of everything we do in cloud. We must learn from it and improve.
How about not putting everything in the cloud for starters. And I think the whole problem is not even the cloud. It's the broken capitalistic economy that in the and always begets monopolies or at least a cartel of a few big companies. Because for the long tail of small businesses it does not make sense to use products and services from the big guys. So, giving more power to the market winners ends up in too-big-to-fail companies. Yet, these big companies only have to make one mistake, one slip up, and the target surface is just too big. It's not a question if, but when such incidents will happen. Only federated services are the answer for the long term survival of society, sacrificing a part of convenience.
> It's the broke capitalist economy...always begets monopolies or at least...
In computers, "it" is mostly the "nobody every got fired for buying IBM" side of human nature. Business leaders are not techno-geeks, dreaming of bigger and better federated services. IT is generally a major expense - both on their bottom lines, and in the amount of time & grief they have to spend learning & deciding about and fighting with computer sh*t.
Metaphor: If the weather is iffy, and they still decide to fly on Microsoft Airlines...well, they'll have plenty of company and sympathy when the flight gets canceled and they're late. Vs. if you're the maverick who takes the train - then the whole extra load of figuring out train schedules and stations and stuff is on you (vs. "everybody knows" about air travel). And anything going wrong with the train (when Microsoft Air gets there okay) makes you stick out like a steaming pile on the kitchen floor.
> IT is generally a major expense - both on their bottom lines, and in the amount of time & grief they have to spend learning & deciding about and fighting with computer sh*t.
Certainly! Yet, just when people muscle memorized Office 2003 by 2007, Microsoft pulled the rug and introduced Ribbon. And people still use .doc insted of .docx for loads of stuff!
no, companies that pass certain financial thresholds get access to capital on very different terms. Access to capital, when managed well, is opportunity to build countless "moats" and take advantage of low-profit long-term market changes. As a small business who has to pay real bills each month or fail, there is no contest.
source-- actual small tech business experience in the USA
They are in many ways equivalent to certificate authoritities' keys.
Organizations using Microsoft and Azure services should take steps to assess potential impact.
People don't seem to know that old saying about not putting all your eggs in one basket anymore.