I'm curious why they didn't look into using Kubernetes ProjectedVolumeTokens for authenticating to Vault? The tokens Kubernetes issues are not stored in etcd, and they contain pod-specific metadata so they are invalidated as soon as the pod dies (when using TokenReview). Alternatively, they can be used to directly authenticate with Vault since they're OIDC-valid tokens .
The semantics around secrets in Kubernetes aren't nearly as robust as Vault, so I was surprised to not see this more clearly called out (ex: list secrets == get all keys and values). Even if you use KMS/AES encryption (which they reference) that doesn't help with access control.
We don't allow read or list of secrets by any human, although of course that's not a perfect control.
I was working with Mesos (before k8s had taken the world by storm), and had a similar issue - how do services get vault tokens without having a workflow that may include storing "secrets" in configuration. What I ended up writing was a tool that a service could query, with it's Mesos Task ID, to get a token. The tool would then read it's own configuration, as well as the current mesos state to determine if the request was valid or not.
Unlike the k8s solution, as I understand it, you don't need to treat the 'service account token' as a secret (reducing the attack surface of when someone steals that token _and_ also has access to Vault). This is accomplished in two ways:
1. You can determine if a request is valid by looking at how long the service is running. If someone steals a Task ID, but the service has already been running for 2 minutes, then the Task ID is useless.
2. The Vault token is only issued once per service launch. This means if an attack steals the Task ID, but the token has already been given out, the Task ID is useless. If the attacker beats the the service in asking for that key, then the service should raise the alarm bells about it's key potentially being stolen. If you are even more paranoid, you could even decide to invalidate all active keys and reduce the amount of time the attacker has a valid Vault Token.
This was largely designed years ago with some cues I took from one of Vault's lead engineers so I expected that the k8s integration would work similarly. I'm not too familiar with k8s however so there might be other constraints I'm overlooking in why Vault's k8s is integrated the way it is.
I don't think there is anything with the same guarantees as the Task ID, sadly.
Asking as a university student: is this a common number of microservices to have running in production? It looks like monzo has about 1,351 total employees . If all of them were software engineers, this would be a little less than one microservice per engineer. How do you handle code reuse and reliability among thousands of microservices? It seems like the number of possible failure states would be unthinkable.
There seems to be a bit of difference in what people mean by 'microservices' - some orgs will have a few REST collections per 'service' so you might end up with 'users', 'products', 'transactions' as your three services, and it being totally unimaginable that you'd ever break 1000.
I'd argue that's still a Service-Oriented Architecture (SOA), but I'm sure it's not anywhere near as 'micro' as what Monzo counts over a thousand of.
Thanks for replying, that does clear things up. My experience didn't really embrace RPC, so used gRPC for one small bit (< all of the service that contained it as I recall) but most was JSON HTTP APIs - the service boundaries pretty much just being team boundaries, though some (incl. mine) had a few services to the team.
I've yet to work for a company that runs over 100 micro services (or at least, that I'm aware of). But I can tell you having a tool like Kubernetes certainly makes a whole lot easier to maintain this many micro services. I think without a container orchestration it would be much harder to do so.
Most of your secrets will be/ should be just random bits maybe in some particular format that was convenient for a particular application e.g. a 4-digit PIN or a sixteen character Hexadecimal string, or 10 characters of A-Za-z0-9
So for these cases there's no reason that secret is ever on a developer's laptop. Best case the developer made a truly random secret, maybe they (like me) keep a set of hexadecimal dice on their desk for random choices. Just as likely it's tainted, the developer ran a randomizer until out popped a string they liked - or even they found one on a web site, or used the same one as in their test environment.
Either way, since what you wanted was random bits it makes sense in most cases (not all cases, obviously a secret key you were sent by somebody else, for example an API key for their system will have to be copied somehow) to have a feature that just spits the right format of random bits into the secure system without any human seeing them at all.
Even better, in cases where it's an option, is not to rely on stored secrets at all. I think Monzo's post is not worrying about this difference, but it can be critical in terms of decisions about debugging to prefer to have entirely ephemeral secrets. When a pod goes away, the ephemeral secrets that pod had vanish with it, and so you aren't storing them anywhere anyway. If they aren't stored, they can't get stolen by anybody and you've got one less thing to go wrong.
You hit on a good point. Vault has features to eliminate this security risk, if I’m understanding correctly.
The first feature is dynamic secrets: this generates an ephemeral, leased set of credentials that are unique per client. For a Kubernetes pod, it would get a unique set of DB credentials, for example. These are tied to the service account (used for auth). When the auth expires, so do the credentials (they’re dropped from the DB, and if the DB supports it we also drop connections).
The second feature is root credential rotation. To use the above feature, a user had to at some point “paste” the superuser credentials into Vault. As you pointed out, there’s a risk here. So what Vault can do is _immediately_ rotate that credential so after configuring Vault, it is no longer valid and only Vault knows the real credential. We support this for most database backend, for example.
If you combine these two elements, you get fully ephemeral secrets that are unknown by anyone except the necessary user. There’s a lot more we can talk about, there’s a lot more features we have around this, but this is just the high level point!
Yes, thank you.
This is one of those features that once you've lived with it, you can't imagine going back. Essentially, every secret gets an automatic expiration date in the near future. This has several effects:
1. It teaches people to never hard-code secrets anywhere, because they always expire. So people will follow fairly strict credential-management rules out of sheer laziness.
2. It guarantees that you don't have stale secrets lying around in random corners of your company. So even if somebody does record a secret somewhere they shouldn't, the window of attack may only be a few hours.
We've recently been escalating via sales for github issues we see use for and hopefully that gets back to product management. Not everyone is able to do that however.
The feedback from customers definitely gets back to product management, so for your case, that works.
Yep, so we have a different system for securely generated random keys. If its a Twilio API key, it realistically has to pass through a dev laptop, and its not that much of a big deal. If its eg an RSA key, we will generate it on an airgapped laptop, encrypt to a public key, and then we have a Vault plugin that decrypts, and writes it into Vault. So the unencrypted data is never anywhere but the airgapped laptop or Vault.
We also try to generate keys inside of Vault where possible, and we generate a lot of certificates this way.
I know nothing about banking, but that surprises me.
The one thing keeping me from adopting isn't that it won't be secure enough. I worry _constantly_ that I'll lock myself out of my data, my infrastructure, etc.
Have others had that worry? How'd you get over it and just start using Vault? (probably through incremental, low-risk adoption first?)
- Absolutely take an incremental approach, there is a learning curve
- Use a high availability backend. I started out with just consul, but have since went with consul for HA, and cockroachdb as the data store.
- Practice your disaster recovery a many times before you go all in
- use a combo of paper plus offline digital storage for the unseal keys and root tokens
- be diligent about roles/policies
- rely on tokens more than any other auth method(username/pass, certs)
- don’t be afraid to use many vault instances that can talk to a core instance for things like transit auto unseal
All that said, I absolutely think it’s worth the investment if you have the infrastructure to back it(aka a cluster without a SPOF)
The PKI backend is phenomenal, and it makes a great sidecar for any app that needs auth, secrets, or general crypto stuff using the transit backend.
I really want to create a user friendly Password Manager/TOTP front end backed by vault. Someday…
On a more general note, its taken us a long time to get comfortable with using Vault for increasingly critical things, and we are now at the point where it being down is extremely critical. But there are several components like this, and we are able to tolerate node failure, so its somewhat acceptable.
I'm hoping you get some replies, but I'm also thinking of a low risk approach first, specifically with temporary SSH keys and temporary DB credentials.
I am not saying that Vault makes leaks impossible; it is just that I trust their team a lot more than myself when it comes to building a secret management tool.
We're hard at work on a v2 that will be self-hostable as well as more powerful, flexible, and robust across the board so that it will be able to handle just about any configuration or secrets management scenario you can throw at it, from small teams to enterprise, while maintaining a simple, it-just-works approach.
In the meantime, our app and client libraries implement true end-to-end encryption that is open source and well-documented, meaning that despite our current product being cloud-hosted, we could not access your secrets even if we wanted to.
1 - https://security.envkey.com/
So take your time, and slowly move things to it, but vault is almost certainly worth your time and energy.
Plus combined with nomad and consul and the rest of the hashi stack it's pretty easy, and hard to get wrong.
Vault is a company-wide "root account", reachable from every part of internal network, storing all the company secret data and have reach to many internal and external systems like databases, where it have full rights to dynamically configure short lived credentials. Doesn't it put too much trust into a single system?
I guess what I’m getting at is—there probably isn’t a perfect answer, just tradeoffs. And if history has taught us anything, it’s a case of “when” not “if” something is attacked/broken. If so, perhaps you should partition your data, including infrastructure, to not rely on just one Vault server for everything? Outside of that, or monitoring, the only other clear answer everyone leans on is “Store it in the cloud,” under the assumption that the hardware and people processes at cloud companies will be more secure overall than anything you’d develop. Which is then the attack vector vault prevents: saving permanent access tokens. Off the cuff that’s how I see it. I’m not actually in SRE or security so I’d welcome other opinions.
Also I love the level of openness. No other bank would do this. (If you're in the UK, Monzo is honestly amazing. I've cancelled all my other bank accounts.)
We try to make the posts accessible to our customers as well as engineers :)
Totally agree! I don't have a Monzo account yet, but I find this level of transparency extremely appealing. I know people who work for a few "traditional" banks (RBS, Santander, Lloyd's), and by all accounts the IT setup is an absolute shit-show. I can kind of understand though; they've been around so long that they have a ton* of tech debt and legacy crap to deal with.
* it's great that Monzo are innovating on the IT side, but to woo me they are going to have to do better on the business side - I want a better interest rate for my personal current account, savings accounts and cash ISAs; I want stocks & shares ISAs, with low fees; I want a better interest rate for my business account, along with support for international payments (every traditional bank handles these, even if they do take ~1.8% in interchange fees... grrr).
Also, other banks have the same kinds of outages, often, but are not as transparent or communicative as Monzo.
I've been full Monzo for the last 18 months and have never had an issue.
One of the unexpected advantages (if you can call it that) of OpenBanking integrations is that you get to see, in practice, just how often the high-street banks have problems. Or how long it takes for them to recover. Whatever has been said about politics and sausages sure applies...
I mean, good grief. Payment handling for a given bank may be down for a week and that's apparently not a cause for concern. Authorisation messages may be missing for days. Incoming queue can be offline for two days and customers just have to deal with it.
From the stats I've seen, Monzo is actually among the best performers with their reliability and recovery. (A recurring complaint on the receiving end is that their app makes it really easy to generate payments outside of the OpenBanking flow. That has generated extra work for our payment and customer service teams.)
Their outages haven't affected me at all yet - I don't think any have lasted for days.
I have a Monzo account as well and really like them, but it doesn't have an IBAN and can't accept international transfers (at least not officially and reliably). That makes it not really usable as a sole bank account for many of us.
My first thought ... so you have a script that can see prod and non prod at the same time?
I think I may be developing paranoia.
I am going to guess this is comparing the key part of key value stores - I regularly add a new configuration value to dev and wonder for hours why pre-prod is failing
somehow i am missing the details and in this kind of case details is all that matters
Here I meant dumping staging values and then comparing prod values as retrieved to staging ones. As in, only compare values in prod as they are used by a deploy script or system configuration tool. That said, if you have one server, one place where secrets are kept, then it’s probably safe enough to send non-prod secrets to servers as a way of ensuring the secrets are invalid. Meaning, you don’t actually need to know the prod secrets to test staging or known weak secrets against prod.
That said, if in practice your secrets are randomly generated as services deploy, you’ll likely need to validate by observed behaviour rather than using hard-coded credentials. And if you’re practicing blue/green deploys, then staging might be just as production as prod...
I’m thinking of a system where the user can register her/his API keys to other third-party systems.
Then in your application when the user submits some data, you ask Vault to encrypt it before you store it in your main data store. When you need to read it, you get that value from your store and ask Vault to decrypt it.
If you have different services doing the reading/writing you can even setup your permissions so that one service can only decrypt and the other can only encrypt.
It gets better...stop building rube goldberg secret contraptions.
> Kubernetes stores the data in plaintext in etcd, a database where it stores all configuration data
Just finished reading the article.