Hacker News new | past | comments | ask | show | jobs | submit login
Monzo: How our security team handle secrets (monzo.com)
137 points by p10jkle on Oct 11, 2019 | hide | past | favorite | 63 comments

(Disclosure: I work on Kubernetes/EKS Security at AWS)

I'm curious why they didn't look into using Kubernetes ProjectedVolumeTokens for authenticating to Vault? The tokens Kubernetes issues are not stored in etcd, and they contain pod-specific metadata so they are invalidated as soon as the pod dies (when using TokenReview). Alternatively, they can be used to directly authenticate with Vault since they're OIDC-valid tokens [1].

The semantics around secrets in Kubernetes aren't nearly as robust as Vault, so I was surprised to not see this more clearly called out (ex: list secrets == get all keys and values). Even if you use KMS/AES encryption (which they reference) that doesn't help with access control.

[1] https://www.vaultproject.io/docs/auth/jwt.html

This is on our radar and I think they can now be used directly with the kubernetes Auth plugin, although I've not heard much about it. This is a very recent change. We could have possibly got the same functionality with the jwt plugin, with some added complexity (and no tokenreview)

We don't allow read or list of secrets by any human, although of course that's not a perfect control.

Interestingly, Googling `ProjectedVolumeTokens` yields this very post. I think that says something about its maturity.

That would by my error, its actually `BoundServiceAccountTokenVolume` and `TokenRequestProjection`

The intersection of microservices + Vault is something I've long had to deal with - as far back as Vault 0.5.0, so I'm a bit "surprised" that isn't something that is turn-key with Vault+k8s today.

I was working with Mesos (before k8s had taken the world by storm), and had a similar issue - how do services get vault tokens without having a workflow that may include storing "secrets" in configuration. What I ended up writing was a tool[1] that a service could query, with it's Mesos Task ID, to get a token. The tool would then read it's own configuration, as well as the current mesos state to determine if the request was valid or not.

Unlike the k8s solution, as I understand it, you don't need to treat the 'service account token' as a secret (reducing the attack surface of when someone steals that token _and_ also has access to Vault). This is accomplished in two ways:

1. You can determine if a request is valid by looking at how long the service is running. If someone steals a Task ID, but the service has already been running for 2 minutes, then the Task ID is useless.

2. The Vault token is only issued once per service launch. This means if an attack steals the Task ID, but the token has already been given out, the Task ID is useless. If the attacker beats the the service in asking for that key, then the service should raise the alarm bells about it's key potentially being stolen. If you are even more paranoid, you could even decide to invalidate all active keys and reduce the amount of time the attacker has a valid Vault Token.

This was largely designed years ago with some cues I took from one of Vault's lead engineers so I expected that the k8s integration would work similarly. I'm not too familiar with k8s however so there might be other constraints I'm overlooking in why Vault's k8s is integrated the way it is.

[1] https://github.com/nemosupremo/vault-gatekeeper

This sounds interesting. I think with bound service account tokens (https://github.com/kubernetes/community/pull/1460) things may improve; pods will be able to prove their identity without it being equivalent to giving away your k8s api access.

I don't think there is anything with the same guarantees as the Task ID, sadly.

> We run about 1,100 microservices written in Go

Asking as a university student: is this a common number of microservices to have running in production? It looks like monzo has about 1,351 total employees [0]. If all of them were software engineers, this would be a little less than one microservice per engineer. How do you handle code reuse and reliability among thousands of microservices? It seems like the number of possible failure states would be unthinkable.

[0] https://en.wikipedia.org/wiki/Monzo_(bank)

I don't think its particularly common for a company of our size. They are mostly very small and handle very specific things. It's one approach, with some benefits, although it makes projects that affect all services quite tricky!

Does 1,100 microservices mean 1,100 distinct programs or 1,100 running instances of dozens or hundreds of distinct programs?

I assume they're mostly RPC, and one procedure is one 'microservice'.

There seems to be a bit of difference in what people mean by 'microservices' - some orgs will have a few REST collections per 'service' so you might end up with 'users', 'products', 'transactions' as your three services, and it being totally unimaginable that you'd ever break 1000.

I'd argue that's still a Service-Oriented Architecture (SOA), but I'm sure it's not anywhere near as 'micro' as what Monzo counts over a thousand of.

Microservices generally have 0-10 RPC handlers (we use https://github.com/monzo/typhon). We have a lot of different object types at Monzo, it goes way beyond 'transaction' or 'user' and each object type is going to be at least one service, but can be several. Services that mostly just consume off queues are often separated from services that handle requests so they can scale separately.

Oh I wasn't suggesting user/product/transaction were what Monzo would use! (Should have made that clearer since 'transaction' is right domain...) Just the first things that came to my mind for very basic sort of service splitting that I'm familiar with.

Thanks for replying, that does clear things up. My experience didn't really embrace RPC, so used gRPC for one small bit (< all of the service that contained it as I recall) but most was JSON HTTP APIs - the service boundaries pretty much just being team boundaries, though some (incl. mine) had a few services to the team.

Distinct but small services each with 1-5 endpoints

The former.

In web scale companies, it is not uncommon to have that many. (including the engineer:microservice ratio). Code re-use is generally handled by standardisation of services (including, language, libraries and frameworks). Reliability is taken care by metrics based measurements. It is truly a wonderful world. Of course, the path to el dorado lies in investing in world class platform and devops including painless CI/CD, Rollout/Rollback.

Just earlier today on another thread on HN I read Uber runs 1,500 micro services.

I've yet to work for a company that runs over 100 micro services (or at least, that I'm aware of). But I can tell you having a tool like Kubernetes certainly makes a whole lot easier to maintain this many micro services. I think without a container orchestration it would be much harder to do so.

The bit where a secret gets pasted into the Very Secure system is a clear problem. Because before it was pasted, and while it was being pasted, it's not in that secure system it's on some dev's laptop.

Most of your secrets will be/ should be just random bits maybe in some particular format that was convenient for a particular application e.g. a 4-digit PIN or a sixteen character Hexadecimal string, or 10 characters of A-Za-z0-9

So for these cases there's no reason that secret is ever on a developer's laptop. Best case the developer made a truly random secret, maybe they (like me) keep a set of hexadecimal dice on their desk for random choices. Just as likely it's tainted, the developer ran a randomizer until out popped a string they liked - or even they found one on a web site, or used the same one as in their test environment.

Either way, since what you wanted was random bits it makes sense in most cases (not all cases, obviously a secret key you were sent by somebody else, for example an API key for their system will have to be copied somehow) to have a feature that just spits the right format of random bits into the secure system without any human seeing them at all.

Even better, in cases where it's an option, is not to rely on stored secrets at all. I think Monzo's post is not worrying about this difference, but it can be critical in terms of decisions about debugging to prefer to have entirely ephemeral secrets. When a pod goes away, the ephemeral secrets that pod had vanish with it, and so you aren't storing them anywhere anyway. If they aren't stored, they can't get stolen by anybody and you've got one less thing to go wrong.

(Disclaimer: founder of HashiCorp, creator of Vault)

You hit on a good point. Vault has features to eliminate this security risk, if I’m understanding correctly.

The first feature is dynamic secrets: this generates an ephemeral, leased set of credentials that are unique per client. For a Kubernetes pod, it would get a unique set of DB credentials, for example. These are tied to the service account (used for auth). When the auth expires, so do the credentials (they’re dropped from the DB, and if the DB supports it we also drop connections).

The second feature is root credential rotation. To use the above feature, a user had to at some point “paste” the superuser credentials into Vault. As you pointed out, there’s a risk here. So what Vault can do is _immediately_ rotate that credential so after configuring Vault, it is no longer valid and only Vault knows the real credential. We support this for most database backend, for example.

If you combine these two elements, you get fully ephemeral secrets that are unknown by anyone except the necessary user. There’s a lot more we can talk about, there’s a lot more features we have around this, but this is just the high level point!

> The first feature is dynamic secrets: this generates an ephemeral, leased set of credentials that are unique per client.

Yes, thank you.

This is one of those features that once you've lived with it, you can't imagine going back. Essentially, every secret gets an automatic expiration date in the near future. This has several effects:

1. It teaches people to never hard-code secrets anywhere, because they always expire. So people will follow fairly strict credential-management rules out of sheer laziness.

2. It guarantees that you don't have stale secrets lying around in random corners of your company. So even if somebody does record a secret somewhere they shouldn't, the window of attack may only be a few hours.

As one of your commercial customers, I would give you the feedback that there are a lot of features that get closed (on the github vault issue tracker) with terse and not very user friendly reasons. This is off putting to a lot of people.

We've recently been escalating via sales for github issues we see use for and hopefully that gets back to product management. Not everyone is able to do that however.

Thank you, I heard this feedback this week as well. We’re hiring for a couple roles right now in the Vault team that will be dedicated to community management and process (that will expand more broadly). We hope this helps with this.

The feedback from customers definitely gets back to product management, so for your case, that works.

We are using your database credential generation features! With thousands of clients, too. Very cool, we'll write a blog post soon. The root credential rotation is genius.

> The bit where a secret gets pasted into the Very Secure system is a clear problem. Because before it was pasted, and while it was being pasted, it's not in that secure system it's on some dev's laptop.

Yep, so we have a different system for securely generated random keys. If its a Twilio API key, it realistically has to pass through a dev laptop, and its not that much of a big deal. If its eg an RSA key, we will generate it on an airgapped laptop, encrypt to a public key, and then we have a Vault plugin that decrypts, and writes it into Vault. So the unencrypted data is never anywhere but the airgapped laptop or Vault.

We also try to generate keys inside of Vault where possible, and we generate a lot of certificates this way.

I think there's also an implication here that secrets aren't (generated and) stored on an HSM.

I know nothing about banking, but that surprises me.

I so, so, so, badly want to use Vault everywhere.

The one thing keeping me from adopting isn't that it won't be secure enough. I worry _constantly_ that I'll lock myself out of my data, my infrastructure, etc.

Have others had that worry? How'd you get over it and just start using Vault? (probably through incremental, low-risk adoption first?)

Some quick points as a very early adopter of Vault:

- Absolutely take an incremental approach, there is a learning curve

- Use a high availability backend. I started out with just consul, but have since went with consul for HA, and cockroachdb as the data store.

- Practice your disaster recovery a many times before you go all in

- use a combo of paper plus offline digital storage for the unseal keys and root tokens

- be diligent about roles/policies

- rely on tokens more than any other auth method(username/pass, certs)

- don’t be afraid to use many vault instances that can talk to a core instance for things like transit auto unseal

All that said, I absolutely think it’s worth the investment if you have the infrastructure to back it(aka a cluster without a SPOF)

The PKI backend is phenomenal, and it makes a great sidecar for any app that needs auth, secrets, or general crypto stuff using the transit backend.

I really want to create a user friendly Password Manager/TOTP front end backed by vault. Someday…

We were pretty concerned about this. It's hard to get in a situation where the data is unrecoverable, because the root keys that you distribute can be used (with a bit of reading Vault code) to directly decrypt the internal storage, which is easy to navigate.

On a more general note, its taken us a long time to get comfortable with using Vault for increasingly critical things, and we are now at the point where it being down is extremely critical. But there are several components like this, and we are able to tolerate node failure, so its somewhat acceptable.

The same worry. Additionally, I also wonder about the additional cognitive load I'd be placing on the dev teams who haven't used it at all.

I'm hoping you get some replies, but I'm also thinking of a low risk approach first, specifically with temporary SSH keys and temporary DB credentials.

I have the same worry — we all probably do. But I worry a lot more about building an in-house solution and having everything leak because I missed something, so Vault gives me some peace of mind there.

I am not saying that Vault makes leaks impossible; it is just that I trust their team a lot more than myself when it comes to building a secret management tool.

Vault interests me but seems to come with a lot of complexity/requires an army of devops. There's a YC backed company called Envkey[0] that looks interesting + simple, but there is no option to self host yet

[0] https://www.envkey.com/

EnvKey founder here. Thanks for the mention!

We're hard at work on a v2 that will be self-hostable as well as more powerful, flexible, and robust across the board so that it will be able to handle just about any configuration or secrets management scenario you can throw at it, from small teams to enterprise, while maintaining a simple, it-just-works approach.

In the meantime, our app and client libraries implement true end-to-end encryption that is open source and well-documented[1], meaning that despite our current product being cloud-hosted, we could not access your secrets even if we wanted to.

1 - https://security.envkey.com/

I run vault at my org, we are a devops team of 2, and we do just fine. Vault can be tricky, because a failure in vault def hurts and can impact production pretty quickly.

So take your time, and slowly move things to it, but vault is almost certainly worth your time and energy.

Plus combined with nomad and consul and the rest of the hashi stack it's pretty easy, and hard to get wrong.

I’ll plug envwarden[0] which is a simple wrapper to manage your keys as environment variables in Bitwarden (which can be self-hosted I believe).

[0] https://github.com/envwarden/envwarden

Yes, Bitwarden can be self-hosted. It is also a really nice secret management tool for humans, although it does have a CLI and API which can be used for machines.

Fair enough, Vault has a ton of features I guess most don't need. It's been cool for us, we are now using it for secrets, database credentials, and certificate management (more blog posts coming!). Next up is hopefully AWS creds management

One thing bugs me about Vault and I rarely see it beeing discussed: how come a design where Vault simultaniously stores secret AND is able to access other systems these secrets are for is deemed as good and secure?

Vault is a company-wide "root account", reachable from every part of internal network, storing all the company secret data and have reach to many internal and external systems like databases, where it have full rights to dynamically configure short lived credentials. Doesn't it put too much trust into a single system?

The real question might be if having it in one place makes monitoring and revocation easier than distributed trusted systems? Also, traditional passwords/secrets don’t expire, and if they do expire, how would you maintain trust in a distributed fashion outside of... complicated multi-part keys, something less secure like DNS, and/or more permanent tokens like the private keys used by a CA system? Somebody somewhere has to maintain a private or secret key, or you need a human to intervene. And even if you store the key on hardware, any users of the key could be compromised.

I guess what I’m getting at is—there probably isn’t a perfect answer, just tradeoffs. And if history has taught us anything, it’s a case of “when” not “if” something is attacked/broken. If so, perhaps you should partition your data, including infrastructure, to not rely on just one Vault server for everything? Outside of that, or monitoring, the only other clear answer everyone leans on is “Store it in the cloud,” under the assumption that the hardware and people processes at cloud companies will be more secure overall than anything you’d develop. Which is then the attack vector vault prevents: saving permanent access tokens. Off the cuff that’s how I see it. I’m not actually in SRE or security so I’d welcome other opinions.

To me, this represents a fatal flaw in Vault's architecture. What it does should be table stakes/standardized and operate in a distributed fashion via integration into the underlying secrets engines. I'm honestly surprised the CNCF hasn't found a project out there to challenge Vault, we desperately need alternatives.

Ha I love how they explain what "cryptography" means, as if anyone reading this wouldn't know.

Also I love the level of openness. No other bank would do this. (If you're in the UK, Monzo is honestly amazing. I've cancelled all my other bank accounts.)

> Ha I love how they explain what "cryptography" means, as if anyone reading this wouldn't know.

We try to make the posts accessible to our customers as well as engineers :)

> Also I love the level of openness. No other bank would do this

Totally agree! I don't have a Monzo account yet, but I find this level of transparency extremely appealing. I know people who work for a few "traditional" banks (RBS, Santander, Lloyd's), and by all accounts the IT setup is an absolute shit-show. I can kind of understand though; they've been around so long that they have a ton* of tech debt and legacy crap to deal with.

* it's great that Monzo are innovating on the IT side, but to woo me they are going to have to do better on the business side - I want a better interest rate for my personal current account, savings accounts and cash ISAs; I want stocks & shares ISAs, with low fees; I want a better interest rate for my business account, along with support for international payments (every traditional bank handles these, even if they do take ~1.8% in interchange fees... grrr).

Slight tangent on that off comment at the end - why did you cancel your other accounts? If Monzo services fail or briefly go down (which is a somewhat medium risk since they've had a couple this year which lasted days) you are screwed.

Their outages have not affected everyone and often affect a small slice of the service.

Also, other banks have the same kinds of outages, often, but are not as transparent or communicative as Monzo.

I've been full Monzo for the last 18 months and have never had an issue.

> Also, other banks have the same kinds of outages

One of the unexpected advantages (if you can call it that) of OpenBanking integrations is that you get to see, in practice, just how often the high-street banks have problems. Or how long it takes for them to recover. Whatever has been said about politics and sausages sure applies...

I mean, good grief. Payment handling for a given bank may be down for a week and that's apparently not a cause for concern. Authorisation messages may be missing for days. Incoming queue can be offline for two days and customers just have to deal with it.

From the stats I've seen, Monzo is actually among the best performers with their reliability and recovery. (A recurring complaint on the receiving end is that their app makes it really easy to generate payments outside of the OpenBanking flow. That has generated extra work for our payment and customer service teams.)

Oh I'd love to see these stats, do you have a link?

The team who handle the payments should have a post coming up soon.

I cancelled the other accounts mainly because it meant I could use the account switching service which automatically moves direct debits.

Their outages haven't affected me at all yet - I don't think any have lasted for days.

> I've cancelled all my other bank accounts.

I have a Monzo account as well and really like them, but it doesn't have an IBAN and can't accept international transfers (at least not officially and reliably). That makes it not really usable as a sole bank account for many of us.

>> We check that a secret exists in our staging environment (s101) when writing to prod, and warn if it doesn't.

My first thought ... so you have a script that can see prod and non prod at the same time?

I think I may be developing paranoia.

Theoretically, you could have a dump of staging data that your prod script could read and use. What’s key here is recognizing that staging and prod don’t have to talk to each other or be trusted, as such. Instead, what’s trusted is the deploy script, and your system is only as weak or as strong as the code used to deploy the deploy jobs. There aren’t many good answers here because you can fat-finger a security bug or data wipe, or have a bug that only presents itself in certain states. Or is created by human input. This quickly devolves into the “trusting trust” problem. Is my paranoia worse? ;-) I’d suggest monitoring all the things, and doing as much code review and fuzzing as you can afford... what normally improves your code and processes will probably help prevent security defects too.

I don't think dumping production secrets to anywhere qualifies as a good idea

I am going to guess this is comparing the key part of key value stores - I regularly add a new configuration value to dev and wonder for hours why pre-prod is failing

somehow i am missing the details and in this kind of case details is all that matters

Comparing keys is interesting, it’s a way of versioning the set of credentials you’re using.

Here I meant dumping staging values and then comparing prod values as retrieved to staging ones. As in, only compare values in prod as they are used by a deploy script or system configuration tool. That said, if you have one server, one place where secrets are kept, then it’s probably safe enough to send non-prod secrets to servers as a way of ensuring the secrets are invalid. Meaning, you don’t actually need to know the prod secrets to test staging or known weak secrets against prod.

That said, if in practice your secrets are randomly generated as services deploy, you’ll likely need to validate by observed behaviour rather than using hard-coded credentials. And if you’re practicing blue/green deploys, then staging might be just as production as prod...

Slightly related, but what’s a good practice for storing secrets that need to be recovered as plain text?

I’m thinking of a system where the user can register her/his API keys to other third-party systems.

If you have the operational manpower, Vault is a good solution as well. With Vault you can create an endpoint that will encrypt/decrypt data with a key that only Vault knows. You can have a unique endpoint for each user.

Then in your application when the user submits some data, you ask Vault to encrypt it before you store it in your main data store. When you need to read it, you get that value from your store and ask Vault to decrypt it.

If you have different services doing the reading/writing you can even setup your permissions so that one service can only decrypt and the other can only encrypt.

I'm a little confused; the system mentioned in this article allows services to retrieve secrets in plaintext; they're just stored encrypted at rest. You can do what you like with the retrieved secret.

One exposure/exercise rotating secrets and it is curtains for you.

It gets better...stop building rube goldberg secret contraptions.

Any opinions about Vault vs. AWS secret management solutions?

Vault is a lot more fully featured and generally has more granular access control than equivalent AWS products, but is tricky to run in a way that is available and secure. Obviously outsourcing that has huge value

Nitpick: k8s has supported encryption at rest for secrets for a while.


This exact link is mentioned in the article! We use it!

Nice, posted that when I came across:

> Kubernetes stores the data in plaintext in etcd, a database where it stores all configuration data

Just finished reading the article.

Related: I have seen lot of people make mistake of hardcoding the secrets in the android app, please make sure you do not do that. I have a tool to check these embedded secrets: https://android.fallible.co/

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact