
Monzo: How our security team handle secrets - p10jkle
https://monzo.com/blog/2019/10/11/how-our-security-team-handle-secrets
======
micah_chatt
(Disclosure: I work on Kubernetes/EKS Security at AWS)

I'm curious why they didn't look into using Kubernetes ProjectedVolumeTokens
for authenticating to Vault? The tokens Kubernetes issues are not stored in
etcd, and they contain pod-specific metadata so they are invalidated as soon
as the pod dies (when using TokenReview). Alternatively, they can be used to
directly authenticate with Vault since they're OIDC-valid tokens [1].

The semantics around secrets in Kubernetes aren't nearly as robust as Vault,
so I was surprised to not see this more clearly called out (ex: list secrets
== get all keys and values). Even if you use KMS/AES encryption (which they
reference) that doesn't help with access control.

[1]
[https://www.vaultproject.io/docs/auth/jwt.html](https://www.vaultproject.io/docs/auth/jwt.html)

~~~
MrSaints
Interestingly, Googling `ProjectedVolumeTokens` yields this very post. I think
that says something about its maturity.

~~~
micah_chatt
That would by my error, its actually `BoundServiceAccountTokenVolume` and
`TokenRequestProjection`

------
nemothekid
The intersection of microservices + Vault is something I've long had to deal
with - as far back as Vault 0.5.0, so I'm a bit "surprised" that isn't
something that is turn-key with Vault+k8s today.

I was working with Mesos (before k8s had taken the world by storm), and had a
similar issue - how do services get vault tokens without having a workflow
that may include storing "secrets" in configuration. What I ended up writing
was a tool[1] that a service could query, with it's Mesos Task ID, to get a
token. The tool would then read it's own configuration, as well as the current
mesos state to determine if the request was valid or not.

Unlike the k8s solution, as I understand it, you don't need to treat the
'service account token' as a secret (reducing the attack surface of when
someone steals that token _and_ also has access to Vault). This is
accomplished in two ways:

1\. You can determine if a request is valid by looking at how long the service
is running. If someone steals a Task ID, but the service has already been
running for 2 minutes, then the Task ID is useless.

2\. The Vault token is only issued once per service launch. This means if an
attack steals the Task ID, but the token has already been given out, the Task
ID is useless. If the attacker beats the the service in asking for that key,
then the service should raise the alarm bells about it's key potentially being
stolen. If you are even more paranoid, you could even decide to invalidate all
active keys and reduce the amount of time the attacker has a valid Vault
Token.

This was largely designed years ago with some cues I took from one of Vault's
lead engineers so I expected that the k8s integration would work similarly.
I'm not too familiar with k8s however so there might be other constraints I'm
overlooking in why Vault's k8s is integrated the way it is.

[1] [https://github.com/nemosupremo/vault-
gatekeeper](https://github.com/nemosupremo/vault-gatekeeper)

~~~
p10jkle
This sounds interesting. I think with bound service account tokens
([https://github.com/kubernetes/community/pull/1460](https://github.com/kubernetes/community/pull/1460))
things may improve; pods will be able to prove their identity without it being
equivalent to giving away your k8s api access.

I don't think there is anything with the same guarantees as the Task ID,
sadly.

------
Shoop
> We run about 1,100 microservices written in Go

Asking as a university student: is this a common number of microservices to
have running in production? It looks like monzo has about 1,351 total
employees [0]. If all of them were software engineers, this would be a little
less than one microservice per engineer. How do you handle code reuse and
reliability among thousands of microservices? It seems like the number of
possible failure states would be unthinkable.

[0]
[https://en.wikipedia.org/wiki/Monzo_(bank)](https://en.wikipedia.org/wiki/Monzo_\(bank\))

~~~
p10jkle
I don't think its particularly common for a company of our size. They are
mostly very small and handle very specific things. It's one approach, with
some benefits, although it makes projects that affect all services quite
tricky!

~~~
Shoop
Does 1,100 microservices mean 1,100 distinct programs or 1,100 running
instances of dozens or hundreds of distinct programs?

~~~
OJFord
I assume they're mostly RPC, and one procedure is one 'microservice'.

There seems to be a bit of difference in what people mean by 'microservices'
\- some orgs will have a few REST collections per 'service' so you might end
up with 'users', 'products', 'transactions' as your three services, and it
being totally unimaginable that you'd ever break 1000.

I'd argue that's still a Service-Oriented Architecture (SOA), but I'm sure
it's not anywhere near as 'micro' as what Monzo counts over a thousand of.

~~~
p10jkle
Microservices generally have 0-10 RPC handlers (we use
[https://github.com/monzo/typhon](https://github.com/monzo/typhon)). We have a
lot of different object types at Monzo, it goes way beyond 'transaction' or
'user' and each object type is going to be at least one service, but can be
several. Services that mostly just consume off queues are often separated from
services that handle requests so they can scale separately.

~~~
OJFord
Oh I wasn't suggesting user/product/transaction were what Monzo would use!
(Should have made that clearer since 'transaction' is right domain...) Just
the first things that came to my mind for very basic sort of service splitting
that I'm familiar with.

Thanks for replying, that does clear things up. My experience didn't really
embrace RPC, so used gRPC for one small bit (< all of the service that
contained it as I recall) but most was JSON HTTP APIs - the service boundaries
pretty much just being team boundaries, though some (incl. mine) had a few
services to the team.

------
tialaramex
The bit where a secret gets pasted into the Very Secure system is a clear
problem. Because before it was pasted, and while it was being pasted, it's not
in that secure system it's on some dev's laptop.

Most of your secrets will be/ should be just random bits maybe in some
particular format that was convenient for a particular application e.g. a
4-digit PIN or a sixteen character Hexadecimal string, or 10 characters of
A-Za-z0-9

So for these cases there's no reason that secret is ever on a developer's
laptop. Best case the developer made a truly random secret, maybe they (like
me) keep a set of hexadecimal dice on their desk for random choices. Just as
likely it's tainted, the developer ran a randomizer until out popped a string
they liked - or even they found one on a web site, or used the same one as in
their test environment.

Either way, since what you wanted was random bits it makes sense in most cases
(not all cases, obviously a secret key you were sent by somebody else, for
example an API key for their system will have to be copied somehow) to have a
feature that just spits the right format of random bits into the secure system
without any human seeing them at all.

Even better, in cases where it's an option, is not to rely on stored secrets
at all. I think Monzo's post is not worrying about this difference, but it can
be critical in terms of decisions about debugging to prefer to have entirely
ephemeral secrets. When a pod goes away, the ephemeral secrets that pod had
vanish with it, and so you aren't storing them anywhere anyway. If they aren't
stored, they can't get stolen by anybody and you've got one less thing to go
wrong.

~~~
mitchellh
(Disclaimer: founder of HashiCorp, creator of Vault)

You hit on a good point. Vault has features to eliminate this security risk,
if I’m understanding correctly.

The first feature is dynamic secrets: this generates an ephemeral, leased set
of credentials that are unique per client. For a Kubernetes pod, it would get
a unique set of DB credentials, for example. These are tied to the service
account (used for auth). When the auth expires, so do the credentials (they’re
dropped from the DB, and if the DB supports it we also drop connections).

The second feature is root credential rotation. To use the above feature, a
user had to at some point “paste” the superuser credentials into Vault. As you
pointed out, there’s a risk here. So what Vault can do is _immediately_ rotate
that credential so after configuring Vault, it is no longer valid and only
Vault knows the real credential. We support this for most database backend,
for example.

If you combine these two elements, you get fully ephemeral secrets that are
unknown by anyone except the necessary user. There’s a lot more we can talk
about, there’s a lot more features we have around this, but this is just the
high level point!

~~~
SEJeff
As one of your commercial customers, I would give you the feedback that there
are a lot of features that get closed (on the github vault issue tracker) with
terse and not very user friendly reasons. This is off putting to a lot of
people.

We've recently been escalating via sales for github issues we see use for and
hopefully that gets back to product management. Not everyone is able to do
that however.

~~~
mitchellh
Thank you, I heard this feedback this week as well. We’re hiring for a couple
roles right now in the Vault team that will be dedicated to community
management and process (that will expand more broadly). We hope this helps
with this.

The feedback from customers definitely gets back to product management, so for
your case, that works.

------
atonse
I so, so, so, badly want to use Vault everywhere.

The one thing keeping me from adopting isn't that it won't be secure enough. I
worry _constantly_ that I'll lock myself out of my data, my infrastructure,
etc.

Have others had that worry? How'd you get over it and just start using Vault?
(probably through incremental, low-risk adoption first?)

~~~
kitotik
Some quick points as a very early adopter of Vault:

\- Absolutely take an incremental approach, there is a learning curve

\- Use a high availability backend. I started out with just consul, but have
since went with consul for HA, and cockroachdb as the data store.

\- Practice your disaster recovery a many times before you go all in

\- use a combo of paper plus offline digital storage for the unseal keys and
root tokens

\- be diligent about roles/policies

\- rely on tokens more than any other auth method(username/pass, certs)

\- don’t be afraid to use many vault instances that can talk to a core
instance for things like transit auto unseal

All that said, I absolutely think it’s worth the investment if you have the
infrastructure to back it(aka a cluster without a SPOF)

The PKI backend is phenomenal, and it makes a great sidecar for any app that
needs auth, secrets, or general crypto stuff using the transit backend.

I really want to create a user friendly Password Manager/TOTP front end backed
by vault. Someday…

~~~
nyxcharon
Something like this? Chrome plugin:
[https://chrome.google.com/webstore/detail/kvasir/kabfjaeebjd...](https://chrome.google.com/webstore/detail/kvasir/kabfjaeebjdpgbifhipejjdebeodlbip?hl=en)

Src: [https://gitlab.com/Dreae/kvasir](https://gitlab.com/Dreae/kvasir)

------
whycombagator
Vault interests me but seems to come with a lot of complexity/requires an army
of devops. There's a YC backed company called Envkey[0] that looks interesting
+ simple, but there is no option to self host yet

[0] [https://www.envkey.com/](https://www.envkey.com/)

~~~
gingerlime
I’ll plug envwarden[0] which is a simple wrapper to manage your keys as
environment variables in Bitwarden (which can be self-hosted I believe).

[0]
[https://github.com/envwarden/envwarden](https://github.com/envwarden/envwarden)

~~~
peterloron
Yes, Bitwarden can be self-hosted. It is also a really nice secret management
tool for humans, although it does have a CLI and API which can be used for
machines.

------
rossmohax
One thing bugs me about Vault and I rarely see it beeing discussed: how come a
design where Vault simultaniously stores secret AND is able to access other
systems these secrets are for is deemed as good and secure?

Vault is a company-wide "root account", reachable from every part of internal
network, storing all the company secret data and have reach to many internal
and external systems like databases, where it have full rights to dynamically
configure short lived credentials. Doesn't it put too much trust into a single
system?

~~~
lstamour
The real question might be if having it in one place makes monitoring and
revocation easier than distributed trusted systems? Also, traditional
passwords/secrets don’t expire, and if they do expire, how would you maintain
trust in a distributed fashion outside of... complicated multi-part keys,
something less secure like DNS, and/or more permanent tokens like the private
keys used by a CA system? Somebody somewhere has to maintain a private or
secret key, or you need a human to intervene. And even if you store the key on
hardware, any users of the key could be compromised.

I guess what I’m getting at is—there probably isn’t a perfect answer, just
tradeoffs. And if history has taught us anything, it’s a case of “when” not
“if” something is attacked/broken. If so, perhaps you should partition your
data, including infrastructure, to not rely on just one Vault server for
everything? Outside of that, or monitoring, the only other clear answer
everyone leans on is “Store it in the cloud,” under the assumption that the
hardware and people processes at cloud companies will be more secure overall
than anything you’d develop. Which is then the attack vector vault prevents:
saving permanent access tokens. Off the cuff that’s how I see it. I’m not
actually in SRE or security so I’d welcome other opinions.

------
IshKebab
Ha I love how they explain what "cryptography" means, as if anyone reading
this wouldn't know.

Also I love the level of openness. No other bank would do this. (If you're in
the UK, Monzo is honestly amazing. I've cancelled all my other bank accounts.)

~~~
deif
Slight tangent on that off comment at the end - why did you cancel your other
accounts? If Monzo services fail or briefly go down (which is a somewhat
medium risk since they've had a couple this year which lasted days) you are
screwed.

~~~
yRetsyM
Their outages have not affected everyone and often affect a small slice of the
service.

Also, other banks have the same kinds of outages, often, but are not as
transparent or communicative as Monzo.

I've been full Monzo for the last 18 months and have never had an issue.

~~~
bostik
> _Also, other banks have the same kinds of outages_

One of the unexpected advantages (if you can call it that) of OpenBanking
integrations is that you get to see, in practice, just how often the high-
street banks have problems. Or how long it takes for them to recover. Whatever
has been said about politics and sausages sure applies...

I mean, good grief. Payment handling for a given bank may be down for a week
and that's apparently not a cause for concern. Authorisation messages may be
missing for days. Incoming queue can be offline for two days and customers
just have to deal with it.

From the stats I've seen, Monzo is actually among the best performers with
their reliability and recovery. (A recurring complaint on the receiving end is
that their app makes it _really_ easy to generate payments outside of the
OpenBanking flow. That has generated extra work for our payment and customer
service teams.)

~~~
sakisv
Oh I'd love to see these stats, do you have a link?

~~~
bostik
The team who handle the payments should have a post coming up soon.

------
lifeisstillgood
>> We check that a secret exists in our staging environment (s101) when
writing to prod, and warn if it doesn't.

My first thought ... so you have a script that can see prod and non prod at
the same time?

I think I may be developing paranoia.

~~~
lstamour
Theoretically, you could have a dump of staging data that your prod script
could read and use. What’s key here is recognizing that staging and prod don’t
have to talk to each other or be trusted, as such. Instead, what’s trusted is
the deploy script, and your system is only as weak or as strong as the code
used to deploy the deploy jobs. There aren’t many good answers here because
you can fat-finger a security bug or data wipe, or have a bug that only
presents itself in certain states. Or is created by human input. This quickly
devolves into the “trusting trust” problem. Is my paranoia worse? ;-) I’d
suggest monitoring all the things, and doing as much code review and fuzzing
as you can afford... what normally improves your code and processes will
probably help prevent security defects too.

~~~
lifeisstillgood
I don't think dumping production _secrets_ to anywhere qualifies as a good
idea

I am going to guess this is comparing the key part of key value stores - I
regularly add a new configuration value to dev and wonder for hours why pre-
prod is failing

somehow i am missing the details and in this kind of case details is all that
matters

~~~
lstamour
Comparing keys is interesting, it’s a way of versioning the set of credentials
you’re using.

Here I meant dumping staging values and then comparing prod values as
retrieved to staging ones. As in, only compare values in prod as they are used
by a deploy script or system configuration tool. That said, if you have one
server, one place where secrets are kept, then it’s probably safe enough to
send non-prod secrets to servers as a way of ensuring the secrets are invalid.
Meaning, you don’t actually need to know the prod secrets to test staging or
known weak secrets against prod.

That said, if in practice your secrets are randomly generated as services
deploy, you’ll likely need to validate by observed behaviour rather than using
hard-coded credentials. And if you’re practicing blue/green deploys, then
staging might be just as production as prod...

------
haolez
Slightly related, but what’s a good practice for storing secrets that need to
be recovered as plain text?

I’m thinking of a system where the user can register her/his API keys to other
third-party systems.

~~~
nemothekid
If you have the operational manpower, Vault is a good solution as well. With
Vault you can create an endpoint that will encrypt/decrypt data with a key
that only Vault knows. You can have a unique endpoint for each user.

Then in your application when the user submits some data, you ask Vault to
encrypt it before you store it in your main data store. When you need to read
it, you get that value from your store and ask Vault to decrypt it.

If you have different services doing the reading/writing you can even setup
your permissions so that one service can only decrypt and the other can only
encrypt.

------
mcnichol
One exposure/exercise rotating secrets and it is curtains for you.

It gets better...stop building rube goldberg secret contraptions.

------
ggregoire
Any opinions about Vault vs. AWS secret management solutions?

~~~
p10jkle
Vault is a lot more fully featured and generally has more granular access
control than equivalent AWS products, but is tricky to run in a way that is
available and secure. Obviously outsourcing that has huge value

------
etxm
Nitpick: k8s has supported encryption at rest for secrets for a while.

[https://kubernetes.io/docs/tasks/administer-
cluster/encrypt-...](https://kubernetes.io/docs/tasks/administer-
cluster/encrypt-data/)

~~~
p10jkle
This exact link is mentioned in the article! We use it!

~~~
etxm
Nice, posted that when I came across:

> Kubernetes stores the data in plaintext in etcd, a database where it stores
> all configuration data

Just finished reading the article.

------
mkagenius
Related: I have seen lot of people make mistake of hardcoding the secrets in
the android app, please make sure you do not do that. I have a tool to check
these embedded secrets:
[https://android.fallible.co/](https://android.fallible.co/)

