
Scalable and secure access with SSH - samber
https://code.facebook.com/posts/365787980419535/scalable-and-secure-access-with-ssh/
======
jorangreef
I found this really useful. It's lightweight, simple and powerful.

However, some pieces of advice in the article could prove catastrophic:

"You need to distribute ca.pub to your entire fleet. Remember, this is meant
to be public, so complete access lockdown isn't the goal."

Complete access lockdown is vital. You need to prevent the CA's public key
from being altered, corrupted or deleted across your fleet. When distributing
the CA's public key, you definitely need to use a secure transport. The CA's
public key is what your system of trust is built on. You need to make sure you
are trusting the right thing.

"Copy the latter to the CA server and get it signed. Because this is public
information, the transport isn't important. You can copy and paste, or fax it"

When copying the user's public key to the server, the transport is still
important, otherwise a MITM could swap the public key and your CA would
happily sign the wrong thing.

------
nsheridan
For (possibly) smaller scale I wrote a self-service SSH CA:
[https://github.com/nsheridan/cashier](https://github.com/nsheridan/cashier)

------
cakoose
> _We do not place SSH certificates on laptops because it is difficult to
> control everything that individual employees run on them. Even though they
> are centrally managed, laptops are more prone to vulnerabilities than the
> bastion servers. Therefore, we do not trust them with SSH private keys._

If the laptop doesn't have an SSH private key, how do you SSH to the bastion
host?

> _In your .ssh / directory, you'll see id_ecdsa and id_ecdsa.pub. Copy the
> latter to the CA server and get it signed. Because this is public
> information, the transport isn't important. You can copy and paste, or fax
> it; just don't copy id_ecdsa anywhere._

The transfer of id_ecdsa.pub doesn't need secrecy, but it does need integrity.
You don't want to accidentally sign an attacker's public key.

~~~
timv
> If the laptop doesn't have an SSH private key, how do you SSH to the bastion
> host?

 _These hosts use centralized LDAP and Kerberos installations to share account
information, and they require two-factor authentication to protect against
password leakage_

It appears that you use your password + some unspecified second factor.

------
hug
I thought this was mostly a solved problem? SSSD or FreeIPA and a bunch of
LDAP servers. Cached credentials will keep you online in the case of a
transient authentication issue.

HA LDAP is pretty much a solved problem as well. Say what you want about
various Microsoft server products, Active Directory is second to none in this
regard, and is a highly flexible and extremely robust service.

Am I missing something specific about Facebook that makes this not an option?

~~~
mfdutra
We have hundreds of thousands of servers and we also run sshd inside
containers everywhere, so we probably have over a million endpoints we can SSH
into. Cache is only good for things you login frequently, like your
workstation or a handful of servers. We don't accept the risk of not being
able to login in a system. Our only dependency is a signed certificate. If
things go south really really bad, we can just get the private key and sign
certificates by hand.

~~~
otterley
> We don't accept the risk of not being able to login in a system.

There are plenty of ways to mitigate the risk of LDAP server unavailability.
Besides, the document said that the bastion server uses it, so why isn't
what's good for the bastion good enough for the rest of the fleet?

~~~
jetpks
LDAP works really well until you have either too many users, or too many
endpoints. Then it doesn't work at all. There's no (cost effective) way to get
past all of the replication traffic flying all over the place. Eventually
consistent models are awful for things like ssh keys and passwords that's
exacerbated significantly when you add caching to the mix to ease the load on
your ldap cluster.

Adding extra brittle moving parts like LDAP and linux clustering to a
greenfield deployment just doesn't seem attractive when a CA is so easy to
run, and you should already have one if you're doing config management sanely.

~~~
otterley
Your premise would make sense if LDAP replication were expensive, but it
isn't. LDAP database modifications are relatively rare: you only make them
when a new user is added; a user's credentials change; or a user is deleted
(which should be never for various post-termination accounting reasons). Even
at Facebook, the change rate should be relatively low.

Also you're making an assumption about the need for consistency, when as a
practical matter there's rarely a need for it. Caching is effective and
practical for this use case and you'd have to make a very strong case that it
should be thrown out.

Finally, it is my experience that people grossly misjudge the difficulty of
securely and scalably running a CA. Most such comments come from those who
have never actually operated one.

------
crypt1d
The implementation seems quite novel, but I can't shake the feeling that its a
bit over-engineered. The additional complexity of this over a standard
solution (eg FreeIPA + SSSD) may introduce other issues that need to be
mitigated.

The approach of using certificates over a cache like sssd is just shifting the
issue from cache expiration over to certificate expiration. With certificates,
you are likely to set some period where the user certificates expire, and if u
have an expired certificate and CA is down u still cannot log in. If you dont
expire your certificates u have a security risk.

With FreeIPA you can also map public keys to specific users and have sshd pull
authorized_keys from SSSD instead of the actual file. So if a user leaves the
system, his keys get removed. So no need to worry about old keys lingering on
your servers.

As somebody already mentioned, LDAP (which FreeIPA uses as backend) has a
battle-proven HA capability. With some sane configurations of the underlying
servers (separate physical equipment, networking, etc), you don't need to
worry about your backend mysteriously going down.

------
AceJohnny2
Tangent: _" When designing a security system, regardless of purpose or
protocol, you need to think of authentication and authorization separately.
Authentication securely verifies that the client is who it claims to be, but
does not grant any permissions. After the successful authentication,
authorization decides whether or not the client can perform a specific
action."_

These two operations are often abbreviated _" authn"_ (for autheNtication) and
_" authz"_ (for authoriZation) in security frameworks.

~~~
mfdutra
In fact, we use a lot the terms authn, authz, authnz and AAA at Facebook. :-)

~~~
beagle3
When I was doing security work, AAA stood for "authorization, authentication
and audit"

~~~
otterley
The last one was also known as "accounting."

------
mcpherrinm
How do you manage what servers clients trust?

I've seen a few SSH CA projects for signing user certs, but not for the server
side certificates.

When you're running containers, if they're spawned frequently, that can mean a
significant amount of known_hosts file churning. It sounds like everyone
funnels through a small set of bastions, so you'd only need to update them
there, but I'm wondering if anyone else uses an SSH CA to solve that side of
the equation?

All the ingredients needed are in `ssh-keygen`, but that doesn't feel super
awesome to me.

Solving server identity is why I built
[https://github.com/square/sharkey](https://github.com/square/sharkey)

~~~
knorker
You've not seen that?

It's HostCertificate in sshd_config. Then in known_hosts:

@cert-authority *.example.com ssh-rsa AAAAB…

[https://www.digitalocean.com/community/tutorials/how-to-
crea...](https://www.digitalocean.com/community/tutorials/how-to-create-an-
ssh-ca-to-validate-hosts-and-clients-with-ubuntu)

Edit: Oh, you mean for making them more short lived.

~~~
mcpherrinm
You can trust a CA easily enough. So do you just glob some shell scripts
around ssh-keygen?

Short-lived is one solution to limiting their lifetime: the other is to use a
CRL format.

Either way, you need software to manage the issuance and later revocation.

I suppose you could build this into your host imaging profile, or use config
management software. I'm just interested in what people do.

------
thenewwazoo
This looks very similar to Netflix' BLESS package[1]

[1] HN discussion at
[https://news.ycombinator.com/item?id=11746425](https://news.ycombinator.com/item?id=11746425)

~~~
spydum
Yes, I was coming to post the exact same thing.

------
ianunruh
Another promising solution that uses these principles is Teleport
([http://gravitational.com/teleport](http://gravitational.com/teleport))

------
skywhopper
If I'm reading this right, it looks like this improves over Netflix BLESS
IIRC. My concern with something like BLESS was that it allows you in to root
or some deploy/app user based on a cert trust chain, but doesn't seem to trace
back to the actual user who's making the changes and/or differentiate between
two users logged in at once. This embeds the user information in the cert.
Looks nice.

------
lox
How do they handle revocation? The problem I had with this approach in the
past was no CRL support.

~~~
mfdutra
We can use RevokedKeys for that, but in fact we normally issue short-lived
certificates. If we ever have a mass certificate leak, we'll just rotate the
entire CA.

~~~
mfdutra
Also, we bind certificates to the hosts that requested them. If a certificate
and its private key move somewhere else, they will be useless.

ssh-keygen -O source-address=1.2.3.4...

~~~
lox
Interesting, so I'm assuming this wouldn't work for staff that roam with
laptops?

~~~
mfdutra
We only do that in bastion servers. It would be a bit trickier with laptops,
but doable I guess.

~~~
lox
Makes sense. I missed that section, much clearer now, thanks. Any plans to
release more details on how the bastion hosts are configured? Seems like that
would be the complex bit to get right.

------
knorker
How anticlimactic. Nothing new or internal, just "hey here's what's in the
OpenSSH manpage: TrustedUserCAKeys and AuthorizedPrincipalsFile".

