Hacker News new | past | comments | ask | show | jobs | submit login
Post-mortem and remediations for the Matrix.org security breach (matrix.org)
183 points by Arathorn 13 days ago | hide | past | web | favorite | 64 comments





One of the failures here is that they weren't able to keep deployed software up to date for security fixes, even when those security fixes were publicly known.

They have acknowledged this in their section "Keeping patched".

However, there is one thing I think they have omitted to consider. The more reliance on third party software not from the server distribution they are using, the more disparate and unreliable sources for security fixes become.

Careful choice of production software dependencies is therefore a factor. Usually it is unavoidable for some small number of dependencies that are central to the mission. But in general, I wonder if they have any kind of policy to favour distribution-supplied dependencies over any other type.

Another way of looking at this: we already have a community that comes together to provide integrated security updates that can be automatically installed, and you already have access to it. Not using this source compromises that ability. If some software isn't available through Debian, it is usually because there is some practical difficulty in packaging it, and I argue that security maintenance difficulty arises from the same root cause.

On a similar note, I'm curious about their choice to switch from cgit to GitLab. Both are packaged in Debian, but I believe that even Debian doesn't use the packaged GitLab for Debian's own GitLab instance. Assuming that Debian GitLab package's version is therefore not practical, wouldn't cgit be better from a "receives timely security updates through the distribution" perspective?


This is an excellent point.

In the (distant) past, we tended to prefer to wrap our own stuff for critical services (e.g. apache, linux kernel) rather than use distribution-maintained packages. The reason was pretty much one of being control freaks: wanting to be able to patch and tweak the config precisely as it came from the developers rather than having to work out how to coerce Debian's apache package to increase the hardcoded accept backlog limit or whatever today's drama might happen to be.

However, this clearly comes at the expense of ease of keeping things patched and up-to-date, and one of the things we got right (albeit probably not for the right reasons at the time) when we did the initial rushed built-out of the legacy infrastructure in 2017 was to switch to using Debian packages for the majority of things.

Interestingly, cgit was not handled by Debian (because we customised it a bunch), and so definitely was a security liability.

Gitlab is a different beast altogether, given it's effectively a distro in its own right, so we treat it like an OS which needs to be kept patched just like we do Debian.

For what it's worth, I think by far the hardest thing to do here is to maintain the discipline to go around keeping everything patched on a regular basis - especially for small teams who lack dedicated ops people. I don't know of a good solution here other than trying to instil the fear of God into everyone when it comes to keeping patched, and throwing more $ and people at it.

Or I guess you can do https://wiki.debian.org/UnattendedUpgrades and pray nothing breaks.


The Nix package manager can help keeping packages that are not available for your distribution updated and customised (https://nixos.org/nix/).

In the past I used to install newer, or customised, versions of e.g. `git` than were available on my Ubuntu into my home directory using e.g. `./configure --prefix=$HOME/opt`. That got me the features I wanted, but of course made me miss out on security updates, and I would have to remember each software I installed this way.

With nix, I can update them all in one go with `nix-env --upgrade`.

Nix also allows to declaratively apply custom patches to "whatever the latest version is".

That way I can have things like you mentioned (e.g. hardcoded accept backlock for Apache, hardening compile flags) without the mentioned "expense of ease of keeping things patched and up-to-date". I found that very hard to do with .deb packages.

It's not as good as just using unattended-upgrades from your main distro, because you still have to run the one `nix-env --ugprade` command every now and then, but that can be easily automated.


I only know Guix, not Nix, but I found it mostly harder to make package definitions for that than to backport rpm and dpkgs, at least for requirements that aren't radically different from the base system. (That's nothing to do with Scheme, by the way.)

Then, if you're bothered about security, it's not clear that having to keep track of two different packaging systems and possible interaction between them, is a win.


> Or I guess you can do https://wiki.debian.org/UnattendedUpgrades and pray nothing breaks.

Better than getting compromised! I used to have a very conservative approach to changing anything, including great caution with security updates and the desire to avoid automatic security updates with a plan to carefully gate everything.

In practice though, security fixes are cherry-picked and therefore limited in scope, and outages caused by other factors are orders of magnitude more common than outages caused by security updates. Better to remain patched, in my opinion, and risk a non-security outage, than to get compromised by not applying them immediately.

A better way to mitigate the risk is to apply the CI philosophy to deployments. Every deployment component should come with a test to make sure it works in production. Add CI for that. Then automate security updates in production gated on CI having passed. If your security update fails, then it's your test that needs fixing.


fwiw, we do do https://wiki.debian.org/UnattendedUpgrades for the debian packages - I should have mentioned in the writeup.

But it's there are still a few custom things running around which aren't covered by that (e.g. custom python builds with go-faster-stripe decals; security upgrades which require restarts etc), hence needing the manual discipline for checking too. But given we need manual discipline for running & checking vuln scans anyway, not to mention hunting security advisories for deps in synapse, riot, etc, i maintain one of the hardest things here is to have the discipline to keep doing that, especially if you're in a small team and you're stressing about writing software rather than doing sysadmin.


Why won't you use https://github.com/liske/needrestart to automatically restart services that need restarting after security upgrades and unattendedupgrades or a cron job for rebooting the whole machine after kernel upgrades/periodically?

Shouldn't ansible do all this for you? I heard it's the recommended way for automatic updates and service restarts.

Please let me know about this as I'm interested myself.


I wonder what's missing from Debian to automate such things since my automation experience is mainly with RHEL. (I realize it may be partly a question of effort for automation, but it sounds as if that's not the root of it.)

Debian can restart processes dependent on updated packages and issue alerts about the need to, and you can automate checking for new releases of things for which you've done package backports. That doesn't finesse reboots for kernel updates and whatever systemd forces on you now, but I assume you can at least have live kernel patching as for the RHEL systems for which I used not to get system time.


BTW have you considered using GitLab CI instead of buildkite?

Hey! yes, although one the reasons for going with Buildkite was that we had it working well for Synapse before we got our Gitlab up and running, and so rather than doing a Jenkins->Travis->Circle->Buildkite->Gitlab journey, we decided to give Buildkite a fair go. The team also has more experience with it. Gitlab CI could work too, though, and we've heard good things. It would be yet another function to worry about self-hosting though (if we used gitlab.matrix.org for it).

Ah, thanks for the answer. Makes sense to not change it. Please let us know if you do want to convert and need help.

That is an excellent and very helpful writeup!

I'm particularly disappointed to hear that Google doesn't provide any way to rotate the signing key for an app. Is there an issue for that file with them anywhere, or more discussion?

Some day, I hope reputable services have migrated to The Update Framework, which has been pointing out and solving these and other problems related to software updates for several years now.

https://theupdateframework.github.io/

Actually, a quick search leads to this - is it indeed possible to rotate your key, at least for Android's Pie version?

  https://www.androidpolice.com/2018/08/13/android-pie-includes-key-rotation-way-developers-change-app-signatures/

So yes, Google Play has let you rotate your key for a few years now, but a) Riot/Android was set up before that was a thing, b) It gives Google the ability to push their own updates to your app, which some of the more paranoid users might object to. So we set it back up with our own key again this time, but this time will protect it with our lives...

Edit: https://developer.android.com/studio/publish/app-signing#app... is the type of key rotation i was talking about here.


actually, the mechanism described in https://www.androidpolice.com/2018/08/13/android-pie-include... sounds different to this, but given it mandates Android 9.0, we can't use that either yet. (Our minimum Android is still 4.1...)

Hi, former lead researcher at NYU for The Update Framework (TUF) here, now security engineer at Datadog taking TUF further. Planning to help PyPA apply TUF to PyPI. Happy to help answer questions, just reach out to me here, thanks.

Great post-mortem, with a candid examination of the decisions that contributed to lax security on the homeserver. While a security breach is never great, this kind of honest post-mortem improves my estimation of the chances that the matrix.org team is likely to get things right in the future.

I applaud the decision to get rid of Jenkins.

The way Jenkins works, with each plugin being able to implement arbitrary endpoint, it is almost inevitable that it would have many security vulnerabilities.

No Jenkins masters should be exposed to the internet, ever -- and if there is really no other way, then set up a proxy in front of it with strict whitelist of allowed URLs.


Author here - hopefully the level of detail here will let others learn from our mistakes (and hopefully benefit from how we've chosen to fix them going forwards). Happy to answer any/all questions or comments.

TL;DR: keep your services patched; lock down SSH; partition your network; and there's almost never a good reason to use SSH agent forwarding.


>The attacker put an SSH key on the box, which was unfortunately exposed to the internet via a high-numbered SSH port for ease of admin by remote users, and placed a trap which waited for any user to SSH into the jenkins user, which would then hijack any available forwarded SSH keys to try to add the attacker’s SSH key

You could also fund/donate to/advocate for a better SSH agent.

I use both Pageant and ssh-agent in my home network for ease of ssh'ing into boxes, especially Unifi gear and some dev VMs. I don't think I will stop using agents, but I probably wouldn't use them at work.

Why couldn't there be an agent that required you to touch a Yubikey before it'd allow keys to be forwarded? Why couldn't you add prompting and timeouts to an agent?


Just use ProxyJump. You basically should never be using agent forwarding.

ssh-agent has prompting and you can set up a Yubikey with ssh.

The problem here was agent forwarding, which you should almost always replace with opening a new connection via ssh -J (or equivalent.)


But can I prompt every time the agent is used?

How would you know whether the agent is being used by a legitimate app or a malicious app racing with a legitimate app to steal access?

At least you only would leak a single access, and you would have a higher chance of noticing, but I can also see that if the hijack was done intermittently you might write it off as a glitch...


Yup, if you're using ssh-agent (as opposed to something like gnome-keyring) setting `AddKeysToAgent confirm` to your ssh config should cause a pop up to happen every time anything requests a key from the agent.

Thanks so much for this write-up, really appreciate the candidness and detail.

I've learned a lot from it and will be adding some of these practices to the infrastructure that I manage.


I should probably also link to the HN thread from when this happened, which has a lot of interesting discussion about the security issues in question: https://news.ycombinator.com/item?id=19642554

Also, https://news.ycombinator.com/item?id=19643227 is an excellent tl;dr.


The writeup is interesting. As a security conscious developer (and with quite a lot of experience with deployments of multi-server systems) I wonder if there's a comprehensive, coherent guide in order to do The Right Thing security-wise in such scenarios. Multiple interacting servers, multiple developers, continuous delivery... I think that Google's BeyondCorp approach is rather different than this (and SSH would be publicly exposed) but has an inherent level of complexity which would be hard to cope with in a small org.

Check out https://infosec.mozilla.org/guidelines/openssh for a nice overview of best-practices on SSH.

>In terms of remediation, designing a secure build process is surprisingly hard, particularly for a geo-distributed team. What we have landed on is as follows:

>We then perform all releases from a dedicated isolated release terminal.

>We physically store the device securely.

Why didn't they go with a HSM?


This approach isn't incompatible with an HSM, as per:

> The signing keys (hardware or software) are kept exclusively on this device.

We still want to make very sure that the build environment itself hasn't been tampered with, hence keeping the build machine itself isolated too.

A much better approach would be to use reproducible builds and sign the hash of a build with a hardware key, but we didn't want to block an improved build setup on reproducibilizing everything.

Edit: we may be missing an HSM trick, though, in which case please elaborate :)


I'm not sure I understand what you mean by it being incompatible -- a HSM is a hardware device which generates and stores its keys separately from your computer's main memory such that getting the keys (even if the machine is compromised) should be impossible. In fact, it would eliminate the issues with

Since Android signing keys are just PKCS #8, and GPG keys are supported by most HSMs, a HSM would definitely be usable (even if you just used an addon HSM card that you added to your "release terminal"). Unfortunately in order to safely use the HSM you'd need to re-generate your keys again from within the HSM -- which obviously is a problem on Android. In addition, HSMs are quite expensive and might be prohibitively so in your case. But I would definitely recommend looking into it if you're really stuck on doing distribution yourselves.

Reproducible builds are a useful thing separately, but using a HSM doesn't require reproducible builds -- after all signing a hash of a binary is the same as just signing the binary. The main benefit of reproducible builds is that people can independently verify that the published source code is actually what was used to build the binary (which means it's an additional layer of verification over signatures).

One question I have is how are going to handle the case where the release terminal fails? Will you have to (painfully) rotate the keys again?


I said isn’t incompatible.

I.e. we are already using HSMs on the build server.


Ah, oops. That explains why it didn't make sense. :P

> [The attacker] placed a trap which waited for any user to SSH into the jenkins user, which would then hijack any available forwarded SSH keys to try to add the attacker’s SSH key to root@ on as many other hosts as possible.

Can the system you log into via ssh just dump your forwarded PRIVATE key? That easily?

Or this was about ssh client being patched on jenkins box to add malicious keys wherever the devops ssh'd from jenkins box?


sorry, I think I could have been clearer here.

When you log into a host with SSH agent forwarding turned on, the private key data itself isn't available to the host you're logging into. However, when you try to SSH onwards from that host, agent forwarding means that the authentication handshake is forwarded through to the agent running on your client, which of course has access to your private keys.

So, even though the private key data itself isn't directly available to the host, any code running which can inspect the SSH_AUTH_SOCK environment variable of the session that just logged in can use that var to silently authenticate with other remote systems on your behalf.

If you've already found a list of candidate hosts (e.g. by inspecting ~/.ssh/known_hosts) then your malware can simply loop over the list, trying to log in as root@ (or user@) and compromising them however you like. Which is what happened here, by copying a malicious authorized_keys2 file with a malicious key onto the target hosts. You don't need to patch the ssh client; it's just ssh agent forwarding doing its thing. :|


There's another scary implication. If the user has set up SSH on their machines and added their own key as a valid key for logging in (for instance if you have 2 machines and you use the same SSH key for logging into everything), the attacker can log into the users own machine and compromise it. This makes remediation even more difficult because even after you have secured all the servers, the attacker can get back in through an affected user's machine.

A simple yet clever attack. I wonder how you'd protect against it without banning SSH forwarding, which has almost certainly saved me from (some) RSI.

> RSI

Manually typing passwords in on an attacker-controlled machine doesn't sound very safe either.


since banning ssh agent forwarding I haven’t missed it at all - ssh -J has been an almost perfect replacement for my use cases.

What sort of thing are you using ssh -A for which couldn’t be replaced by ssh -J?


> What sort of thing are you using ssh -A for which couldn’t be replaced by ssh -J?

git checkouts from private repositories, for example. HTTPS requires username/password which may or may not be checked/monitored.


Right. I covered this specifically in the writeup, because it's a use case that we have too. Our proposal is:

> If you need to regularly copy stuff from server to another (or use SSH to GitHub to check out something from a private repo), it might be better to have a specific SSH ‘deploy key’ created for this, stored server-side and only able to perform limited actions.

And this is the approach we're taking going forwards.

If the problem is that you only ever want to read from git when an admin is logged into the machine, i guess the safest bet would be to use a temporary deploy key (or temporarily copy the deploy key onto the machine until you've finished admining). Forwarding all the keys from your agent is a recipe to end up pwned like we did, however.


You cannot dump it but you can sign data with it.

I wonder why no one uses SSH certificate based authentication. Yes SSH support certificate signed keys to allow login. No keys need to be at server at all.

Package management and microservices is the reason for this.

TL;DR Don't let devs manage your systems.

This might be the best post-mortem I've seen. Bravo Matrix.org team, informative and inspires confidence in your process.

> SSH agent forwarding should be disabled.

> SSH should not be exposed to the general internet.

> If you need to copy files between machines, use rsync rather than scp.

Great. Just great. I still remember when SSH was described as the solution to fix telnet and rcp. And now we can't use it any more. Fan-freaking tastic.


SSH is fine :) But agent forwarding is the biggest footgun imaginable, and scp sadly has design flaws some of which it literally inherited verbatim from rcp.

But using SSH as a shell is fine. And rewiring your fingers to type rsync rather than scp isn't too bad either - plus you get resumption etc for free :) (And yes, I appreciate the parent is being slightly tongue in cheek).

Edit: of course, if we'd been using xrsh and xrcp from XNS rather than this newfangled TCP/IP stuff none of this would probably ever have happened...


Sorry for my snarky tone. I'm dealing with an intrusion of my own right now and your writeup was actually quite helpful, so thanks for doing it.

gah, sorry to hear that - good luck!

How does using rsync instead of scp help? Isn't the default behavior of rsync to use SSH for transport, just like ssh does? Thus you'd still rely on forwarding keys or another ssh authentication method.

The suggestion is to use rsync rather than scp, not ssh (which as you rightly say is the default transport for rsync).

SCP is a protocol layered on SSH, and has had a spate of security flaws recently:

* Incorrect validation of the SCP client directory name (CVE-2018-20685)

* The SCP client did not receive the validation of the name of the received object (CVE-2019-6111)

* Counterfeit client SCP through object name (CVE-2019-6109)

* SCP Client spoofing using stderr (CVE-2019-6110)

And as of 8.0, OpenSSH recommends you no longer use SCP in favour of sftp or rsync, as per the security paragraph of https://www.openssh.com/txt/release-8.0:

> The scp protocol is outdated, inflexible and not readily fixed. We recommend the use of more modern protocols like sftp and rsync for file transfer instead.


Ah, I wasn't familiar with the security problems in the SCP protocol. Thanks. I had misread the recommendation on the blog post as " SSH agent forwarding is insecure, so use rsync instead of SCP", which didn't make sense.

Not to mention that rsync is a much better tool than scp in almost all respects (the only advantage scp has is that it works on all OpenSSH servers, while rsync requires you to have rsync installed on the remote end).

Is SSH fine..?

mosh dev and users think no.


what sort of thing are they worried about?

I don't know if they'd describe it as "worry", but take a look!¹

>We think that Mosh's conservative design means that its attack surface compares favorably with more-complicated systems like OpenSSL and OpenSSH. Mosh's track record has so far borne this out. Ultimately, however, only time will tell when the first serious security vulnerability is discovered in Mosh—either because it was there all along or because it was added inadvertently in development. OpenSSH and OpenSSL have had more vulnerabilities, but they have also been released longer and are more prevalent.

> In one concrete respect, the Mosh protocol is more secure than SSH's: SSH relies on unauthenticated TCP to carry the contents of the secure stream. That means that an attacker can end an SSH connection with a single phony "RST" segment. By contrast, Mosh applies its security at a different layer (authenticating every datagram), so an attacker cannot end a Mosh session unless the attacker can continuously prevent packets from reaching the other side. A transient attacker can cause only a transient user-visible outage; once the attacker goes away, Mosh will resume the session.

> However, in typical usage, Mosh relies on SSH to exchange keys at the beginning of a session, so Mosh will inherit the weaknesses of SSH—at least insofar as they affect the brief SSH session that is used to set up a long-running Mosh session.

¹https://mosh.org/#techinfo


Eh, I think your misunderstood this. There are still no alternatives to SSH. And if you want to expose something to an open internet, SSH is way better than telnet and rcp.

In particular, rsync command that they are talking about is still using SSH as an underlying transport.


mosh?

from mosh.org

> Mosh doesn't listen on network ports or authenticate users. The mosh client logs in to the server via SSH, and users present the same credentials (e.g., password, public key) as before.


Initially.

I have never used matrix.org service but I had heard of them previously from their website I could see there was the word ‘security’ or ‘secure’ used a lot.

Reading the blog post I wonder how many security specialists this organisation really has as they would never allow these fundamental errors to be made even with the explanation that they setup their infra in a rush. Dedicated security teams would have surely fixed these basic errors.

I would advise anybody looking for ‘secure’ applications to stay away from these organisations who knows how many possible flaws are deeply embedded in their systems like zero days, memory leaks and more they did not even have a basic security policy system in place... please don’t use the word secure


Isn't a lot of it/all of it reviewable on their github? Does that not help you make a decision on their quality?

Dosent really help if I can just go into their system and introduce my own code into their SDKs or just sign my own release of a build. It just makes me question how secure their build process is? Without security people you cannot claim to be secure?



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: