Initial idea and C++ implementation (using TLS with X.509 certificates and explicit UDP broadcasts) was done in 2007 by another person. Redesign of the protocol (to TLS with OpenPGP keys¹ and DNS Service Discovery²), and re-implementation in Python and C, I did in collaboration with that person. In addition to ongoing maintenance, the relatively recent switch from TLS with OpenPGP keys to TLS with Raw Public Keys³ was done by me.
The level of understanding required is something I would think that all system administrators worth their salt had at the time. I would think that the best way to acquire such knowledge is doing the Linux From Scratch⁴ exercise, even though I have not done it myself.
Looks like a neat project but the intro/faq should probably be a bit more self-critical to point out weaknesses. The “nope, it’s protected by TLS” answers ignore the fact that anyone attacking this could also have attacked the PKI. If someone gets the client cert and key, they can probably fake the request to get the decryption password. I’m assuming that client key isn’t protected by a password, since then that would be the thing a user has to provide at boot time. And what about the vector where someone attacks the CA that issued the certs? Where is that stored? Can fake roots be injected by someone in possession of both machines? This may be moot if you are using self-signed certs, but of course those introduce their own management issues.
Also, I don’t really see any discussion of availability concerns. This is a system with a pretty gnarly fail-closed kill switch that could happen with a simple network outage. That doesn’t really seem to be acknowledged and there’s no discussion of the inherent balance between security and availability. You really need to be able to guarantee a certain level of availability or things basically self-destruct. Presumably there’s a mechanism that allows a self-destructed pair or cluster of these mandros’d servers to go back to a normal operating mode?
Anyway, I don’t mean to be too critical. It’s a really cool project. A little Byzantine but with a stated reason for that. Would just like to see more focus on the weaknesses and potential critical operational issues. A section called “reasons you may not want to use this” that is very up front about those seems appropriate.
Teddyh's answer describes some of the technical aspects, through I would like to add the security scenarios that Mandos works to address. Any security measure is in one way or another designed with known threats, assets and costs/outcomes.
If one operates a bunch of servers with FDE in a server room, getting there every time there is a need to reboot is a significant problem. To mention a few causes, redundant nodes going up and down in the middle of the night, updates to the operating system and kernel, and misbehaving hardware. At the same time, those servers are likely to hold a lot of sensitive data to companies or persons, especially email, which puts the administrator at conflict between using full disk encryption or not using it. In my experience, unless there are regulations that dictate otherwise, servers are not encrypted because of the hassle and downsides of manual or needing to attend reboots in person. This was the initial case as to why Mandos was created many years ago. If the server hall loses both primary and backup power, there is a real risk that the administrator does need to travel there to bring the machines back up. That would be one of the major trade offs, through I would still recommend administrators to do that, compared to the risk of an unencrypted disk getting lost, stolen, or cases where someone comes in and takes all the servers.
There are naturally other scenarios that one can use Mandos for, but like any tool it's good to know whats it is designed for. It is not intended to replace setups where one is already using FDE where one types in the passwords manually at the terminal and is happy with it. If one does not need the unattended aspect but want to remotely reboot the server, there are things like Dropbear or IPMI/remote KVMs, in which case the security will rely on those components' security. In my experience, IPMI's security should not be exposed to the internet which means one first needs a security entry point to the local network. Dropbear uses ssh which mean one should use client certificates and verify the signature before use. Depending on the use case and what risks one wants to take there are benefits and drawbacks, but the key point I do want to come back to is that people really should use full disk encryption and Mandos alleviates the primary reason people don't use FDE.
> If the server hall loses both primary and backup power, there is a real risk that the administrator does need to travel there to bring the machines back up. That would be one of the major trade offs, through I would still recommend administrators to do that, compared to the risk of an unencrypted disk getting lost, stolen, or cases where someone comes in and takes all the servers.
Yeah, I agree with all of this. My nitpick is just basically requesting the doc talk about this being a conscious tradeoff where your infra availability and your lack of tolerance for frequent fail-closed events might lead you to intentionally weaken the security guarantees by lengthening the timeouts. In other words, you set the timeouts as short as you can tolerate, based on your infra.
> And what about the vector where someone attacks the CA that issued the certs?
There is no CA involved, nor any X.509 keys. The keys used in TLS are ed25519 raw keys, and the server has a list of, and checks, individual key fingerprints.
> This may be moot if you are using self-signed certs, but of course those introduce their own management issues.
Yes, you have to generate and transport keys out-of-band (i.e. by hand) as part of the initial setup. The instructions on exactly how to do this are shown as part of installation and configuration.
> a pretty gnarly fail-closed kill switch
That’s a feature. A security system should fail closed.
> Presumably there’s a mechanism that allows a self-destructed pair or cluster of these [mandos]’d servers to go back to a normal operating mode?
Yes. You either type in a password on the console on one of the servers, or use a dropbear to ssh in remotely to do it.
> A section called “reasons you may not want to use this” that is very up front about those seems appropriate.
The project is mostly intended for those people who have already decided that full-disk encryption is a requirement, and Mandos is meant to alleviate some of the pain which they have already accepted. But sure, I see your point.
> That’s a feature. A security system should fail closed.
Of course, but there should at least be a mention of the fact that you need to tune the fail-closed parameters to take your availability into consideration. I appreciate that the various attacks would have to be done "pretty quick" according to the FAQ but the definition of "pretty quick" is necessarily countered by what kind of guarantees you can make about your availability (of the server(s) and the client), and this isn't mentioned. If a 30 second network failure causes the server to refuse keys to the client from that point on, but you can't guarantee that level of network availability (taking into account things like replacement of network switches and other types of maintenance), the definition of "pretty quick" may be too quick. It's a very direct and explicit tradeoff between security and availability and that concept is absent from the intro/FAQ. As a mental exercise, consider how you'd answer the FAQ "So I should set my timeouts super low for better security?"
Again, I'm not trying to be a picky ass, and I think the project is cool. I just think this is a topic that non-security-folks don't necessarily think about automatically, and this is the opportunity to make them think about it. The entire doc sounds like "faster timeout == better" and it would be very unfortunate for someone to configure and deploy this based on that understanding.
PS your other responses to my nitpicks were great, and somehow I missed the entry about stealing the client key being possible but having to be done very quickly. Thumbs up. I'm curious about what you mean by saying you aren't using "x509 keys" though. You must be generating a self-signed x509 cert containing the client's pubkey in order to do TLS. The packaging of the key itself isn't really relevant, is it? The "cert validation" on either side doesn't really care much about the contents of the cert other than the pubkey encoded therein, but you still do actually have to create x509 certs using those keys unless you've completely butchered the TLS stack. Right?
The timeout can’t realistically be set very short, since it needs to allow for a normal reboot of a server. Servers are, in my experience, notoriously slow to reboot. Therefore, a typical network hiccup is assumed to be shorter than that. The default timeout value of 5 minutes reflects this.
Also, you could add the Mandos server status to your alerting system, and if anything goes wrong with your network and the Mandos server times out for a client, you can be alerted to this fact, so you can fix it before the next time that client happens to reboot.
> consider how you'd answer the FAQ "So I should set my timeouts super low for better security?"
Fair; the text could be clearer about this.
> I'm curious about what you mean by saying you aren't using "x509 keys" though.
That looks like an awesome project but I'm not sure building an LFS system would help developing a system like that. Possibly in understanding and configuring it.
I still recall how to build a Linux system from go. Coding what you're working on up in Python/C would take a large unrelated amount of knowledge.
The knowledge about how to write a program comes naturally when you know, in fine enough detail, both the problem which the program should solve, how to solve it, and the environment in which the program should run. In this case, writing a Python server program to respond to requests was relatively simple; Python provides built-in modules which makes writing servers easy. And when you know what the client program (i.e. the program running on the currently locked host) should do, and you know what environment the program has to operate in, the program more or less writes itself.
The first version of the program used a simple UDP broadcasting method to a hard-coded port to find servers, which required some rudimentary networking knowledge, but only basic TCP/IP stuff.
Later, both the server and client parts have gone through numerous refactorings which brought in many features (like a plugin system on the client side, and a D-Bus interface on the server side), but those were manageable chunks to add to an already mature and working system.
But sure, in addition to the knowledge one could acquire from LFS, I also had some high-level knowledge of how TLS and its handshake worked, I knew that there was some way to use OpenPGP keys instead of X.509 certificates in TLS, and I knew a little about how DNS-SD worked. The rest I needed I read up on as I wrote the code.
Hardware security modules are no cakewalk either. For webservers I think most people consider them overkill. They mostly IME get used to handle code signing.
And at one company they were worried about the devices getting stolen, so they could had HSMs and still couldn’t reboot unattended (though most of the signing keys were with humans rather that automated)
There is actually a solution for that (shameless plug): https://www.recompile.se/mandos