
Immutable Infrastructure: No SSH - axelfontaine
https://boxfuse.com/blog/no-ssh.html
======
viraptor
It's quite silly. I mean they mix up two completely separate issues.
Deployment problem and people problem. Image deployments are not new -
computing clusters have done it for ages. Immutable deployments aren't new
either - actual immutable filesystems booted over network were there for many
many years.

Yes, that's going to prevent drift at first. No, it's not going to prevent
people from doing silly stuff. At least ssh allows them to do silly stuff in a
"normal" and visible way and doesn't stop you from debugging issues when you
actually need it. If you enforce no-login without teaching people, they'll
just reinvent it... badly - for example by making executable configurations,
or pseudo-web-shells.

And there's one big issue with images which will cause a lot of friction in
real teams - they're slow to build. And even slower to test. If you're doing
it properly it means first testing locally on some approximate environment,
then handing the package over to a system which will test the whole
deployment, which takes ages to boot up all components. (time scales with
system complexity - it's not ages for a single webserver of course) Anyone
developing something related to integrating services will get additional
frustration unfortunately.

> Enforcing immutability [...] You need to prevent log in.

No... If you need immutability, you make your system actually immutable and
then audit logins when logins are needed. Preventing logins is just punishing
people and slowing them down when they need it the most, because you don't
trust them.

> Vulnerabilities like ShellShock simply vanish

Said a person who does not understand why ShellShock was an issue and why it
doesn't matter if a user who already has local access uses shellshock.

~~~
shangxiao
> Vulnerabilities like ShellShock simply vanish

I'm actually surprised at the naivety of whoever wrote this. I'm not an op at
all and I understood enough about ShellShock to know that this isn't correct
at all.

------
georgebarnett
Removing ssh ties the hands of your operations team when an outage hits
because you have removed their debugging console.

With this change, the team loses access to all the standard OS tools which are
useful in the event of a service outage for debugging. What usually ends up
happening then is that the standard tools are replaced with substandard and
patchy implementations.

If your infrastructure is drifting then a better solution is to restart it
more regularly. If you find staff are still making untracked changes then
that's not a technology problem - it's a people problem.

~~~
StavrosK
A better alternative is to kill entire servers every once in a while and
reprovision them. That way, console access effectively becomes read-only,
since any changes will soon be lost, but you can still debug stuff.

~~~
nailer
+1 netflix made Chaos Monkey specifically for that (not that anyone on HN
couldn't code an equivalent, just that it's a technique used at scale).

------
kazinator
_" Sign up for your Boxfuse account."_

What? And disturb the tranquil immutability of boxfuse.com?

~~~
kaonashi
It actually gets swapped with a new Boxfuse.com on write.

------
socketpuppet
We do something similar with Nix
([https://www.nixos.org](https://www.nixos.org)). We have a CI server (Hydra)
which creates a closure containing our software and all its dependencies,
which we then upload to a network of AWS machines that are created with a few
python scripts.

This works fairly well, but from experience we do need to keep an SSH setup on
the machines. Just last week we had a load spike on a production server which
was caused by a software bug, triggered by a usage pattern that was unexpected
and not covered in our tests. If we did not have SSH (and access to the CLI
tools on that machine) we would not have been able to debug this unexpected
problem. I guess what I'm saying is that as long as software has bugs and
hardware has glitches, we'll sometimes need access to low-level tools which
can help us figure out the cause of these unexpected scenarios.

~~~
ris
Exactly what I was thinking as I read this. Just use nix. Forget the dogma.

------
rotten
Somewhere in the infrastructure there has to be mutable servers - databases
and log servers at the least. I can't think of many architectures where
everything can be a closed appliance.

Are they suggesting that every time you need to apply a security patch, you'll
have to build a new server and cut your traffic over to it? That sounds like a
lot of DNS and Load Balancer config updates (mutable!) even if the newly
patched server builds are 99% automated.

The idea of black-box appliances has been around a long time, and it has its
place in the modern infrastructure, but I'm not sure it really solves the
problems they are trying to solve (which sound more like change management
issues).

~~~
axelfontaine
This type of image is primarily designed for 12-factor apps where all
persistent state is kept outside the instance in some geo-redundant highly
available system like Amazon RDS or S3.

~~~
falcolas
The problem with Amazon RDS, and they freely admit it, is that it's not built
for the scale at which some companies operate. If you need advanced features
of your DB, or need to operate at large scales (the magnitude of scale differs
by RDS engine), you'll end up wanting to run your own DB on an EC2 instance.

~~~
jacques_chester
When you have enough business that RDS is too small, you are not going to be
heartbroken about paying for a bespoke setup.

Until then, ignorance is bliss.

~~~
falcolas
Well, ignorance also tends to fuel any number of flamewars...

In my experience with RDS as a MySQL admin, Amazon was not using sane defaults
for quite some time, which resulted in a remarkably non-performant product.
Thankfully, they listened to feedback, so over the course of a few years the
MySQL RDS instances started to become much more dependable and useful. I'd be
happy to use one today, which is something I couldn't have said too long ago.

And even with RDS, having a DBA available (even just as a consultant) is still
quite useful - it's unrealistic to expect your developers to write ideal SQL.
You can get a long ways with RDS if your queries and schemas are well tuned.

------
grhmc
So I've considered this before, and I know it is coming from a good place. It
seems great for the good times. At Clarify.io we use immutable infrastructure
and we don't log in to boxes. Except when there are problems. And there are
problems. Always.

You don't want to have to fight your way in in the middle of a disaster.

At Clarify.io we've considered having an on-login event fire when an admin
uses SSH to log in to the box. This would schedule the server for termination
within 24 hours. This gets you best of all worlds:

\- Gives tools to debug and restore service when there is a failure

\- Forces admins to not depend on SSH to bring a system up

\- Replaces servers when they are potentially drifting

\- Encourages a "cattle" mentality about the servers

~~~
aalbertson
I really like this idea. Would be keen to see what your orchestration code
looks like to handle that.

~~~
grhmc
We already use JanitorMonkey. It would likely involve tagging the instance in
AWS with a tainted tag, and adding code to JanitorMonkey to do the normal
mark-and-sweep.

~~~
aalbertson
That's awesome! I don't know how I've missed JanitorMonkey amongst their
tools, but that looks great. Definitely giving it a whirl.

------
mugsie
"Vulnerabilities like ShellShock simply vanish" \- eh .... not really.

Shellshock was not just from SSH, it was any program that used environment
variables to pass data around.

Immutable infrastructure is a great ideal, but has various downsides.
sometimes people need to access the raw system to do debugging. While shoot
the node, and boot a new one has advantages, it does just move the problem
down the line.

The BBC solution is a nice balance.

------
lamontcg
This is just dumb as nails.

Now I can't login to the production server in order to strace the process to
figure out why production is misbehaving.

And don't say you should always replicate the issue in preprod, sometimes you
simply can't. Some issues only emerge at prod scale+load+latency that you
don't have the financial resources or technical ability to be able to exactly
replicate in a testbed.

You are drunk on too much koolaid, go home.

~~~
juliangregorian
That's great if strace et. al. can actually tell you why production is
misbehaving. Of course, strace doesn't exist in vacuum, it changes the way
your code executes. Everything is going to be slower, and race conditions and
locks may not present themselves so readily. Then you also have the added
stress of screwing around with a live server, what if you do something that
messes it up worse?

Sometimes it makes more sense just to kill that server and add a fresh one to
the pool. Better yet if that process is automated. Good logging can actually
be more valuable than mucking about in the live environment (or as I call it,
panic mode).

------
Sleaker
Uhh... you now how we handle drift? We manage the update repository ourselves.
New packages don't get released until we are ready to release them....

You don't need to limit ssh to help handle state drift. Article writer doesn't
seem to understand the difference between user access, and software
deployment.

------
exarch
This is what happens when dev teams hate ops teams. Troubleshooting a
production bug without a terminal is a recipe for lengthy outages.

~~~
jacques_chester
I worked on Cloud Foundry buildpacks for 7 months. I was working at the
Pivotal Labs offices in NYC and I can tell you this for free: devs want SSH
access too.

When a box dies in staging, it's nice to learn why.

------
eyko
> Trouble starts the minute you start relying on commands like this:
    
    
        sudo apt-get install mypkg
    

So what's wrong with `sudo apt-get install mypkg=<version>`?

~~~
yebyen
Do you do this every time? Must be cumbersome, there's no equivalent of
Gemfile.lock and bundler for apt-get, is there?

(Or is there? Serious question)

~~~
oblio
[https://help.ubuntu.com/community/PinningHowto](https://help.ubuntu.com/community/PinningHowto)

My guess is that apt pinning/holding predates Gemfile.lock by quite a few
years :)

~~~
yebyen
This and Ansible are non-answers to me.

The functionality of bundle install and bundle update are not replicated by
apt pinning or holding. How do you roll back a bad update to your pinned
packages? Put back the old Gemfile.lock and bundle (again)? Does that really
work?

I guess I don't really use Ansible so I don't know, but it's very easy to roll
back a bundle update if you keep your Gemfile.lock in revision control too.
I've used apt pinning before though, and I'm quite sure rollback of a package
update that has a dependency conflict with the old version is not as easy.

~~~
icebraining
_I 've used apt pinning before though, and I'm quite sure rollback of a
package update that has a dependency conflict with the old version is not as
easy._

Aptitude has always handled that just fine for me, downgrading the
dependencies as well if needed.

~~~
yebyen
You're still losing the information about what you had pinned before by
unpinning, unless you can dig it out of dpkg.log after the fact. This is
almost nothing like having Gemfile.lock and committing it to the project code
repo, saving each changeset after every "bundle update".

It is an issue with the package manager, like the python/virtualenv guy
suggested, apt can't support keeping parallel versions installed at the same
time and linking in the locked version into each separate project that needs a
specific version. The conflicting versions do conflict. This is not a problem
for bundler, since a given project is not likely to need two separate
conflicting versions in a single bundler exec.

There is no "project" concept at all in dpkg. You just have to maintain wholly
separate environments for those conflicting dependencies, on separate
machines, if they show up in cases where you really need both versions.

Or work with containers instead, which is arguably not really a different
solution than already proposed.

~~~
icebraining
_You 're still losing the information about what you had pinned before by
unpinning, unless you can dig it out of dpkg.log after the fact._

Oh, that's why I mentioned Ansible in my previous comment. I have a version
controlled file with:

    
    
      - apt: pkg=lib1 version=11.6
      - apt: pkg=app version=33
      ...
    

To rollback, I can just revert to the right commit and re-run ansible-
playbook.

As for multiple parallel versions installed, sure you can. All you need to do
is packaging them using different prefixes. What you can't do is blindly use
packages from the main repos; those are built to be used to support the OS
programs, not yours. But even those are often made to have multiple parallel
version: I have python 2.7 and python 3 installed, both from Debian's main
repo.

------
stephen
Interesting; so, boxfuse is one of these (..."micro" kernels? What's the name,
of when you run in a hypervisor without a real OS, so no user isolation, no
process isolation, etc.) OS/image-creation tools...

I didn't realize they were production ready yet, but sounds pretty spiffy.

What's interesting is that I think Boxfuse is spinning what most people would
call a weakness ("if stuff breaks, I can't SSH into the shell and fix it
(because, not only is that a 'bad idea', there just isn't a Linux OS there to
SSH into anyway)" and calling it a strength.

Nice spin. :-)

If I tried boxfuse, I'd probably look at embedding the Apache Java-based SSHD
server, not for app setup, but debugging ... assuming its terminal session
would provide useful information in the Boxfuse "there is no OS" environment.

(The boxfuse site uses the term "Secure Micro OS, Few MBs", so if I'm
misinterpreting what their platform does, someone please correct me.)

------
HugoDaniel
On *BSD kernels you can use securelevel to achieve proper immutability to some
extent (like unchangeable pf rules, read-only raw disk devices, etc...).
Disabling ssh only gives you the guarantee that ssh is going to be disabled,
and that is quite different from immutability.

------
amalcon
> Vulnerabilities like ShellShock simply vanish.

What? No, they don't! ShellShock was about executing unintended code in a bash
process where you control an environment variable but not stdin. If you
control stdin, there's no need to sneak a command into an environment
variable: you can just type it into the shell.

Apart from this, the article makes no argument for removing SSH: it only
argues for restricting access to privileged userids. That is to say, the
standard best practice for decades. Root should be reserved for special
situations, not used by default.

------
OriPekelman
This is so the wrong question. Immutable infrastructure does not require nor
warrant restricting the form of access you have to your services.

Having everything build-oriented (basically building the builder on every
deployment, than using it to build the service) is what prevents drift.

Running on a read-only file system is what guarantees immutability and
prevents drift.

Not allowing SSH is plugging a single hole which guarantees nothing. Can the
app-server write on its own disk? Can it modify configuration files? Maybe
even code?

As a developer I want to be able to write applications that are not specific
to any one deployment method. I want to be able to change my mind to host this
on different kind of infrastructure. These kind of limits force you to create
a strong adherence between your application and its tooling and the deployment
method.

Of course, the Read/Write domain has to be explicit and limited, but this is
good practice anyway. Running on a distributed storage grid based on something
like CEPH can allow you to address this domain not only explicitly but also in
a fault-tolerant and scalable way.

When you cut off SSH you are cutting off half of the tooling we have and you
make everything overly complicated. I want to be able to tail any log on any
host. I want to be able to scp for inspection any file to my dev machine.

At platform.sh (a PaaS running on a distributed grid of micro-containers) we
run a build oriented immutable infrastructure so there is no, there can be no
infrastructure drift.

We proxy SSH so security stays the responsibility of the orchestration layer
not any single host or service. This means we can filter by role any
connection to any service, but developers can still just use the tools that
work.

------
asymmetric
This is not even a problem if you're using Ansible (and possibly others): you
just say you want the latest version of the package installed, and ansible
will to an `apt-get update` for you as a preliminary step, and install/update
the package if needed.

So basically, all the machines you're running ansible on will have the latest
version (if that's what you want, of course)

------
peterwwillis
What is drift? Local edits, updates not applied, literally ntpd/ntpdate not
updating the time. Your host is misconfigured; it's broken.

How do you deal with systems breaking? By monitoring for them. It also helps
to make them 'disposable', but this is not a replacement for monitoring. Good
monitoring will tell you when time is off, when disks are about to fill up,
when permissions are bad and when packages aren't up to date.

Removing ssh does not give you monitoring. It does not prevent system state
from changing. It really doesn't do anything but remove a very secure and
simple method of communication and file transfer.

Minimal/disposable images? That's fine. It won't make your servers or services
immutable. One day your hardware's going to catch on fire and your first clue
won't be the smoke billowing into the neighboring rack, it will be the errors
in your logs.

------
tgeek
I find the better pattern here is to limit and discourage SSH and then monitor
and log the hell out of it. There are numerous tools out there that can
centralize any actions being taken on a host and sending it to a log that can
be centralized. Outright removing all SSH puts you in a rough spot if things
go south with some piece of software that your system monitoring/centralized
logging don't cover 100%, and makes it way hard to do things like strace on a
process.

~~~
grhmc
Or if one of the problems of the server caused logging to not correctly start
:)

~~~
aalbertson
Which is certainly a valid case, but I would argue that if that IS the case,
and it cannot autorecover, then that is a good candidate for the instance to
self terminate.

------
mwcampbell
A common response on this thread has been that if you take away SSH, you can't
log into individual machines to debug. But you know the other situation where
you can't do that? When your software is running on your users' machines
rather than yours, i.e. mobile and desktop apps. I agree with the position of
the OP, that it's better to make the software robust by design and through
testing.

~~~
icebraining
_it 's better to make the software robust by design and through testing._

But that's a false dichotomy; having SSH in no way prevents you from designing
and testing your software just as well.

~~~
juliangregorian
I don't know, I think there's a legitimate argument to be made that some tools
can become a crutch and a detriment. At one of my consults all the developers
lean heavily on IDE debugging at the expense of developing well-defined
contracts and interfaces. Nobody blinks twice at having a ton of threads
mutating global variables because they have all these great debugging and
tracing tools.

------
njharman
I ssh into machines for many reasons other than to change them. In fact most
changes are done remotely with Ansible.

------
siliconc0w
I'm a fan of immutable infrastructure but there is still a need to get access
to a container now and again to troubleshoot things. We don't run SSH inside
the containers but it's easy enough to write some tooling to find a host
running an instance and use docker exec to get a shell on it.

------
therealmarv
What about logs? What about looking into the server to find out what is going
on on the server?

~~~
axelfontaine
Application logs should be shipped over the network to either your own ELK
infrastructure or some hosted solution like Loggly, Logentries or Papertrail.
Boot logs can be obtained from your infrastructure (like EC2 instance logs).

~~~
spacecowboy_lon
And when that stops working for some reason? of course the real solution to
have out of band managemnet

~~~
liveoneggs
the network is 100% reliable, always. The internet, doubly so (it's so big,
you know). Maybe you missed the memo.

~~~
spacecowboy_lon
Well as a "circuit switched bigot" Vint must have left me off the mailing list
:-)

I used to to do OSI international interconnect support an testing for my sins
many moons ago.

------
dschiptsov
May I suggest another "innovation" \- statically link everything into a single
blob and run it under systemd.

The building scripts should be written in JavaScript, of course. Package it as
a startup. Become millionaire.

------
huslage
Docker?

~~~
bojo
Still requires an underlying OS which may need to be patched or upgraded.

