

Upcoming AWS Security Maintenance - mattybrennan
https://aws.amazon.com/premiumsupport/maintenance-2015-03/

======
josh2600
If you use Terminal on top of AWS (one deployment option) we can just migrate
your workloads without rebooting.

The way it works is that you read the RAM pages from one machine to another in
real time and when the RAM cache is almost synchronized you slam the IP
address over to the new box (and then you let Amazon reboot your old box and
then migrate back post-upgrade if you want to).

You can try it out on our public cloud at terminal.com if you'd like to (we
auto-migrate all of our customers off of the degrading hardware before it
reboots on our public cloud, but you can control that if you're running
terminal as your infrastructure).

~~~
geofft
... how?? That is seriously nifty.

Are you migrating just a process tree / other contained environment, or the
entire machine?

Are you using CRIU or similar? Do open TCP connections survive the transfer?

~~~
josh2600
We wrote a bunch of hacks to the linux kernel to do it.

Custom container implementation, custom networking, custom storage.

It's just really good hardcore kernel engineering.

If you wanna talk more and you're in SF, come to our meetup on the 10th:
machinelearningsf.eventbrite.com.

Edit: the whole machine including RAM cache, CPU instructions, IP connections,
etc. is carried over. We can also resize your machine in seconds while it's
running.

~~~
stephenr
Is this somehow different to Xen Live Migration/VMware Vmotion/etc?

~~~
josh2600
Yes. VMWare VMotion and Xen Live Migration are both VM migration tools, not
containers.

The difference is subtle, but important. VMs have overhead because of
virtualizing the kernel, Containers don't (or rather containers benefit from
kernel performance much more than VMs).

In other words, you can achieve the same thing with VMotion, but it's slower
and more overhead and harder to manage.

~~~
stephenr
Ah I didn't even know you're a container based shop. So you're moving live
containers between aws provided xen vms.

~~~
josh2600
Or on Bare-metal. It works just the same on Xen but slower because of the VM
overhead.

Containers are really only part of the solution, there's a lot of other things
you have to think about if you want to build a better mousetrap in the
virtualization world (like networking and storage).

------
elmin
It's a bit odd that they don't stop launching new VMs on the old hardware.
That would allow people who wanted to control the transition to just stop and
start their VMs.

~~~
skywhopper
This was my reaction, too. Thinking it over some, I think it's likely that
certain tiers of instance types are more affected than others. And it's also
likely that even though AWS seems to have lots of open capacity available,
they probably are operating at a pretty high percentage, so gradually idling
10% of their host hardware would probably put a pinch on availability. The
average lifetime of instances may also be a factor. I have a few long-running
instances going most of the time, but I do tend to start and stop dozens of
instances within minutes, depending on what I'm working on. So for those use
cases, it hardly matters.

And of course, Amazon has an interest in encouraging its users to build their
systems in a cloud-friendly way. ie, properly designed services on AWS should
not suffer from having a handful of VMs get rebooted at any time. So, from
that POV, it's just good medicine to encourage the culture they built their
service to accommodate.

~~~
michaelt

      they probably are operating at a pretty 
      high percentage, so gradually idling 10% 
      of their host hardware would probably 
      put a pinch on availability
    

If they can withstand the loss of one of out of three availability zones, they
must have at least 33% spare capacity in each AZ, ready for new instances to
start to take up the slack from the lost AZ.

Even if they have more availability zones than the three they show to users, I
would hope they would have more than 10% spare capacity!

~~~
Xorlev
They do have more than 3 AZs. Our account has access to 5 distinct zones and I
suspect there are more than that.

~~~
discodave
That depends which region you're in. us-east has like 5 AZs but some of the
small regions like Sydney have only two.

~~~
vacri
All regions have a minimum of 3 zones, but the client doesn't always see all
of them. Sydney has 3 AZs, but you only get to see 2, for example.

~~~
count
Not all regions (see US GovCloud for example).

------
zytek
Been there, done that. AWS re:Boot in September 2014 showed us how good it was
to invest in Ansible roles for all parts of our infrastructure. Still, a lot
of hassle for Ops Team, especially that it was done during DevOps Days Warsaw
;-) AWS also said '10%' then, but for us it was 81 out of ~300 instances.

What is sad is that we learn about it from Hacker News and not from AWS, even
when we have premium support and our own account manager. :/

Let's see how many of us did their homework after previous "xen update", and
how much "10%" is now ;-)

~~~
soccerdave
I have 19 instances (18 in US-West-2) and none of them are affected. I would
guess that lots of people here run in us-east-1 since that's the longest
running region and I would bet that a lot of that 10% exists there. So, it may
be 10% total in all regions but higher percentage if you run in us-east-1.
Just a guess though.

~~~
pkapkg
44% of my instances in us-west-2 are affected, 55% of my instance in us-
east-1, and 18% of my instances in eu-west-1 are affected. It seems to be tied
pretty tightly to instance types.

Overall, I'm looking at a huge quantity of affected servers. That said, I
don't blame AWS. I blame my incompetent architect for designing systems that
are incredibly hard to upgrade, and that can't be rebooted safely. Definitely
not bitter at that idiocy at all.

~~~
andrioni
Yeah, I'd guess it depends mainly on what type of instances you run. Only one
of the 28 instances I'm running right now (all in us-east-1) is going to be
rebooted, and it is the only old generation instance I still run (it's a
hi1.4xlarge). None of my M3s, C3s or R3s are affected, even though some are
still on PV.

------
hendersoon
Linode forced a reboot for us last night also. They did not disclose why, for
some reason, even though I pointedly asked. Downtime was ~20 minutes.

These must be some seriously bad mojo to force reboots with little to no
notice over a week before they're scheduled to leave embargo.

~~~
VonGuard
Yup: [http://xenbits.xen.org/xsa/](http://xenbits.xen.org/xsa/)

5 undisclosed Xen vulns. Wheeeeee!

------
WestCoastJustin
Related: Five new undisclosed Xen vulnerabilities (xen.org)
[https://news.ycombinator.com/item?id=9116937](https://news.ycombinator.com/item?id=9116937)

------
jamescun
We contacted SoftLayer about this issue, they literally had not heard anything
about it and they would "contact their datacenter team".

If they treat it like the last round of Xen vulnerabilities, they will simply
place a warning on their dashboard an hour beforehand - not sending out any
form of email notice. The first we knew about it was when we started receiving
alerts from nagios.

~~~
blacksmith_tb
Sigh. I opened a ticket with Softlayer regarding it, too. And got pretty much
exactly the same response - nothing is scheduled 'at this time' but they will
'let us know' if they need to reboot any of the hosts we're on. Joy.

~~~
iancarroll
I just got notified that they'll be sending times for the reboots soon (if
needed).

------
ericcholis
Rackspace notice regarding the same patch:

[https://community.rackspace.com/general/f/53/t/4978](https://community.rackspace.com/general/f/53/t/4978)

I wasn't able to find anything on Digital Ocean's public facing websites.

~~~
akerl_
DigitalOcean uses KVM, I thought? Assuming that's true, they're almost
certainly not affected.

If they are using Xen, they shouldn't know the details of the vuln yet as they
aren't on the pre-disclosure list:

[http://www.xenproject.org/security-
policy.html](http://www.xenproject.org/security-policy.html)

~~~
infamouscow
DigitalOcean uses KVM.

------
edibleEnergy
They've updated the announcement, most of the restarts have been cancelled due
to them being able to upgrade the machines without reboots.

------
mrsirduke
I think it will be interesting to see how other providers handles this.

------
alimoeeny
Anybody knows what this 10% mean? I mean :

a) only 10% of the fleet are running a version of the hypervisor that is
affected by the bug

b) based on the turnover rate, they expect 10% to need rebooting under the
customers by the date the bugs are being released.

c) 10% are running a combination of the affected hypervisor and vm's that are
reasonably at risk of exploitation, other's may have the faulty hypervisor but
either are being used as single tenant (there is no risk of someone breaking
out and affecting someone else) or are running vm's that may not be able to
break out depending on the nature of bugs.

Just speculating, any ideas?

~~~
geofft
In the past, Xen has has vulnerabilities based on things different between
Intel and AMD processors, or even between different processors from the same
company. It seems likely that the fleet is all running the same version of the
hypervisor, but the bug only matters on 10% of their hardware.

Here's a previous Xen vulnerability based on Intel implementing the SYSRET
instruction (originally introduced by AMD, along with SYSCALL; Intel's version
of this was SYSENTER and SYSEXIT, with different semantics about kernel stacks
and things) in a slightly different way from how AMD implemented it. Both
Intel's docs and AMD's docs were accurate for their own processors, but if you
only read AMD's docs, you'd implement syscalls in a way that was vulnerable on
Intel.

[https://blog.xenproject.org/2012/06/13/the-intel-sysret-
priv...](https://blog.xenproject.org/2012/06/13/the-intel-sysret-privilege-
escalation/)

~~~
cortesoft
In this case, as is explained in the post, the reason it is only 10% is
because the newer hardware can be upgraded without requiring a reboot.

------
teh
Does anyone know what this means for spot instances?

------
admbk
Wouldn't using kpatch remove the need to reboot instances ?

------
thebouv
Rackspace is doing the same due to the Xen vulns announced.

