
Critical Xen bug in PV memory virtualization code - tshtf
https://raw.githubusercontent.com/QubesOS/qubes-secpack/master/QSBs/qsb-022-2015.txt
======
bluedino
What's the quote from Theo again?

 _You are absolutely deluded, if not stupid, if you think that a worldwide
collection of software engineers who can 't write operating systems or
applications without security holes, can then turn around and suddenly write
virtualization layers without security holes._

~~~
makomk
The Qubes commentary on this vulnerability is unintentionally amusing too, it
basically boils down to them complaining that the Xen developers failed to
spot it for seven years even though there's no obvious way to spot it.

~~~
three14
They're arguing that the Xen developers shouldn't code in a way that makes it
so hard to spot, not that they should have spotted it as is.

~~~
devit
Yes, the problem is that this terrible code directly copies the untrusted
"nl2e" variable from the VM into the extremely critical hardware page table,
only doing a broken unclearly written check.

Instead, "nl2e" should have a data type preventing such a direct copy, and
only allowing to test single bits.

The code should then be written to copy bits one by one (for bits where it is
appropriate), with a comment for each bit stating why it is safe to copy them.

The fact that this is not the case means that none of the other code in Xen
can really be trusted to be bug-free, and there is probably no way to fix that
without starting over or doing an equivalent amount of rewriting work.

~~~
sn
Can you name one substantial software project with more than 10 developers
that you expect to be bug free?

~~~
ploxiln
Space Shuttle embedded controllers firmware

... I'll say a bit more ... I think it's sad that it's not practical anymore
to write really high quality code. And most computer security researchers
aren't interested in that anymore either, because it's not possible to force
all other developers to write high quality code (and if you do you're called
"mean" "alienating" etc.) So all they/we try to do is find new layers to
contain all the bad code that has been and will be written.

Can we really be surprised when those layers, particularly if they are popular
because they came out first and have lots of features, are also low-quality
code? Is there anywhere a foot can be put down?

Also, openbsd is another good example of a project with code with a very very
low bug density.

~~~
jperras
> Space Shuttle embedded controllers firmware

Not the space shuttle, but if you think things that go to space (and land on
the goddamned Moon) are free of bugs then you are very sorely mistaken.

[https://www.netjeff.com/humor/item.cgi?file=ApolloComputer](https://www.netjeff.com/humor/item.cgi?file=ApolloComputer)

~~~
vacri
At $job-2 doing agricultural telemetry, our firmware engineer found that one
ISP's satellite was randomly dropping the last character in routed traffic.
Took a little while for them to believe us, and that satellite had been up
there for 30 years...

------
tetrep
As the title doesn't indicate, this is a pretty big deal. Full host compromise
from a guest that always works and doesn't leave any traces. Amazon is
claiming it doesn't affect them[0], but they don't give any details as to why.

[0]: [https://aws.amazon.com/security/security-
bulletins/XSAsecuri...](https://aws.amazon.com/security/security-
bulletins/XSAsecurityadvisories-October/)

~~~
walterbell
Cloud providers typically have early access to security fixes and would be
patched by now. Amazon has the ability to hot-patch Xen, a feature which is
now being developed for general use.

This guest escape vulnerability
([http://xenbits.xen.org/xsa/advisory-148.html](http://xenbits.xen.org/xsa/advisory-148.html))
has been present since 2008 and was found by Alibaba, which recently joined
the Xen advisory board, [http://www.xenproject.org/about/in-the-news/192-xen-
project-...](http://www.xenproject.org/about/in-the-news/192-xen-project-
announces-alibaba-joins-advisory-board.html)

~~~
zurn
In that case Amazon's wording would be pretty disingenuous - "AWS customers'
data and instances are not affected by these issues" when the reality is more
like "as far as we know nobody exploited this before we patched"?

~~~
jerdfelt
It depends on the version of Xen they are running. XSA-148 only affects Xen
3.4 and newer. If AWS is running an older version of Xen, they may never have
been affected.

~~~
rodgerd
Since Amazon don't tell anyone about what they run, and don't release back to
the broader Xen community, I guess we'll never know.

------
walterbell

      | The code to validate level 2 page table entries is bypassed when
      | certain conditions are satisfied.  This means that a PV guest can
      | create writeable mappings using super page mappings.
      | 
      | Such writeable mappings can violate Xen intended invariants for pages
      | which Xen is supposed to keep read-only.
    

Xen is used by security-focused Qubes, which published an analysis of XSA-148,
[https://github.com/QubesOS/qubes-
secpack/blob/master/QSBs/qs...](https://github.com/QubesOS/qubes-
secpack/blob/master/QSBs/qsb-022-2015.txt):

 _" The above is a political way of stating the bug is a very critical one.
Probably the worst we have seen affecting the Xen hypervisor, ever. Sadly.

Admittedly this is subtle bug, because there is no buggy code that could be
spotted immediately ... On the other hand, it is really shocking that such a
bug has been lurking in the core of the hypervisor for so many years. In our
opinion the Xen project should rethink their coding guidelines and try to come
up with practices and perhaps additional mechanisms that would not let similar
flaws to plague the hypervisor ever again (assert-like mechanisms perhaps?).
Otherwise the whole project makes no sense, at least to those who would like
to use Xen for security-sensitive work.

Specifically, it worries us that, in the last 7 years (i.e. all the time when
the bug was sitting there having a good time) so much engineering and
development effort has been put into adding all sorts of new features and
whatnots, yet no serious effort to improve Xen security effectively. Because
there have been, of course, many more security bugs found in Xen over the last
years (as the numbering of this XSA suggests)... the bugs in Xen are being
found regularly, and this is no good news. For a type-1 hypervisor of the age
and maturity of Xen, this simply should not be happening. If it does, it
suggests the development process is not prioritizing security."_

------
eli
Linode apparently was affected but got advance notice and patched all servers
over the past week.

~~~
dangrossman
Same for Rackspace, IBM/Softlayer, and everyone else with a large cloud
running Xen.

Linode was actually my failover target for some of those clouds to weather the
reboots. They offer both Xen and KVM based instances. If you're still running
Xen, you can live migrate your instance to KVM and make all your new Linodes
default to KVM via a toggle in your account settings.

~~~
upbeatlinux
Thanks - this is good info to have for personal stuff.

------
bgirard
I was wondering what it was. That explains the 'Critical Xen Maintenance'
reboot ticket I got from Linode 11 days ago.

------
notabot
This is not the first time they rant about this sort of things. Of course if
they can't stand the code quality of Xen they are always welcome to switch to
KVM, virtualbox, bhyve or whatever open source hypervisor they think has the
best code quality and security practice.

I know it is within their right to write things as they please. But seriously,
ranting like this is not very constructive and doesn't move things forward. If
Qubes thinks the practices in Xen community are bad, why don't they start a
conversation on Xen development mailing list?

(edited: typo)

------
rcconf
Does this mean it's a good idea to reset all keys/passwords for any service
that was using Xen?

I don't know much about Xen, so perhaps someone else can chime in on this.
This exploit says it's for PV, and not for HVM.

But Xen can apparently run both at the same time. Does this mean even if you
were using HVM, and Xen had other guests using PV, you were still exploitable?

~~~
devit
Anyone who could create Xen PV VMs could trivially take control of the whole
machine including of course all your VMs regardless of type.

------
arielby
> This bug might also be considered an argument for the view of ditching of
> para-virtualized (PV) VMs, and switch to HVMs

It's not like Xen HVMs have a better security story than PVMs. The
paravirtualization code should probably be more heavily audited than it is
already.

------
upbeatlinux
Rackspace has been patching their Xen stuff for a little over a week now.

It's nice having a multi-region / multi-provider setup otherwise the typical
Rackspace reboot window (much like UPS deliveries or the cable guy) would've
been a pain in the ass.

~~~
gibsonje
We deploy to a single region. This maintenance was a major pain in the ass. We
got a notice 72 hours ahead of time (only 1 business day ahead of time) that
around 60% of all of our hosts failed live migration and would be rebooted.

~~~
upbeatlinux
@gibsonje - similar experience here. Except Rackspace gave notice about a week
ahead. The day maintenance was scheduled for it was postponed but without
notice. Then the reboots happened across 3 days as opposed to 1 day. Glad they
are patching but that's not really fanatical support.

------
_delirium
prgmr.com has a rundown on what affected them:
[http://blog.prgmr.com/operations/2015/10/29/recent-
xsas.html](http://blog.prgmr.com/operations/2015/10/29/recent-xsas.html)

------
PaulHoule
I think hardware implementation can have security bugs too and it's a lot
harder to upgrade the hardware than the software.

------
devit
The existence of such a catastrophic bug shows that Xen is unfortunately not
suitable for use as an hypervisor in a secure system and needs to be replaced.

We really need a properly written open source hypervisor: entities using
secure hypervisors commercially like Amazon AWS should fund the development of
one.

~~~
toomuchtodo
Is KVM not a "properly written open source hypervisor"?

~~~
lsc
just in October:

[https://rhn.redhat.com/errata/RHSA-2015-1896.html](https://rhn.redhat.com/errata/RHSA-2015-1896.html)

[http://cve.mitre.org/cgi-
bin/cvename.cgi?name=CVE-2015-3456](http://cve.mitre.org/cgi-
bin/cvename.cgi?name=CVE-2015-3456)

I could go on and on and on... but the point is that kvm (and xen HVM mode)
may have performance advantages in some situations, but xen pv mode has
consistently had fewer security vulnerabilities, especially if you use
something like pv-grub to load untrusted guest kernels (rather than loading
those kernels directly in the dom0)

The biggest problem is the qemu drivers; Honestly, I don't know enough about
kvm to know if it was possible, but if you could completely remove the device
emulation and force the guest to only use paravirtualized drivers, your
security under KVM (or Xen HVM) would be much better, mostly because you'd
vastly reduce the amount of hypervisor code the guest interacts with.

But the point here is that all systems have problems, and the xen pv mode has
had fewer security holes found it it than most other hypervisors, mostly
because there's a lot less code that the guest interfaces with.

~~~
cthalupa
>may have performance advantages in some situations

s/may/certainly s/some/almost all

It's not a small gap, either. You are basically doubling the number of context
switches required for system calls with PV, due to AMD removing CPU protection
rings from x86_64, forcing this separation to have to happen in software. You
lose out on EPT, so your page table performance suffers in almost all
workloads. You cannot take advantage of SR-IOV for NICs or nvme.

PVH might be an answer someday, but for now, there is a pretty massive
performance loss. Context switches are important. Memory latency is important.
(Nearly) direct access to the hardware is important.

~~~
lsc
I'm not saying you are wrong about performance (other than to point out that
actually using the qemu devices is even slower.) Still, if you go through the
xsa list, if I was running HVM, I'd have had to reboot pretty often. It's a
huge firedrill every time the security list sends something out, for those of
us who use (normally lower-stress) local disk. Twice in one year is bad
enough.

~~~
cthalupa
Well, yeah, software emulated hardware is slow, but no one should be using
emulated PCI devices on anything. There are PV or better yet, SR-IOV drivers
that can be used for those that are going to be the closest to bare metal
performance you'll get in a VM.

