
FreeBSD/EC2 on C5 instances - dantiberian
http://www.daemonology.net/blog/2017-11-17-FreeBSD-EC2-C5-instances.html
======
STRML
While the new C5 instances are certainly welcome - I've been hoping for their
release since their announcement in November _2016_ (and they were already
late for Skylake at that point) - we have encountered a number of show-
stopping problems that point to this project being just a bit too ambitious.

To name a few:

1\. EBS volumes attached to C5 instances show completely bogus CloudWatch
metrics, over an order of magnitude higher than reality (e.g. average
read/write latency prints at 100-60,000ms depending on load)

2\. C5 instances don't work - at all - behind an NLB with a Target Group
pointing to it as an "instance". You have to put it in "IP" mode.

3\. OpsWorks, as always, lags way behind AWS offerings. You can't launch C5
instances. This was true of even R4 instances for a while, but you could at
least change them via API. Not so with the C5 instances; unless you want to
lose track of their type completely, you just have to abstain for now.

3a. As a result, we have to run R4 instances for some of our web tier -
despite not needing the memory - because they have the highest network
allocation. To make matters worse, AWS won't tell you the network allocation.
You don't know until you start dropping packets.

4\. ZFS on C5 instances can behave strangely. We've been unable to resize
drives (zpool online -e <pool> <drive>) if they're identified by ID ("unable
to read disk capacity"). Moving the instance back to any other type fixes the
issue.

As always, you expect a couple of quirks with a new architecture, but I found
myself wishing they had just stuck a new board with a Skylake chip into a rack
and launched it.

Compared to this, GCE has a far more attractive offering: even ignoring all
these issues, we simply can't get the instance size we want (a few fast CPUs +
lots of memory). It just doesn't exist.

~~~
toast0
> 3a. As a result, we have to run R4 instances for some of our web tier -
> despite not needing the memory - because they have the highest network
> allocation. To make matters worse, AWS won't tell you the network
> allocation. You don't know until you start dropping packets.

Ugh, I hate that. They also have hidden limits on the number of incoming tcp
connections you can have.

~~~
STRML
Worse still, we spent weeks on the phone with AWS insisting there was no
throttling - just to find out there was throttling. And that would have been
discoverable if a single engineer had looked at CloudWatch.

~~~
otterley
Do you have Enterprise Support? If so, this is a very surprising result.

~~~
STRML
Just Business Support, we're not that large of a shop.

------
kijiki
> YOU'RE BUILDING A HARDWARE FRONT-END TO EBS? You guys are insane!

It seems more likely that they've put in a software device model of NVME as a
replacement for the BlkBack software device model that the BlkFront driver
talked to. Not much different than the software e1000 NIC that xen/qemu
already supports.

That said, with their virtualizable Annapurna wonder-NIC, they could be doing
it in "hardware", though even in that case a reasonable part of the device
model would be software, just running on a NPU, not a CPU.

Hopefully Amazon will disclose more details.

~~~
cperciva
There's absolutely no way that they would get the performance I'm seeing from
an emulated disk. We're talking to real hardware, exposed via PCI passthrough.

Now, exactly what form that hardware takes is an open question. I would assume
it's something like "NVME interface hardware" \+ "ARM CPU which implements the
EBS protocol" \+ "25 GbE PHY", but that guess is based solely on "that's how I
would design it".

~~~
jsolson
It's fun to speculate about how other clouds do things :)

> There's absolutely no way that they would get the performance I'm seeing
> from an emulated disk. We're talking to real hardware, exposed via PCI
> passthrough.

There's a wide spectrum between "emulated" and "real hardware, exposed via PCI
passthrough". Passing through to PCI hardware, in and of itself, gains you
very little in terms of absolute guest-visible performance versus eliding all
VMEXITs via other means, but it has other important characteristics that I
suspect AWS very much wants in the c5 family.

> I would assume it's something like "NVME interface hardware" \+ "ARM CPU
> which implements the EBS protocol" \+ "25 GbE PHY", but that guess is based
> solely on "that's how I would design it".

I would expect something along these lines, although I'd be a little surprised
if they bothered putting the NVMe bits down in silicon. My personal guess
would be a good silicon DMA engine and PCIe interface married to sufficient
general-purpose processing (ARM SoC, FPGA, etc.) to keep pace with the NVMe
Command/Completion queues.

Regardless of how they've implemented it, the end-to-end result seems to hang
together very nicely. Kudos to the team at AWS.

(note: I work on Google Compute Engine's hypervisor; my speculation about AWS
really is speculation about how they'd do this — the two companies have very
different engineering approaches, so I may be _entirely_ wrong trying to
project onto theirs — like I said at the top, it's fun to try :)

~~~
kijiki
> I would expect something along these lines, although I'd be a little
> surprised if they bothered putting the NVMe bits down in silicon.

You and Colin both know they bought Annapurna Labs, right?

We don't have to speculate _that_ much about what is probably going on here...

~~~
jsolson
I do, hence my speculation about putting the NVMe in firmware instead of
hardware :)

(Colin's speculation in a peer reply is also reasonable -- personally I've
seen enough errata in "off the shelf" IP to shudder at the idea of anything in
silicon that doesn't have to be, but fundamentally I'm a SWE, so that _would_
be my take, wouldn't it)

~~~
cperciva
The way I see it, everything has errata... but if you're taking something off
the shelf, it's more likely that someone else already found them. :-)

------
X86BSD
Colin is one of the many smart wizards we are lucky to have in FreeBSD land.
Well done Colin!

~~~
cperciva
I can't claim much credit here. I haven't made anything work; all I did was
figure out what didn't work and let the right people know.

~~~
ksec
That is basically what a Project Manager does, and from my perspective it is a
lot more valuable then what can been seen form both inside and outside.

~~~
cperciva
Fair enough. Maybe I should say that I didn't demonstrate any special talents
here. What I do could have been done by anyone in the FreeBSD project, but I
ended up managing the FreeBSD/EC2 platform by accident and now I'm the obvious
person to keep on managing the platform.

~~~
ksec
> Special talents

Speaking from my (limited) project management experience, I found distributing
tasks to the right person is actually a talent in itself.

Note: I was actually surprise how many people could not do it when I was
trying to transition away into another project.

------
tedunangst
What kind of device does freebsd hang off nvme? Is it da or something not cam?
Haven't really been paying attention.

~~~
cperciva
It's nvme. The GEOM disks (which are what you want to use) show up as
/dev/nvd#.

