Hacker News new | comments | show | ask | jobs | submit login
FreeBSD/EC2 on C5 instances (daemonology.net)
162 points by dantiberian on Nov 17, 2017 | hide | past | web | favorite | 44 comments



While the new C5 instances are certainly welcome - I've been hoping for their release since their announcement in November 2016 (and they were already late for Skylake at that point) - we have encountered a number of show-stopping problems that point to this project being just a bit too ambitious.

To name a few:

1. EBS volumes attached to C5 instances show completely bogus CloudWatch metrics, over an order of magnitude higher than reality (e.g. average read/write latency prints at 100-60,000ms depending on load)

2. C5 instances don't work - at all - behind an NLB with a Target Group pointing to it as an "instance". You have to put it in "IP" mode.

3. OpsWorks, as always, lags way behind AWS offerings. You can't launch C5 instances. This was true of even R4 instances for a while, but you could at least change them via API. Not so with the C5 instances; unless you want to lose track of their type completely, you just have to abstain for now.

3a. As a result, we have to run R4 instances for some of our web tier - despite not needing the memory - because they have the highest network allocation. To make matters worse, AWS won't tell you the network allocation. You don't know until you start dropping packets.

4. ZFS on C5 instances can behave strangely. We've been unable to resize drives (zpool online -e <pool> <drive>) if they're identified by ID ("unable to read disk capacity"). Moving the instance back to any other type fixes the issue.

As always, you expect a couple of quirks with a new architecture, but I found myself wishing they had just stuck a new board with a Skylake chip into a rack and launched it.

Compared to this, GCE has a far more attractive offering: even ignoring all these issues, we simply can't get the instance size we want (a few fast CPUs + lots of memory). It just doesn't exist.


> 3a. As a result, we have to run R4 instances for some of our web tier - despite not needing the memory - because they have the highest network allocation. To make matters worse, AWS won't tell you the network allocation. You don't know until you start dropping packets.

Ugh, I hate that. They also have hidden limits on the number of incoming tcp connections you can have.


Worse still, we spent weeks on the phone with AWS insisting there was no throttling - just to find out there was throttling. And that would have been discoverable if a single engineer had looked at CloudWatch.


Do you have Enterprise Support? If so, this is a very surprising result.


Just Business Support, we're not that large of a shop.


> They also have hidden limits on the number of incoming tcp connections you can have.

Do you have proof of this?


No, I had an experiment I was tried to run on EC2 in 2013, and ran into this. It was very clear though. Established connections would plateau, and then no more tcp syns would arrive to the ec2 host, unless a connection was closed. At the time the limit was 5000 connections on the micro instance -- from my notes, we got an allocation for hi1.4xlarge and I know we hit the limit there too, but I don't recall what the limit was. This was a very simple TCP proxy, using Linux kernel ipmasq; this uses significantly less memory than HAProxy, although with a lot less features.

We had a very excited account rep because of where I work, but he was only barely able to confirm the limits were there, he wasn't able to get them raised or removed, nor could he tell us the limits by machine type.

There's a thread from 18 months ago on the aws forums [1], where an aws rep more or less confirms, but again provides no information

[1] https://forums.aws.amazon.com/thread.jspa?threadID=231806


The "Connection Tracking" portion of http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-net... provides some insight here and describes a method to avoid connection tracking in EC2's firewall.


Thanks, this looks like what I needed to know!


> I've been hoping for their release since their announcement in November 2016 (and they were already late for Skylake at that point)

I don't know what you mean by 'late'. Skylake Xeons were delayed and have only been released recently, at least partially. You may be able to get them from vendors like Dell but you still can't go out and buy one anywhere that I'm aware of.


Does the updated documentation at http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitorin... explain the CloudWatch metrics?


That's more helpful. That's a lot of special cases - and thought - to put into each time we look at a C5-attached volume as opposed to any other. It also does not include any mention of Average Read Latency and Average Write Latency, which appears to be incorrect in all dimensions (average, min, max, sum).


We've found that CloudWatch is useful for some system and I/O related metrics (i.e., the from the EBS SAN's point of view) but less useful for others (i.e., metrics from the VM's point of view). It's worthwhile to install a monitoring agent on the VM that can collect I/O latency and other statistics there too. We use Datadog, but there are lots of options out there (collectd etc.).


This is a topic that we're continuing to iterate on between EBS, CloudWatch, and the AWS console team for metrics. When a newly introduced behavior makes step function changes in graphs displayed in the AWS console it doesn't meet the principle of least astonishment.

We're also continuing to investigate the reported latency in the console. Because this is a derived metric from VolumeTotal{Read,Write}Time and Volume{Read,Write}Ops, there may be a miscalculation happening due to the change in dimensions.


Not sure if this is related, but when I was getting EFS working last year I noticed that CloudWatch graphs were often complete nonsense due to EFS not logging zeroes for idle filesystems but CloudWatch treating this as "missing data" rather than "implied zeroes".


#2 is expected to work. Can you reach out to me directly (joemag@) and I will make sure we take a look.


There's an open support ticket, but I've reached out as well.


So I wouldn't say C5s don't work at all with NLB instance target groups - I'm running one right now.

I wouldn't be surprised if you're hitting an edge case with the tighter integration* between NLB's and instance associations. If you haven't already, please do reach out to support.

* from the docs "If you specify targets using an instance ID, the source IP addresses of the clients are preserved and provided to your applications. If you specify targets by IP address, the source IP addresses are the private IP addresses of the load balancer nodes."


It appears the specific issue is with an NLB in one subnet referencing a C5 in another subnet. It's useful to put the targets in a private subnet and the NLB in a public subnet, so this breaks usage for us.


You have given me a whole other perspective to an aws customer. I spend a lot on aws (70k/m +) but id be happy if i had core2duo cpus!

What sort of work do you have that requires the latest generation ? Or more why do you want the latest ?

Id expect aws to always be behind the ball - are they the right platform for you?

Gce is interesting, im moving half of my infra over there - but again its not really about their hardware offerings. Its more about being multi cloud/redundant


Our particular use case involves a few processes that are heavily serial, while yet memory-intensive; the fastest possible processor would be a boon to us (for instance, if someone would guarantee an overclocked Xeon, we'd take it in a heartbeat for almost any price).

AWS is indeed behind the curve more often than not, but they also have some amazing hosted products. That was a bigger deal in 2014 than it is now (Kubernetes will eat the world) but it's still a reliable offering, and reliability is our most important criteria.

Agreed re: being multi-cloud in the end. It's the only responsible choice above a certain scale.


> YOU'RE BUILDING A HARDWARE FRONT-END TO EBS? You guys are insane!

It seems more likely that they've put in a software device model of NVME as a replacement for the BlkBack software device model that the BlkFront driver talked to. Not much different than the software e1000 NIC that xen/qemu already supports.

That said, with their virtualizable Annapurna wonder-NIC, they could be doing it in "hardware", though even in that case a reasonable part of the device model would be software, just running on a NPU, not a CPU.

Hopefully Amazon will disclose more details.


There's absolutely no way that they would get the performance I'm seeing from an emulated disk. We're talking to real hardware, exposed via PCI passthrough.

Now, exactly what form that hardware takes is an open question. I would assume it's something like "NVME interface hardware" + "ARM CPU which implements the EBS protocol" + "25 GbE PHY", but that guess is based solely on "that's how I would design it".


It's fun to speculate about how other clouds do things :)

> There's absolutely no way that they would get the performance I'm seeing from an emulated disk. We're talking to real hardware, exposed via PCI passthrough.

There's a wide spectrum between "emulated" and "real hardware, exposed via PCI passthrough". Passing through to PCI hardware, in and of itself, gains you very little in terms of absolute guest-visible performance versus eliding all VMEXITs via other means, but it has other important characteristics that I suspect AWS very much wants in the c5 family.

> I would assume it's something like "NVME interface hardware" + "ARM CPU which implements the EBS protocol" + "25 GbE PHY", but that guess is based solely on "that's how I would design it".

I would expect something along these lines, although I'd be a little surprised if they bothered putting the NVMe bits down in silicon. My personal guess would be a good silicon DMA engine and PCIe interface married to sufficient general-purpose processing (ARM SoC, FPGA, etc.) to keep pace with the NVMe Command/Completion queues.

Regardless of how they've implemented it, the end-to-end result seems to hang together very nicely. Kudos to the team at AWS.

(note: I work on Google Compute Engine's hypervisor; my speculation about AWS really is speculation about how they'd do this — the two companies have very different engineering approaches, so I may be entirely wrong trying to project onto theirs — like I said at the top, it's fun to try :)


> I would expect something along these lines, although I'd be a little surprised if they bothered putting the NVMe bits down in silicon.

You and Colin both know they bought Annapurna Labs, right?

We don't have to speculate _that_ much about what is probably going on here...


That's why I said ARM for the EBS protocol handling rather than MIPS. :-)

I guessed a hardware NVMe interface because that seems like something which could be acquired more or less off-the-shelf, thus minimizing the engineering risks.


I do, hence my speculation about putting the NVMe in firmware instead of hardware :)

(Colin's speculation in a peer reply is also reasonable -- personally I've seen enough errata in "off the shelf" IP to shudder at the idea of anything in silicon that doesn't have to be, but fundamentally I'm a SWE, so that would be my take, wouldn't it)


The way I see it, everything has errata... but if you're taking something off the shelf, it's more likely that someone else already found them. :-)


And I could speculate on hypervisor bypass in Andromeda 2.1 ;-)


Indeed! I didn't leave too much to the imagination with my replies on the original post[0], though.

Honestly, I'm mostly curious about how much of "KVM" you're running that's stock, how much is modified, and how much of the userland is running on the far side of PCIe rather than in host ring3 (particularly given "C5 instances are built using a new light-weight hypervisor, which provides practically all of the compute and memory resources to customers’ instances.").

[0]: Especially this one: https://news.ycombinator.com/item?id=15641391


_msw_ mentioned this in the announcement thread (https://news.ycombinator.com/item?id=15640040#15640360):

"but the latest generation EC2 instances offload networking and storage processing to hardware. This is the case for both for instances that use Xen and C5 that uses the new KVM-based hypervisor."

So that seems to confirm an offload to hardware for storage.


Yes, sorry, I fully expect that's what they're doing -- I was really writing two replies there --

1) You don't have to pass PCIe through to hardware to get performance that makes it look like you have.

2) We know they're offloading though, so here's what I think it looks like.


Throughput is one thing, but what does the latency look like? I am no expert, but my assumption is that latency is the bigger problem with network-attached storage.


NVMe supports SR-IOV much in the same way that NICs do - which i suspect is how AWS is delivering "physical" NICs to VMs currently. So its a pretty safe bet that this is how NVMe devices are being delivered to guest Vm's.


> Hopefully Amazon will disclose more details.

We will have some more details on how this all works at re:Invent in a couple weeks.


Colin is one of the many smart wizards we are lucky to have in FreeBSD land. Well done Colin!


I can't claim much credit here. I haven't made anything work; all I did was figure out what didn't work and let the right people know.


That is basically what a Project Manager does, and from my perspective it is a lot more valuable then what can been seen form both inside and outside.


Fair enough. Maybe I should say that I didn't demonstrate any special talents here. What I do could have been done by anyone in the FreeBSD project, but I ended up managing the FreeBSD/EC2 platform by accident and now I'm the obvious person to keep on managing the platform.


> Special talents

Speaking from my (limited) project management experience, I found distributing tasks to the right person is actually a talent in itself.

Note: I was actually surprise how many people could not do it when I was trying to transition away into another project.


Sounds like you worked a certain kind of magic.

Looking forward to the upcoming 11.2 AMIs :)


What kind of device does freebsd hang off nvme? Is it da or something not cam? Haven't really been paying attention.


It's nvme. The GEOM disks (which are what you want to use) show up as /dev/nvd#.


thus spoke the manual nvme(4): "The nvme driver creates controller device nodes in the format /dev/nvmeX and namespace device nodes in the format /dev/nvmeXnsY."

it's been a while since i've had access to nvme gear but it "just worked" at the time - although my use case was for a a daemon that accessed the block device directly to do its own horrible things to it.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: