
EC2 Bare Metal Instances with Direct Access to Hardware - jeffbarr
https://aws.amazon.com/blogs/aws/new-amazon-ec2-bare-metal-instances-with-direct-access-to-hardware/
======
KenoFischer
I'm really, really, happy about this. I've been complaining about the lack of
cloud servers with exposed performance counters to any cloud vendor that'll
listen (though of course nothing ever came of that). Kudos AWS, this is really
cool.

~~~
aliguori
Thanks! Would love to hear more about the counters that your interested in.
We've exposed more in C5 than in previous instance types and we are trying to
make more available over time in a safe way.

~~~
KenoFischer
I have two use cases:

\- General performance analysis. For this more counters is generally
incrementally better.

\- Running [https://github.com/mozilla/rr](https://github.com/mozilla/rr).
This requires the retired-branch-counter to be available (and accurate -
sometimes virtualization messes that up)

The second one I actually care more about, because I've pretty much stopped
trying to debug software when rr is not available, too painful ;). Feel free
to email me (email is in my profile) for gory details.

~~~
khuey
For the benefit of anyone reading this, KVM and VMWare virtualization
generally work. Xen has problems because of a stupid Xen workaround for a
stupid Intel hardware bug from a decade ago. I can provide more details about
that via email (in my profile) if desired.

~~~
paulie_a
Can you please just post the info. Intel deserves to be shamed

~~~
khuey
One of the things the performance monitoring unit (PMU) is capable of doing is
triggering an interrupt (the PMI) when a counter overflows. When combined with
the ability to write to the counters, this lets you program the PMU to
interrupt after a certain number of counted events. Nehalem supposedly had a
bug where the PMI fires not on overflow but instead whenever the counter is
zero. Xen added a workaround to set the value to 1 whenever it would instead
be 0. Later this was observed on microarchitectures other than Nehalem and Xen
broadened the workaround to run on every x86 CPU. Intel never provided any
help in narrowing it down and there don't seem to be official errata for this
behavior too.

This behavior is ok for statistically profiling frequent events but if you
depend on _exact_ counts (as rr does) or are profiling infrequent events it
can mess up your day.

[https://lists.xen.org/archives/html/xen-
devel/2017-07/msg022...](https://lists.xen.org/archives/html/xen-
devel/2017-07/msg02242.html) goes a little deeper and has citations.

------
LogicX
Packet.net, arguably the leader in api driven, on demand bare metal instances,
recently blogged about this: [https://www.packet.net/blog/why-we-cant-wait-
for-aws-to-anno...](https://www.packet.net/blog/why-we-cant-wait-for-aws-to-
announce-its-bare-metal-product/)

I am a customer of packet's, along with other virtual and dedicated hosting
providers. I don't use aws ec2. I've been pleased with Packet, and their
offerings are much more diverse than this initial offering from aws.

~~~
larrymcp
I just now took a look at Packet's web site and their data center locations.
They categorize each location as either "core" or "edge", but I couldn't find
anything to indicate what those terms mean in this context. Are you familiar
with that distinction?

The location nearest me is an "edge", not a "core". I wonder what I would be
missing out on, if it's not "core".

~~~
mcrae
Edge DCs only have one type of machine (1E) and no block storage, but I think
are otherwise the same.

Even in core DCs though the availability of different types of machines
varies.

Love packet.net btw -- the bgp stuff is really game changing.

------
CSDude
Impressive hardware but I wonder what will be the cost considering even the
regular VMs of EC2 are generally more expensive than dedicated offerings of
other providers.

~~~
Radim
Head-to-head cost comparison of Amazon's AWS, IBM's Softlayer, Hetzner and
Google's ComputeEngine on a machine learning benchmark:

[https://rare-technologies.com/machine-learning-hardware-
benc...](https://rare-technologies.com/machine-learning-hardware-benchmarks/)

~~~
tyingq
Interesting, but the really big difference comes when you need to push data
out to end users...network egress charges.

~~~
Radim
True! Not with machine learning though. The target of this benchmark is
hardcore number-crunching.

------
nextzuckerberg
What differentiates these from dedicated boxes in server rack? Is their
dedicated "cloud" hardware somehow managing access to RAM/storage/etc?

On another tangent - how do Google Cloud and EC2 attach GPUs to instances -
given that you can choose CPU and RAM the GPUs must somehow be modularized
away from a dedicated server?

~~~
aliguori
You can provision these servers just like any other instance. They work just
like any other Amazon EC2 instance (same Nitro System platform as C5).

Disclaimer: I work at AWS on the team responsible for the Nitro System
including EC2 Bare Metal Instances.

~~~
rrix2
is there any information about nitro or ENA (assuming this is the "hardware
accelerators" that are mentioned in tfa) that is publicly available? it seems
like the most nifty little thing

~~~
uji
how many NICs can you attach to these?

~~~
_msw_
15 Elastic Network Interfaces (ENIs) can be attached, just like i3.16xlarge.

------
cperciva
I'm looking forward to testing out FreeBSD on these... and also bhyve, for a
fully BSD virtualization stack.

~~~
nickpsecurity
With them leaning bare-metal and low cost, I wonder if services like these
could be used to bootstrap clouds in VAR form for niche OS's. Might be useful
at the least for getting bugs out of the virtualization software using diverse
workloads. If costs kept minimal, might even be profitable if the niche OS has
enough users.

------
will_hughes
> Storage – 15.2 terabytes of local, SSD-based NVMe storage.

That's probably the most interesting aspect for me.

Does anyone know how that's provisioned? i.e 8x just under 2TB volumes, or
something else?

~~~
aliguori
It's exactly the same as with the i3.16xlarge instance type. There are eight
1900 GB drives. In an i3.16xlarge, those eight drives are passed through to
the instance with PCIe passthrough but for the i3.metal instance, you avoid
going through a hypervisor and IOMMU and have direct access.

~~~
will_hughes
Thanks.

I guess some other open questions:

\- If one of those drives fails, will Amazon hotswap them out, or do you need
to migrate to a new instance (moving TBs of data to a new box without causing
outages can be painful.)

\- Is there a hardware RAID controller for those drives, or is it software
only?

\- Can anyone with access to one of these boxes produce some IO performance
stats on them? Bonus points for stats on single drive vs concurrent across all
drives (i.e is there any throttling). More points for RAID10 performance
across the whole 8.

~~~
_msw_
The local NVMe storage for i3.metal is the same as i3.16xlarge. There are 8
NVMe PCI devices. For i3.16xlarge those PCI devices are assigned to the
instance running under the Xen hypervisor. When running i3.metal, there simply
isn't a hypervisor and the PCI devices are accessed directly.

\- There is no hot swap for the NVMe storage.

\- The 8 NVMe devices are discrete, there is no hardware RAID controller

\- Anyone can get I/O performance stats on i3.16xlarge as a baseline. Intel
VT-d can introduce some overhead from the handling (and caching) of DMA
remapping requests in the IOMMU and interrupt delivery so I/O performance may
be a bit higher on i3.metal, with a few microseconds lower latency.

------
andy_ppp
For all this progress the billing on AWS is so damn confusing to figure out if
some machine is left on unused that I won’t use AWS again. GCE and Azure miles
ahead here.

~~~
_wmd
It takes all of 3 clicks to figure this out using cost explorer

~~~
andy_ppp
Why did AWS support send me a 5329 word message on how to check every region
and service then?

~~~
syncsynchalt
Cost Explorer takes [up to] 24 hours to set up, so it's not a good answer to
support questions about billing.

------
azinman2
So if it’s truly bare, how does Amazon give and take control of the machine
for provisioning? Don’t they still need done kind of hypervisor?

~~~
KaiserPro
Most servers have some sort of "lights out" management, which gives KVM +
remote imaging and bios control.

With amazon, they have complete control over the network in and out, so
cutting you off and re-imaging a server is pretty trivial.

To be fair, its not that hard to do even if you're not amazon.

Most of the big server vendor's out of band interfaces have an API, so telling
a server to reboot from a network image is pretty trivial. Providing a netboot
infrastructure to install images with a 'userdata' script is also not that
difficult.

you'll need a DHCP server, tftp to serve the boot image, and usuaally an NFS
server to pull the rest of the image over. With some engineering work that
could be made to use HTTP.

[https://wiki.centos.org/HowTos/NetworkInstallServer](https://wiki.centos.org/HowTos/NetworkInstallServer)

~~~
detaro
It's a bit harder if you host something like this for the general public to
use (vs administrating machines in your private DC). Normal setups aren't
really hardened against someone flashing firmware, messing with UEFI, ..., all
of which mean you can't entirely trust a machine coming back from customer
control. I wouldn't be surprised if Amazon took this seriously and invested
effort in stopping such things. At their scale, they probably can customize
the hardware enough.

~~~
justincormack
Everyone who sells bare metal as a service takes this seriously. As AWS build
their own hardware, especially in these newer machines, I would guess that its
not possible to flash firmware from the user machine, only from the control
node.

------
oellegaard
I wonder what it will cost - as of now I don’t see it in the price list

~~~
_msw_
Pricing will come with general availability. I suspect (and hope) everyone
will be surprised and happy with the price.

------
FBISurveillance
This is really good news, happy to see this is an option now.

And thanks for posting this here personally @jeffbarr.

~~~
jeffbarr
Any time!

------
hartator
Why fancy words when they are just offering regular dedicated servers?

~~~
anonu
These were my exact same thoughts. I suppose its almost like a step back from
the framework of "virtualize everything"... what's old is new..

addon thoughts: nonetheless, the specs on the bare metal box are ridiculous.
buying something like that will cost you $50k (someone correct me?) - then you
need to find a place to host it... thats not easy to do.

~~~
rrix2
Because they're still virtualizing literally everything but the actual
computer. You can attach NVMe backed EBS volumes, snapshot them as normal,
etc. You can have this thing exist in a vpc next to your virtualized
components, with 25gbps dedicated link. They're virtualizing the things you
shouldn't need to care about, leaving you with a free Cpu and access to all
the things that make aws aws

------
witten
Accounting question: Might this qualify as a capital expense? If you squint
hard enough?

For context, AWS is coy (at least publicly) about the existing dedicated
instances and CapEx vs. OpEx.

~~~
_msw_
The information found here should help your finance team or accountants
determine how best to classify your expenses:
[https://aws.amazon.com/ec2/dedicated-
hosts/faqs/#Should_I_Co...](https://aws.amazon.com/ec2/dedicated-
hosts/faqs/#Should_I_Consider_a_Dedicated_Host_Reservation_a_Capital_or_Operational_Expense)

Since EC2 Bare Metal instances will use the same pricing models as all other
EC2 instances (on demand, reserved instances, dedicated host, spot), the same
information is relevant.

------
zrail
Will there be smaller instances available eventually? I'm interested in bare
metal performance but I don't need an instance that huge for my current
workload.

~~~
_msw_
Our goal is to for the majority of virtualized EC2 instances to be
indistinguishable from bare metal (if not better). In most CPU and memory
intensive benchmarks there is very little difference between an virtualized
EC2 instance and bare metal, especially for smaller numbers of cores and
memory sizes.

------
a012
So now you can rent a dedicated server on AWS, what is nice.

~~~
thesandlord
AWS already had dedicated instances, but they still had a VM running on top.
These are bare metal, which means you run directly on the hardware.

------
sdfjkl
Interesting. I expect we'll be seeing a lot more VPS providers running on AWS
with these instance types.

------
SteveNuts
This will expose virtualization? As in I can run my own virtualization stack
on these instances (KVM, etc)?

~~~
_msw_
EC2 Bare Metal instances provide all the typical Intel processor features,
including VT-x and VT-d. Yes you can use KVM.

------
empath75
Seems like this would be better for container farms, depending on cost.

~~~
icebraining
Seems like a good way to build your own version of Joyent Triton, running
containers without VMs.

------
xena
What is the price? I can't find it.

~~~
syncsynchalt
From _msw_ in this thread: "Pricing will come with general availability. I
suspect (and hope) everyone will be surprised and happy with the price."

------
zsmith928
awesome stuff! Great to see AWS pushing things around the baremetal problem
set.

------
GrumpyNl
Isnt this just regular web hosting?

~~~
dboreham
Not quite: this is cloud-provisioned so you can do things like supply your own
image and it integrates with all the other AWS services like virtual machines
do. Provisioning is automated and self-serve. Also per-second billing which
you couldn't get in the olden days with hosting.

------
brutopia
Thanks Jeff!

------
titzer
I think Amazon is exposing themselves to far greater security risks than they
realize.

~~~
aeorgnoieang
Like what?

~~~
titzer
Blackhats, state actors, etc all trying to attack Amazon or colocated
services. As an example (I don't know the extent of "bare metal" access, so I
couldn't be sure) with the ability to run their own operating system, a client
could potentially get all the way down to the NIC to form arbitrary network
packets. With this they could potentially map and attack Amazon's internal
network protocols (routers, etc). Any kind of vulnerability within Amazon's
software stack on other servers now gets a whole lot worse. If the client did
this at a very low rate, it would be difficult to detect. Firewalling off
these servers only helps so much, since they could still attack colocated
servers of other clients, or could potentially spoof the protocol of Amazon's
own server management.

I hope they have thought this through carefully, because it potentially
exposes everyone on EC2 to more, potentially worse, attacks.

~~~
_msw_
The NIC that is used by EC2 Bare Metal instances is an Elastic Network Adapter
(ENA) PCI device that surfaces a logical VPC Elastic Network Interface. ENA is
implemented in an ASIC that we design and build.

When ENA is used in virtualized instances, Intel VT-d and SR-IOV are used to
bypass the hypervisor. When ENA is used in a bare metal instance, the OS
simply has direct access to the PCI device. In either case the device is a
controlled surface, and VPC software defined networking deals with verifying
and encapsulating network traffic.

~~~
titzer
It's all really cool that you design and build your own NICs. They are
probably awesome tech designed by really smart people.

But how many hundreds of millions of lines of code are on these systems,
roughly? Ballpark estimate.

------
k__
OT: Job advice needed (because I think many back-end devs will be here :D)

I'm thinking about going full-stack next year. I have a bit of experience
building APIs besides being mainly a front-end developer.

Is going "cloud only" a good idea? I thought about starting with AWS Lambda,
S3, DynamoDB and the Serverless framework.

Are the providers hugely different or is it a good idea to spread out and do
some Azure and GCP too?

~~~
dotancohen
That's completely off topic. In fact, the question is so broad that I cannot
think of anyplace other than the water cooler or Quora to ask it.

Career advice: Never go "foobar-only". Make an effort to learn "foobar" but
understand whatever is one layer below it in the stack. Want to go "cloud-
only"? Learn OpenCloud, not AWS.

~~~
k__
lol, while complaining you still gave me a decent answer, thanks :)

~~~
dotancohen
Of course, we're all here to help each other! Good luck!

