
Visualizing Meltdown on AWS - mike_heffner
https://blog.appoptics.com/visualizing-meltdown-aws/
======
patrickxb
It would be nice if AWS could write something official about what they are
doing.

I've been noticing major performance changes in our instances and have no idea
if it is related to Meltdown or something else.

Google released a blog post specifically on performance:
[https://blog.google/topics/google-cloud/protecting-our-
googl...](https://blog.google/topics/google-cloud/protecting-our-google-cloud-
customers-new-vulnerabilities-without-impacting-performance/)

It would be nice to have similar transparency from AWS.

~~~
pathseeker
Google likely only wrote a blog post about it because they were able to find a
way to brag that there was effectively no performance hit. They have never
written any blog posts IIRC explaining bad/unpredictable performance on GCE

~~~
jacksmith21006
Do they not deserve to brag about it? Heck they among others are who found the
flaws as well as Broadpwn, Cloudbleed, Heartbleed among others and deserve
credit for the work they have done, imo.

Think the constant ragging on Google that appears on HN is becoming a little
too much.

~~~
plandis
It was Google who discovered the vulnerabilities in the first place. How long
did they internally wait to give themselves and advantage before disclosing
these? I'm honestly asking I don't know.

------
rdtsc
That's one interesting aspect of these issues and mitigations is that
performance really depends on the workload. Just because Google saw little
performance impact on their servers, doesn't mean your application won't see.
Or because someone said their CPU usage went up 2x doesn't mean it will go up
for you.

On an unrelated note, kind of wish Meltdown had been discovered and exposed
separately from Spectre. Intel has managed to weasel its way of out of taking
responsibility by implying that this is not a bug and all the other CPUs have
similar issues. If they had to respond to Meltdown only, it would have made it
a bit harder for their PR and legal department to deny the security and
performance implications.

~~~
boulos
Disclosure: I work on Google Cloud.

Just a nit, we said [1] that not only are our own applications in production
doing fine (even against Variant 2) but also we haven’t been inundated with
support calls over the last few months while mitigations were silently rolled
out at the host kernel and hypervisor layer. So _this_ class of “Hey, my
instances are suddenly way slower, I didn’t do anything” isn’t happening on
GCE.

That does _not_ mean that if _you_ perform guest OS updates, don’t have PCID
enabled, etc. that you won’t see a degradation. We certainly haven’t tried all
permutations of all guests, and the kernel patches are still improving. That’s
why we’re actively trying to get everyone to rally behind retpoline, KPTI
(with PCID enabled), and so on.

[1] [https://www.blog.google/topics/google-cloud/protecting-
our-g...](https://www.blog.google/topics/google-cloud/protecting-our-google-
cloud-customers-new-vulnerabilities-without-impacting-performance/)

~~~
rdtsc
> but also we haven’t been inundated with support calls over the last few
> months

Sorry, I didn't mean to imply that it was Google's customers specifically who
should measure and be suspicious. I meant in general, say someone running a
service on bare metal outside of any public cloud for example drawing this
conclusion - "Google measured the impact of these bugs and mitigations and
they didn't see any a significant performance regressions so I probably don't
have to worry either".

Also (since you disclosed you work on GCP), I like how Google sponsored
Project Zero. Fantastic work and I am sure it will be a great return on that
investment. I can certainly see someone thinking about that when deciding to
go with GCP vs other solution.

------
mrep
My team saw a 40% CPU usage increase on all of our EC2 instances and even our
RDS instances. We were shocked since the media was downplaying the performance
impact.

I tried to start a poll but it seems as though my team was just the unlucky
one:
[https://news.ycombinator.com/item?id=16109036](https://news.ycombinator.com/item?id=16109036)

~~~
schak
Hi,

How are you trying to measure the performance impact? Are you checking the
cloudwatch data or running any specific test?

I am interested to assess the baseline statistics (unpatched so far) and after
patching. Any suggestions?

~~~
mrep
my teams usage is pretty even so the cloudwatch graph is pretty obvious:
[https://m.imgur.com/a/khGxU](https://m.imgur.com/a/khGxU)

~~~
tomsmeding
That is the clearest graph ever to show the degradation in performance. How do
you get your usage so flat?

------
hanklazard
Pardon this likely naive question, but I haven’t seen it addressed yet in all
the coverage: what’s the cost in electricity of patching this vulnerability?
Does a company like amazon running a massive cloud infrastructure see a non-
negligible increase in their cost of doing business?

~~~
dkuebric
It may impact the AWS control plane and amazon.com, but as far as AWS services
go, it just means customers will be paying for more instances.

~~~
cm2187
But for VM, do customers pay by CPU usage or runtime? I am sure the Netflix of
this world optimise their CPU usage but I also suspect the majority of VMs are
mostly idle or have little traffic as their task are either intermittent or
are sized for peak usage.

~~~
013a
Technically, you pay for both time (hours on) and CPU Usage (instance tier).
Its not like different instance tiers (at least in the same class) use
fundamentally more or less powerful processors. They all use the same
processors, you just get more or less of it depending on what you pay.

Conceptually it is "pay as you use" by CPU usage, but just rounded into
buckets by instance tier.

Of course, there's a _lot_ of underutilization within each bucket, because the
granularity isn't per 1% used, but (more or less) per 100% used (aka each
core). And also, most applications can't switch instance tiers easily to adapt
to demand (though some certainly can).

~~~
user5994461
>>> Its not like different instance tiers (at least in the same class) use
fundamentally more or less powerful processors. They all use the same
processors, you just get more or less of it depending on what you pay.

There is a variety of CPUs. You can "cat /proc/cpuinfo" to see what you got.

>>> most applications can't switch instance tiers easily to adapt to demand
(though some certainly can).

Most applications can and do change. Changing the instance type is just a
reboot of the machine.

~~~
013a
> There is a variety of CPUs.

There certainly are, but generally they are the same within the same "class"
(ie t2, c5, etc). Example:

\- C5: 3.0Ghz custom Xeon processors. \- C4: Xeon E5-2666 v3 \- C3: Xeon
E5-2680 v2 \- X1/X1e: Xeon E7-8880 v3 \- R4: Xeon E5-2686 v4 \- R3: Intel Xeon
E5-2670 v2

Etc. The only exception I'm aware of are the burstable instances; AWS's
documentation doesn't explicitly guarantee a CPU bin there, it just says "high
frequency Xeon".

The difference between an m5.large and an m5.xlarge isn't that you're getting
a "faster" processor (meaning higher gigahertz or newer architecture). You're
just getting "more" of the same processor (more cores). This is different than
on, say, GCP, where you just ask for cores and you can _specify_ if you
specifically want a broader generation of Xeon (like Broadwell), but you can't
be guaranteed specific chips.

My intention behind saying that most applications can't switch instance tiers
is more to point out that most applications aren't prepared to handle node
failure, not that there is something intrinsic to the node types which stops
them from being able to switch (that'd be much more rare).

------
mike_heffner
Would love to know if anyone else had data on:

* Impact on M5/C5 instances over similar time period, any difference with the Nitro hypervisor?

* Were Dedicated instances ([https://aws.amazon.com/ec2/purchasing-options/dedicated-inst...](https://aws.amazon.com/ec2/purchasing-options/dedicated-instances/)) patched as well?

* Other examples of software that adapted batching performance automatically with increase in call latency.

~~~
taf2
We had a lot of m5 and c5 servers randomly die. It was as if someone was
running chaos monkey from Netflix in our VPCs...

~~~
otterley
Likewise. Can you reach out to me privately? I'd love to have independent
corroboration.

------
jdangu
Anyone has more info on the performance recovery today? We experienced similar
performance issues over the last few days with a seemingly complete recovery
today (on a cluster of ~2500 HVM T-1s).

~~~
bpicolo
Very curious as to what changed today if performance increased. Some sort of
smarter patch? That'd be an amazingly impressive thing to cobble together so
quickly.

------
k__
This is especially interesting for workloads that already ran on >70%

Some stuff won't run in the free tiers anymore and people will have to switch
to bigger machines :/

------
yclept
We saw instances which normally kept a healthy stock of CPU credits quickly
burn through them and severely degrade in performance thanks to Meltdown :<

------
alacombe
Trying to foresee the future...

Could we expect Intel to fix the design flaw^Wfeature so that future server
appliance (but also desktop) can run without KPTI while still not being
affected by Meltdown ? If so, what timeline could we expect ? Say a year for
new CPU designs, plus a year to roll-out new machines in datacenter ?

~~~
MBCook
What I’ve seen is it takes five years to design a CPU from scratch.

I imagine they’ll try and rush this to get it out there as fast as possible
(obviously a lot of people would like to buy CPUs they don’t have this issue
for security/performance reasons) but it’s going to take a while. I think
years is definitely the minimum.

Meltdown is easy enough (relatively) but Spectre is kind of a disaster. What
do you do? Does the branch predictor have to start tagging every branch guess
with some sort of process ID to prevent one process from messing with
another’s predictions? Tag the cache lines instead so even though the data is
in cache you can’t see it because YOUR process didn’t pull it in yet? What a
mess.

~~~
voidlogic
>Does the branch predictor have to start tagging every branch guess with some
sort of process ID to prevent one process from messing with another’s
predictions?

Its worth pointing out that for their newest designs AMD (and Samsung Exynos)
uses the full memory address for branch predictions; no doubt Intel's next
design will be doing this.

~~~
MBCook
Ah, that makes since. Sounds like a much less complicated fix than my idea.

------
scurvy
"Why I like to run my own hardware for $100, Alex"

You can patch various tiers of servers at your own leisure, depending on
threat levels and exposure. Measure the impact, capacity plan, etc. Rather
than it being forced on you across all tiers because cloud.

~~~
amazingman
You forgot to type 5 or 6 zeros there.

~~~
gnosek
[https://www.kimsufi.com/us/en/](https://www.kimsufi.com/us/en/)

Granted, that's the bottom of the barrel (single disk, no IPKVM etc.), but
$100 keeps you running for over a year. Better servers are easily available as
well, usually a couple of times cheaper than AWS.

Is this a US thing? Based on HN only, I'd never know there's anything between
the public cloud and racks of own hardware that you have to wire up and
maintain.

I have a bunch of quad core 32 GB machines with dual 480GB SSDs for less than
$100/month each (and that's a rather expensive provider with great support,
you'll cut the price almost in half with e.g. SoYouStart).

Yes, AWS is convenient, but it's far from the only thing in the world.

~~~
user5994461
HN has a lot of professionals. They can't run a business on a refurbished
server without ECC and without RAID and without dual power supplies.

Saying that they should run on kumsufi is like explaining to a wholesale
company that they should use motorbikes instead of trucks, because motorbikes
are cheaper.

~~~
late2part
AWS Advocates Eventual Consistency, and I believe offers less than 3 nines
guaranteed uptime on many products.

We've been taught to build distributed system with unreliable componens and
temporaral JIT eventual consistency.

Of course we can run production cloud-scale operations on unreliable systems.
And single power supplies without ECC or RAID is pretty low on my list of
things that cause outages. Most big hadoop/cassandra shops are running without
raid and without redundant power.

~~~
user5994461
hadoop/cassandra are replicated to multiple nodes by the software. That's a
terrible argument for not using any raid in regular setups.

~~~
late2part
I suspect your and my definition of regular setups is different. My regular
setups are stateless and automatically installed and configured.

~~~
user5994461
Lucky you, never having to deal with any database or storage.

------
perfmode
Over a 15 year time scale, there is no way AWS will remain competitive with
GCP.

~~~
dgsb
Why is that ? They both belongs to the top cloud providers today, both seems
to invest highly on their R&D. I don't see anything obvious on what could
happen in the future.

~~~
perfmode
Google’s network engineering is better.

------
bufferoverflow
Is there an option of AWS dedicated instances without these patches? I thought
all these new vulnerabilities are only really dangerous in shared
environments.

~~~
nolok
> I thought all these new vulnerabilities are only really dangerous in shared
> environments.

You're not the first I see saying that, here and on other sites, and this is
absolutely wrong.

Shared environments like clouds were singled out because not only were they
impacted the worst security wise, they were also going to suffer the most from
the fixes.

But even if you only have a regular normal happy server or computer for you
alone, remote code execution vulnerabilities aren't unheard of; one your
application (be it your own, or a specific one you use, or one of the
bazillion stuff running on your system as part of the OS) gets broken and
you're a free target. Or anything with a proper scripting surface.

If you're system isn't protected, and any of the application you run has a
major security hole, everything will be at risk.

------
k__
Guess humanity lost 30% of its computing power

