
Degraded performance after forced reboot due to AWS instance maintenance - woliveirajr
https://forums.aws.amazon.com/thread.jspa?threadID=269858
======
panarky
"I guess you are essentially confirming that the instance maintenance was
likely to be the reason for the major change to cpu usage, and that the only
solution now is to change instance type."

"Of course I am not entirely happy about this. I bought a 3-year reserved
instance 2 years ago, and now have to hope I can sell the remaining year for a
reasonable amount (which may be a stretch given that I am apparently using
legacy instance type), and then purchase new reserved instance after
upgrading."

~~~
taf2
Or migrate to google since they don’t have a broken discounting billing system

~~~
RhodesianHunter
I'll stick with the vastly superior offering and swallow the occasional
hiccup, but thanks.

~~~
vertis
I have used Amazon for many years happily, but I'm struggling to understand
what you consider to be vastly superior? Google has some very solid product
offerings.

Google Container Engine for instance is amazing. Granted Amazon has recently
launched a couple of products that offer similar features, but as of mid last
year running containers on Google felt like it was ahead of Amazon.

~~~
fishywang
One thing is s3/gcs latency. On my last job we had a Go server running on
ec2/gce and do some occasional read from s3/gcs (same, single region). We used
to be running on ec2+s3 but later switched to gce+gcs for cost reasons (gcp
only costs ~60% of aws' cost for our setup). But the thing we noticed was that
when we were on aws reading ~10KB single file from s3 usually takes <100ms,
occasionally hit 100ms+ but rarely hit >200ms. On gcp that's usually >100ms,
sometimes hikes to >1s and rarely goes <100ms.

~~~
boulos
Disclosure: I work on Google Cloud.

Yeah, the underlying system is optimized for throughput (though we're working
on small file latency). AWS clearly has a small file optimization and caching
that we don't (yet).

I often point people at Zach Bjornson's blogpost [1] since he compared all
three providers and is a neutral third party.

[1] [http://blog.zachbjornson.com/2015/12/29/cloud-storage-
perfor...](http://blog.zachbjornson.com/2015/12/29/cloud-storage-
performance.html)

------
eob
Unsure if this is related, but a few weeks back AWS paravirtualized instances
were experiencing extreme clock skew and suffering frequent restarts. Our
clocks were skewing about 8 seconds per hour, and the instances were getting
swapped out at least once per day, sometimes more.

Switching to a HVM instance type resolved the problems.

~~~
kainosnoema
It's related. This was during the early versions of the security patch, which
was causing performance and clock skew issues for PV instances
([https://twitter.com/kainosnoema/status/941816042103324672](https://twitter.com/kainosnoema/status/941816042103324672)).
Since then AWS has mostly resolved these, but we've moved all our instances to
HVM AMIs as well since they're apparently not affected.

------
amluto
Two possibilities:

1\. Your distro has PTI/KAISER enabled and the reboot caused you to upgrade
your guest kernel. Sucks to be running on an Intel CPU right now...

2\. AWS implemented some kind of hypervisor-side mitigation for the VM attack
that made VM entries and exits slower. Sucks to be running on an affected CPU
(which seems to include AMD, too, but I could be wrong).

~~~
chrisper
I am pretty sure the patch needs to be applied on the host kernel and not the
guest.

~~~
alpb
This is not true. [https://cloud.google.com/compute/docs/security-
bulletins](https://cloud.google.com/compute/docs/security-bulletins) says:

> However, all guest operating systems and versions must be patched to protect
> against this new class of attack regardless of where those systems run.

~~~
chrisper
It depends on what you try to fix. VM - to - VM is host only. But now it seems
it also needs to be applied to guests because of applications inside the
guests which could abuse this as well.

------
phoe-krk
If I understand this correctly, paravirtualization in this scenario means that
the host kernel is impacted by the KAISER patch's overhead AND the guest
kernel is impacted by the KAISER patch overhead, and that these overheads are
multiplicative. Could someone more knowledgeable than me explain and/or verify
my assumption?

~~~
anarazel
No, I don't think that's correct. KAISER/KPTI isn't enabled on the guest side
on xen PV instances:

[https://news.ycombinator.com/item?id=16065647](https://news.ycombinator.com/item?id=16065647)

~~~
drvdevd
but since the _host_ still has to perform some paging (and possibly for the
guest) in paravirt, it should affect guest performance. In this snippet I
think that is paravirt _guest_ code so page tables don’t need to be isolated
twice.

So I think the host kernels’ perf degredation will still affect it.

~~~
anarazel
> So I think the host kernels’ perf degredation will still affect it.

Oh, yes, definitely. I was just responding to "that these overheads are
multiplicative." (I suspect additive rather than multiplicative was meant)

~~~
phoe-krk
No, I actually meant multiplicative. I was worried that the worst case is 30%
syscall overhead on virtualization host multiplied by 30% syscall overhead on
virtualization guest.

Still, even if this mitigation is only enabled on Xen host and not on the
guest, can this attack be then used inside the guest's userspace to read
memory of the guest's kernel, as opposed to reading memory of the host's
kernel?

~~~
Dylan16807
The form of the overhead is doing a set amount of extra work per syscall. It's
hard for me to picture any way that could multiply. The extra work doesn't
change based on how long the syscall takes.

------
blaisio
Yeah this is due to the recent security bug related to speculative execution
on CPUs. It causes a performance hit, and they have to force people who
haven't restarted in a long time to restart now because it's around the
official disclosure deadline.

------
slizard
First indications of early KAISER fix testing in the wild..?

(which might have also been a test to see how will customers react)

~~~
anarazel
I think this is possibly the xen PV side of the fix. Linux' PTI mitigation
isn't enabled for paravirt xen:

    
    
            if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
    		pti_print_if_insecure("disabled on XEN PV.");
    		return;
    	}

~~~
justryry
Interesting. I took a look and this is the only hypervisor specific piece of
code I can find in the patches.

I have wondered what the impact would be on hypervisors. Xen seems like they
patched it in a way that removes the need for guests to mitigate, but would
guests of other hypervisors get hit with the penalty twice in some cases?

~~~
panarky
> removes the need for guests to mitigate

Google says guests need to upgrade.

 _" Compute Engine customers must update their virtual machine operating
systems and applications so that their virtual machines are protected from
intra-guest attacks and inter-guest attacks that exploit application-level
vulnerabilities."

"Compute Engine customers should work with their operating system provider(s)
to download and install the necessary patches."_

[https://support.google.com/faqs/answer/7622138#gce](https://support.google.com/faqs/answer/7622138#gce)

------
newman314
Given that Google has been experimenting with POWER, does anyone know if POWER
is affected too?

~~~
ghaff
Apparently potentially: Red Hat Security statement
[https://access.redhat.com/security/vulnerabilities/speculati...](https://access.redhat.com/security/vulnerabilities/speculativeexecution)

~~~
newman314
IBM has confirmed that POWER is vulnerable.

See [https://www.ibm.com/blogs/psirt/potential-impact-
processors-...](https://www.ibm.com/blogs/psirt/potential-impact-processors-
power-family/)

Linux patches to drop 1/9, AIX and i to drop 2/12.

------
justincormack
"I have reviewed support case 4743634091 regarding what you're experiencing on
your instance. You're correct that what you are seeing is not the same issue
as what others are reporting. In the first correspondence from support they
correctly pointed out that the kernel in your instance is encountering an out
of memory (OOM) condition and made suggestions about how to adjust the
configuration within your instance to avoid the OOM processor killer from
kicking in.

The update that is being applied to instances that have scheduled reboot
maintenance can cause slight changes to system resources available to
paravirtualized instances, including a small reduction in usable memory. This
can cause smaller instances, like m1.medium, that run workloads that were
previously just fitting within the usable memory available to the instance to
trigger out of memory conditions. Adding a swap file (as no swap is configured
in your instance) or reducing the number of processes may resolve the issue on
your existing PV instance."

ie there is less memory on PV after the fix, and this machine was nearly out
and is OOMing. This is not a reflection of any significant performance hit
with normal cases.

~~~
PuffinBlue
You've mis-read this section - it was answering another commenter who claimed
to be having the same issue but was in fact having a different issue as you
described in your quote.

The underlying problem the update caused was acknowledged by an AWS rep in the
thread who suggested moving to an instance with more CPU resources.

> If moving to a HVM based AMI is not easy, changing your instance size to
> m3.medium, which provides more compute than m1.medium at a lower price, may
> be a workaround.

------
stedaniels
Are we presuming this is the Intel patch being applied?

------
soccerdave
So this sounds like this Intel bug will not impact performance on HVM
instances.

~~~
panarky
That's not what it says.

Amazon suggests moving to higher-performance instances to offset offset the
performance hit from the fix.

HVM and m3 instances have higher performance for the same price, but they may
also have been degraded by the fix.

~~~
jamesjoethomas
I don't think that's right, Ctrl-F for "HVM" here:
[https://xenbits.xen.org/xsa/advisory-254.html](https://xenbits.xen.org/xsa/advisory-254.html)

In the HVM case an attacker can't generate hypervisor addresses because the
hypervisor runs in a separate address space, so HVM isn't vulnerable to the
most easily exploitable of the disclosed issues.

------
kreitje
We ran into this and finally made the switch this morning to an HVM ami. The
performance has been much better.

We started seeing issues on Dec 23rd, with HDD reads and it magically went
away the night of Jan 1st. The cpu loads remained high though.

------
macintux
"As the notice points out, the update that is being applied is important to
maintain the high security and operational aspects for your instances."

Yeah, definitely Intel Inside.

~~~
yeukhon
AWS has always told customers they use both stock and custom Intel processors.

------
k-ian
for those who don't want to log in,

[https://i.imgur.com/f76tOvY.png](https://i.imgur.com/f76tOvY.png)

[https://i.imgur.com/MhXyT3g.png](https://i.imgur.com/MhXyT3g.png)

~~~
madez
Even simple .png pictures are not displayed on imgur.com without JavaScript.

~~~
Macha
imgur checks your referral header and if it can figure out you've navigated to
the image directly (or there's no referral header), it will redirect you to
the image viewer + ads page, instead of just displaying the image file.

It's a pity, given the reasons imgur started, but it's the cycle of all image
hosts I guess.

~~~
leggomylibro
The site doesn't even perform its basic functions on mobile anymore.

Agreed, it's a shame, but such is image hosting.

------
tuna
PV vs HVM woes

