Hacker News new | past | comments | ask | show | jobs | submit login
Degraded performance after forced reboot due to AWS instance maintenance (amazon.com)
302 points by woliveirajr on Jan 3, 2018 | hide | past | favorite | 68 comments

"I guess you are essentially confirming that the instance maintenance was likely to be the reason for the major change to cpu usage, and that the only solution now is to change instance type."

"Of course I am not entirely happy about this. I bought a 3-year reserved instance 2 years ago, and now have to hope I can sell the remaining year for a reasonable amount (which may be a stretch given that I am apparently using legacy instance type), and then purchase new reserved instance after upgrading."

Sounds like all instance types can now be run in HVM mode (news to me, but the AWS tech in the thread made the claim). Converting to HVM is annoying but ultimately not that hard or risky. Since HVM mode appears to resolve much of the performance issue, this seems like a reasonable workaround.

>Sounds like all instance types can now be run in HVM mode (news to me, but the AWS tech in the thread made the claim)


"All instance types support HVM AMIs."

This is a recent change.

(I work at AWS, not directly involved with this change and not speaking in any sort of official capacity, but can confirm you can use HVM on m1, etc, now)

Or migrate to google since they don’t have a broken discounting billing system

I'll stick with the vastly superior offering and swallow the occasional hiccup, but thanks.

Having used both, I'd say that AWS has a lot more "stuff" (services) and they tend to be more scalable and practical. Google tends to create fewer services that are (generally) rock-solid with some gems like BigQuery.

Fun Anecdote time:

6 months ago I needed 100 GPU machines for a bunch of batch processing.

I called my Google rep and contacted AWS and Azure at the same time.

Google struggled to give me enough instances and was only able to provide 10 at the time, with more on the way in other regions...

However, their GCE wouldn't let me create an autoscaling group with GPU instances. Their tooling simply didn't support it. After much hassling their support centers, my rep informed me they simply didn't support it yet, and he recommended I use AWS.

Azure never responded to my inquiries. Their UI and tooling is pretty lame too, and the service is expensive.

AWS increased my quota quickly and it worked great. Their support tends to be responsive if brief and biased towards making the problem go away by giving you what you want.

Disclosure: I work on Google Cloud.

If you had no history, our folks were probably being cautious in granting quota (you may have noticed that cryptocurrency has been going crazy). Feel free to request another increase, but I understand if you're not interested.

I don't understand your managed instance group thing though... Is that perhaps because you tried to create a regional instance group, and we only had GPUs in particular zones? (I saw a bug like that, conveyed an answer, and never heard back).

I got 10 or so in one region, 10 or so in another region, and 32 or so in asia. It was all spread out. It really seemed like a capacity issue. This was in March 2017.

In fairness, Google was pretty transparent about the whole thing. It had a lot more personal touch than the other providers. Google set me up on a call with some of their engineers to discuss autoscaling. But at some point the communication broke down a bit or maybe interest was lost in helping with my issue. Concluding with the realization that they didn't support managed instance groups with GPUs yet, and that was that.

Please please please convince someone to invest in your documentation. Your services could be great but people will never know because the documentation is either poor, spread out across multiple seemingly unrelated pages, or non-existent.

Already in process, but we hear you. The most helpful thing is to actually use the Send Feedback features throughout the docs and console (you need to be signed in) to highlight things that you find unclear. Often the initial documentation is written by the product manager in concert with our docs team, but it's easy to gloss over "obvious" things when you know it in your head.

I'd encourage you to take a(nother) look at Google Cloud. I've been using it for the past few months and have been very impressed.

There are a few things missing compared to AWS, but all the main stuff is there (and to my mind a lot more logical and integrated). Some key features blow AWS out of the water.

They've come a long way in the past few years.

I have used Amazon for many years happily, but I'm struggling to understand what you consider to be vastly superior? Google has some very solid product offerings.

Google Container Engine for instance is amazing. Granted Amazon has recently launched a couple of products that offer similar features, but as of mid last year running containers on Google felt like it was ahead of Amazon.

running containers on google compute or ec2 has always been the same.

gke is great, however. if you need to run kubernetes in production, today, i would not even slightly hesitate to yell to the corners of the earth to run gke. it makes me excited for eks!

i would also say that the new c5 instance type w/ ena is extremely nice. if they are unaffected by this, they're probably the fastest cpu-based instances you can get anywhere (admittedly i haven't checked... really anything else, but they're fast).

One thing is s3/gcs latency. On my last job we had a Go server running on ec2/gce and do some occasional read from s3/gcs (same, single region). We used to be running on ec2+s3 but later switched to gce+gcs for cost reasons (gcp only costs ~60% of aws' cost for our setup). But the thing we noticed was that when we were on aws reading ~10KB single file from s3 usually takes <100ms, occasionally hit 100ms+ but rarely hit >200ms. On gcp that's usually >100ms, sometimes hikes to >1s and rarely goes <100ms.

Disclosure: I work on Google Cloud.

Yeah, the underlying system is optimized for throughput (though we're working on small file latency). AWS clearly has a small file optimization and caching that we don't (yet).

I often point people at Zach Bjornson's blogpost [1] since he compared all three providers and is a neutral third party.

[1] http://blog.zachbjornson.com/2015/12/29/cloud-storage-perfor...

"Vastly superior" could use some qualification. I find the gcloud tooling to he superior to awscli.

I would describe much of AWS and the supporting tooling as "ugly", sometimes hideously ugly, but the services tend to work well.

They clearly prioritize making things work over making them pleasant to use.

AWS Lambda Mapping Templates and S3 Bucket Policy are two things that come to mind as hideously ugly. But they work!

Google offer exactly the same reservations mechanism


"Discounts apply to the aggregate number of vCPUs or memory within a region so they are not affected by changes to your instance's machine type."

Sure, if you're happy with global outages.

https://status.cloud.google.com/incident/compute/17003 - On Monday 30 January 2017, newly created Google Compute Engine instances, Cloud VPNs and network load balancers were unavailable for a duration of 2 hours 8 minutes.

https://status.cloud.google.com/incident/compute/16007 - On Monday, 11 April, 2016, Google Compute Engine instances in all regions lost external connectivity for a total of 18 minutes, from 19:09 to 19:27 Pacific Time.

Or, complete outages for a region: https://status.cloud.google.com/incident/compute/17008 - On Thursday 8 June 2017, from 08:24 to 09:26 US/Pacific Time, datacenters in the asia-northeast1 region experienced a loss of network connectivity for a total of 62 minutes.

AWS guarantees regional independence for all services, and EC2 guarantees availability-zone independence.

Unsure if this is related, but a few weeks back AWS paravirtualized instances were experiencing extreme clock skew and suffering frequent restarts. Our clocks were skewing about 8 seconds per hour, and the instances were getting swapped out at least once per day, sometimes more.

Switching to a HVM instance type resolved the problems.

It's related. This was during the early versions of the security patch, which was causing performance and clock skew issues for PV instances (https://twitter.com/kainosnoema/status/941816042103324672). Since then AWS has mostly resolved these, but we've moved all our instances to HVM AMIs as well since they're apparently not affected.

Two possibilities:

1. Your distro has PTI/KAISER enabled and the reboot caused you to upgrade your guest kernel. Sucks to be running on an Intel CPU right now...

2. AWS implemented some kind of hypervisor-side mitigation for the VM attack that made VM entries and exits slower. Sucks to be running on an affected CPU (which seems to include AMD, too, but I could be wrong).

I am pretty sure the patch needs to be applied on the host kernel and not the guest.

This is not true. https://cloud.google.com/compute/docs/security-bulletins says:

> However, all guest operating systems and versions must be patched to protect against this new class of attack regardless of where those systems run.

It depends on what you try to fix. VM - to - VM is host only. But now it seems it also needs to be applied to guests because of applications inside the guests which could abuse this as well.

from what i understand, it’s both. if you don’t protect the hypervisor, the guest can break through and read it’s ram. if you don’t protect the host, your apps can break into your own OS and see it’s ram.

i could be wrong; if i am, clarification would be good!

Yes it is both now.

The host patch protects against vm-to-vm leaks, but you still need to protect your vm against application-to-application leaks.

When I wrote the other comment it was at the time I did not have all the information we have now. Like it took me some time to read into it.

The host PTI patch does nothing whatsoever to protect from attacks initiated by a guest.

(Well, if host QEMU gets compromised, PTI protects the host kernel a bit.)

If I understand this correctly, paravirtualization in this scenario means that the host kernel is impacted by the KAISER patch's overhead AND the guest kernel is impacted by the KAISER patch overhead, and that these overheads are multiplicative. Could someone more knowledgeable than me explain and/or verify my assumption?

No, I don't think that's correct. KAISER/KPTI isn't enabled on the guest side on xen PV instances:


but since the host still has to perform some paging (and possibly for the guest) in paravirt, it should affect guest performance. In this snippet I think that is paravirt guest code so page tables don’t need to be isolated twice.

So I think the host kernels’ perf degredation will still affect it.

> So I think the host kernels’ perf degredation will still affect it.

Oh, yes, definitely. I was just responding to "that these overheads are multiplicative." (I suspect additive rather than multiplicative was meant)

No, I actually meant multiplicative. I was worried that the worst case is 30% syscall overhead on virtualization host multiplied by 30% syscall overhead on virtualization guest.

Still, even if this mitigation is only enabled on Xen host and not on the guest, can this attack be then used inside the guest's userspace to read memory of the guest's kernel, as opposed to reading memory of the host's kernel?

The form of the overhead is doing a set amount of extra work per syscall. It's hard for me to picture any way that could multiply. The extra work doesn't change based on how long the syscall takes.

I’ve been wondering the same thing. I think it depends on how/if the hypervisor tries to “emulate” speculative execution on some calls or instructions. If it does and does so faithfully, then perhaps. My guess though is that such emulated functionality would be missing this bug.

Yeah this is due to the recent security bug related to speculative execution on CPUs. It causes a performance hit, and they have to force people who haven't restarted in a long time to restart now because it's around the official disclosure deadline.

First indications of early KAISER fix testing in the wild..?

(which might have also been a test to see how will customers react)

I think this is possibly the xen PV side of the fix. Linux' PTI mitigation isn't enabled for paravirt xen:

        if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
		pti_print_if_insecure("disabled on XEN PV.");

Interesting. I took a look and this is the only hypervisor specific piece of code I can find in the patches.

I have wondered what the impact would be on hypervisors. Xen seems like they patched it in a way that removes the need for guests to mitigate, but would guests of other hypervisors get hit with the penalty twice in some cases?

> removes the need for guests to mitigate

Google says guests need to upgrade.

"Compute Engine customers must update their virtual machine operating systems and applications so that their virtual machines are protected from intra-guest attacks and inter-guest attacks that exploit application-level vulnerabilities."

"Compute Engine customers should work with their operating system provider(s) to download and install the necessary patches."


Note that this is only for PV guests, which most people don't use anymore...

Given that Google has been experimenting with POWER, does anyone know if POWER is affected too?

Apparently potentially: Red Hat Security statement https://access.redhat.com/security/vulnerabilities/speculati...

IBM has confirmed that POWER is vulnerable.

See https://www.ibm.com/blogs/psirt/potential-impact-processors-...

Linux patches to drop 1/9, AIX and i to drop 2/12.

"I have reviewed support case 4743634091 regarding what you're experiencing on your instance. You're correct that what you are seeing is not the same issue as what others are reporting. In the first correspondence from support they correctly pointed out that the kernel in your instance is encountering an out of memory (OOM) condition and made suggestions about how to adjust the configuration within your instance to avoid the OOM processor killer from kicking in.

The update that is being applied to instances that have scheduled reboot maintenance can cause slight changes to system resources available to paravirtualized instances, including a small reduction in usable memory. This can cause smaller instances, like m1.medium, that run workloads that were previously just fitting within the usable memory available to the instance to trigger out of memory conditions. Adding a swap file (as no swap is configured in your instance) or reducing the number of processes may resolve the issue on your existing PV instance."

ie there is less memory on PV after the fix, and this machine was nearly out and is OOMing. This is not a reflection of any significant performance hit with normal cases.

You've mis-read this section - it was answering another commenter who claimed to be having the same issue but was in fact having a different issue as you described in your quote.

The underlying problem the update caused was acknowledged by an AWS rep in the thread who suggested moving to an instance with more CPU resources.

> If moving to a HVM based AMI is not easy, changing your instance size to m3.medium, which provides more compute than m1.medium at a lower price, may be a workaround.

That's a different person than the one that created the support thread though.

Are we presuming this is the Intel patch being applied?

So this sounds like this Intel bug will not impact performance on HVM instances.

That's not what it says.

Amazon suggests moving to higher-performance instances to offset offset the performance hit from the fix.

HVM and m3 instances have higher performance for the same price, but they may also have been degraded by the fix.

I don't think that's right, Ctrl-F for "HVM" here: https://xenbits.xen.org/xsa/advisory-254.html

In the HVM case an attacker can't generate hypervisor addresses because the hypervisor runs in a separate address space, so HVM isn't vulnerable to the most easily exploitable of the disclosed issues.

I don’t know enough about Xen to be sure, but I know typically your dom0 is a Linux kernel even (or especially) with HVM. If it’s been patched and is performing any paging on behalf of the guest, this will indeed affect performance of both paravirt and HVM instances.

We ran into this and finally made the switch this morning to an HVM ami. The performance has been much better.

We started seeing issues on Dec 23rd, with HDD reads and it magically went away the night of Jan 1st. The cpu loads remained high though.

"As the notice points out, the update that is being applied is important to maintain the high security and operational aspects for your instances."

Yeah, definitely Intel Inside.

AWS has always told customers they use both stock and custom Intel processors.

Strange. I see the thread perfectly fine without logging in.

The AWS forums are really weird. If you don't have any AWS Console cookies (incognito window, for example), they load fine, but if they know you've got an AWS account, they require a login.

Sounds like Google forums.

25% bump in CPU usage from that graph, ouch.

It definitely shows how many people are running close to the limits, too. I have a couple of m1.mediums running ancillary workloads (e.g. rabbitmq) and it added something like ~0.5%.

What's more interesting is later on in that thread where a support staffer noted that one of the people reporting problems was running out of memory because they'd apparently been very close to the max and that dropped slightly with the update. I had a problem with two test RDS instances suddenly failing into the invalid parameters state which turned out to be years-old config for the InnoDB pool size being just a little too large after a reboot reduced the available RAM by just enough to matter.

Even simple .png pictures are not displayed on imgur.com without JavaScript.

imgur checks your referral header and if it can figure out you've navigated to the image directly (or there's no referral header), it will redirect you to the image viewer + ads page, instead of just displaying the image file.

It's a pity, given the reasons imgur started, but it's the cycle of all image hosts I guess.

The site doesn't even perform its basic functions on mobile anymore.

Agreed, it's a shame, but such is image hosting.

I have no JS and I can see the images fine.

PV vs HVM woes

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact