
AWS EC2 T2 Instances Demystified: Don’t Learn the Hard Way - rtisdale
https://roberttisdale.com/aws-ec2-t2-instances-demystified-dont-learn-hard-way/
======
sudosteph
> You will likely find it is difficult or impossible to access the server to
> take any measures to solve the issue until the CPU credits have accrued.

Only if you don't want to enable t2 unlimited ->
[https://docs.aws.amazon.com/en_us/cli/latest/reference/ec2/m...](https://docs.aws.amazon.com/en_us/cli/latest/reference/ec2/modify-
instance-credit-specification.html)

But really seems like it would be simple enough to rig something up that
would:

1\. Trigger CloudWatch Alarm on Insufficient T2 Credits

2\. Send SNS notification to SNS topic when triggered

3\. have lambda function subscribed to enable t2 unlimited

4\. have subscribed email or slack alert to be like "hey something's up, we
enabled t2 unlimited for now"

Though CW alarms for this are limited to running every 5 min, so could take a
bit to notify, or you might need to set your threshold a bit high at first. I
usually prefer to configure a CW event rule for this kind of stuff, but I
don't see any support for that out of the box.

At least they explain the math pretty decently for predicting the usage:
[https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-insta...](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances-
monitoring-cpu-credits.html)

~~~
jweir
Terraform just released support for the unlimited flag via the
'credit_specification' field

[https://github.com/terraform-providers/terraform-provider-
aw...](https://github.com/terraform-providers/terraform-provider-
aws/blob/master/CHANGELOG.md#1160-april-25-2018)

~~~
sudosteph
And CloudFormation users can specify it here:
[https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-
properties-ec2-instance-creditspecification.html)

Good to see it's becoming more common knowledge though.

------
sudosteph
Sorry, I already commented but did want to recognize an important point OP
made.

> The examples above are contrived but it does illustrate the danger of not
> properly architecting your AWS system and monitoring it once it’s in place
> (especially if you’re using T2 instances).

> I often see this issue with a website that receives increases in the
> baseline traffic over long periods of time with occasional spikes. If the
> instance type is not changed and sized appropriately, this eventually leads
> to all the accrued CPU credits being used paired with downtime and a
> significant amount of head scratching.

This is so true. Proper architecture and planning up front will save all kinds
of headaches. Anyone serious about using EC2 needs to understand and use
autoscaling groups with ELBs to manage scaling. If production availability is
important, and you don't want to be bugged to fix something at 3AM, ASGs are a
nice way to mitigate for potential AZ issues (you can do that with an ASG with
max size of 1 even!). The biggest problem I always saw was that people would
not take the time to test what happens when an ASG instance is terminated.
Also You can't be ssh'ing in to manually change things and expect those
changes to stick around.

Proper monitoring and alerts if of course the other side to this.

~~~
sudhirj
If I remember correctly, just setting autoscaling based on CPU usages already
works fine and does the right thing with T2 instances as well - if you don't
have enough credits to drive your load your CPU is going to be stuck at 100%,
irrespective of credit balance or instance size. So the common sense method of
scaling up at 70% / 80% CPU capacity should already work fine for any
instance, no special knowledge required.

~~~
sudosteph
If your load is still consistently high enough, it is possible to be caught in
an ASG cycling situation with t2s that can affect your application.

Ex: \- Say your ASG is set to 4 max instances, min 2

\- You get a sustained increase in traffic and scale up to 4 instances

\- Sustained 70% across 4 instances consumes more credits than are earned

\- As instances run out of CPU credits, they'll be pegged at 100% and stop
responding correctly to TCP health checks.

\- Autoscaling will see that it's passed the threshold for too many failures
and will stop routing traffic there, and terminate the instance and add
another

\- Other instances fail at about the same time for the same reason (though
hopefully this is staggered) and also get terminated and a new instance is
created.

\- each time an instance gets pegged, the elb traffic gets routed to one of
the remaining ones, causing the remaining instances to consume their remaining
cpu credits faster

Now, it's unlikely that the timings would work out so that all 4 instances are
concurrently unavailable, and launching new instances is relatively quick.
However it's still possible that each time an instance runs out of credits
that it could be serving a request and thus send an error to the user. It also
would create a lot of monitoring noise.

The other thing I could see being problematic is that depending on your ASG
termination policy, when you scale back in, it's not guaranteed to terminate
the instance with the fewest remaining cpu credis. So you may scale down with
traffic at closer to 40% cpu, and run out of credits anyhow because the
instance with the most credits was terminated on scale-in.

Edit: And I should add that the scale-in scenario can be especially
problematic if a cooldown period is configured (default is 5 min). So if it
just scaled back in to 2 instances, and those two were close to max and get
pegged, it will not scale back up for 5 mintues. It will still attempt to
replace them for health check failures, but once you get to a scenario where
you only have 2 up, your likelihood of a full application outage goes up quite
a bit (even if it's only for 2-3 minutes).

So I would say that even if using an ASG, special knowledge is required for
effective use of T2 instances.

------
cyberferret
The post raises some good points, but in reality, I've been using T2 instances
on my projects for a long time with good results.

It really comes down to your use cases. If you have an infrequently used web
app, or have a site that needs to run some small periodic tasks, then T2
instances are fantastic for this sort of thing, and they are super cheap too.

I normally run Sinatra/Padrino based web apps, and keep a close eye on my CPU
credits via AWS's monitoring tools, and I have never gotten even close to
using up the allocated credits.

~~~
2RTZZSro
If you are not looking for an exotic instance type or instance type
flexibility Vultr provides superior servers at every comparable EC2 price
point. Additionally Vultr provides an extensible external hard drive so-called
"block storage" for a very low rate. Digital Ocean is great as well. Keep your
eyes peeled for deals on [https://lowendbox.com/](https://lowendbox.com/)

I have nothing to do with any of these services

~~~
cyberferret
I do use Digital Ocean for some projects, and have been looking at Vultr
recently as well.

However, most of my projects are reliant on the entire AWS suite, and nearly
all my projects use RDS for database storage and SES for handling emails etc.,
not to mention DynamoDB, S3, VPC, Elastic Beanstalk, CodeDeploy, CodePipeline
etc. etc. which makes it easier to stay with the one provider and choose their
cheapest instances to do the heavy lifting.

~~~
StudentStuff
Maybe I'm just an oddball, but I've found significant use in setting up shared
infrastructure including a mailserver, gitlab, VPN server and similar, with
things like CodeDeploy replaced by a simple & reliable git hook.

Whenever I look at the AWS ecosystem, I see confusing and unclear
documentation combined with pricing models that ensure there are many
different ways my credit card could end up with significant surprise charges,
and a vendor that has repeatedly shipped me broken & fake items in the past.
What is the incentive to use AWS?

~~~
rtisdale
I agree that AWS pricing and documentation can be incredibly obtuse in some
cases and that it's often overkill for small personal projects.

Vultr and Digital Ocean are incredibly competitive from a performance per
dollar perspective, while still offering a high level of flexibility.

With that said, AWS does offer some features that don't have any equivalent
open source solutions (at least that I'm aware of).

The most apparent example of this that is commonly used are Autoscaling
Groups, something that is difficult to replicate on Digital Ocean or Vultr
unless you intend to write custom code (I believe Openstack or Cloudstack may
offer something similar but that's a whole other bag of worms)

Cloudwatch is also an incredibly powerful solution for correlating logs and
metrics that can be difficult to replicate without rolling your own solution,
which can become quite complex (generally combining something such as
Prometheus + ELK + Grafana).

In short, AWS allows you to utilize some incredibly powerful tools without
nearly as much management overhead as rolling your own solution.

This is the area where I think AWS excels.

~~~
somecallitblues
I would add route 53, S3 and easy of setting up security groups as killer
features. Plus the ability to take a snapshot and boot an instance from it,
awesome API, elastic IPs and reliability. I’ve had 8 instances across 3 data
centres, for 5 years, with a single hardware failure where I’ve had to take
action. Their shit just works.

~~~
grey-area
DO has most of that I think - DNS, Object Storage, reuse snapshots, API,
floating IPs, load balancers, reliability on par with AWS, and simpler
pricing, but no auto-scaling that I'm aware of.

------
antoncohen
Why use t2 instances of t2.large or larger? The price difference to go to m5
instances is negligible[1], and with m5 you don't have to worry about CPU
credits and you get better networking.

[1] e.g., Linux on-demand for for t2.xlarge comes to $135.49/month, where as
m5.xlarge is $140.16/month. I seem to remember last time I was doing reserved
instance purchasing, the RI cost for m4 instances was actually less than
equivalent t2 instances (I don't see it now, so maybe my memory is wrong).

------
timjulien
As I myself learned the hard way, T2 are awful for production websites with
sustained high traffic. The author points out the biggest issue:

“About 40 minutes into Hour 6 your instance no longer has any credits
remaining. At this point, you’re now limited to the baseline performance of
the instance, in this case, 10% of the vCPU.

At this point your application grinds to a halt.”

Agreed. As terrible as that is, it’s actually worse than that in my
experience, bc even while you have positive CPU credits, you can’t spend them
down smoothly to achieve high sustained CPU - the scheduler gives you
something more like a sine wave of performance. Which is mystifying to say the
least and leads me to my final point about why T2 is so awful: it violates the
principle of least astonishment. Why would anyone expect that a CPU is limited
to a small fraction of its output? T2 should not be categorized in the
“general computing” aws instance family. It should be in a family all its own
called “burstable” that explains the major difference between it and the
general computing family.

~~~
_pmf_
What would be the purpose (for the layman)? Handling spikey event ingress?

~~~
sudosteph
They're still good for: \- non production dev servers

\- executing recurring scheduled events with known estimate for cpu usage
(cron jobs)

\- potentially for non-overly busy build servers

I wouldn't use them for anything that needs production reliability though.

------
benmmurphy
you get the same problem with gp2 EBS storage when you have less than the
maximum baseline performance. if you have sustained peak IO then the disk will
eventually have a big performance drop. i guess people are less likely to hit
this wall than the T2 CPU wall. it is worth checking how long you can run at
peak load for your application before you get throttled and how your
application behaves when it gets throttled.

------
01walid
I'm wondering what's the position of Amazon Lightsail in all of this ?

Is it exempted?

~~~
laCour
Lightsail instances are just T2 instances, and they are not exempt from CPU
credits.

From
[https://aws.amazon.com/lightsail/faq/](https://aws.amazon.com/lightsail/faq/)

> Lightsail uses burstable performance instances that provide a baseline level
> of CPU performance with the additional ability to burst above the baseline.

------
appdrag
I'm really happy with T2 instances, and T2 Unlimited have solved the issue
described in this article!

~~~
rtisdale
Howdy appdrag.

I've added a note about T2 unlimited to my article, I will likely be writing
up another article in the near future explaining T2 Unlimited concepts.

Thanks for the note!

------
rosege
No mention of the newish T2 unlimited setting either

~~~
rtisdale
Author here, thanks for the suggestion!

I just added a note about T2 Unlimited and will see about writing a follow up
post explaining the new T2 Unlimited concepts.

EDIT: Added the note and gave a shout out, thanks again :).

~~~
rosege
Great thanks - sent your article to some colleagues who need to know this
stuff :-)

------
pritambarhate
Has anyone noticed that even after enabling T2 Unlimited it takes a few
minutes for the application to become responsive again? I have an application
with a Spiky load. I have turned on T2 unlimited for this server. But still, I
noticed that the application was becoming unresponsive for a few minutes when
there was a continuous load.

Didn't get a chance to get to the bottom of this yet. Did anybody else notice
something similar?

------
inopinatus
I’ve found them really great for a memory-bound Rails app that needs very
little CPU. As ever choosing the right instance type is a matter of mechanical
sympathy.

I do turn on t2 unlimited mode in case some horrible bug pegs the CPU.

------
budhajeewa
Any idea whether is this the same behavior in comparable computing instances
in GCP as well?

