
Things You Should Know About AWS - jpmc
http://highscalability.com/blog/2013/11/5/10-things-you-should-know-about-aws.html
======
falcolas
> Stripe your RDS disks for better performance

This is a fun hack to perform, but it opens you up to another problem: latency
on any of the striped EBS volumes will lag out the entire striped array.

Attempts to mitigate this problem (including setting up raid 10) work in the
short term, but it really is easier to just purchase a guaranteed iops volume
if you want to run a database on EC2.

~~~
DrJ
RDS is the hosted sql database solution.

When RDS stripes your disks (I have seen it start when I jumped from 100GB to
300GB) you benefit from faster IOPS (I have seen 700 IOPS improve to 2500
IOPS.

PIOPS (Provisioned IOPS) is the way to go in the long run though.

~~~
falcolas
> you benefit from faster IOPS

Yes, you do.

However, if a single EBS that backs the RDS instance lags out (as EBS volumes
in AWS are wont to do), you lose the iops.

The key point is that as the number of EBS volumes you depend on goes up, _the
chances of this happening_ goes up correspondingly.

Striping EBS volumes works, but I would not depend on it for a production
environment.

------
vosper
This is great and all, but I'd love to see a community wiki / discussion area
for AWS tips, tricks, and gotchas. Something like Quora crossed with
Wikipedia, just for AWS.

Is anyone aware of such a thing? If not, any advice for starting one?

~~~
lifeformed
Sounds like something ideal for a StackExchange site.

~~~
vosper
My problem with StackExchange sites is that they're not that good as a
reference. There are lots of great questions and answers (and lots of
repetition thereof) but as a place to collate community knowledge about a
topic they fall short.

Something like a wiki might be better, I suppose.

------
talonx
These are more "tricks on optimizing AWS usage" than things to know "about"
AWS.

------
jwilliams
"9) Use Virtual Private Cloud (VPC) from the start"

This is now a no-brainer. New registrations in certain zones will kick you
into a basic VPC from the get-go.

The only inbound should be via an ELB (HTTP/HTTPS) and an non-DNS-resolvable
SSH bastion/NAT host (m1.small is more than enough). Your bastion is the only
host that is on the public Internet.

Setting up a bastion is reasonably straight-forward. There is an AMI that will
do it all for you. Just make sure you've got "src/dst check" turned off in the
EC2 panel for that server. Just that tip can save you hours of hair-tearing.

Outbound is via the bastion, which you can lock down to certain protocols via
VPC security groups. I limit to HTTPS (no HTTP!).

The bastion should use ssh keys only (no username/password). Put in place
fail2ban on the bastion. You can also add a firewall rule that backs off
multiple fails SSH attempts. This all but nukes brute-force attacks. Also,
make sure you patch regularly.

I go as far as to have separate keys for the bastion and then the hosts, but
sensible policies should apply here (e.g. passphrase on your keys please).

Keep the bastion superuser login to a very select group. You can quickly
remove employees' access by taking them out of the bastion. If you're in panic
mode, you can turn off the bastion and isolate your network until you can
regroup (similarly with the ELB for web-based attacks).

This is a pretty sound foundation for a secure setup.

~~~
joevandyk
Why not use a vpn on the bastion?

~~~
jwilliams
I guess you could, but I don't see if being that far different from SSH. One
mild plus of SSH is you don't have addressing problems (e.g. Wifi network
colliding with your AWS one).

------
oakwhiz
>Use ZFS and RAIDZ with EBS

That's a really cool idea, and I'm curious to see what kind of performance
losses will occur during normal usage.

------
leftnode
What is the 'aws' command in example #2? Is that a Ruby/Python/other script
released by Amazon or a 3rd party?

~~~
jeffbarr
That's the AWS CLI (Command Line Interface), available at
[http://aws.amazon.com/cli/](http://aws.amazon.com/cli/) .

~~~
oulipo
Is there a huge difference with s3cmd?

~~~
vacri
It's more of a swiss army knife. It's also still pretty new - the aws
documentation misses out on some things for it (like autoscaling) and it's
still in development to some degree, but it means you have to remember one
command with one auth, instead of a handful (s3cmd, boto, others I can't
recall)

~~~
DrJ
awscli depends on boto-core actually, so in some sense awscli is a fork of
boto.

------
ademarre
I don't hear enough people talking about vendor lock-in with AWS, or any other
cloud provider for that matter. Too many of us are building our company's
infrastructure on AWS with little regard for the cost of switching.

I try to stick with Amazon's IaaS offerings, only employing PaaS products when
they offer considerable advantages over anything I could roll economically on
bare infrastructure.

Ask yourself, "in terms of time or money, what would it cost me if this cloud
product were discontinued without notice?"

AWS might lose a patent battle, get shut down by the government, or get
crippled by hackers or a long-term service interruption. If that happened, how
long would you be down while you struggle to get up and running at another
provider? What if Amazon raises their prices to the point where you want to
change cloud providers? Would the cost of switching be prohibitive?

------
anan0s
Actually I was wondering if there is any experience/benchmarks about Amazon
high-performance instances.

anyone already used this ?

~~~
vosper
I am using a number of high-performance instances - cc2.8xlarge machines for
EMR jobs, and cr1.8xlarge machines for analytics databases. From my testing:

\- Running EMR jobs on cc2.8xlarge machines as spot instances is a great way
to get a LOT of computer power very cheaply. Because our jobs are periodic we
run both Core and Task as spots and simply retry the job if our machines get
terminated. I did a lot of benchmarking and found that a small number of
cc2.8xlarge machines out-performs and is cheaper than a large number of lesser
instances (and I tried most of the lesser machines). In us-west-2 it's very
uncommon to lose our instances, unlike us-east-1 which has major price
fluctuations (this is true for all types of spot instance).

\- The cr1.8xlarge has fantastic performance, relative to the rest of the AWS
machines. It's also very expensive compared to the cost of hardware or a
similar solution on another cloud provider. Since we're fully integrated with
AWS and don't want to run our own hardware we're sucking up the cost for now,
but it's definitely a sore-point in our budget. The cr1.8xlarge is also all-
round a better machine than the hi.4xlarge, which has a lot of disk but is
pitiful in terms of CPU.

~~~
mdellabitta
One thing to note with EMR: you still pay 25% of the _ondemand_ price as
overhead to use EMR. If you're bringing up and turning off clusters all the
time, it's probably worth it, but you might want to look into using Whirr
instead.

~~~
vosper
You've a good point about the EMR charge - that's easy to overlook.

I took a look at Whirr [1] but I don't see how having a cloud-agnostic
platform helps - are there really alternatives to EMR out there? Can they give
me 500+ cc2.8xlarge equivalent machines on-demand but at spot prices?

[1] Assuming this is the Whirr to which you refer:
[https://whirr.apache.org/](https://whirr.apache.org/)

~~~
mdellabitta
Whirr just uses the AWS APIs to provision a cluster for you, but then you're
getting a cluster built on EC2 instances rather than EMR, so you don't pay
that EMR overhead. You can choose spot or ondemand instances. If your spots
get reaped, I think it would fail pretty similarly to an EMR cluster built on
spots. I have no idea how quickly it could provision a 500 node cluster,
however.

~~~
vosper
Ahh, I see - I thought it was about being able to move my EMR jobs to other
providers. Running on EC2 without the EMR charge would be nice, and assuming
that there's no special casing for spot requests for EMR (I don't know of any,
the spot instance requests show up like normal ones) then it may be well able
to get me that many instances.

Thanks for the tip!

------
ebaxt
An important gotcha regarding Cloudformation - there is no way to recover from
a failed rollback (except maybe to contact amazon support). So it's basically
only safe for initial setup of resources.

~~~
ojiikun
Failed rollbacks are definitely an exception, not a norm, and something
support should know about!

------
ideaoverload
Description of noisy neighbor problem #6 lacks some depth.

AWS noisy neighbors problem is very often misunderstood. CPU steal time under
linux does NOT mean that somebody is stealing your CPU. It simply means that
you wanted to use CPU and hypervisor has given it to another instance. This
may happen because you have exceeded your quota or scheduling algorithm
selected another pending instance at this very moment and it would give CPU
back to you a bit later. In the end in both cases your instance gets fair
share.

Great detailed explanations of steal time:
[https://support.cloud.engineyard.com/entries/22806937-Explan...](https://support.cloud.engineyard.com/entries/22806937-Explanation-
of-Steal-Time) and: [http://www.stackdriver.com/understanding-cpu-steal-
experimen...](http://www.stackdriver.com/understanding-cpu-steal-experiment/).
The latter article is mentioned by OP but seems to be not fully
read/understood.

Why does killing and restarting instance help? It likely moves instance to
different hardware node with less active neighbors. When your neighbor is not
active your instance can use CPU idle cycles of your neighbor! You sort of
become the noisy one. Still hypervisor would prevent it once neighbor starts
to fully utilize his CPU quota and you are back to square one.

Amazon does not oversubscribe CPU according to their CTO:
[http://itknowledgeexchange.techtarget.com/cloud-
computing/am...](http://itknowledgeexchange.techtarget.com/cloud-
computing/amazon-does-not-oversubscribe/)

Amazon specifically states that t1.micro instances do not guarantee CPU
performance: "Micro instances are a very low-cost instance option, providing a
small amount of CPU resources. Micro instances may opportunistically increase
CPU capacity in short bursts when additional cycles are available. They are
well suited for lower throughput applications and websites that require
additional compute cycles periodically, but are not appropriate for
applications that require sustained CPU performance."

While CPU sharing is pretty well documented noisy neighbor problem still
exists for network and disk resources being shared by multiple instances on
the same hardware node. The only way to detect these problems is to track
network throughput/loss rate for network and IO stats for disk.

You are guaranteed to avoid noisy neighbors CPU problem by using AWS dedicated
instances: [http://aws.amazon.com/dedicated-
instances/](http://aws.amazon.com/dedicated-instances/)

I work for APM ( Application Performance Management) vendor , I have no
business praising AWS.

[Edit:spelling and clarity]

[Edit2: changed CPU scheduling description]

~~~
JulianWasTaken
Why does this happen even when the instance isn't using all of its CPU?

~~~
ideaoverload
>Why does this happen even when the instance isn't using all of its CPU?

Probably hypervisor does not assign CPU every time it is requested but it
still manages to assign as much as needed because in the end there is some
idle time left.

~~~
JulianWasTaken
We've seen for example, if we have 5 instances running the same app, one
instance potentially uses 20% more CPU with tons of CPU steal time, but none
are using 100%, the "normal" ones use ~40% and the one with tons of steal time
spikes between 60% - 70%.

But rebuilding the instance brings them all in line.

------
darkarmani
Does anyone understand the point they are making in #9 about VPC?

Are they suggesting using HAProxy in your public subnet and ELBs in your
private subnets? Is this their reference to avoiding using a NAT box (actually
PAT)? Don't you still need a HAProxy box in each of your AZs?

~~~
falcolas
I'm not sure, but I read it as ELB over your HAProxy instances (for high
availability), and use HAProxy to do the intelligent load balancing (which
would solve the "sticky" argument put forth by the author).

------
victor_haydin
One more nice post about this topic:
[http://www.elekslabs.com/2013/11/aws-10-things-youre-
probabl...](http://www.elekslabs.com/2013/11/aws-10-things-youre-probably-
doing.html)

------
simonebrunozzi
It looks like that this post has been partly inspired by a recent presentation
that I made:
[http://www.slideshare.net/simone.brunozzi/5-thingsyoudontkno...](http://www.slideshare.net/simone.brunozzi/5-thingsyoudontknowaboutaws)

