
AWS Tips I Wish I'd Known Before I Started - richadams
http://wblinks.com/notes/aws-tips-i-wish-id-known-before-i-started/
======
rkalla
Fantastic list with much more depth than I expected. Some surprises that
others might be interested in from this article and comments below:

    
    
      [1] Keeping buckets locked down and allowing direct client -> S3 uploads
      [2] Using ALIAS records for easier redirection to core AWS resources instead of CNAMES.
      [3] What's an ALIAS?
      [-] Using IAM Roles
      [4] Benefits of using a VPC
      [-] Use '-' instead of '.' in S3 bucket names that will be accessed via HTTPS.
      [-] Automatic security auditing (damn, entire section was eye-opening)
      [-] Disable SSH in security groups to force you to get automation right.
    
    

[1]
[http://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlU...](http://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlUploadObject.html)

[2]
[http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/Cre...](http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/CreatingAliasRRSets.html)

[3] [http://blog.dnsimple.com/2011/11/introducing-alias-
record/](http://blog.dnsimple.com/2011/11/introducing-alias-record/)

[4] [http://www.youtube.com/watch?v=Zd5hsL-
JNY4](http://www.youtube.com/watch?v=Zd5hsL-JNY4)

~~~
jamiesonbecker
I like SSH. But I'm the founder of Userify ;)
[http://userify.com](http://userify.com)

Also, S3 buckets cannot scale infinitely. This is a huge myth
[http://aws.typepad.com/aws/2012/03/amazon-s3-performance-
tip...](http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-
seattle-hiring-event.html)

------
mbreese
I'd also add to the list - make sure that AWS is right for your workload.

If you don't have an elastic workload and are keeping all of your servers
online 24/7, then you should investigate dedicated hardware from another
provider. AWS really only makes sense ($$) when you can take advantage of the
ability to spin up and spin down your instances as needed.

~~~
AaronBBrown
AWS makes a lot of sense in a 24/7 environment, particularly when you are a
new startup and don't have enough information (or capital!) to make educated
server purchases.

~~~
_zen
> particularly when you are a new startup and don't have enough information
> (or capital!) to make educated server purchases.

I doubt there are many founders who are technically informed enough to know
about Amazon Web Services, but don't know about the other big 3 (Digital
Ocean, Linode, Rackspace). If you truly don't, then you must not be a tech
company, and I have a hard time believing a non-tech company without any
technical founders would even know about AWS.

~~~
AaronBBrown
DO, Linode, and Rackspace have lower bottom line costs, but a (much) smaller
feature set which means more operational work. Especially when kicking things
off, developer and operational time is often far more valuable than the cost
of the servers.

~~~
mbreese
Yeah, but it you start out using too much of the AWS tools, you're far more
likely to get trapped in the AWS infrastructure and end up paying
significantly more in the long term (which is what they want!). I'm not saying
that AWS doesn't have useful features, but you need to appreciate the costs of
these things before starting. If you need to spend a bit more time on devops
in the beginning, then so be it. If you're starting a company, there had
better be some good reasons behind using AWS, aside from "because faster
iteration, developer time!".

Specifically - what AWS features do you find to be useful at the beginning?
You seem to have some specific use-cases in mind. I'm legitimately curious.

~~~
jon-wood
Depending which tools you're using you'll have to do a bit of work to migrate
out of AWS, but there tools are usually pretty standard stuff managed for you.

For example, RDS is just a database instance with management. You'd have to
invest some time to replicate what AWS already does for you, but its not
rocket science. The same for Elasticache, Autoscaling, ELB, and most of their
other services.

------
Judson
One thing the article mentions is terminating SSL on your ELB. If you want
more control over your SSL setup AND want to get remote IP information (e.g.
X-Forwarded-For) ELB now supports PROXY protocol. I wrote a little
introduction on how to set it up[0]. They haven't promoted it very much, but
it is quite useful.

[0]: [http://jud.me/post/65621015920/hardened-ssl-ciphers-using-
aw...](http://jud.me/post/65621015920/hardened-ssl-ciphers-using-aws-elb-and-
haproxy)

~~~
richadams
Great post, I had no idea you could do this with ELB. I've added your link to
the additional reading list in my post, thanks for sharing!

------
mslot
Be very careful with assigning IAM roles to EC2 instances. Many web
applications have some kind of implicit proxying, e.g. a function to download
an image from a user-defined URL. You might have remembered to block
127.0.0.*, but did you remember 169.254.169.254? Are you aware why
169.254.169.254 is relevant to IAM roles? Did you consider hostnames pointed
to to 169.254.169.254? Did you consider that your HTTP client might do a
separate DNS look-up? etc.

There are other subtleties which make roles hard to work with. The same
policies can have different effects for roles and users (e.g., permission to
copy from other buckets).

IAM Roles can be useful, especially for bootstrapping (e.g. retrieving an
encrypted key store at start-up), but only use them if you know what you're
doing.

Conversely, tips like disabling SSH have negligible security benefit if you're
using the default EC2 setup (private key-based login). It's really quite
useful to see what's going on in an individual server when you're developing a
service.

Also, it does matter whether you put a CDN in front of S3. Even when
requesting a file from EC2, CloudFront is typically an order of magnitude
faster than S3. Even when using the website endpoint, S3 is not designed for
web sites and will serve 500s relatively frequently, and does not scale
instantly.

~~~
richadams
Great point with regards to IAM roles. The applications I've worked on don't
download things from user-defined URLs, so this never even occurred to me.

Is the purpose of blocking 169.254.169.254 important because it could
potentially give users access to the instance metadata service for your
instance? I'd be interested to hear more information on securing EC2 with
regards IAM roles, you seem to have lots of experience in that area.

The disabling SSH tip wasn't really about security (I agree that it has
negligible security benefit), it's more about quickly highlighting parts of
your infrastructure that aren't automated. It's often tempting to just quickly
SSH in and fix this one little thing, and disabling it will force you to
automate the fix instead.

The CDN info has been mentioned elsewhere too, lots of things I didn't know.
I'll be updated the article soon to add all of the points that have been made.
Thanks for the tips!

~~~
mslot
The way IAM roles work is terrifyingly simple. IAM generates temporary access
key identifier and secret access key with the configured permissions and EC2
makes them available to your instance via the instance metadata as JSON at
[http://169.254.169.254/latest/meta-data/iam/security-
credent...](http://169.254.169.254/latest/meta-data/iam/security-credentials/)
. The AWS SDK periodically retrieves and parses the JSON to get the new
credentials. That's it. I'm not entirely sure whether the credentials can be
used from a different IP or not, but given a proxying function that does not
really matter.

I make sure all HTTP requests in my (Java) application go through a DNS
resolver that throws an exception if: ip.isLoopbackAddress() ||
ip.isMulticastAddress() || ip.isAnyLocalAddress() || ip.isLinkLocalAddress()

The last clause captures 169.254.169.254. Of course, many libraries use their
own HTTP client, so it's easy to make a mistake.

I'm trying to bring my usage of IAM roles down to 0 as a matter of policy.
Currently, I'm only using an IAM role to retrieve an encrypted Java key store
from S3 (key provided via CloudFormation) and encrypted AWS credentials for
other functions (keys contained in the key store). I'd be happier to bootstrap
using CloudFormation with credentials that are removed from the instance after
start-up.

Thanks for making updates. There are definitely some great tips in there.

------
Fizzer
> you pay the much cheaper CloudFront outbound bandwidth costs, instead of the
> S3 outbound bandwidth costs.

What? CloudFront bandwidth costs are, at best, the same as S3 outbound costs,
and at worse much more expensive.

S3 outbound costs are 12 cents per GB worldwide. [1]

CloudFont outbound costs are 12-25 cents per GB, depending on the region. [2]

Not only that, but your cost-per-request on CloudFront way more than S3
($0.004 per 10,000 requests on S3 vs $0.0075-$0.0160 per 10,000 requests on
CloudFront)

[1] [http://aws.amazon.com/s3/pricing/](http://aws.amazon.com/s3/pricing/) [2]
[http://aws.amazon.com/cloudfront/pricing/](http://aws.amazon.com/cloudfront/pricing/)

~~~
richadams
Doh, I feel stupid now. I only looked at bandwidth costs, not the request
prices. That's what I get for editing my post late at night based on reading,
instead of based on personal experience.

For low bandwidth, you're absolutely right, the costs are at best the same.
For high bandwidth however (once you get above 10TB), CloudFront works out
cheaper (by about $0.010/GB, depending on region). But that wasn't taking into
account the request cost, which as you point out, is more expensive on
CloudFront, which can negate the savings from above depending on your usage
pattern.

I'll update my post accordingly, thanks for pointing this error out!

~~~
_hyn3
You do have to pay for S3 to CloudFront traffic, so really you're paying
twice. (Although the S3 to CF traffic might be cheaper than S3 to Internet,
according to the Origin Server section on the Cloudfront pricing page.)
[http://aws.amazon.com/cloudfront/pricing/](http://aws.amazon.com/cloudfront/pricing/)

Also, S3 buckets cannot scale infinitely. They have to have their key names
managed appropriately to do it.
[http://aws.typepad.com/aws/2012/03/amazon-s3-performance-
tip...](http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-
seattle-hiring-event.html)

Finally :) I like SSH. But I'm the founder of Userify!
[http://userify.com](http://userify.com)

------
krallin
Lots of very useful tips there!

There's one that I think could be improved on a little:

    
    
        Uploads should go direct to S3 (don't store on local filesystem and have another process move to S3 for example). 
    

You could even use a temporary URL[0,1] and have the user upload directly to
S3!

[0]: [http://stackoverflow.com/questions/10044151/how-to-
generate-...](http://stackoverflow.com/questions/10044151/how-to-generate-a-
temporary-url-to-upload-file-to-amazon-s3-with-boto-library) [1]:
[http://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlU...](http://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlUploadObject.html)

~~~
toomuchtodo
I've always seen issues pushing objects directly to S3 from a browser using
CORS. YMMV.

~~~
ceejayoz
You can specify CORS headers for S3, or you can just use a standard form POST.

~~~
toomuchtodo
You still need a stub API for generating the signature to sign the upload
requests to S3, correct?

~~~
ceejayoz
Not technically, but generally in practice. You can open things up
permissions-wise but run the risk of folks uploading lots of large files.
Keeping permissions locked down and doing a signature allows things like file
size, location, etc. restrictions.

------
j-kidd
Good article, but I think it touches too little about persistence. The trade-
off of EBS vs ephemeral storage, for example, is not mentioned at all.

Getting your application server up and running is the easiest part in
operation, whether you do it by hand via SSH, or automate and autoscale
everything with ansible/chef/puppet/salt/whatever. Persistence is the hard
part.

~~~
crescentfresh
Good point. We're struggling to see the benefits of EBS for Cassandra that has
its own replication strategy (ie data is not lost if an instance is lost),
voiding the "only store temporary data on ephemeral stores" argument.

~~~
blakesmith
How do you handle entire datacenter outages with ephemeral only setup? You can
replicate to another datacenter, but if power is lost to both do you just
accept that you'll have to restore from a snapshotted backup?

~~~
ismarc
Our use-case is different (not cassandra or db hosted on ephemeral drives),
but what we've found using AWS for about 2 years now is that when an
availability zone goes out, it's either linked to or affects EBS. Our setup
now is to have base-load data and PG WAL files stored/written to S3, all
servers use ephemeral drives, difference in data is loaded at machine creation
time, AMI that servers are loaded from is recreated every night. We always
deploy to 3 AZs (2 if that's all a region has) with Route 53 latency based DNS
lookups that points to an ELB that sits in front of our servers for the region
(previously had 1 ELB per AZ as they used DNS lookups to determine where to
route someone amongst AZs and some of our sites are the origin for a CDN, so
it didn't balance appropriately...this has since been changed) that is in the
public+private section of a VPC with all the rest of our infrastructure in the
private section of a VPC (VPC across all 3 AZs). We use ELBs internal to the
AZ for services that communicate with each other. The entire system is
designed to where you can shoot a single server, a single AZ or a single
region in the face and the worst you have is degraded performance (say, going
to the west coast from the east coast, etc.).

Using this type of setup, we had 100% availability for our customers over a
period of 2 years (up until a couple of weeks ago where the flash 12 upgrade
caused a small amount of our customers to be impacted). This includes the
large outage in US East 1 from the electrical storm, as well as several other
EBS related outages. Overall costs are cheaper than setting up our own geo-
diverse set of datacenters (or racks in datacenters) thanks to heavy use of
reserved instances. We keep looking at the costs and as soon as it makes
sense, we'll switch over, but will still use several of the features of AWS
(peak load growth, RDS, Route 53).

The short answer is to design your entire system so that any component can
randomly be shot in the face, from a single server to the eastern seaboard
falling into the ocean to the United States immediately going the way of Mad
Max. Design failure into the system and you get to sleep at night a lot more.

------
PhilipA
Really useful article, though I don't agree with not using a CDN instead of
S3. There are multiple articles which proves the performance of S3 being quite
bad, and not useful for serving assets, comparing to CloudFront.

~~~
tyw
also the outbound bandwidth cost of S3 is very high. it would cost us several
times what we're paying for s3+cloudfront to serve our content straight from
s3.

~~~
tedivm
Even cloudfront is ridiculously overpriced for a CDN. If you're pushing
anything close to real bandwidth you could do a lot better elsewhere.

~~~
Consultant32452
If you're pushing anything close to real bandwidth you're probably getting an
individually negotiated price that is not public.

~~~
tedivm
Not with Amazon you aren't, they are amazingly stringent in their pricing on
this matter.

~~~
tyw
[http://aws.amazon.com/cloudfront/pricing/](http://aws.amazon.com/cloudfront/pricing/)
the reserved capacity pricing is much better than the on demand pricing.
Basically like EC2 on demand vs reservations. We set our reserved capacity at
about 70-80% of what we expect to use most of the time. We could probably
shave a few tenths of a cent per gig off but we get a good price on everything
above what we've reserved so it's worked out.

If you use a lot of cloudfront bandwidth without setting up a reservation,
yeah... you're gonna pay through the nose.

------
drob
Along these lines, I recommend installing New Relic server monitoring on all
your EC2 instances.

The server-level monitoring is free, and it's super simple to install. (The
code we use to roll it out via ansible:
[https://gist.github.com/drob/8790246](https://gist.github.com/drob/8790246))

You get 24 hours of historical data and a nice webUI. Totally worth the
effort.

------
match

      > Use random strings at the start of your keys.
      > This seems like a strange idea, but one of the implementation details 
      > of S3 is that Amazon use the object key to determine where a file is physically 
      > placed in S3. So files with the same prefix might end up on the same hard disk 
      > for example. By randomising your key prefixes, you end up with a better distribution 
      > of your object files. (Source: S3 Performance Tips & Tricks)
    

This is great advice, but just a small conceptual correction. The prefix
doesn't control where the file contents will be stored it just controls where
the index to that file's contents is stored.

------
lfuller
Your body tag is set to "overflow: hidden;". I wasn't able to scroll until I
tweaked it manually in the inspector.

~~~
richadams
Oops, sorry about that. Should be fixed now.

~~~
paulgb
Also, if you change the first line of
[http://wblinks.com/css/style.css](http://wblinks.com/css/style.css)

to

    
    
        @import url(http://fonts.googleapis.com/css?family=Droid+Sans:400,700);
    

you should notice an improvement in the boldface font rendering.

Great article, btw.

~~~
riffraff
OT, but how does this work? Why does it improve rendering? Is it downloading
both the normal and bold weight rather than only one leaving the browser to do
the rest?

~~~
paulgb
Yes, you have it exactly right. The browser will try to make fonts bold on its
own, but it's not a match for the bold that the designer intended.

------
kolev
One painful to learn issue with AWS is the limits of services, which some of
them are not so obvious. Everything has a hard limit and unless you have the
support plan, it can take you days and weeks to get those lifted. They are all
handled by the respective departments and lifted (or rejected) one by one.
Many times we've encountered a Security Group limit right before a production
push or other similar things. Last, but not least, RDS and CloudFront are
extremely painful to launch. I have many incidents where RDS was taking nearly
2 hours to launch - blank multi-AZ instance! CloudFront distributions take 30
minutes to complete. I hate those two taking so long as my CloudFormation
templates pretty much take an excess of an hour due to the blocking RDS and
CloudFront. Last, but not least - VPC is nice, I love it, but it takes time to
get what's the difference between Network ACL and Security groups and
especially - why the neck do you need to run NATs?! Why isn't this part of the
service?! They provide some outdated "high" availability scripts, which are,
in fact, buggy, and support only 2 AZs. Also, a CloudFront "flush" takes over
20 minutes - even for empty distributions! Also, you can't do a hot switch
from on distribution to another as it also take 30 minutes to change a CNAME
and you cannot have two distributions having the same CNAME (it's a weird edge
case scenario, but anyway).

~~~
kolev
Just recalled another big annoyance! CloudFormation allows you to store JSON
files in the user data, which is a bit similar to CloudInit, but... it turns
your numbers into strings! So, imagine you need to put some JSON config file
in there and the software expect an integer and craps out if there's a string
value instead. I won't even bring how limited and behind the API
CloudFormation is... Even their AWS CLI is behind and doesn't support major
services like CloudFront. They even removed the nice landing page of the CLI
took, which made it very obvious which services are NOT supported - I guess
they just got embarrassed by having so many unsupported ones!

------
noelherrick
> Have tools to view application logs.

Yes! Centralized logging is an absolute must: don't depend on the fact that
you can log in and look at logs. This will grow so wearisome.

~~~
txttran
What tools do you recommend for centralized logging?

------
Mizza
That '.' instead of '-' tip for SSL'd buckets just saved me a large future
headache. Good stuff!

~~~
croddin
I think you reversed them.

------
michaelmior
Disabling SSH is an interesting tip. I guess the OP doesn't do any automation
via SSH.

~~~
richadams
Just disabling inbound SSH connections, the servers can still SSH out to other
systems to pull in files, configurations, clone git repos, etc.

It's just a way to stop yourself from cheating and SSHing in just to fix that
one thing, instead of automating it.

~~~
milkshakes
except that some automation frameworks rely on inbound ssh access to the
machines. ansible would be an example of such a framework, in its default
configuration at least.

~~~
richadams
Ah, I wasn't aware of that, very good point!

The goal of the tip is really to stop users SSHing in just to fix that one
little thing, so you could still allow your automation frameworks SSH access
and just disable it for users.

------
novaleaf
i'm a devops noob. what tools should i use to log / monitor all my servers?

i don't want to learn some complex stuff like cheff/puppet btw.... anything
SIMPLE?

~~~
carlio
Though I haven't tried it, people tell me that ansible is pretty simple -
[http://www.ansible.com/home](http://www.ansible.com/home)

For logging, try logstash? [http://logstash.net/](http://logstash.net/)

Monitoring... well that's a large and complicated topic!

~~~
adenot
+1 on Ansible, great tool and super simple to configure and use.

------
freerobby
Can you (or somebody else) elaborate on disabling ssh access? Is this a dogma
of "automation should do everything" or is there a specific security concern
you are worried about? What is the downside of letting your ops people ssh
into boxes, or for that matter of their needing to do so?

~~~
pavel_lishin
Based on the article, it seems it's there to make sure that you're automating
everything, instead of logging in to do that one little thing by hand.

~~~
freerobby
Thanks.

Does anybody else here agree with this mentality? This seems a major
mispractice to me. I've worked at companies with as few as two people to as
many as 50,000 people. None of them have had production systems that are
entirely self-maintaining. Most startups are better off being pragmatic than
investing man-years of time handling rare error cases like what to do if you
get an S3 upload error while nearly out of disk space. There's a good reason
why even highly automated companies like Facebook have dozens of sysadmins
working around the clock.

I thought all of his other points were spot-on but this one rings very
dissonant to my experience.

~~~
pnathan
I agree - it seems off to me. Sometimes you really want to diagnose your
problems manually.

I'm also wondering how command and control is maintained without SSH access.
Is there some kind of autoupdating service polling a master configuration
management server (i.e., puppet's puppetmaster)?

I can appreciate that ensuring that a typical deploy doesn't require hand-
twiddling. That makes sense, lots of it. But not disabling SSH.

~~~
richadams
I think the problem is that I've made it seem like a strict rule in the
article; "You must disable SSH or everything will go wrong!!!". It's really
just about quickly highlighting what needs automating. Like you say, sometimes
you just want to diagnose your problems manually, and that's fine, re-enable
SSH and dive in. But if you're constantly fixing issues by SSHing in and
running commands, that's probably something you can try to automate.

Personally I always had a bad habit of cheating with my automation. I would
automate as much as I could, and then just SSH in to fix one little thing now
and then. I disabled SSH to force myself to stop cheating, and it worked well
for me, so I wanted to share the idea.

Of course, there's always going to be cases where it's simply not feasible to
disable it completely. It depends on your application. The ones I've worked on
aren't massively complex, so the automation is simpler. I can certainly see
how not having SSH access for larger complex systems could become more of a
hindrance.

------
Estragon
How hard is it to roll your own version of AWS's security groups? I want to
set up a Storm cluster, but the methods I have come up with for firewalling it
while preserving elasticity all seem a bit fragile.

~~~
jamiesonbecker
Check out Dome9. Amazing tool and I think they work with both AWS and
elsewhere.

------
mblaney
As an Australian developer, using an EC2 instance seems to be the cheapest
option if you want a server based in this country. Anyone got any other
recommendations?

~~~
kibibu
Ninefold aren't bad either

~~~
mootpointer
As a Ninefold employee, I'd like to think we're pretty good. We do virtual
servers and we have a solid Rails platform as well.

------
simonlebo
Can anyone explain how disabling ssh has anything to do with automation? We
automate all our deployments through ssh and I was not aware of another way of
doing.

~~~
ceejayoz
I believe the idea is that by preventing SSH the temptation to just pop in and
tweak something manually isn't possible.

~~~
richadams
Yup, this was the intention. You could still allow your automation processes
SSH access, just disable it for your users.

The idea is that if a user can't SSH in (at least not without modifying the
firewall rules to allow it again), it will force them to try and automate what
they were going to do instead. It worked well for me, but it's probably not
for everyone.

------
rdl
I'd probably also say "avoid ELB where possible, especially for instance
storage" and "avoid ELB, roll your own."

------
late2part
Thing I wish I'd known before I started: Don't rely on proprietary AWS
solutions when open source solutions work just as well.

------
jamiesonbecker
With regards to managing ssh, keys, etc... userify. Disclaimer: founder.

------
gesman
Someone needs to create such list for Azure as well.

And make it Wiki-ized.

------
ape4
Wow looks like a big pain.

------
Fasebook
What's the point of auditing security in the Cloud? Is there any point at
which you can know that your making any progress?

~~~
mscarborough
Just one example -- Amazon will sign a Business Associate's Agreement for
HIPAA compliance. That doesn't absolve you of your application security
responsibilities, but it does give you piece of mind on the PAAS EC2/S3 side
of things.

~~~
tel
For further note though, they won't unless you buy dedicated instances. This
also disables RDS.

------
5ersi
Aww man, my head hurts just looking at this list.

Just go with a PaaS, like Heroku or AppEngine, and forget about this sysadmin
crap.

~~~
q3k
> sysadmin crap

Without this “sysadmin crap” you would not have your precious PaaS.

~~~
tburch
>> sysadmin crap

>Without this “sysadmin crap” you would not have your precious PaaS.

The difference being that _I_ don't have to deal with the “sysadmin crap”.

~~~
q3k
Sure, but there's no reason to call it crap - have some respect for your
fellow ops ;).

