
AWS Tips, Tricks, and Techniques - sehrope
https://launchbylunch.com/posts/2014/Jan/29/aws-tips/
======
thomseddon
Here's another one for you: distribute your s3 paths/names

Because of the way s3 is designed, the place files are stored on the physical
infrastructure is dependant on the prefix of the key name. I'm not exactly
sure how much of the key name is used, but for example if you prefixed all you
images with imges/....jpg it's highly likely they will all be stored on the
same physical hardware.

I know of at least two companies for whom this has caused large problems for,
one of them is netflix. Imagine all the videos in a single bucket with key
names "/video/breaking_bad_s1_e1.mp4" (a crude example I know), all requests
hit the same physical hardware and under high load the hardware just can't
keep up and this exact issue has apparently been the cause of more than one
Netflix outage.

The solution is simple, ensure your files have a random prefix
({uuid}.breaking_bad_s1_e1.mp4) and they will be spread around the datacentre
:)

~~~
jeffbarr
This is definitely a good tip. Here's a blog post with more info on the topic:

[http://aws.typepad.com/aws/2012/03/amazon-s3-performance-
tip...](http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-
seattle-hiring-event.html)

~~~
moe
This seems a rather odd implementation detail to burden the user with.

Why don't you just use a hash of the filename for the partition key, so all
files are distributed randomly regardless of their name?

Is there an advantage for amazon or for the user in having S3 files cluster up
by filename by default?

~~~
penguindev
most likely so dir listings are a lot faster. (they are sorted by key, and can
use delimiters for subdir emulation..)

~~~
dllthomas
... but for that they just need to store the _filenames_ sorted somewhere.
Content could be located based on a hash of the filename, still.

------
grosskur
Another tip: IAM roles for EC2 instances.

[http://aws.typepad.com/aws/2012/06/iam-roles-for-
ec2-instanc...](http://aws.typepad.com/aws/2012/06/iam-roles-for-
ec2-instances-simplified-secure-access-to-aws-service-apis-from-ec2.html)

Basically, the apps you run on EC2 often need to access other AWS services. So
you need to get AWS credentials onto your EC2 instances somehow, which is a
nontrivial problem if you are automatically spinning up servers. IAM roles
solve this by providing each EC2 instance with a temporary set of credentials
in the instance metadata that gets automatically rotated. Libraries like boto
know to transparently fetch, cache, and refresh the temporary credentials
before making API calls.

When you create an IAM role, you give it access to only the things it needs,
e.g., read/write access to a specific S3 bucket, rather than access to
everything.

~~~
penguindev
You can also use that temp cred to assumerole() to an external IAM role, for
example to access a customers account. It's a lot to think about - my ec2
instance role then assuming another 3rd party account's IAM role - but its
quite secure; neither you or the customer have any private secrets or rotation
headaches.

------
teraflop
I would temper the suggestion to use Glacier with a warning: make sure you
thoroughly understand the pricing structure. If you don't read the details
about retrieval fees in the FAQ, it's easy to shoot yourself in the foot and
run up a bill that's vastly higher than necessary. You get charged based on
the _peak_ retrieval rate, not just the total amount you retrieve. Details
here:
[http://aws.amazon.com/glacier/faqs/](http://aws.amazon.com/glacier/faqs/)

For example, suppose you have 1TB of data in Glacier, and you need to restore
a 100GB backup file. If you have the luxury of spreading out the retrieval in
small chunks over a week, you only pay about $4. But if you request it all at
once, the charge is more like $175.

In the worst case, you might have a single multi-terabyte archive and ask for
it to be retrieved in a single chunk. I've never been foolhardy enough to test
this, but according to the docs, Amazon will happily bill you tens of
thousands of dollars for that single HTTP request.

~~~
mjn
They explain it _really_ weirdly, but it's not that hard to cap your
expenditure if you translate it into bandwidth units rather than storage
units, since that's what's really being billed. The footnote on the main
Pricing page is the most misleading part of their explanation:

 _If you choose to retrieve more than this amount of data in a month, you are
charged a retrieval fee starting at $0.01 per gigabyte._

From this you might think the retrieval fee is some function of the gigabytes
retrieved, but isn't really in any direct way: you're charged ((max gigabytes
retrieved in any one hour in a month) x (720 hours in the month) x per-GB
price), modulo some free quota. So you can really get charged up to "720
gigabytes" for a single gigabyte retrieved, which is a bit silly to think of
in terms of per-gigabyte pricing.

It makes more sense to me if you think of it as priced by your peak retrieval
bandwidth. Then you can cap your expenditure by just rate-throttling your
retrieval. Make sure to throttle the actual requests, using HTTP range queries
on very large files if necessary (many Glacier front-ends support this)...
throttling just your network or router will not have the desired effect, since
pricing doesn't depend on download completion.

Anyway, the retrieval pricing in units of Mbps is ~$3/Mbps, billed at the peak
hour-average retrieval rate for the month above the free quota. If you use
that as a rule if thumb it's not bad for personal use. For example given my
pipe, it's perfectly reasonable for me never to pull down more than 10 Mbps
anyway, in which case my max retrieval costs are < $30/month.

~~~
teraflop
Yeah, that's a good summary.

Another way to look at it is that if you retrieve data as fast as possible --
say, to S3 or EC2 -- it costs about $1.80/GB, minus the free quota. So for
large objects, it's cheaper to store them in Glacier than S3 if you don't
expect to need them in a hurry any time in the next 18 months. The free
retrieval quota starts making a difference if your typical "ASAP-retrieval"
size is less than 0.1% of your total archived data.

------
sehrope
OP here. Really cool to see people enjoying the write up.

It started off as a list of misc AWS topics that I found myself repeatedly
explaining to people. It seemed like a good idea to write them down.

I'm planning on listing out more in follow up posts.

~~~
cheese1756
Fantastic post with really helpful tips. Thanks a lot for that.

I really want to read the details on the spot instance setup for a web
service. Do you have a quick summary for that?

~~~
sehrope
> Fantastic post with really helpful tips. Thanks a lot for that.

Thanks!

> I really want to read the details on the spot instance setup for a web
> service. Do you have a quick summary for that?

I don't have anything written up to point to but the idea is to have the spot
instances dynamically register themselves with an ELB on startup. As long as
they stay up ( _ie. spot price is below your bid_ ) you get incredibly cheap
scaling for your web service. The ELB will automatically kick out instances
that get terminated as they will fail their health checks. Combine this with a
couple regular ( _ie. non spot instances_ ) and you get a web service that
scales cheaply when spot prices are low and gradually degrades the QOS for
your users when it gets more expensive.

I've got a couple other topics in the works first, but I should the write up
for that one up pretty soon too. Haven't gotten around to adding an email
signup form to my site yet ( _need to do that..._ ) so shoot me a mail if you
want to notified when the full post is up. Email is in my profile.

------
dmourati
This is all good stuff. AWS takes a while to grok but once you do, it offers
so many new possibilities.

The Aha! moment for me came when playing with SimianArmy, the wonderful
Netflix OSS project and in particular, Chaos Monkey.

Rather than build redundancy into your system, build failure in and force
failures early and often. This will surface architectural problems better than
any whiteboard.

[https://github.com/Netflix/SimianArmy](https://github.com/Netflix/SimianArmy)

Also, check out boto and aws cli.

[http://aws.amazon.com/sdkforpython/](http://aws.amazon.com/sdkforpython/)

[http://aws.amazon.com/cli/](http://aws.amazon.com/cli/)

~~~
benjaminwootton
I think I saw a tweet along the lines of "don't think of EC2 instances as you
would servers."

The point is, achitecting for the cloud is all about treating your
infrastructure as very transient and failure prone.

Hopefully stuff doesn't fail but architecting for that gives you a much better
application.

Netflix are my favourite tech company based on their open source distributed
system stuff. Check out Hystrix and their circuit breaker implementation. I am
going to lean on that heavily in an upcoming project such that entire
application tiers can dissapear with graceful degredation.

------
ultimoo
I for one think that S3's server side encryption is amazing for most if not
all data. We backup hundreds of gigabytes of data every day to S3 and enabling
server side encryption enables us to not worry about generating, rotating, and
managing keys and encryption strategies. It also saves us time since we don't
have to compute the encryption or the decryption. The best part is that the
AWS AES-256 server side Encryption at Rest suffices for compliance.

Of course, the data that we store, while confidential, isn't super-mission-
critical-sensitive data. We trust AWS to not peek into it, but nothing will be
lost if they do.

~~~
thomseddon
I suppose it's OK for some use cases but realistically it doesn't provide a
whole lot of security.

Doing this only really protects you if someone steals the physical disks from
inside AWS, whilst this is a legitimate risk it seems incredibly unlikely and
i'm not aware of any such instance.

HOWEVER, it does not protect you from the far more likely scenario of someone
gaining access to your AWS account. If someone is able to get your credentials
(or otherwise hack your account) then all your data is vulnerable.

~~~
gfodor
In a lot of cases your AWS account being compromised is going to open you up
to having any encryption keys you use for GPG encryption stolen too anyway. So
it seems that there's a decent chance that even GPG encryption is just
protecting you from the "disks stolen from the datacenter" risk.

At the end of the day your AWS account is often the "master key", so make sure
you use a good password, rotate it, and 2-factor auth.

~~~
e12e
How's that? Say you set up duplicity on your AWS _instances_. They only need
access to the public encryption key and private signing key -- the session
(actual encryption) keys will be destroyed after a backup is made. Now, to
_restore_ you'll need to provide access to the public signing key (this _is_
public, so no problem), and private encryption key. This should only ever be
on-line and decrypted in case of a restore.

Of course, if you set things up differently, then, yes. Compromised by design.

------
kmfrk
Cloud66 is a good cautionary tale of refusing the temptation to double-dip,
when it comes to your AWS tokens:

[https://news.ycombinator.com/item?id=5685406](https://news.ycombinator.com/item?id=5685406)

[https://news.ycombinator.com/item?id=5669315](https://news.ycombinator.com/item?id=5669315)

------
kudu
I know this is about AWS, but it might be helpful to mention that DreamObjects
([http://www.dreamhost.com/cloud/dreamobjects/](http://www.dreamhost.com/cloud/dreamobjects/))
is an API-compatible cheaper alternative to S3.

------
geoffroy
Very good article, thanks. I did not know about S3 temporary Urls, might use
that in the future.

~~~
sehrope
Yes they're a _really_ useful feature. You can use them to allow both
downloads and uploads. The latter is particularly cool for creating scalable
webapps that involve users uploading large content.

Rather than having them tie up your app server resources during what could be
a long/slow upload, they upload directly to S3, and then your app downloads
the object from S3 for processing. If your app is hosted on EC2 you can even
have the entire process be bandwidth free as user-to-S3 uploads are free and
S3-to-app-server downloads to EC2 are free. Just delete the object immediately
after use or use a short lived object expiration for the entire bucket.

I've made a note to add an example of this to my next post on this topic.

~~~
mje__
Also useful for authenticated downloads - don't tie up an app thread, but have
your app redirect to a pre-signed url

------
mrfusion
Question about the underlying SaaS product. I'm not understanding how a
database client on the cloud can connect to a database server on my own
machine?

Or am I misunderstanding what it does?

~~~
sehrope
> Question about the underlying SaaS product. I'm not understanding how a
> database client on the cloud can connect to a database server on my own
> machine?

OP and founder of JackDB[1] ( _the SaaS product referred to_ ) here.

The majority of people using JackDB connect to cloud hosted databases such as
Heroku Postgres or Amazon RDS. As they're designed for access across the
internet, no special configuration is necessary for them.

To connect to a database on your local machine or local network you would need
to set up your firewall to allow inbound connections from our NATed IP. More
network details available at [2].

[1]: [http://www.jackdb.com/](http://www.jackdb.com/)

[2]:
[http://www.jackdb.com/docs/security/networking.html](http://www.jackdb.com/docs/security/networking.html)

------
aerlinger
If you're running an application that runs on more than one server it's
definitely worth checking out AWS OpsWorks. It's a huge time saver and
extremely useful in integrating and managing setup, configuration, and
deployment across a server/db/cache etc without any loss of control or
customization.

------
sandGorgon
So, in older reddit threads, I read about how you need to build Raid-1 EBS for
all your ec2 servers as well as test your EBS storage, because they could be
really bad.

Is anybody doing this for their EC2 deployments and more importantly,
automating this?

------
rschmitty
Does anyone have experience with Reduced Redundancy Storage on S3?

How often do you lose files? Do you run a daily job to check in on them?

~~~
jhgg
I've been using S3 RRS for a few hundred thousand files, and have yet to lose
one over the course of the year that I've stored them.

You don't run a daily job though, you simply set up an event on SNS to notify
you if a file is lost[0].

[0]:
[http://docs.aws.amazon.com/AmazonS3/latest/UG/SettingBucketN...](http://docs.aws.amazon.com/AmazonS3/latest/UG/SettingBucketNotifications.html)

~~~
rschmitty
Excellent, thank you so much!

------
alimoeeny
thanks for the [http://www.port25.com/](http://www.port25.com/) tip,

------
sparkzilla
The first tip should be to use something cheaper and faster.

~~~
FourEdenSix
Such as?

~~~
sparkzilla
I was on AWS for over a year. I moved to Digital Ocean. Much cheaper and much
faster.

~~~
rubyist_delight
Except for the fact Digital Ocean doesn't take security seriously at all. It's
good for testing small projects but to run a business on, no thanks. DO has a
very small fraction of the features available that AWS does.

~~~
toomuchtodo
And DigitalOcean doesn't have ELBs, or S3, or DNS service checks. Need I go
on? Digital Ocean is good for small projects, cheap projects, or anything
where you only need semi-reliable compute power.

