Hacker News new | past | comments | ask | show | jobs | submit login
AWS Tips, Tricks, and Techniques (launchbylunch.com)
295 points by sehrope on Jan 29, 2014 | hide | past | favorite | 56 comments

Here's another one for you: distribute your s3 paths/names

Because of the way s3 is designed, the place files are stored on the physical infrastructure is dependant on the prefix of the key name. I'm not exactly sure how much of the key name is used, but for example if you prefixed all you images with imges/....jpg it's highly likely they will all be stored on the same physical hardware.

I know of at least two companies for whom this has caused large problems for, one of them is netflix. Imagine all the videos in a single bucket with key names "/video/breaking_bad_s1_e1.mp4" (a crude example I know), all requests hit the same physical hardware and under high load the hardware just can't keep up and this exact issue has apparently been the cause of more than one Netflix outage.

The solution is simple, ensure your files have a random prefix ({uuid}.breaking_bad_s1_e1.mp4) and they will be spread around the datacentre :)

This is definitely a good tip. Here's a blog post with more info on the topic:


This seems a rather odd implementation detail to burden the user with.

Why don't you just use a hash of the filename for the partition key, so all files are distributed randomly regardless of their name?

Is there an advantage for amazon or for the user in having S3 files cluster up by filename by default?

most likely so dir listings are a lot faster. (they are sorted by key, and can use delimiters for subdir emulation..)

... but for that they just need to store the filenames sorted somewhere. Content could be located based on a hash of the filename, still.

If this is true, it is umm, naively suboptimal (I have harsher words in my head).

Relevant sessions from re:Invent:

Building Scalable Applications on Amazon Simple Storage Service (2012) https://www.youtube.com/watch?v=YYnVRYbUR6A

Maximizing Amazon S3 Performance (2013) https://www.youtube.com/watch?v=uXHw0Xae2ww

Randomness isn't necessary to distribute the path names. Somewhere I read that reversing sequential ids is sufficient.

So, for instance, if you have some sequentially numbered files you want to store: 123.jpg 124.jpg 125.jpg 126.jpg

You save them at: 321.jpg 421.jpg 521.jpg 621.jpg

S3 keys use the leading ~16 bytes for partitioning, most significant first, as I recall. Prefixing with guids is ideal, epoch timestamps are pathologically bad, a "random" char should be enough for 99% of users.

Another tip: IAM roles for EC2 instances.


Basically, the apps you run on EC2 often need to access other AWS services. So you need to get AWS credentials onto your EC2 instances somehow, which is a nontrivial problem if you are automatically spinning up servers. IAM roles solve this by providing each EC2 instance with a temporary set of credentials in the instance metadata that gets automatically rotated. Libraries like boto know to transparently fetch, cache, and refresh the temporary credentials before making API calls.

When you create an IAM role, you give it access to only the things it needs, e.g., read/write access to a specific S3 bucket, rather than access to everything.

You can also use that temp cred to assumerole() to an external IAM role, for example to access a customers account. It's a lot to think about - my ec2 instance role then assuming another 3rd party account's IAM role - but its quite secure; neither you or the customer have any private secrets or rotation headaches.

I would temper the suggestion to use Glacier with a warning: make sure you thoroughly understand the pricing structure. If you don't read the details about retrieval fees in the FAQ, it's easy to shoot yourself in the foot and run up a bill that's vastly higher than necessary. You get charged based on the peak retrieval rate, not just the total amount you retrieve. Details here: http://aws.amazon.com/glacier/faqs/

For example, suppose you have 1TB of data in Glacier, and you need to restore a 100GB backup file. If you have the luxury of spreading out the retrieval in small chunks over a week, you only pay about $4. But if you request it all at once, the charge is more like $175.

In the worst case, you might have a single multi-terabyte archive and ask for it to be retrieved in a single chunk. I've never been foolhardy enough to test this, but according to the docs, Amazon will happily bill you tens of thousands of dollars for that single HTTP request.

They explain it really weirdly, but it's not that hard to cap your expenditure if you translate it into bandwidth units rather than storage units, since that's what's really being billed. The footnote on the main Pricing page is the most misleading part of their explanation:

If you choose to retrieve more than this amount of data in a month, you are charged a retrieval fee starting at $0.01 per gigabyte.

From this you might think the retrieval fee is some function of the gigabytes retrieved, but isn't really in any direct way: you're charged ((max gigabytes retrieved in any one hour in a month) x (720 hours in the month) x per-GB price), modulo some free quota. So you can really get charged up to "720 gigabytes" for a single gigabyte retrieved, which is a bit silly to think of in terms of per-gigabyte pricing.

It makes more sense to me if you think of it as priced by your peak retrieval bandwidth. Then you can cap your expenditure by just rate-throttling your retrieval. Make sure to throttle the actual requests, using HTTP range queries on very large files if necessary (many Glacier front-ends support this)... throttling just your network or router will not have the desired effect, since pricing doesn't depend on download completion.

Anyway, the retrieval pricing in units of Mbps is ~$3/Mbps, billed at the peak hour-average retrieval rate for the month above the free quota. If you use that as a rule if thumb it's not bad for personal use. For example given my pipe, it's perfectly reasonable for me never to pull down more than 10 Mbps anyway, in which case my max retrieval costs are < $30/month.

Yeah, that's a good summary.

Another way to look at it is that if you retrieve data as fast as possible -- say, to S3 or EC2 -- it costs about $1.80/GB, minus the free quota. So for large objects, it's cheaper to store them in Glacier than S3 if you don't expect to need them in a hurry any time in the next 18 months. The free retrieval quota starts making a difference if your typical "ASAP-retrieval" size is less than 0.1% of your total archived data.

OP here. Really cool to see people enjoying the write up.

It started off as a list of misc AWS topics that I found myself repeatedly explaining to people. It seemed like a good idea to write them down.

I'm planning on listing out more in follow up posts.

Great write-up! Could merge some useful thoughts from here too: https://news.ycombinator.com/item?id=6991974

I'd certainly contribute to this if it was on Github :)

It'd be great if you could contribute to Ops School (http://www.opsschool.org/en/latest/introduction.html)!

Fantastic post with really helpful tips. Thanks a lot for that.

I really want to read the details on the spot instance setup for a web service. Do you have a quick summary for that?

> Fantastic post with really helpful tips. Thanks a lot for that.


> I really want to read the details on the spot instance setup for a web service. Do you have a quick summary for that?

I don't have anything written up to point to but the idea is to have the spot instances dynamically register themselves with an ELB on startup. As long as they stay up (ie. spot price is below your bid) you get incredibly cheap scaling for your web service. The ELB will automatically kick out instances that get terminated as they will fail their health checks. Combine this with a couple regular (ie. non spot instances) and you get a web service that scales cheaply when spot prices are low and gradually degrades the QOS for your users when it gets more expensive.

I've got a couple other topics in the works first, but I should the write up for that one up pretty soon too. Haven't gotten around to adding an email signup form to my site yet (need to do that...) so shoot me a mail if you want to notified when the full post is up. Email is in my profile.

Great write-up. Thank you for sharing. I like the billing triggers. In hindsight very obvious. Good to learn it without getting burned. :-)

This is all good stuff. AWS takes a while to grok but once you do, it offers so many new possibilities.

The Aha! moment for me came when playing with SimianArmy, the wonderful Netflix OSS project and in particular, Chaos Monkey.

Rather than build redundancy into your system, build failure in and force failures early and often. This will surface architectural problems better than any whiteboard.


Also, check out boto and aws cli.



I think I saw a tweet along the lines of "don't think of EC2 instances as you would servers."

The point is, achitecting for the cloud is all about treating your infrastructure as very transient and failure prone.

Hopefully stuff doesn't fail but architecting for that gives you a much better application.

Netflix are my favourite tech company based on their open source distributed system stuff. Check out Hystrix and their circuit breaker implementation. I am going to lean on that heavily in an upcoming project such that entire application tiers can dissapear with graceful degredation.

I for one think that S3's server side encryption is amazing for most if not all data. We backup hundreds of gigabytes of data every day to S3 and enabling server side encryption enables us to not worry about generating, rotating, and managing keys and encryption strategies. It also saves us time since we don't have to compute the encryption or the decryption. The best part is that the AWS AES-256 server side Encryption at Rest suffices for compliance.

Of course, the data that we store, while confidential, isn't super-mission-critical-sensitive data. We trust AWS to not peek into it, but nothing will be lost if they do.

I suppose it's OK for some use cases but realistically it doesn't provide a whole lot of security.

Doing this only really protects you if someone steals the physical disks from inside AWS, whilst this is a legitimate risk it seems incredibly unlikely and i'm not aware of any such instance.

HOWEVER, it does not protect you from the far more likely scenario of someone gaining access to your AWS account. If someone is able to get your credentials (or otherwise hack your account) then all your data is vulnerable.

In a lot of cases your AWS account being compromised is going to open you up to having any encryption keys you use for GPG encryption stolen too anyway. So it seems that there's a decent chance that even GPG encryption is just protecting you from the "disks stolen from the datacenter" risk.

At the end of the day your AWS account is often the "master key", so make sure you use a good password, rotate it, and 2-factor auth.

How's that? Say you set up duplicity on your AWS instances. They only need access to the public encryption key and private signing key -- the session (actual encryption) keys will be destroyed after a backup is made. Now, to restore you'll need to provide access to the public signing key (this is public, so no problem), and private encryption key. This should only ever be on-line and decrypted in case of a restore.

Of course, if you set things up differently, then, yes. Compromised by design.

Well clearly you wouldn't keep your keys in s3, if you were to keep them on an ec2 instance, for example, as long as you had proper SSH access configured even with AWS account access the servers would still be secure.

That then makes your local SSH key the last link in the chain (and OpenSSL of course, and that you trust AWS is actually encrypting the data).

What kind of "encryption-at-rest" compliance industry or standard are you dealing with?

I'm curious to where what AWS supplies suffices. As the encryption and decryption is, AFAICT, automatic and transparent, aren't you merely protecting your data from someone at AWS running off with it?

I.e. if your system that does the API calls is somehow exploited, the attacker has your AWS keys and so can get to the encrypted data. Or if there's a programming error in your data access routines that check whether user X can access data Y.

Perhaps I'm reading too much into what "encrypted data at rest" really is supposed to provide though (i.e. just data encrypted, key stored on a different system)

Just interested in what compliance you are referring to? We are looking at options for storing medical data in the cloud and I understand that Amazon can be used so as to conform with HIPPA [http://en.wikipedia.org/wiki/Health_Insurance_Portability_an...].

Yes, it can be used, but it needs some work up front. Like all compliance schemes it is not entirely devoid of effort. There used to be a whitepaper outlining the basics somewhere in the AWS site, IIRC. There's also a number of companies providing HIPPA-compliant cloud services at various tiers and some of the offerings are backed by AWS. So it's possible and it's been done already.

Yes I appreciate that a service like AWS can't give you HIPPA of 21CFR Part11 compliance, only supply tools that can be used that way. What I'm looking for are examples (e.g. white papers) that describe how people have done it, preferably having been audited too.

Cloud66 is a good cautionary tale of refusing the temptation to double-dip, when it comes to your AWS tokens:



I know this is about AWS, but it might be helpful to mention that DreamObjects (http://www.dreamhost.com/cloud/dreamobjects/) is an API-compatible cheaper alternative to S3.

Very good article, thanks. I did not know about S3 temporary Urls, might use that in the future.

Yes they're a really useful feature. You can use them to allow both downloads and uploads. The latter is particularly cool for creating scalable webapps that involve users uploading large content.

Rather than having them tie up your app server resources during what could be a long/slow upload, they upload directly to S3, and then your app downloads the object from S3 for processing. If your app is hosted on EC2 you can even have the entire process be bandwidth free as user-to-S3 uploads are free and S3-to-app-server downloads to EC2 are free. Just delete the object immediately after use or use a short lived object expiration for the entire bucket.

I've made a note to add an example of this to my next post on this topic.

Also useful for authenticated downloads - don't tie up an app thread, but have your app redirect to a pre-signed url

Question about the underlying SaaS product. I'm not understanding how a database client on the cloud can connect to a database server on my own machine?

Or am I misunderstanding what it does?

> Question about the underlying SaaS product. I'm not understanding how a database client on the cloud can connect to a database server on my own machine?

OP and founder of JackDB[1] (the SaaS product referred to) here.

The majority of people using JackDB connect to cloud hosted databases such as Heroku Postgres or Amazon RDS. As they're designed for access across the internet, no special configuration is necessary for them.

To connect to a database on your local machine or local network you would need to set up your firewall to allow inbound connections from our NATed IP. More network details available at [2].

[1]: http://www.jackdb.com/

[2]: http://www.jackdb.com/docs/security/networking.html


Isn't that dangerous? I wonder if they would be better off having users install something locally that goes between the DB and the client?

If you're running an application that runs on more than one server it's definitely worth checking out AWS OpsWorks. It's a huge time saver and extremely useful in integrating and managing setup, configuration, and deployment across a server/db/cache etc without any loss of control or customization.

So, in older reddit threads, I read about how you need to build Raid-1 EBS for all your ec2 servers as well as test your EBS storage, because they could be really bad.

Is anybody doing this for their EC2 deployments and more importantly, automating this?

Does anyone have experience with Reduced Redundancy Storage on S3?

How often do you lose files? Do you run a daily job to check in on them?

I've been using S3 RRS for a few hundred thousand files, and have yet to lose one over the course of the year that I've stored them.

You don't run a daily job though, you simply set up an event on SNS to notify you if a file is lost[0].

[0]: http://docs.aws.amazon.com/AmazonS3/latest/UG/SettingBucketN...

Excellent, thank you so much!

thanks for the http://www.port25.com/ tip,

The first tip should be to use something cheaper and faster.

Such as?

I was on AWS for over a year. I moved to Digital Ocean. Much cheaper and much faster.

Except for the fact Digital Ocean doesn't take security seriously at all. It's good for testing small projects but to run a business on, no thanks. DO has a very small fraction of the features available that AWS does.

And DigitalOcean doesn't have ELBs, or S3, or DNS service checks. Need I go on? Digital Ocean is good for small projects, cheap projects, or anything where you only need semi-reliable compute power.

A motorbike is faster and cheaper than a car, but you can't use it to carry as many people. Or cargo. Or protect you from the elements.

If you're looker for cheaper and faster then http://ramnode.com/ is the VPS you want.

wow, so here we are at the bottom of the page, below the grayed-out guy, but heh, ramnode looks cool - $2 per month!!! That's perfect for goof-off domains, silly projects, sandboxes and whatnot. Thanks!

I've been using them for backup DNS. Great uptime, fast seeks, and I only need enough RAM for NSD to cache my zonefiles.

Looking to move more serious projects on, but we'll see.

DO is not an alternative to AWS, it is an alternative to EC2. If you want to tell people not to use AWS, you have to provide an alternative to AWS. Even google's AWS-like offering is missing tons of things lots of AWS customers rely on.

Windows Azure, but it doesn't offer everything AWS has (though it does offer some stuff AWS doesn't have).

It's not cheaper though.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact