Hacker News new | past | comments | ask | show | jobs | submit login
AWS Tips I Wish I'd Known Before I Started (wblinks.com)
606 points by richadams on Feb 3, 2014 | hide | past | web | favorite | 149 comments

Fantastic list with much more depth than I expected. Some surprises that others might be interested in from this article and comments below:

  [1] Keeping buckets locked down and allowing direct client -> S3 uploads
  [2] Using ALIAS records for easier redirection to core AWS resources instead of CNAMES.
  [3] What's an ALIAS?
  [-] Using IAM Roles
  [4] Benefits of using a VPC
  [-] Use '-' instead of '.' in S3 bucket names that will be accessed via HTTPS.
  [-] Automatic security auditing (damn, entire section was eye-opening)
  [-] Disable SSH in security groups to force you to get automation right.

[1] http://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlU...

[2] http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/Cre...

[3] http://blog.dnsimple.com/2011/11/introducing-alias-record/

[4] http://www.youtube.com/watch?v=Zd5hsL-JNY4

I like SSH. But I'm the founder of Userify ;) http://userify.com

Also, S3 buckets cannot scale infinitely. This is a huge myth http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tip...

I'd also add to the list - make sure that AWS is right for your workload.

If you don't have an elastic workload and are keeping all of your servers online 24/7, then you should investigate dedicated hardware from another provider. AWS really only makes sense ($$) when you can take advantage of the ability to spin up and spin down your instances as needed.

The startup I'm working for has minimal scaling required but we still use AWS despite the higher cost for EC2 because the broad ecosystem of AWS products make it easier to develop interesting things quickly and efficiently.

If we went with all of our own dedicated hardware, or cheaper instances from a different cloud provider then we'd miss out on ELB, have slower and more expensive communication to and from S3, not to mention that services like Elastic Beanstalk make deploying to EC2 instances very easy compared with rolling your own deployment system. And for those who don't want to bother with administrating databases and cache machines RDS and Elasticache are going to be cheapest and fastest if your instances are EC2.

So yeah I agree that EC2 is expensive, but the benefits of living fully within the Amazon ecosystem are pretty large.

I think of AWS as a tool for prototyping and early-stage outsourcing of your infrastructure. Use it when you're fighting past market and technology risk with a single-digit team; consider dropping it in order to optimize costs when you've got more people (including fractional people) to evaluate, configure, and operate alternatives.

Isn't relying so much on AWS like "putting all your eggs in one basket"?

AWS != EC2. Do not assume that AWS is only used as VMs. It's different for everyone but AWS provides massive savings for many companies. It makes sense for many use cases in addition to elastic workloads.

True, but outside of S3 and Route53, how much under the AWS umbrella is much use without using at least one EC2 instance?

I can see a lot of benefit to using S3 without EC2, but after that, I'm not sure what else would be possible. Care to elaborate more?

Can you use their queues and database tools w/o using EC2? (If you are using a VPC, maybe?)

Not sure why it has to be without EC2 to be useful. The main benefit is indeed the fact that you can have VMs, queue, load balancers, databases, DNS, cache, etc. all using consistent APIs, etc. You can create dev/test environments that are as close to prod as possible with fraction of the cost, etc.

In our case (and I'm pretty sure we're not in the fringe) when you consider all costs including the administrative overhead, etc. EC2 costs are not significant portion of the cost. Just to give an idea, for us, even if someone had offered VM hosting for free, it would not be cost effective for us to move. For a company that have very high processing requirements it could be a different story. I just wanted to mention that the value added by services in addition to EC2 is sufficiently high that even if EC2 costs are higher than alternative (and they are higher), AWS can still provide significant savings.

Time of a highly skilled dev/ops person is (very) valuable, and not something we can buy more of easily. Anything that saves us time, implement faster pays for itself pretty easily. If you don't have a massive EC2 bill, chances are AWS overall is a good proposition.

Considering just how much more expensive AWS is, I have my doubts on this. AWS also really likes "hiding" costs, like bandwidth costs. It's not obvious that most sites will go over the bandwidth limit and pay extra, whereas for sites on VPS providers, this won't happen. In reality they have a billion different line items on every bill, and most are not mentioned on their pricing pages.

Expect to pay a LOT more than you'd expect to pay from the pricing page.

I don't get the whole line of reasoning : either your project is not important and having weekly or monthly backups is certainly sufficient. In such a case, having a script to backup a VPS is by far the most cost efficient way to run the project.

If your project is important enough to have a complex AWS setup, you have an admin anyway.

"But you can trust amazon". Well, no : http://www.businessinsider.com.au/amazon-lost-data-2011-4 (note that disasters on VPS/dedicated servers are also rare)

If you're making complex AWS setups because "it's cool", then by all means, but keep in mind that you're paying a lot for the privilege. Don't do this on production.

I can only give you a datapoint. This has not been the case in our case. We never paid more than what we expected.

If you think VPS providers is an alternative to AWS, you have an entirely different use case than ours. For an application with high availability requirements, messages queues, scalable data store, etc. admin work is not "monthly backups". Implementing and maintaining highly available load balancing, message queues, data stores, DNS, auto scaling, etc. all require a lot of work, and often skills that you may not have in the team. AWS makes these tasks easier. You still have to do admin work, but much less of it, and don't have to be expert at each one. That's worth a lot.

"Can you use their queues and database tools w/o using EC2?" Yes, you very easily can. RDS is a simple setup, and so is SQS (queuing). The same goes for Dynamo and Redshift.

Are you really going tolerate 10+ ms of latency between a database, and an app server? I have never personally tried it, but it seems like it would be a disaster.

I just tested pinging our public RDS instance (non-VPC) from one of our EC2 instances that is loading data ever 1-3 minutes from S3 objects into that RDS instance from within a VPC. 3-5ms ping times. More than acceptable.

It's not 10ms. Halfway through a staged AWS migration now and the latency has really not been an issue. This is from a London data centre to AWS Ireland.

At most it's been 4 to 6 ms. And that has coped with our heaviest load.

Heh, heck my db and app live in the same data center, and the latency is more than 30-40 MS.

Really? That seems enormously too slow. Even over the internet, I'm within 5-10ms of most things in the bay area.

If you have a dozen slow (100us) switches, three slow (2ms) routers, and 500 kilometers of fibre (2ms), you should still be under 10ms each way. And that's enough gear to cross some countries.

I'm curious how you can do much worse than this in a data center. Maybe some pokey "application firewall" or massive buffer bloat?

You're moving the goalposts. First you said:

AWS really only makes sense ($$) when you can take advantage of the ability to spin up and spin down your instances as needed.

After it's pointed out that other facilities besides EC2 motivate people to use AWS, you can't turn around and complain that those other facilities are useless without EC2. Even if that were true (it's not), such people would be using EC2 not for its load-scaling, but rather to enable their use of the non-EC2 facilities.

Who's moving goal posts? After it was (correctly) pointed out that AWS!=EC2 (even though most of the original post was talking about EC2), I asked a simple question - how useful are the rest of the services outside of EC2? It is a legitimate question.

I'm not sure how useful everything in the AWS umbrella is outside of EC2. I was hoping to learn something, not provoke anyone.

I guess I misunderstood you; no worries!

The problem here is that you're classifying everything AWS does as "EC2", which then causes confusion when someone else comes along and says "vendor X does everything EC2 does for half the price", and they mean solely the VM-related stuff, not the additional array of services.

That's a problem with the comments for this whole post. AWS means a lot of different things to different people. While I focused on EC2, others focused on RDS or S3 or ELB, etc... it's easy to get off on a tangent.

When everything is covered under the AWS umbrella, it's easy to argue from any side you want.

I still think that it's important to consider whether or not AWS (for all values of AWS) correctly fits your workload. You may not need the ability to spin up servers within minutes, or have multi-datacenter redundancy in a database, or have virtually unlimited storage, or a robust queuing system. A lot of great engineering has gone into all of the AWS products, and for many instances it is probably overkill. And that can cost you a lot if you don't know what you're doing.

AWS makes a lot of sense in a 24/7 environment, particularly when you are a new startup and don't have enough information (or capital!) to make educated server purchases.

They don't have to make any purchasing decisions. They can rent servers from companies like Softlayer and Rackspace (#3 and #4 behind AWS for YC startups), or spin up much cheaper VPS's (Linode's #2). We're talking $120/month commitments, not buying hardware and driving to a data center to install it. Deploying to a freshly imaged physical server is the same as deploying to EC2, and they can be provisioned for you in an hour or two. Each of those servers gets you many times the performance of an EC2 instance in the same price class, which means much more time to figure out your capacity needs as you grow.

As someone who has worked w/ AWS (and Rackspace) for several years with multiple startups...

Unless they have dramatically improved their offering in the last couple years, an hour or two from "I need a new server" to delivery is 1) not an accurate timeframe for physical servers from Rackspace and 2) even if it was realistic, that's an eternity when you are trying to iterate quickly. I can have a new server in 30 seconds with AWS and in the course of an hour could have tested my automation tools half a dozen times or more vs reimaging a server over and over again.

I'm not saying it's always the right choice or that it's cheaper, but that flexibility combined with some of the pre-canned tools (ELB, RDS, CloudWatch, SQS, SNS) has tremendous value even when you aren't autoscaling.

You've completely moved the goalposts; "testing automation tools" where your servers live for a few minutes then get destroyed is not the same as "a 24/7 environment". I didn't suggest renting servers to do ephemeral testing.

I only have experience renting physical hardware from Softlayer. They have built and imaged new servers for me in under 2 hours a dozen times, day and night. They also have a "xpress servers" line with guaranteed <2hr delivery. They also let you reimage your servers through their control panel or API; you don't need to have a new one built just to get a fresh disk image.

You speak as though ops and automation is too mysterious for a startup to handle. There are so many tools and frameworks that do what AWS does that it's easy to acquire that expertise. And who says you always need a new server to iterate quickly?

I moved my last company completely off AWS and it was proven to be a great decision across a number of dimensions.

Can you list some of those tools and provide an estimate of how long they take to configure and how much day-to-day support they require?

Sure. Pop in opscode chef. Took me a weekend to write the basic framework, 3 weeks to make it solid. The 3 weeks more than paid for itself and hosting the config servers with them is a couple hundred a month. I could've hosted it myself too. This includes support for things like a load balancer, heterogeneous nodes (db, app, cache, chat, etc).

Ansible, puppet, sprinkler, and the like would take a similar amount of time to configure.

A bunch of Chef cookbooks does not AWS make. Configuration management tools are of course a necessity in AWS but do not replace their offerings.

I'm very impressed that you were able to build in 3 weeks time a low-latency multi-data center application with master/slave database failover, robust fault tolerant load balancing, and backups that can be restored in minutes with an API to control all of those services. That would normally take a senior team of engineers several months to accomplish and have it be of the quality and reliability of the services provided by AWS.

More likely is that you had a use case or a mindset that did not suit AWS very well and was easy to implement on your own. That's awesome and I'm glad you were able to find better value elsewhere. AWS is not for everyone, and is definitely quite expensive on the pocketbook.

had master slave, had fault tolerant load balancing, had backup scripts, tested restore procedures. also had node upgrade procedures, and more.

it was more than just chef obviously but chef + any bare metal host environment gets you a large percentage of the way there. Tacking on specific aws services like route53 when necessary works too.

That's a great accomplishment in such a short period of time. I'm glad that you were able to save time and money by using another service. Thankfully, there are tons of options out there to suit every business' use cases.

I don't really follow you here; AWS is infrastructure. So all the chef/puppet stuff has to happen anyway.

The benefit of AWS is that you have some immediate bootstrap, and simple auto scaling. This last is a killer feature. Being able to scale your caches and load balances silently, based on metrics, is a real time and money saver.

Sure. You need to have a level of scale to need that :) but when you do, AWS can have some good features.

When I said AWS I was speaking more about the entire ecosystem which includes a load balancer, databases with failover, snapshotting, backups, dns lookups, and more.

So use AWS or Linode for that specific use case. Great, that's its strength.

Once it's clear that the new server will be needed long-term, transition to dedicated hardware to save money and get better performance.

DigitalOcean - Same or better speed, 1/6 or less of the price of ECC.

DO has a very limited API, no ability to add additional storage without resizing your droplets, and has no firewall protection without iptables being enabled. DO has its uses (I'm migrating my personal server there presently), but it's not even close to being in the same market/caliber as EC2.

For some businesses, the huge AWS feature set (RDS, EBS, ELB, security groups, VPC, ELB, EIPs, etc, etc, etc) is more valuable than the bottom $$ line. For others, those features aren't worth the added cost, but hand-wavey "just use DO" or "Just rent physical servers from SoftLayer/Rackspace" is disingenous.

TL;DR AWS has value above and beyond simply hour-to-hour elasticity.

Hey, I've started to learn about devops and systems administration recently and I've learned a ton in this thread and this article, so thanks for that and thanks to everyone else.

But do you know of any good resources to learn about those two things? And I'm taking about basic devops, before you even start worrying about automating, and the things you would actually automate–because I don't know what it is I should be doing in the first place.

Things like what you should be doing right after you SSH into your server, how to make your server secure, how to use nginx, chmod'ing permissions of files correctly, and things I don't even know about.

Is there a One Month Rails or Michael Hartl's RoR Tutorial for devops/sys admin?

Regardless, thanks for taking the time to read this :)

AWS is PCI compliant. http://aws.amazon.com/compliance/pci-dss-level-1-faqs/

DigitalOcean and most others are not.

AWS's security and API is lightyears ahead of everyone else.

Yes, this is all true. But the question remains - how much is that worth to you? For some it will be mandatory. For others (particularly startups), not as much.

If said startup is storing any personal data, it better be important.

I don't see how that makes sense.

If all you need is a server that is up 24/7, rent it by the month. You don't need information to make an educated choice, since they are pretty much all cheaper than EC2.

With the additional caveat that you have tons of money sure. With 24/7 load, you'd probably pay a 10x premium to use AWS.

Reserved Instances really drive the cost down quite a bit, but yes, it's expensive.

> particularly when you are a new startup and don't have enough information (or capital!) to make educated server purchases.

I doubt there are many founders who are technically informed enough to know about Amazon Web Services, but don't know about the other big 3 (Digital Ocean, Linode, Rackspace). If you truly don't, then you must not be a tech company, and I have a hard time believing a non-tech company without any technical founders would even know about AWS.

I've seen places where they had an Amazon instance up 24/7 just sitting there idle, while they paid hundreds a month. Just because they occasionally needed a server for data processing. Someone told them that it was fast to spin up an AWS instance, so they did. But no one told them how to manage it, and the people doing the work were not very competent at system administration. So, they wasted thousands on idle time, just because no one told them how to snapshot their EBS volumes and terminate an instance.

More people know about AWS than you may realize.

DO, Linode, and Rackspace have lower bottom line costs, but a (much) smaller feature set which means more operational work. Especially when kicking things off, developer and operational time is often far more valuable than the cost of the servers.

Yeah, but it you start out using too much of the AWS tools, you're far more likely to get trapped in the AWS infrastructure and end up paying significantly more in the long term (which is what they want!). I'm not saying that AWS doesn't have useful features, but you need to appreciate the costs of these things before starting. If you need to spend a bit more time on devops in the beginning, then so be it. If you're starting a company, there had better be some good reasons behind using AWS, aside from "because faster iteration, developer time!".

Specifically - what AWS features do you find to be useful at the beginning? You seem to have some specific use-cases in mind. I'm legitimately curious.

Depending which tools you're using you'll have to do a bit of work to migrate out of AWS, but there tools are usually pretty standard stuff managed for you.

For example, RDS is just a database instance with management. You'd have to invest some time to replicate what AWS already does for you, but its not rocket science. The same for Elasticache, Autoscaling, ELB, and most of their other services.

This is off-topic, but because I've just recently started my foray into devops/systems administration do you know any lists of good resources to learn about those two things? I'd love to see a guide like this [1] except that's for analytics.

Things like what you should do right after you SSH into a server, how to make your server secure, setting up nginx, chmod'ing permissions of files correctly, and things I don't even know.

Regardless if you get back to me, I appreciate you taking the time to read this :)

[1] https://segment.io/academy/

Absolutely, great point! AWS isn't for everyone, and there can be lots of cases where it's cheaper to use dedicated hardware. Shop around before jumping in. I've added this as a new tip at the end of the article (crediting you of course). Thanks!

One thing the article mentions is terminating SSL on your ELB. If you want more control over your SSL setup AND want to get remote IP information (e.g. X-Forwarded-For) ELB now supports PROXY protocol. I wrote a little introduction on how to set it up[0]. They haven't promoted it very much, but it is quite useful.

[0]: http://jud.me/post/65621015920/hardened-ssl-ciphers-using-aw...

Great post, I had no idea you could do this with ELB. I've added your link to the additional reading list in my post, thanks for sharing!

Be very careful with assigning IAM roles to EC2 instances. Many web applications have some kind of implicit proxying, e.g. a function to download an image from a user-defined URL. You might have remembered to block 127.0.0.*, but did you remember Are you aware why is relevant to IAM roles? Did you consider hostnames pointed to to Did you consider that your HTTP client might do a separate DNS look-up? etc.

There are other subtleties which make roles hard to work with. The same policies can have different effects for roles and users (e.g., permission to copy from other buckets).

IAM Roles can be useful, especially for bootstrapping (e.g. retrieving an encrypted key store at start-up), but only use them if you know what you're doing.

Conversely, tips like disabling SSH have negligible security benefit if you're using the default EC2 setup (private key-based login). It's really quite useful to see what's going on in an individual server when you're developing a service.

Also, it does matter whether you put a CDN in front of S3. Even when requesting a file from EC2, CloudFront is typically an order of magnitude faster than S3. Even when using the website endpoint, S3 is not designed for web sites and will serve 500s relatively frequently, and does not scale instantly.

Great point with regards to IAM roles. The applications I've worked on don't download things from user-defined URLs, so this never even occurred to me.

Is the purpose of blocking important because it could potentially give users access to the instance metadata service for your instance? I'd be interested to hear more information on securing EC2 with regards IAM roles, you seem to have lots of experience in that area.

The disabling SSH tip wasn't really about security (I agree that it has negligible security benefit), it's more about quickly highlighting parts of your infrastructure that aren't automated. It's often tempting to just quickly SSH in and fix this one little thing, and disabling it will force you to automate the fix instead.

The CDN info has been mentioned elsewhere too, lots of things I didn't know. I'll be updated the article soon to add all of the points that have been made. Thanks for the tips!

The way IAM roles work is terrifyingly simple. IAM generates temporary access key identifier and secret access key with the configured permissions and EC2 makes them available to your instance via the instance metadata as JSON at . The AWS SDK periodically retrieves and parses the JSON to get the new credentials. That's it. I'm not entirely sure whether the credentials can be used from a different IP or not, but given a proxying function that does not really matter.

I make sure all HTTP requests in my (Java) application go through a DNS resolver that throws an exception if: ip.isLoopbackAddress() || ip.isMulticastAddress() || ip.isAnyLocalAddress() || ip.isLinkLocalAddress()

The last clause captures Of course, many libraries use their own HTTP client, so it's easy to make a mistake.

I'm trying to bring my usage of IAM roles down to 0 as a matter of policy. Currently, I'm only using an IAM role to retrieve an encrypted Java key store from S3 (key provided via CloudFormation) and encrypted AWS credentials for other functions (keys contained in the key store). I'd be happier to bootstrap using CloudFormation with credentials that are removed from the instance after start-up.

Thanks for making updates. There are definitely some great tips in there.

I agree. Not to toot own horn too much, but I use SSH for everything, including backups, automated jobs, live log monitoring, etc. disclaimer: founder Userify

> you pay the much cheaper CloudFront outbound bandwidth costs, instead of the S3 outbound bandwidth costs.

What? CloudFront bandwidth costs are, at best, the same as S3 outbound costs, and at worse much more expensive.

S3 outbound costs are 12 cents per GB worldwide. [1]

CloudFont outbound costs are 12-25 cents per GB, depending on the region. [2]

Not only that, but your cost-per-request on CloudFront way more than S3 ($0.004 per 10,000 requests on S3 vs $0.0075-$0.0160 per 10,000 requests on CloudFront)

[1] http://aws.amazon.com/s3/pricing/ [2] http://aws.amazon.com/cloudfront/pricing/

Doh, I feel stupid now. I only looked at bandwidth costs, not the request prices. That's what I get for editing my post late at night based on reading, instead of based on personal experience.

For low bandwidth, you're absolutely right, the costs are at best the same. For high bandwidth however (once you get above 10TB), CloudFront works out cheaper (by about $0.010/GB, depending on region). But that wasn't taking into account the request cost, which as you point out, is more expensive on CloudFront, which can negate the savings from above depending on your usage pattern.

I'll update my post accordingly, thanks for pointing this error out!

You do have to pay for S3 to CloudFront traffic, so really you're paying twice. (Although the S3 to CF traffic might be cheaper than S3 to Internet, according to the Origin Server section on the Cloudfront pricing page.) http://aws.amazon.com/cloudfront/pricing/

Also, S3 buckets cannot scale infinitely. They have to have their key names managed appropriately to do it. http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tip...

Finally :) I like SSH. But I'm the founder of Userify! http://userify.com

Lots of very useful tips there!

There's one that I think could be improved on a little:

    Uploads should go direct to S3 (don't store on local filesystem and have another process move to S3 for example). 
You could even use a temporary URL[0,1] and have the user upload directly to S3!

[0]: http://stackoverflow.com/questions/10044151/how-to-generate-... [1]: http://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlU...

I have a desktop client that requests one-time upload URLs from my server via an API. Later they get downloaded and processed somewhere else - never actually touching my web server.

Even cooler I think if you need a lot of file uploads (and potentially organized into their own folders) is letting your customer connect to a WebDAV interface with their system file browser, then they can just drag and drop whatever. (https://code.google.com/r/1meref-sabredav-amazons3/)

Wow, didn't know about pre-signed URLs, very useful. I've added this info to my article, thanks!

I've always seen issues pushing objects directly to S3 from a browser using CORS. YMMV.

You can specify CORS headers for S3, or you can just use a standard form POST.

You still need a stub API for generating the signature to sign the upload requests to S3, correct?

Not technically, but generally in practice. You can open things up permissions-wise but run the risk of folks uploading lots of large files. Keeping permissions locked down and doing a signature allows things like file size, location, etc. restrictions.

Good article, but I think it touches too little about persistence. The trade-off of EBS vs ephemeral storage, for example, is not mentioned at all.

Getting your application server up and running is the easiest part in operation, whether you do it by hand via SSH, or automate and autoscale everything with ansible/chef/puppet/salt/whatever. Persistence is the hard part.

Good point. We're struggling to see the benefits of EBS for Cassandra that has its own replication strategy (ie data is not lost if an instance is lost), voiding the "only store temporary data on ephemeral stores" argument.

You've probably read this, but in case anyone else is considering EBS and cassandra...

"EBS volumes are not recommended for Cassandra data volumes."


How do you handle entire datacenter outages with ephemeral only setup? You can replicate to another datacenter, but if power is lost to both do you just accept that you'll have to restore from a snapshotted backup?

Our use-case is different (not cassandra or db hosted on ephemeral drives), but what we've found using AWS for about 2 years now is that when an availability zone goes out, it's either linked to or affects EBS. Our setup now is to have base-load data and PG WAL files stored/written to S3, all servers use ephemeral drives, difference in data is loaded at machine creation time, AMI that servers are loaded from is recreated every night. We always deploy to 3 AZs (2 if that's all a region has) with Route 53 latency based DNS lookups that points to an ELB that sits in front of our servers for the region (previously had 1 ELB per AZ as they used DNS lookups to determine where to route someone amongst AZs and some of our sites are the origin for a CDN, so it didn't balance appropriately...this has since been changed) that is in the public+private section of a VPC with all the rest of our infrastructure in the private section of a VPC (VPC across all 3 AZs). We use ELBs internal to the AZ for services that communicate with each other. The entire system is designed to where you can shoot a single server, a single AZ or a single region in the face and the worst you have is degraded performance (say, going to the west coast from the east coast, etc.).

Using this type of setup, we had 100% availability for our customers over a period of 2 years (up until a couple of weeks ago where the flash 12 upgrade caused a small amount of our customers to be impacted). This includes the large outage in US East 1 from the electrical storm, as well as several other EBS related outages. Overall costs are cheaper than setting up our own geo-diverse set of datacenters (or racks in datacenters) thanks to heavy use of reserved instances. We keep looking at the costs and as soon as it makes sense, we'll switch over, but will still use several of the features of AWS (peak load growth, RDS, Route 53).

The short answer is to design your entire system so that any component can randomly be shot in the face, from a single server to the eastern seaboard falling into the ocean to the United States immediately going the way of Mad Max. Design failure into the system and you get to sleep at night a lot more.

Persistence for AWS can be relatively simple if you use a distributed filesystem, such as GlusterFS (http://gluster.org) or our ObjectiveFS (https://objectivefs.com). You get a shared namespace for all your instances and persisting your data becomes as simple as writing files.

Really useful article, though I don't agree with not using a CDN instead of S3. There are multiple articles which proves the performance of S3 being quite bad, and not useful for serving assets, comparing to CloudFront.

The issue with CloudFront is the tremendous cost of $600/mo for custom domain SSL certificate. You also need to apply and get approved. There many not-so-obvious limits and blocks and unless you pay for support (which is pretty cheap, by the way), it may take you a week to lift those - you need to request them one by one and various teams approve/disapprove the requests. It's totally ridiculous.

I'll admit I hadn't really look at this in depth, using S3 without a CDN solved a particular use case I had a while ago, and it just seemed unnecessary to add a CDN in front of it. I've been doing some reading today, and it seems I was wrong. Adding a CDN in front adds lots of benefits I didn't know about!

I'll update the article soon to add in the new information.

also the outbound bandwidth cost of S3 is very high. it would cost us several times what we're paying for s3+cloudfront to serve our content straight from s3.

Even cloudfront is ridiculously overpriced for a CDN. If you're pushing anything close to real bandwidth you could do a lot better elsewhere.

If you're pushing anything close to real bandwidth you're probably getting an individually negotiated price that is not public.

Not with Amazon you aren't, they are amazingly stringent in their pricing on this matter.

http://aws.amazon.com/cloudfront/pricing/ the reserved capacity pricing is much better than the on demand pricing. Basically like EC2 on demand vs reservations. We set our reserved capacity at about 70-80% of what we expect to use most of the time. We could probably shave a few tenths of a cent per gig off but we get a good price on everything above what we've reserved so it's worked out.

If you use a lot of cloudfront bandwidth without setting up a reservation, yeah... you're gonna pay through the nose.

Along these lines, I recommend installing New Relic server monitoring on all your EC2 instances.

The server-level monitoring is free, and it's super simple to install. (The code we use to roll it out via ansible: https://gist.github.com/drob/8790246)

You get 24 hours of historical data and a nice webUI. Totally worth the effort.

  > Use random strings at the start of your keys.
  > This seems like a strange idea, but one of the implementation details 
  > of S3 is that Amazon use the object key to determine where a file is physically 
  > placed in S3. So files with the same prefix might end up on the same hard disk 
  > for example. By randomising your key prefixes, you end up with a better distribution 
  > of your object files. (Source: S3 Performance Tips & Tricks)
This is great advice, but just a small conceptual correction. The prefix doesn't control where the file contents will be stored it just controls where the index to that file's contents is stored.

Your body tag is set to "overflow: hidden;". I wasn't able to scroll until I tweaked it manually in the inspector.

Oops, sorry about that. Should be fixed now.

Also, if you change the first line of http://wblinks.com/css/style.css


    @import url(http://fonts.googleapis.com/css?family=Droid+Sans:400,700);
you should notice an improvement in the boldface font rendering.

Great article, btw.

OT, but how does this work? Why does it improve rendering? Is it downloading both the normal and bold weight rather than only one leaving the browser to do the rest?

Yes, you have it exactly right. The browser will try to make fonts bold on its own, but it's not a match for the bold that the designer intended.

>Is it downloading both the normal and bold weight rather than only one leaving the browser to do the rest?

Yes. The original URL only provides the normal weight (400), but you can specify other weights/styles[1]. 700 is the weight for bold and GP's URL requests both 400 and 700.

[1]: https://developers.google.com/fonts/docs/getting_started#Syn...

Thanks for the tip! CSS updated.

Since we're giving a little help here, you have two typos in this sentence.

> You can't change bucket names one you've creatd them, so you'd have to copy everything to a new bucket.

once and created

Thanks, fixed.

I can't zoom on Firefox mobile for Android

One painful to learn issue with AWS is the limits of services, which some of them are not so obvious. Everything has a hard limit and unless you have the support plan, it can take you days and weeks to get those lifted. They are all handled by the respective departments and lifted (or rejected) one by one. Many times we've encountered a Security Group limit right before a production push or other similar things. Last, but not least, RDS and CloudFront are extremely painful to launch. I have many incidents where RDS was taking nearly 2 hours to launch - blank multi-AZ instance! CloudFront distributions take 30 minutes to complete. I hate those two taking so long as my CloudFormation templates pretty much take an excess of an hour due to the blocking RDS and CloudFront. Last, but not least - VPC is nice, I love it, but it takes time to get what's the difference between Network ACL and Security groups and especially - why the neck do you need to run NATs?! Why isn't this part of the service?! They provide some outdated "high" availability scripts, which are, in fact, buggy, and support only 2 AZs. Also, a CloudFront "flush" takes over 20 minutes - even for empty distributions! Also, you can't do a hot switch from on distribution to another as it also take 30 minutes to change a CNAME and you cannot have two distributions having the same CNAME (it's a weird edge case scenario, but anyway).

Just recalled another big annoyance! CloudFormation allows you to store JSON files in the user data, which is a bit similar to CloudInit, but... it turns your numbers into strings! So, imagine you need to put some JSON config file in there and the software expect an integer and craps out if there's a string value instead. I won't even bring how limited and behind the API CloudFormation is... Even their AWS CLI is behind and doesn't support major services like CloudFront. They even removed the nice landing page of the CLI took, which made it very obvious which services are NOT supported - I guess they just got embarrassed by having so many unsupported ones!

> Have tools to view application logs.

Yes! Centralized logging is an absolute must: don't depend on the fact that you can log in and look at logs. This will grow so wearisome.

What tools do you recommend for centralized logging?

That '.' instead of '-' tip for SSL'd buckets just saved me a large future headache. Good stuff!

I think you reversed them.

Disabling SSH is an interesting tip. I guess the OP doesn't do any automation via SSH.

Just disabling inbound SSH connections, the servers can still SSH out to other systems to pull in files, configurations, clone git repos, etc.

It's just a way to stop yourself from cheating and SSHing in just to fix that one thing, instead of automating it.

except that some automation frameworks rely on inbound ssh access to the machines. ansible would be an example of such a framework, in its default configuration at least.

Ah, I wasn't aware of that, very good point!

The goal of the tip is really to stop users SSHing in just to fix that one little thing, so you could still allow your automation frameworks SSH access and just disable it for users.

It can also be useful to SSH into a system to check what's going on with a specific problem. Sometimes weird things happen that you can't always anticipate or automate away.

Userify is awesome for this - disable SSH user accounts at any time and then re-enable when you realize you still need SSH to find out why your instance stopped sending logs!! ;)

Thanks Userify CEO! :)

i'm a devops noob. what tools should i use to log / monitor all my servers?

i don't want to learn some complex stuff like cheff/puppet btw.... anything SIMPLE?

Though I haven't tried it, people tell me that ansible is pretty simple - http://www.ansible.com/home

For logging, try logstash? http://logstash.net/

Monitoring... well that's a large and complicated topic!

+1 on Ansible, great tool and super simple to configure and use.

thanks, i'll give ansible a try!

Can you (or somebody else) elaborate on disabling ssh access? Is this a dogma of "automation should do everything" or is there a specific security concern you are worried about? What is the downside of letting your ops people ssh into boxes, or for that matter of their needing to do so?

Based on the article, it seems it's there to make sure that you're automating everything, instead of logging in to do that one little thing by hand.


Does anybody else here agree with this mentality? This seems a major mispractice to me. I've worked at companies with as few as two people to as many as 50,000 people. None of them have had production systems that are entirely self-maintaining. Most startups are better off being pragmatic than investing man-years of time handling rare error cases like what to do if you get an S3 upload error while nearly out of disk space. There's a good reason why even highly automated companies like Facebook have dozens of sysadmins working around the clock.

I thought all of his other points were spot-on but this one rings very dissonant to my experience.

I agree - it seems off to me. Sometimes you really want to diagnose your problems manually.

I'm also wondering how command and control is maintained without SSH access. Is there some kind of autoupdating service polling a master configuration management server (i.e., puppet's puppetmaster)?

I can appreciate that ensuring that a typical deploy doesn't require hand-twiddling. That makes sense, lots of it. But not disabling SSH.

I think the problem is that I've made it seem like a strict rule in the article; "You must disable SSH or everything will go wrong!!!". It's really just about quickly highlighting what needs automating. Like you say, sometimes you just want to diagnose your problems manually, and that's fine, re-enable SSH and dive in. But if you're constantly fixing issues by SSHing in and running commands, that's probably something you can try to automate.

Personally I always had a bad habit of cheating with my automation. I would automate as much as I could, and then just SSH in to fix one little thing now and then. I disabled SSH to force myself to stop cheating, and it worked well for me, so I wanted to share the idea.

Of course, there's always going to be cases where it's simply not feasible to disable it completely. It depends on your application. The ones I've worked on aren't massively complex, so the automation is simpler. I can certainly see how not having SSH access for larger complex systems could become more of a hindrance.

Facebook has dozens of sysadmins working 24x7, but they also have 200,000 servers.

While diagnosing issues may require logins, you aren't going to be fix anything on 200,000 servers without automation. At large scale, all problems becomes programming problems.

It also means the death of jobs for sysadmins that only know how to go in tweak *.conf files and reboot servers. So, I guess that sucks for you.

Agreed, but two important notes:

1) Automation doesn't have to be complete automation. I can use Chef tools like knife-ssh to run a single command on every one of our boxes in near-real time. This might not be automated to the extent that the OP is referring but it's 99.8% more efficient than logging into 500 boxes to do it (or 200,000 in Facebook's case). If you can get 99.8% with ease, it may not be worth pre-automating for that final 0.2%.

2) Knowing what is worth automating comes from experience. If I have to do something nasty once, it might be a total waste of time to automate it. There are lots of one-off things I type at a command prompt that are quicker to run than to automate. I think being so dogmatic about automation as to say you should never run things at a command line requires you to spend peple's time non-optimally.

No there is no reason to give up on such a tool; the only good reason to support such idea is to get some sort of attention at your person by claiming something crazy

This is correct. The tip about disabling SSH isn't about security, it's just about quickly highlighting areas where you're not automated.

When developing an application for example, it's often necessary to SSH in to play with some things. But once you've ready to go to production, you want as much automation as possible. Forcing yourself to not use SSH will quickly show you where you aren't automated.

What if my automation tool uses SSH (ie, Ansible)?

Someone else pointed this out too. The goal of the tip is really to stop users SSHing in just to fix that one little thing, so you could still allow your automation frameworks SSH access and just disable it for users (the idea is to disable in firewall, not turning off SSH on your server, that way you can still use it for emergencies). The idea worked well for me, but obviously isn't for everyone, YMMV.

Thanks. I'm a fan of automation but respectfully disagree with this (see my response above for details).

Perfectly valid. This particular tip certainly seems to have caused some great discussion! It worked for my particular case, but I can definitely see it not working for everyone.

I've added a link to this thread to my tip, and expanded on it a little to warn people that it's not for everyone.

Thanks for the reply here and above! Good discussion indeed.

I like SSH personally.. how else do you log in to figure out why your production instances quit logging (or anything else). I do appreciate the logic behind this, though... what he really seems to be saying is "turn off SSH to see if you can live without it." Good call. (disclaimer: I developed Userify, which pushes out SSH keys and lets you disable SSH for any or all users anytime.. and then re-enable when you need it!)

Pure bullshit

How hard is it to roll your own version of AWS's security groups? I want to set up a Storm cluster, but the methods I have come up with for firewalling it while preserving elasticity all seem a bit fragile.

Check out Dome9. Amazing tool and I think they work with both AWS and elsewhere.

As an Australian developer, using an EC2 instance seems to be the cheapest option if you want a server based in this country. Anyone got any other recommendations?

Ninefold aren't bad either

As a Ninefold employee, I'd like to think we're pretty good. We do virtual servers and we have a solid Rails platform as well.

thanks will keep them in mind.

Can anyone explain how disabling ssh has anything to do with automation? We automate all our deployments through ssh and I was not aware of another way of doing.

I believe the idea is that by preventing SSH the temptation to just pop in and tweak something manually isn't possible.

Yup, this was the intention. You could still allow your automation processes SSH access, just disable it for your users.

The idea is that if a user can't SSH in (at least not without modifying the firewall rules to allow it again), it will force them to try and automate what they were going to do instead. It worked well for me, but it's probably not for everyone.

ssh is handy if you're creating instances and then setting them up. However, if you're doing that on a regular basis, you might ought to use custom AMIs instead. Then (with proper "user data" management) you can just roll out instances that are already set up how you want.

I'd probably also say "avoid ELB where possible, especially for instance storage" and "avoid ELB, roll your own."

Thing I wish I'd known before I started: Don't rely on proprietary AWS solutions when open source solutions work just as well.

With regards to managing ssh, keys, etc... userify. Disclaimer: founder.

Someone needs to create such list for Azure as well.

And make it Wiki-ized.

Wow looks like a big pain.

What's the point of auditing security in the Cloud? Is there any point at which you can know that your making any progress?

Just one example -- Amazon will sign a Business Associate's Agreement for HIPAA compliance. That doesn't absolve you of your application security responsibilities, but it does give you piece of mind on the PAAS EC2/S3 side of things.

For further note though, they won't unless you buy dedicated instances. This also disables RDS.

Aww man, my head hurts just looking at this list.

Just go with a PaaS, like Heroku or AppEngine, and forget about this sysadmin crap.

> sysadmin crap

Without this “sysadmin crap” you would not have your precious PaaS.

>> sysadmin crap

>Without this “sysadmin crap” you would not have your precious PaaS.

The difference being that I don't have to deal with the “sysadmin crap”.

Sure, but there's no reason to call it crap - have some respect for your fellow ops ;).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact