Keeping buckets locked down and allowing direct client -> S3 uploads
 Using ALIAS records for easier redirection to core AWS resources instead of CNAMES.
 What's an ALIAS?
[-] Using IAM Roles
 Benefits of using a VPC
[-] Use '-' instead of '.' in S3 bucket names that will be accessed via HTTPS.
[-] Automatic security auditing (damn, entire section was eye-opening)
[-] Disable SSH in security groups to force you to get automation right.
Also, S3 buckets cannot scale infinitely. This is a huge myth
If you don't have an elastic workload and are keeping all of your servers online 24/7, then you should investigate dedicated hardware from another provider. AWS really only makes sense ($$) when you can take advantage of the ability to spin up and spin down your instances as needed.
If we went with all of our own dedicated hardware, or cheaper instances from a different cloud provider then we'd miss out on ELB, have slower and more expensive communication to and from S3, not to mention that services like Elastic Beanstalk make deploying to EC2 instances very easy compared with rolling your own deployment system. And for those who don't want to bother with administrating databases and cache machines RDS and Elasticache are going to be cheapest and fastest if your instances are EC2.
So yeah I agree that EC2 is expensive, but the benefits of living fully within the Amazon ecosystem are pretty large.
I can see a lot of benefit to using S3 without EC2, but after that, I'm not sure what else would be possible. Care to elaborate more?
Can you use their queues and database tools w/o using EC2? (If you are using a VPC, maybe?)
In our case (and I'm pretty sure we're not in the fringe) when you consider all costs including the administrative overhead, etc. EC2 costs are not significant portion of the cost.
Just to give an idea, for us, even if someone had offered VM hosting for free, it would not be cost effective for us to move.
For a company that have very high processing requirements it could be a different story. I just wanted to mention that the value added by services in addition to EC2 is sufficiently high that even if EC2 costs are higher than alternative (and they are higher), AWS can still provide significant savings.
Time of a highly skilled dev/ops person is (very) valuable, and not something we can buy more of easily. Anything that saves us time, implement faster pays for itself pretty easily. If you don't have a massive EC2 bill, chances are AWS overall is a good proposition.
Expect to pay a LOT more than you'd expect to pay from the pricing page.
I don't get the whole line of reasoning : either your project is not important and having weekly or monthly backups is certainly sufficient. In such a case, having a script to backup a VPS is by far the most cost efficient way to run the project.
If your project is important enough to have a complex AWS setup, you have an admin anyway.
"But you can trust amazon". Well, no : http://www.businessinsider.com.au/amazon-lost-data-2011-4 (note that disasters on VPS/dedicated servers are also rare)
If you're making complex AWS setups because "it's cool", then by all means, but keep in mind that you're paying a lot for the privilege. Don't do this on production.
If you think VPS providers is an alternative to AWS, you have an entirely different use case than ours. For an application with high availability requirements, messages queues, scalable data store, etc. admin work is not "monthly backups". Implementing and maintaining highly available load balancing, message queues, data stores, DNS, auto scaling, etc. all require a lot of work, and often skills that you may not have in the team. AWS makes these tasks easier. You still have to do admin work, but much less of it, and don't have to be expert at each one. That's worth a lot.
At most it's been 4 to 6 ms. And that has coped with our heaviest load.
If you have a dozen slow (100us) switches, three slow (2ms) routers, and 500 kilometers of fibre (2ms), you should still be under 10ms each way. And that's enough gear to cross some countries.
I'm curious how you can do much worse than this in a data center. Maybe some pokey "application firewall" or massive buffer bloat?
AWS really only makes sense ($$) when you can take advantage of the ability to spin up and spin down your instances as needed.
After it's pointed out that other facilities besides EC2 motivate people to use AWS, you can't turn around and complain that those other facilities are useless without EC2. Even if that were true (it's not), such people would be using EC2 not for its load-scaling, but rather to enable their use of the non-EC2 facilities.
I'm not sure how useful everything in the AWS umbrella is outside of EC2. I was hoping to learn something, not provoke anyone.
When everything is covered under the AWS umbrella, it's easy to argue from any side you want.
I still think that it's important to consider whether or not AWS (for all values of AWS) correctly fits your workload. You may not need the ability to spin up servers within minutes, or have multi-datacenter redundancy in a database, or have virtually unlimited storage, or a robust queuing system. A lot of great engineering has gone into all of the AWS products, and for many instances it is probably overkill. And that can cost you a lot if you don't know what you're doing.
Unless they have dramatically improved their offering in the last couple years, an hour or two from "I need a new server" to delivery is 1) not an accurate timeframe for physical servers from Rackspace and 2) even if it was realistic, that's an eternity when you are trying to iterate quickly. I can have a new server in 30 seconds with AWS and in the course of an hour could have tested my automation tools half a dozen times or more vs reimaging a server over and over again.
I'm not saying it's always the right choice or that it's cheaper, but that flexibility combined with some of the pre-canned tools (ELB, RDS, CloudWatch, SQS, SNS) has tremendous value even when you aren't autoscaling.
I only have experience renting physical hardware from Softlayer. They have built and imaged new servers for me in under 2 hours a dozen times, day and night. They also have a "xpress servers" line with guaranteed <2hr delivery. They also let you reimage your servers through their control panel or API; you don't need to have a new one built just to get a fresh disk image.
I moved my last company completely off AWS and it was proven to be a great decision across a number of dimensions.
Ansible, puppet, sprinkler, and the like would take a similar amount of time to configure.
I'm very impressed that you were able to build in 3 weeks time a low-latency multi-data center application with master/slave database failover, robust fault tolerant load balancing, and backups that can be restored in minutes with an API to control all of those services.
That would normally take a senior team of engineers several months to accomplish and have it be of the quality and reliability of the services provided by AWS.
More likely is that you had a use case or a mindset that did not suit AWS very well and was easy to implement on your own. That's awesome and I'm glad you were able to find better value elsewhere. AWS is not for everyone, and is definitely quite expensive on the pocketbook.
it was more than just chef obviously but chef + any bare metal host environment gets you a large percentage of the way there. Tacking on specific aws services like route53 when necessary works too.
The benefit of AWS is that you have some immediate bootstrap, and simple auto scaling. This last is a killer feature. Being able to scale your caches and load balances silently, based on metrics, is a real time and money saver.
Sure. You need to have a level of scale to need that :) but when you do, AWS can have some good features.
Once it's clear that the new server will be needed long-term, transition to dedicated hardware to save money and get better performance.
For some businesses, the huge AWS feature set (RDS, EBS, ELB, security groups, VPC, ELB, EIPs, etc, etc, etc) is more valuable than the bottom $$ line. For others, those features aren't worth the added cost, but hand-wavey "just use DO" or "Just rent physical servers from SoftLayer/Rackspace" is disingenous.
TL;DR AWS has value above and beyond simply hour-to-hour elasticity.
But do you know of any good resources to learn about those two things? And I'm taking about basic devops, before you even start worrying about automating, and the things you would actually automate–because I don't know what it is I should be doing in the first place.
Things like what you should be doing right after you SSH into your server, how to make your server secure, how to use nginx, chmod'ing permissions of files correctly, and things I don't even know about.
Is there a One Month Rails or Michael Hartl's RoR Tutorial for devops/sys admin?
Regardless, thanks for taking the time to read this :)
DigitalOcean and most others are not.
AWS's security and API is lightyears ahead of everyone else.
If all you need is a server that is up 24/7, rent it by the month. You don't need information to make an educated choice, since they are pretty much all cheaper than EC2.
I doubt there are many founders who are technically informed enough to know about Amazon Web Services, but don't know about the other big 3 (Digital Ocean, Linode, Rackspace). If you truly don't, then you must not be a tech company, and I have a hard time believing a non-tech company without any technical founders would even know about AWS.
More people know about AWS than you may realize.
Specifically - what AWS features do you find to be useful at the beginning? You seem to have some specific use-cases in mind. I'm legitimately curious.
For example, RDS is just a database instance with management. You'd have to invest some time to replicate what AWS already does for you, but its not rocket science. The same for Elasticache, Autoscaling, ELB, and most of their other services.
Things like what you should do right after you SSH into a server, how to make your server secure, setting up nginx, chmod'ing permissions of files correctly, and things I don't even know.
Regardless if you get back to me, I appreciate you taking the time to read this :)
There are other subtleties which make roles hard to work with. The same policies can have different effects for roles and users (e.g., permission to copy from other buckets).
IAM Roles can be useful, especially for bootstrapping (e.g. retrieving an encrypted key store at start-up), but only use them if you know what you're doing.
Conversely, tips like disabling SSH have negligible security benefit if you're using the default EC2 setup (private key-based login). It's really quite useful to see what's going on in an individual server when you're developing a service.
Also, it does matter whether you put a CDN in front of S3. Even when requesting a file from EC2, CloudFront is typically an order of magnitude faster than S3. Even when using the website endpoint, S3 is not designed for web sites and will serve 500s relatively frequently, and does not scale instantly.
Is the purpose of blocking 169.254.169.254 important because it could potentially give users access to the instance metadata service for your instance? I'd be interested to hear more information on securing EC2 with regards IAM roles, you seem to have lots of experience in that area.
The disabling SSH tip wasn't really about security (I agree that it has negligible security benefit), it's more about quickly highlighting parts of your infrastructure that aren't automated. It's often tempting to just quickly SSH in and fix this one little thing, and disabling it will force you to automate the fix instead.
The CDN info has been mentioned elsewhere too, lots of things I didn't know. I'll be updated the article soon to add all of the points that have been made. Thanks for the tips!
I make sure all HTTP requests in my (Java) application go through a DNS resolver that throws an exception if:
ip.isLoopbackAddress() || ip.isMulticastAddress() || ip.isAnyLocalAddress() || ip.isLinkLocalAddress()
The last clause captures 169.254.169.254. Of course, many libraries use their own HTTP client, so it's easy to make a mistake.
I'm trying to bring my usage of IAM roles down to 0 as a matter of policy. Currently, I'm only using an IAM role to retrieve an encrypted Java key store from S3 (key provided via CloudFormation) and encrypted AWS credentials for other functions (keys contained in the key store). I'd be happier to bootstrap using CloudFormation with credentials that are removed from the instance after start-up.
Thanks for making updates. There are definitely some great tips in there.
What? CloudFront bandwidth costs are, at best, the same as S3 outbound costs, and at worse much more expensive.
S3 outbound costs are 12 cents per GB worldwide. 
CloudFont outbound costs are 12-25 cents per GB, depending on the region. 
Not only that, but your cost-per-request on CloudFront way more than S3 ($0.004 per 10,000 requests on S3 vs $0.0075-$0.0160 per 10,000 requests on CloudFront)
For low bandwidth, you're absolutely right, the costs are at best the same. For high bandwidth however (once you get above 10TB), CloudFront works out cheaper (by about $0.010/GB, depending on region). But that wasn't taking into account the request cost, which as you point out, is more expensive on CloudFront, which can negate the savings from above depending on your usage pattern.
I'll update my post accordingly, thanks for pointing this error out!
Also, S3 buckets cannot scale infinitely. They have to have their key names managed appropriately to do it. http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tip...
Finally :) I like SSH. But I'm the founder of Userify! http://userify.com
There's one that I think could be improved on a little:
Uploads should go direct to S3 (don't store on local filesystem and have another process move to S3 for example).
Getting your application server up and running is the easiest part in operation, whether you do it by hand via SSH, or automate and autoscale everything with ansible/chef/puppet/salt/whatever. Persistence is the hard part.
"EBS volumes are not recommended for Cassandra data volumes."
Using this type of setup, we had 100% availability for our customers over a period of 2 years (up until a couple of weeks ago where the flash 12 upgrade caused a small amount of our customers to be impacted). This includes the large outage in US East 1 from the electrical storm, as well as several other EBS related outages. Overall costs are cheaper than setting up our own geo-diverse set of datacenters (or racks in datacenters) thanks to heavy use of reserved instances. We keep looking at the costs and as soon as it makes sense, we'll switch over, but will still use several of the features of AWS (peak load growth, RDS, Route 53).
The short answer is to design your entire system so that any component can randomly be shot in the face, from a single server to the eastern seaboard falling into the ocean to the United States immediately going the way of Mad Max. Design failure into the system and you get to sleep at night a lot more.
I'll update the article soon to add in the new information.
If you use a lot of cloudfront bandwidth without setting up a reservation, yeah... you're gonna pay through the nose.
The server-level monitoring is free, and it's super simple to install. (The code we use to roll it out via ansible: https://gist.github.com/drob/8790246)
You get 24 hours of historical data and a nice webUI. Totally worth the effort.
> Use random strings at the start of your keys.
> This seems like a strange idea, but one of the implementation details
> of S3 is that Amazon use the object key to determine where a file is physically
> placed in S3. So files with the same prefix might end up on the same hard disk
> for example. By randomising your key prefixes, you end up with a better distribution
> of your object files. (Source: S3 Performance Tips & Tricks)
Great article, btw.
Yes. The original URL only provides the normal weight (400), but you can specify other weights/styles. 700 is the weight for bold and GP's URL requests both 400 and 700.
> You can't change bucket names one you've creatd them, so you'd have to copy everything to a new bucket.
once and created
Yes! Centralized logging is an absolute must: don't depend on the fact that you can log in and look at logs. This will grow so wearisome.
It's just a way to stop yourself from cheating and SSHing in just to fix that one thing, instead of automating it.
The goal of the tip is really to stop users SSHing in just to fix that one little thing, so you could still allow your automation frameworks SSH access and just disable it for users.
i don't want to learn some complex stuff like cheff/puppet btw.... anything SIMPLE?
For logging, try logstash? http://logstash.net/
Monitoring... well that's a large and complicated topic!
Does anybody else here agree with this mentality? This seems a major mispractice to me. I've worked at companies with as few as two people to as many as 50,000 people. None of them have had production systems that are entirely self-maintaining. Most startups are better off being pragmatic than investing man-years of time handling rare error cases like what to do if you get an S3 upload error while nearly out of disk space. There's a good reason why even highly automated companies like Facebook have dozens of sysadmins working around the clock.
I thought all of his other points were spot-on but this one rings very dissonant to my experience.
I'm also wondering how command and control is maintained without SSH access. Is there some kind of autoupdating service polling a master configuration management server (i.e., puppet's puppetmaster)?
I can appreciate that ensuring that a typical deploy doesn't require hand-twiddling. That makes sense, lots of it. But not disabling SSH.
Personally I always had a bad habit of cheating with my automation. I would automate as much as I could, and then just SSH in to fix one little thing now and then. I disabled SSH to force myself to stop cheating, and it worked well for me, so I wanted to share the idea.
Of course, there's always going to be cases where it's simply not feasible to disable it completely. It depends on your application. The ones I've worked on aren't massively complex, so the automation is simpler. I can certainly see how not having SSH access for larger complex systems could become more of a hindrance.
While diagnosing issues may require logins, you aren't going to be fix anything on 200,000 servers without automation. At large scale, all problems becomes programming problems.
It also means the death of jobs for sysadmins that only know how to go in tweak *.conf files and reboot servers. So, I guess that sucks for you.
1) Automation doesn't have to be complete automation. I can use Chef tools like knife-ssh to run a single command on every one of our boxes in near-real time. This might not be automated to the extent that the OP is referring but it's 99.8% more efficient than logging into 500 boxes to do it (or 200,000 in Facebook's case). If you can get 99.8% with ease, it may not be worth pre-automating for that final 0.2%.
2) Knowing what is worth automating comes from experience. If I have to do something nasty once, it might be a total waste of time to automate it. There are lots of one-off things I type at a command prompt that are quicker to run than to automate. I think being so dogmatic about automation as to say you should never run things at a command line requires you to spend peple's time non-optimally.
When developing an application for example, it's often necessary to SSH in to play with some things. But once you've ready to go to production, you want as much automation as possible. Forcing yourself to not use SSH will quickly show you where you aren't automated.
I've added a link to this thread to my tip, and expanded on it a little to warn people that it's not for everyone.
The idea is that if a user can't SSH in (at least not without modifying the firewall rules to allow it again), it will force them to try and automate what they were going to do instead. It worked well for me, but it's probably not for everyone.
And make it Wiki-ized.
Just go with a PaaS, like Heroku or AppEngine, and forget about this sysadmin crap.
Without this “sysadmin crap” you would not have your precious PaaS.
>Without this “sysadmin crap” you would not have your precious PaaS.
The difference being that I don't have to deal with the “sysadmin crap”.