My problem with StackExchange sites is that they're not that good as a reference. There are lots of great questions and answers (and lots of repetition thereof) but as a place to collate community knowledge about a topic they fall short.
This would be great. I actually was searching for something like this only a week or so ago.
In my current role I have been doing a lot with Elastic Beanstalk. Many cool features are non obvious and/or not clearly documented. Especially in the case of extending its functionality with various hacks and recipes.
By now I have a fair accumulation of notes on this stuff, would happily pour this into a wiki.
"9) Use Virtual Private Cloud (VPC) from the start"
This is now a no-brainer. New registrations in certain zones will kick you into a basic VPC from the get-go.
The only inbound should be via an ELB (HTTP/HTTPS) and an non-DNS-resolvable SSH bastion/NAT host (m1.small is more than enough). Your bastion is the only host that is on the public Internet.
Setting up a bastion is reasonably straight-forward. There is an AMI that will do it all for you. Just make sure you've got "src/dst check" turned off in the EC2 panel for that server. Just that tip can save you hours of hair-tearing.
Outbound is via the bastion, which you can lock down to certain protocols via VPC security groups. I limit to HTTPS (no HTTP!).
The bastion should use ssh keys only (no username/password). Put in place fail2ban on the bastion. You can also add a firewall rule that backs off multiple fails SSH attempts. This all but nukes brute-force attacks. Also, make sure you patch regularly.
I go as far as to have separate keys for the bastion and then the hosts, but sensible policies should apply here (e.g. passphrase on your keys please).
Keep the bastion superuser login to a very select group. You can quickly remove employees' access by taking them out of the bastion. If you're in panic mode, you can turn off the bastion and isolate your network until you can regroup (similarly with the ELB for web-based attacks).
This is a pretty sound foundation for a secure setup.
It's more of a swiss army knife. It's also still pretty new - the aws documentation misses out on some things for it (like autoscaling) and it's still in development to some degree, but it means you have to remember one command with one auth, instead of a handful (s3cmd, boto, others I can't recall)
In terms of CLIs for Amazon S3, 'aws s3' is a collection of commands that are almost equivalent to 's3cmd' functionality, while 'aws s3api' is a collection of commands to interact directly to API using corresponding action names.
I don't hear enough people talking about vendor lock-in with AWS, or any other cloud provider for that matter. Too many of us are building our company's infrastructure on AWS with little regard for the cost of switching.
I try to stick with Amazon's IaaS offerings, only employing PaaS products when they offer considerable advantages over anything I could roll economically on bare infrastructure.
Ask yourself, "in terms of time or money, what would it cost me if this cloud product were discontinued without notice?"
AWS might lose a patent battle, get shut down by the government, or get crippled by hackers or a long-term service interruption. If that happened, how long would you be down while you struggle to get up and running at another provider? What if Amazon raises their prices to the point where you want to change cloud providers? Would the cost of switching be prohibitive?
I am using a number of high-performance instances - cc2.8xlarge machines for EMR jobs, and cr1.8xlarge machines for analytics databases. From my testing:
- Running EMR jobs on cc2.8xlarge machines as spot instances is a great way to get a LOT of computer power very cheaply. Because our jobs are periodic we run both Core and Task as spots and simply retry the job if our machines get terminated. I did a lot of benchmarking and found that a small number of cc2.8xlarge machines out-performs and is cheaper than a large number of lesser instances (and I tried most of the lesser machines). In us-west-2 it's very uncommon to lose our instances, unlike us-east-1 which has major price fluctuations (this is true for all types of spot instance).
- The cr1.8xlarge has fantastic performance, relative to the rest of the AWS machines. It's also very expensive compared to the cost of hardware or a similar solution on another cloud provider. Since we're fully integrated with AWS and don't want to run our own hardware we're sucking up the cost for now, but it's definitely a sore-point in our budget. The cr1.8xlarge is also all-round a better machine than the hi.4xlarge, which has a lot of disk but is pitiful in terms of CPU.
One thing to note with EMR: you still pay 25% of the ondemand price as overhead to use EMR. If you're bringing up and turning off clusters all the time, it's probably worth it, but you might want to look into using Whirr instead.
You've a good point about the EMR charge - that's easy to overlook.
I took a look at Whirr  but I don't see how having a cloud-agnostic platform helps - are there really alternatives to EMR out there? Can they give me 500+ cc2.8xlarge equivalent machines on-demand but at spot prices?
Whirr just uses the AWS APIs to provision a cluster for you, but then you're getting a cluster built on EC2 instances rather than EMR, so you don't pay that EMR overhead. You can choose spot or ondemand instances. If your spots get reaped, I think it would fail pretty similarly to an EMR cluster built on spots. I have no idea how quickly it could provision a 500 node cluster, however.
Ahh, I see - I thought it was about being able to move my EMR jobs to other providers. Running on EC2 without the EMR charge would be nice, and assuming that there's no special casing for spot requests for EMR (I don't know of any, the spot instance requests show up like normal ones) then it may be well able to get me that many instances.
Description of noisy neighbor problem #6 lacks some depth.
AWS noisy neighbors problem is very often misunderstood. CPU steal time under linux does NOT mean that somebody is stealing your CPU. It simply means that you wanted to use CPU and hypervisor has given it to another instance. This may happen because you have exceeded your quota or scheduling algorithm selected another pending instance at this very moment and it would give CPU back to you a bit later. In the end in both cases your instance gets fair share.
Why does killing and restarting instance help? It likely moves instance to different hardware node with less active neighbors. When your neighbor is not active your instance can use CPU idle cycles of your neighbor! You sort of become the noisy one. Still hypervisor would prevent it once neighbor starts to fully utilize his CPU quota and you are back to square one.
Amazon specifically states that t1.micro instances do not guarantee CPU performance:
"Micro instances are a very low-cost instance option, providing a small amount of CPU resources. Micro instances may opportunistically increase CPU capacity in short bursts when additional cycles are available. They are well suited for lower throughput applications and websites that require additional compute cycles periodically, but are not appropriate for applications that require sustained CPU performance."
While CPU sharing is pretty well documented noisy neighbor problem still exists for network and disk resources being shared by multiple instances on the same hardware node. The only way to detect these problems is to track network throughput/loss rate for network and IO stats for disk.
"You are guaranteed to avoid noisy neighbors CPU problem by using AWS dedicated instances"
This is incorrect. You simply ensure that the only ec2 instances that can be a noisy neighbor are other instances of yours. Since most users investing in dedicated instances tend to put their instances to work, this sometimes has the net effect of reducing performance.
There was recently [a paper] published in ACM Transactions on Computer Systems where the argument was that if your instance is scheduled out by the hypervisor during TCP traffic, the latency in ACKing packets would reduce the network throughput of instances, because of the slow start mechanism.
Given that distributed network rely heavily on network latency and capacity, I found it very interesting to see the effect of busy CPU propagate to slower network IO.
If you're sharing a CPU with a neighbour and you're using 3% and the neighbour is at 100%, the hypervisor needs to somehow squash that 103% into the 100% it has available - which means you lose a little bit of the scheduled time slots that otherwise would have been yours. If your neighbour drops down to 50%, there's now enough CPU time to handle both of you, so there's no 'steal'.
We've seen for example, if we have 5 instances running the same app, one instance potentially uses 20% more CPU with tons of CPU steal time, but none are using 100%, the "normal" ones use ~40% and the one with tons of steal time spikes between 60% - 70%.
But rebuilding the instance brings them all in line.
Does anyone understand the point they are making in #9 about VPC?
Are they suggesting using HAProxy in your public subnet and ELBs in your private subnets? Is this their reference to avoiding using a NAT box (actually PAT)? Don't you still need a HAProxy box in each of your AZs?
I'm not sure, but I read it as ELB over your HAProxy instances (for high availability), and use HAProxy to do the intelligent load balancing (which would solve the "sticky" argument put forth by the author).