Hacker News new | past | comments | ask | show | jobs | submit login
AWS mistakes to avoid (cloudonaut.io)
537 points by hellomichibye on Dec 26, 2015 | hide | past | favorite | 262 comments

Full disclosure: I work for Linode and I am super biased. This is my own opinion etc etc

Perhaps the first AWS mistake you might make is... using AWS? Even before I started at Linode, I thought it was terrible. It's extremely, unreasonably pricey. The UI is terrible. Their offerings are availble elsewhere. I started MediaCrush, a now-defunct media hosting website, on AWS. After a while, we switched to dedicated hosting (really sexy servers). We were looking at $250 a month and scaled up to millions of visitors per day! I ran our bandwidth and CPU usage and such numbers through the AWS price calculator a while ago - over $20,000 per month. AWS is a racket. It seems to me like the easiest way to burn through your new startup's seed money real fast.

Edit: not trying to sell you on Linode, just disclosing that I work there. There are lots of options, just do the research before you reach for AWS.

Since you're dealing in absolutes, let me return the favor; you're being a little silly.

Yes, Linode beats AWS when it comes to price. On the other hand, Linode's offer is incredibly basic and simplistic. AWS offers service after service that Linode simply doesn't and realistically cannot.

Sure, you can emulate a subset of these services (and a subset of their features) using open-source software but at what price ? That's the major flaw in your reasoning. Getting to the point where your installations are as stable and reliable as AWS', given a large stress on the system, will cost you a lot of money and time. Directly comparing the cost of hardware access is ignoring other costs and headaches that are not very easy to estimate.

There's a market for a service like Linode and there's a market for a service like AWS. You've simply never worked on a project/system that works better on AWS than on Linode. I know I couldn't run the systems that I currently operate on Linode without multiplying the workload that is needed to maintain them.

Linode and AWS are competitors but there's space in the market for both; they simply fill different niches. Establishing one as absolutely superior to the other is silly and closed-minded. A lot of people chose AWS; go and ask them why (feel free to reach out to me at nick at nasx dot io - I'll be more than happy to talk to you).

Do you even understand his point? He is just nice enough to disclose his current employer. He's not recommending Linode. Really, all he's doing is just asking you to do some research and it needn't be Linode, there are plenty of alternatives out there.

If we keep attacking the people for their honest disclosures, we're only discouraging them from expressing their conflict of interest in the future. Please. Don't.

I'm really not trying to compare Linode to AWS here. I just felt that it was necessary to mention that I work for Linode because there _is_ some overlap. I'm really also not trying to sell anyone on Linode in particular. I'm pointing out that from my own personal experience, AWS is grossly overpriced and worse than the alternatives for a large number of use-cases.

AWS is insanely overpriced. It's weird to me that everyone jumping on AWS and some of their customers never come close to attracting the internet traffic required to set up CloudFormation. I think if you're starting a webapp out of your own pocket and not millions of dollars of investors, it makes sense to save where you can.

Using locust.io, I've seen that my current site on two $10/month Linodes can scale up to approximately 300k people/day and increasing that substantially just means I press a button and upgrade my app/database servers.

If it came to a point where I was growing at a pace I didn't want to manage and money was flowing in, and I was out of ideas on software optimization, only then I would consider spending tens of thousands/month on AWS.

I'm not denying the great benefits AWS gives, I honestly would love to use it now and just be done with most of my devop headaches, but the costs are prohibitive.

What are some alternatives to AWS? Are there realistic alternatives?

The "alternatives" depend on your situation.

Picture a continuum between brain-dead simple websites and business-critical complex websites:

  simple:  static website, WordPress blog
  moderate:  small business CMS, etc
  complex:  Netflix, AirBNB
If you're running a simple WordPress blog, the AWS prices are absurdly overkill. For this use-case, there are a zillion alternatives. Linode, Digital Ocean, Rackspace bare metal, etc, etc.

On the other end of the spectrum, you want to run a high-availability website with failover across multiple regions like Netflix. You need the value-added "services" of a comprehensive cloud provider (the "I" and "S" in "IaaS" as in "Infrastructure Services"). For that scenario, there are currently 4 big competitors: AWS, MS Azure, Google Compute Cloud, and IBM SoftLayer. However, many observers see that Google and IBM are not keeping pace with AWS and Azure on features so at the moment, it's more of a 2 horse race than a 4.

Keep in mind that the vast majority of cost comparisons showing AWS to be overpriced are based on comparing Amazon's EC2 vs bare metal. The EC2 component is a small part of the complete AWS portfolio.[1] If you're doing more complicated websites, you have to include the costs of Linux admins + devops programmers to reinvent what AWS has out of the box. (The non-EC2 services.) Even if you use OpenStack as a baseline for a "homegrown AWS", you'll still need extensive staffing to configure and customize it for your needs. It may very well turn out that homegrown on Linode is cheaper but most articles on the web do not have quality cost analysis on the more complicated business scenarios. Anecdotes yes! But comprehensive unbiased spreadsheets with realistic cost comparisons?!? No.


Digitalocean is a great alternative for simple things.

I use it for dev / early projects and as things get complex or need more redundancy I make the production spend on AWS.


Much like I wouldn't trust a person's whose only comments on hn are AWS puff, why should I trust a person whose only posts are Google Cloud puff?

I've found Google Compute Engine decent, but with App Engine I ran into difficulties with vendor lock-in. I had to first modify my code to get the app run on my local computer. After that I never got it deployed because GAE requires me to bind their own TCP context for outbound connections. This was a shame because I would have wanted to have my app on GAE and database on Heroku.

Just saying that calling 90% of Google Cloud's products to be far better than AWS (or any equivalent provider for that matter) is bold.

App Engine has a lot of faults, but your use case sounds incredibly niche. You're using the multiregion availability of App Engine's HTTP servers but want to use Heroku's single AZ database across relatively flaky, slow internet instead of App Engine's multiregion available database on a managed internal SDN. I'm having a hard time imagining a scenario where that setup would make sense.

Google Cloud is not just App Engine now. Its pretty much the equivalent to aws in features. Let me also say I have PhD in Cloud and big data (graduated in 2013), worked for Amazon aws, worked for an rackspace for year and half, I was a core contributor to OpenStqck for the products OpenStack Poppy and OpenStack Zaqar. I hope you his puts se more weight behind my statements. If you are interested to know the details and why Google Cloud is better that aws for 90% of products, reach out to me.

They aren't even close to each other in featuers. All the various additional Services AWS supports go way beyond what Google has at the moment. Numerous Database offerings, Queueing, transcoder, ... Go way beyond what Google has in place.

And no the ability to run it yourself on Google is not the same.

Your statement is simply false.

Here's a material answer - https://cloud.google.com/docs/google-cloud-platform-for-aws-...

In short, Google has parity with AWS on many fronts, and exceeds AWS on many others. Only material AWS advantage at this point is full IAM. Biggest thing you gotta remember is AWS is stuck in "VM" world and only slightly deviates from that. Google's advantage is in its fully managed services, which AWS does poorly (don't tell me Redshift is "fully managed").

Google has Bigtable, which AWS has no competitor for. Google's Pubsub is vastly superior to SQS. Google doesn't need Firehose because Google's services scale to Firehose levels without needing a new product and a new price. Google has Zync transcoder service.

Google has BigQuery, which is vastly superior to Redshift in price, performance, scale, and manageability. Google has Dataflow, which AWS has no competitor for.

Even for VMs, Google offers better networking, faster disks, more reliability (live migration), better load balancer, etc.

And to put the cherry on top, Google's no-committment, no-contract price is only beat if you lock yourself into AWS's 3-year contract.

(disclaimer: work on BigQuery)

Cool story, bro.


Please take a look into Google Cloud. It's now feature equivalent with AWS (https://cloud.google.com/docs/google-cloud-platform-for-aws-...). Google Cloud is better than AWS on all fronts other than Relational Databases. Google Cloud has MySQL only. AWS has MySQL, PostreSQL, Oracle, MSSQl. Rest of the services Google beats AWS by a fair and wide margin. I have done a fair amount of benchmarking Google vs AWS in past 3 months and conclusions are shocking. With Google Cloud, you can save about 50% of your bill (I have taken into account of reserved instances), huge increase in companies agility (Speed in taking decisions like capacity planning, instance type, sizing, zones, cluster sizes ... due to simplified infrastructure and pricing & performance wise too, blazing fast disks, live vm migration), far better security (data encryption at rest and on wire by default for all products, ssh key management). I am presenting my findings at our meetup group (http://www.meetup.com/Cloud-Big-Data-and-Data-Science-Group-...) in a couple of months. If any of you want to talk, reach out to me at obulpathi at gmail.

You'll need a better source than a Google marketing page to support claims of "feature parity". An objective comparison also shouldn't start with "Please look into Google cloud".

That is why I am sharing my findings in detail at out local meetup. If interested to know more, just shoot an email to me at Obulpathi at gmail. Also see my another comment above.

Thanks; I hadn't seen that that page from Google and that's a pretty good idea to make that available.

You know what would be a cool idea: if Google developed a method of copying an Amazon Machine Image (AMI) to Google Cloud. That would give us an easy way to try out our same servers at Google without having to rebuild everything, as we do not use containers yet.

You're right, but I know of a few startup teams that chose Linode, Rackspace etc. at start to account for future 'scalability' concerns. Well the number of users never came and they spent a lot of time just configuring stuff. Of course I have also seen teams that couldn't build a decent business on AWS.

I think the net cost of an extra personnel (a server admin) in a startup team, is more than the cost of the server.

Huh? In general AWS is chosen for 'scalability' concerns given it's auto-scaling functionality, not Linode/Rackspace/etc.

If anything, I see a lot of people choose AWS for 'scalability' concerns when they never end up needing to scale.

Your 300k people/day metric conveys little information if we don't get an idea of the workload each user entails?

Aww and digital ocean customer here(as well internally hosted).

Main advantages I see is Security - aws has security zones, FW ETC FOR FREE Momentum - constant new features(apache spark, lambda, sqs)

Your post basically picks the one thing he voluntarily disclosed and ignores the rest of his perfectly reasonable statements in some manic attempt to defend the one true stack.

The only points you even remotely bring up are vague and wishy-washy and could easily be applied to any statement, even outside of programming.

Most of the reasons to go with AWS as a major strategic decision are to avoid paying for 3-4 (or perhaps... 40+) ops engineers to run and manage so many things that really don't add much business value in themselves and that you could hold to some form of an SLA (especially with enterprise support accounts) and recoup some costs if things came down to it. Furthermore, engineering a lot of AWS services would cost you developer time as well.

I consult / contract for a Fortune 50 customer that pays an insane amount of money to AWS per month and it seems ridiculous what they're paying for and what they get. But do you know what's even more ridiculous? Paying three times as much but getting even worse reliability and a massive amount of time wasted while waiting weeks to provision servers. One group internally used to have massive outages daily resulting in needing to hire a dedicated team of about 6 just to handle that load of production incidents (terrible reliability also resulting in loss of revenue too) and since migrating their software with even the most bone-headed SPF-everywhere AWS based software architecture haven't had more than one or two per week for outages now. The cost benefits get better the more incompetent / incapable your internal IT organization is. The amount of wasted resources due to internal bureaucracy, legacy, and working with companies that don't really practice any form of technology at scale has been an incredible amount of savings for my customer.

And knowing how much my customer relies upon AWS support for the most menial of tasks (primarily a cultural thing with how they treat their vendors as well as internal resources) I am sure that nobody but massive companies that deal with external bureaucracies effectively could deliver the size and kind of support team to handle the volume of support requests generated by my customer for the most ridiculous of tasks (I've seen executives file AWS support requests to reboot VMs at 3 am because the bureaucracy is extremely brittle here and reflects into the software systems reliability). We may throw millions / month at AWS, but Amazon has to hire at least a few dozen engineers and several account managers just to handle support alone which I might argue could be a net loss to handle the customer.

Most vendors that work with my customer have a very hard time working with them because they approach sourcing quite similar to how Walmart does and abuses them to the point where the vendors abuse their employees.

"I've seen executives file AWS support requests to reboot VMs at 3 am"

Isn't this falls outside amazon's responsibly area?

Atleast the documentation I read and a few aws promotional talks I attended seems to suggest the contrary.

Or I have understood completely wrong and amazon is now also providing managed IT services?

There are some legitimate cases where AWS support is the right place to issue a request. Just the other day I had a VM go down and I issued a request to stop the instance and it just hung there with no valid way to keep shutting it down. Turns out AWS support is the only way for most users to get an instance on a host that suddenly goes down while stopping your VM to be correctly stopped because the underlying API call does not go through and it seems like a small critical section of sorts.

But when you pay AWS so much they are pretty beholden to your requests and can wind up becoming a lot like an MSP. I've spent countless hours with those poor guys waiting on a traceroute failure or for mtr or sat to give some anomalous event that everyone suspects is something wrong in AWS when I've insisted AWS is almost never at fault because our incompetence is almost certainly the problem.

The AWS support team's worst engineer is probably at least in the top 25% of our support engineers I'd argue so I'm sure if my customer had the option they'd want to pay to have them troubleshoot our own internal networks but that's what my team is for I guess.

"The cost benefits get better the more incompetent / incapable your internal IT organization is. "

Yes. There will always be a less stupid solution out there that is not as bad as the bad solution.

Or, you could fix IT. But that takes thought.

Easier to give money to a vendor you can blame...?

Fixing completely dysfunctional F500 IT organizations is something that many, many people have tried across completely different company cultures with maybe a handful of successes I've heard of. Leading huge organizations (especially ones damaged by previously failed leadership efforts - think how attempts to refactor a legacy codebase go and multiply by an order magnitude in number of variables) is very hard and because large organizations tend to have so much capital working for them, IT management failures create a large opportunity (in theory) for optimization by others. In practice, consulting for organizations that don't want to fundamentally change how they do X (or think $ must solve everything and leadership becomes so fragmented into non-communicating middle managers) while wanting large changes to X are the equivalent of an obese person wanting to lose weight without changing their diet or exercising. Yet we know there's a huge "fitness" industry with large profits being made off of those that do not use their products and subscriptions effectively.

Change is tricky. What you said reminded me of an interesting tweet I came across: https://twitter.com/JICare/status/662339652825780224

This is a fairly naive way of looking at it. Often it's easier to deal with an external organization than to fix your own organization's behaviour.

In a lot of pathologically-managed companies or divisions, there is effectively zero chance of an employee being able changing the pathology.

I have heard of a few poorly managed companies pulling out of ruin by using consultants and third party labor to supplement their collapsing workforce. But I've seen poorly managed companies on occasion fire their executives and stabilize as well. People love to spend 7 figures on consultants and refuse to take their advice in favor of suck-ups that recommend that they just slightly change what they're doing or just sit in silence and continue to bill out. This is why I stopped consulting / contracting in the federal government space - nothing I do even if I was a company founder can fundamentally fix even a tiny fraction of the deep flaws. People got rich off of manipulating contracts to do as little as legally possible while the best at delivering were cut because they actually finished the job right. Reward for failure is a system that exists plenty in the private world, but only in public sector is it institutionalized.

This sounds about right. This is why I don't believe in the "industry is more efficient than government" dogma. Large organizations of people are generally inefficient unless extraordinary care has gone into the engineering and maintenance of the organization.

Are there any particularly memorable situations in your consulting career that illustrate the political realities of such organizations? Would be really awesome if you can share any :)

Hi, first of all, thanks for your comment, and your full-disclosure. I am a happy Linode customer, and I have been using your service since 2010-09-02 (that is the date of the first invoice), but I have a few issues with your product that you might be able to look at.

- It is too hard to figure out that you offer hourly billing, I found it by accident, and a year too late.

- I tried out your Nodebalancers, as I thought they would be a nice help for my stack, but I ended up building my own Haproxy solution since the performance wasn't good. When I pay for a dedicated loadbalancers, I would expect it to be top-tuned from your side.

- Your API is a mess, and so is your documentation. Everything is GET-requests. And creating a new instance, shouldn't take 11 requests, and a dance to set every little detail correctly. And to be fair, your documentation doesn't even document that, this is a subset of what I use: https://gist.github.com/kaspergrubbe/756f0a227db8aeb92818 I even managed to find some errors in the documentation when I went through it, so it didn't feel very polished or tested, but maybe it isn't used a lot. It also seems strange that to emulate one or two clicks in the interface, needs 11 requests through the API.

- Your advanced settings are extremely hard to find, especially "Edit Configuration Profile" and the "Auto-configure Networking"

- Elastic IPs. I feel like it should be easier to swap IPs around Linodes, without me using heartbeatd, and some strange configs. This is much easier with Amazon, and one of the reason why I prefer them.

- Anycast routing. Linode has several datacenters, it would be awesome if my setup in the US could have the same IP as my setup in the UK, because right now I am doing ugly DNS hacks to send my customers to the closest datacenter. Is this planned?

- Lack of an object store, even though I use Linode, I am still using AWS S3, which means added latency, and I have to pay Amazon for outside traffic, you can't even compare with AWS on these things without starting to have these things.

Can I setup a Linode through the API with private networking enabled?

I am a happy customer, and I loved your KVM upgrade, I love your prices, and I love your service. But it would be nice to see more innovations happening.

Hi there. I wasn't looking to represent Linode tonight (just my personal opinions here), but I'll go ahead and fill you in. I can't say more than "I've made a note" of most of these things, but I wanted to mention that we're looking at overhauling some things that will definitely fix a lot of your gripes. Linode has been around for 13 years now, and a huge part of my job and the reason I was hired is to modernize a lot of the stale parts, including lots of the things that are frustrating you.

I'm also a happy customer and want to address a few of these points.

> It is too hard to figure out that you offer hourly billing, I found it by accident, and a year too late.

It's still a pretty new feature, only about 18 months old, and their price sheet has clearly listed e.g. ".06/hr to $40/mo" for quite a while now. I vaguely remember getting an email and seeing a blog post about it when it was announced...

> When I pay for a dedicated loadbalancers, I would expect it to be top-tuned from your side.

I think it is top-tuned, but only on $20/month worth of hardware and networking. ;-) They explicitly say in some blog posts to add more NB and scale out if you need more concurrent connections. I agree that this is a little alarming, and I'd be willing to pay more for beefier NBs—just like I do with regular nodes. Building out your own HAproxy nodes is a fine solution and what I plan to do if this becomes an issue for us.

> Your API is a mess... It also seems strange that to emulate one or two clicks in the interface, needs 11 requests through the API.

It's not the greatest API in the history of computers but it gets the job done and I found the documentation to be more than adequate. It's bare-bones and you're going to want an abstraction over it, whether you write your own (like I did), use their CLI, or use Salt or Knife or what have you. Whether it's one API call or 100 doesn't matter if all you have to do on a regular basis is `linode-create foo` or `ansible-playbook create-linode.yml bar`.

Anyway, I just ran through a quick build using the web interface and it takes like 10+ clicks to build a node with the default disks, no private interface, no label, no forward or reverse DNS, not booted, etc. For this to take four API calls to allocate the node, create root and swap volumes, and build the config doesn't seem like a big deal to me. Tack on another four API calls to set the label and display group, set the DNS, add a private IP, and boot the node. I'm not seeing the problem here.

> Lack of an object store

I mean, sure, Linode lacks a lot of the things that AWS provides. They also don't provide a database aaS or a queue aaS or a CDN aaS or... you get the idea. I don't think they're trying to compete with Amazon at this level. Where do you draw the line between IaaS and PaaS? Can you implement your own object store using some open source software just like you run your own HAproxy and database instances?

> Can I setup a Linode through the API with private networking enabled?

Yes, but it's a separate API call. :-) https://www.linode.com/api/linode/linode.ip.addprivate

> I am a happy customer, and I loved your KVM upgrade, I love your prices, and I love your service. But it would be nice to see more innovations happening.

Agreed on all points. My innovation wishlist includes VPC (I really just want my own private VLAN) and more flexible node configurations (another commentor mentioned that you can't add disk without adding CPU and RAM—and vice-versa—and this is a huge pain point). I haven't had to deal (yet) with IP swapping and anycast but I can see those becoming issues for us in the future. And I actually do agree that Linode should implement some kind of a object store to compete with S3... I would categorize that as IaaS, not PaaS, and it's the most glaring hole in Linode's offering right now.

Linode offers the equivalent of a tiny, tiny subset of AWS. I haven't looked recently, but do they have something equivalent to SQS? Autoscaling groups? DynamoDB? SES? Cloudwatch?

If you are going to end up re-building AWS's services from scratch on top of what are essentially naked linux machines (or doing your integration work between services from many vendors), you are doing a huge amount of work that isn't necessary and will be a lot more expensive than that $20k/ month you could have been paying to Amazon.

$20k/month is $240k/year, which is at least two developer salaries. Do you really need two full time devops/whatever to manage the open source equivalents to the AWS services?

Aren't there tools that can provide a similarly high level/seamless/polished experience running on your own computer, using your choice of cloud provider as the backend? If not, why not?

(I wouldn't know - I don't even run a website, and if I did, I would avoid the locked-in services like the plague, out of principle. But I'm curious.)

Do you really need two full time devops/whatever to manage the open source equivalents to the AWS services?

No, you need many more than that. AWS's offerings are complex systems, each backed by multiple large teams of developers and operations people. And that's not just because they have tons of customers.

You want to run a partitioned, replicated data store with five nines of reliability? Good luck doing even just that with two people. Throw in metrics (CloudWatch), a queuing system (SQS), monitoring/failover with on-demand provisioning (EC2 + auto scaling groups), and you're talking a lot more than just two devops people... well, unless you want them both on-call 24/7/365, and you want them to burn out in a matter of weeks.

AWS isn't perfect by any means, and you can certainly argue that it can be overpriced... but maybe you pick and choose, and use AWS for the things you don't have expertise in or time for, and roll your own for the rest. The nice thing is that AWS lets you offload all that stuff when you're small and don't want to or can't retain that kind of expertise in-house, and then you can focus on actually building your awesome product that runs on AWS's plumbing. And when you have more time and people later, you can look into doing things differently to save money, piece by piece, when it makes sense to do so.

From a management standpoint, it takes at least 2 developers to run something equivalent. Perhaps not full time, but it doesn't scale out that much - not more than 10 highly available services per team. A good example of this is roughly a 3 instance RabbitMQ cluster used by ~100 people.

1. The "Bus Factor" - https://en.wikipedia.org/wiki/Bus_factor

2. Context switching - if you have a developer that also works on coding or other things.

So maybe it's not a full $240k/year for only 1 service, but the price in the door IS that. 2 to 3 extra services will cost no more - that's where AWS will make the money.

I'd think you'd need those staff (or approximate) regardless of whether you used AWS or not, no? It ain't the marketing team that orchestrates all that amazon stuff.

At scale, the arguments about needing 10 admins vs needing 15, for example, start making sense.

It depends on which services, but for simpler services with well understood open source equivalents (e.g. route53 and BIND) you can probably manage two per sysadmin at a similar quality to AWS with a basic API. For something like RDS or Redshift, if you're very lucky you'll get away with one sysadmin per service.

This covers not just running the services, but keeping up to date with security updates, scaling the systems, feature additions, bug fixes and keeping up with the state of the art.

Services like RDS are an amazing net cost saving for small to medium users. Something like Cloudfront would cost you a million just to get going. Raw EC2 is nearer break even.

"Something like Cloudfront would cost you a million just to get going."

We've used CDNs in the past, way below a million/year for a medium, growing startup. care to elaborate?

Ti believe the context is running a CDN as opposed to using.

Look into Kubernetes. Since everyone is migrating to Docker and Microservices, Kubernetes provides a perfect fit to run your workloads on local premise and then migrate to Cloud when extra capacity is needed. It's opensource and supported by several vendors.

Or even VPC. I'd like to be able to run a small Kubernetes cluster on Linode at some point, but I'm not sure I'd want to run it without a VPC.

I'm involved in two different startups, one on Linode (because that's all it can afford) and the other on AWS. For it's current offering, Linode is great for MVPs, and you like your servers as pets, but it's doubtful we'd stay with Linode once we start having to scale.

False. Most AWS services are implementations of open source solutions. Once you implement it yourself on Linux, the ongoing cost drops. With AWS you pay more the larger you get.

> Most AWS services are implementations of open source solutions.

This is fundamentally not understanding what AWS is.

- You could say "EC2 is just Xen". But next time there is a 0-day exploit, I'll have AWS working all weekend to patch my servers. And Xen still doesn't have an API for scaling physical hardware..

- You could say "S3 is just Apache". But I will never see a "disk full" message, I will never get paged if something is borked (but it will still get fixed), I will never worry about DDOS attacks, etc.

> Once you implement it yourself on Linux, the ongoing cost drops.

That's like saying "it's cheaper if you change your own oil". Might be true, but doesn't matter. I'm still taking my car in. Lots of other people do. You might try asking them why.

Some services are unique. EMR and RDS are not. Others are not.

Outsource if you want. When your bill hits $500k/mo and you realize you're paying for things you can do yourself, your position may change.

Have you been through an AWS outage and been paged? It happens. When you realize your business depends on an opaque organization you may want to diversify.

If you're small it makes sense to outsource sometimes. Not always and not forever.

I am especially curious to see that 'cheaper if you do it' done for their databases offers.

I'm using them in prod already, and trying to create a similar setup on real hardware for enterprise customers using pg, pgpool2 and haproxy is already a pain, and yet you couldn't autorecover them as conveniently when they go belly up.

What? Implement RDS/SQS/Redshift myself? That is not easy, believe me.

It's not easy, but also not every difficult. The company I work for has two Dev Ops (only one until a year ago) and runs on AWS serving millions of users.

We are using MySQL RDS, but are migrating to self-managed Cassandra. We use RabbitMQ instead of SQS. We use Hive/Presto instead of Redshift.

You sure can adapt the open-sourced solution, though I don't think Cassandra/Presto can match up to what RDS/Redshift can offer.

The key point is, running those things yourself, you are likely to lose all the nice failover mechanisms/monitoring/auto-snapshotting stuff that AWS offers. To live up to that, it will require you to not only have extensive understanding of the software you are running, but also a considerable amount of your time will be dedicated to Ops side, which can lead to some really big frustration from time to time. In that sense, I don't think you can re-IMPLEMENT something by assuming too much of comprises.

Cassandra in particular doesn't compete with RDS as much as DynamoDB and, respectfully, setting up failover and monitoring and autosnapshotting isn't actually that difficult. I'm a big fan of outsourcing that stuff--the new system I'm building uses AWS Lambda, for example--but when using a managed solution requires compromising on features or tying yourself into knots to make things work, hosting your own is doable, especially in the age of devops. Automation is your friend!

It is very difficult to run it at AWS's scalability, reliability, security and automation level. These all cost money and/or time. Not all companies require these features to the same extent, of course, so AWS is not always the right choice, but it isn't always the wrong choice either.

I've been a customer of linode for over a year and soon will be migrating to another provider, most probably Amazon. The main reason is a lot of connectivity issues and down time. I always hosted pet projects and simple web app on Linode, but just this week, I used it a for a major project. Now, the whole region of Fremont,CA has connectivity issues and is down for a couple of hours per day. The same happens for other regions as well. Maybe always it's been like that, but I've never paid attention. You can see their incident history in here: http://status.linode.com (I browsed it a little bit and it seems they had a lot issues in the past as well) I probably will lost about 6 times more money that Linode has saved me over AWS only for the down time in this week.

So my suggestion is for a small-non critical projects you can use Linode, but for anything serious you better use something that is reliable.

They've been getting DDoSed for the past week. I've had good luck getting credit from them after past outages.

There's a lot more that goes into the decision than service costs. There are lots of other costs that must be considered. E.g. if using AWS can save you one employee, that alone might make it worth the extra service costs.

I'm not saying you're wrong, just that your analysis is incomplete.

I'm sure there's a situation where AWS saves money on TCO. If you ever find it, please let us know.

Umm, does linode offer higher level services like SQS, Lambda, SWF and friends?

Their open source alternatives running on Linux?

Because ops is free.

+ the cost of paying people to setup, deploy & manage these these things.

In the past during security incidents (and there have been more three than major ones in as many years), there has been absolutely no transparency from Linode management.

That's more than enough reason for serious customers to avoid Linode...

I don't think it's appropriate for me to comment on security incidents (before my time, anyway), but I'm not trying to sell people on Linode here. MediaCrush was actually using a combination of Voxility and Digital Ocean (and Route 53, actually). There are lots of options, guys.

I was going to say the same thing. The amount of transparency in general from aws is DRAMATICALLY better than linode.

I simply don't believe the $250 - $20,000 month comparison. Could you please provide some real-world numbers/hardware/stack information?

I just did this migration (non-AWS -> AWS), and we saw a ~20% reduction in monthly prices with 1-year contract at AWS vs. physical hosting (admittedly we were getting terrible pricing with our physical hosts, but still). This will be reduced with the addition of t2.nano servers as we run enough t2.micro instances for little odds and ends that it will put a dent in the bill.

Your parent is [rightfully] including bandwidth. If he was able to get 1gbps unmetered, then the math is in the right ballpark (that's a huge if though).

Still, I agree with the parent's general sentiment re price. The gap used to be much worse, but it's still anywhere from 2x-20x depending on what you're doing.

- EC2 prices are simply high and exclude things like bandwidth.

- The CPUs are usually 1-3 generations behind (they've gotten better with this).

- Virtualization adds overhead (again, this can be very significant if you're doing a lot of system calls)

- IO options are significantly worse

- Their network connectivity is, at best, average

You said it yourself, you're comparing a "terrible" deal to AWS and only saved 20%. Sounds to me like AWS is only 20% better than terrible ;)

I think in this case I could have interchanged "terrible" with "enterprise" and it would've had a more ""correct"" meaning (seriously I wish I could keep adding quotes to all of that...). I had to fight and argue with the previous host (Rackspace) to get them to lower our prices. It also took weeks to deploy a server. Now it takes 5-25 seconds with a prebuilt AMI that I've cooked together, deployed on exactly the hardware that I want, with the network configuration that I setup, and the security rules (bi-directional! Rackspace firewall only offers inbound at the level we had) that I want/need. On top of this, to get lower prices, I open a ticket and say "Hello, I'll be needing this server for 1 year, lower prices please." No phone calls involved. I'm not trying to diss Rackspace, they seem to be doing a lot of cool things... I simply had a permanantly sour taste left in my mouth. AWS is a big juicy BBQ sandwich in comparison.

On your points about AWS itself I'm not going to argue with as they certainly are valid, however in my specific setup (everything production is c4/r3 types, 1tb gp2 volumes as all volumes, and running network benchmarks to ensure everything is properly sized) I've actually had a pretty big performance gain over dedicated servers in a DMZ behind a hardware firewall.

With regards to the bandwidth, it seems a common issue is when you have very chatty services (Apache Kafka) deployed across multiple AZ's the bill gets real big real fast. Thus far, we're not even cracking 3-digits in bandwidth monthly. Maybe we're simply not at a scale to notice these problems yet.

This is correct. The biggest cost I saw was bandwidth, followed by cycles on their machines. MediaCrush was a "media hosting website" (see original comment) and as such, used a lot of bandwidth and did a lot of work transcoding things, and had a large storage requirement.

Were you comparing bandwidth/data transfer in your numbers for EC2, S3 or CloudFront?

I don't know, it was an anecdotal story from a while ago. Before we left AWS we were utilizing all three (iirc these numbers didn't factor in CloudFront though). Point I'm making is - do the research and don't just reach immediately for AWS.

Oh okay I was just curious. I completely agree. The AWS bill at my startup went up quite a bit after the free trial ended since there are many variables that are included with each service such as how many requests/mo, tiered bandwidth, etc. The AWS calculator is nice and so is the billing estimator, but it's still quite confusing.

We started with Linode. We moved to AWS. Here's the main reason: you can't decouple disk space from RAM. So we paid thru the nose just to get more disk space. With AWS it's pay for what you need. That's the killer feature for us.

Sure but you could have also just moved to SoftLayer or any number of dedicated hardware providers for the same, cheap probably cheaper, result.

Which ones decouple disks from computer instances when it comes to paying?

As neither a AWS or Linode customer (but might become one) I have a couple questions:

- Can Linode offer short-term compute or memory intensive service? e.g., If I want to consume 100 teraflops 24x7 for about a couple weeks, or maybe 10 teraflops but a terabyte of memory. (Pay-as-you-go, not monthly/yearly subscription)

- On top of that, Can I do above with my IPython notebook code, which is already setup (it uses numpy at the backend so python slowdown is not a problem), with a step-by-step process of how to make it run on Linode (without having to install python scientific stack)?

These are some of the use cases that I think are getting popular these days, and for which AWS is known for, e.g., in the areas of machine learning, 3D rendering, etc.

Good luck trying to get huge capacity from AWS with no history.

Their service limits start low and ratchet slowly, if they have the capacity to give.

All these features and elasticity sound nice until you get big and they don't deliver.

This is categorically false. If you have large capacity needs talk to your TAMs/Account Manager and they can make it happen.

It's been true five times in the last three years for me. The elasticity isn't always there and service limits are not always granted when needed.

I don't know what an AWS TAM is but the SAs have tried and failed.

Make a new account and try to run 1000 c4.4xl in parallel for 1 hour. It won't happen.

That's not what happened the five times, but it's illustrative.

And who defines what is a need and what is a want?

That's because of a history of bitcoin mining abuse. People post their API keys on github every day and Amazon refunds almost all of the lost compute time.

I've asked and got basically a "no". I can believe that if you have a history of spending big $$$ on AWS they get increasingly flexible with raising the limits. But if you don't, they aren't going to let you spin up 500 instances, even if you ask nicely and have a reasonable justification.

I'm working at a medium, growing startup and wonder when/why to use AWS?

We currently use a mixture of root servers (DB) and VPS (App, HAProxy, NGinx) at a hoster.

* We use a Docker/Consul/Nomad/SaltStack setup to install new servers for replacing broken ones and growth

* App setup is Redis/Disque, Postgres(DB, BI) and app servers

* New dedicated servers we get in around 1h, VPS in around 1min (Only VPS over API)

* We have around 0.2 Ops FTE (devs rollout apps, ops for security updates, new infrastructure etc. and incidents)

* Price seems to be way below Amazon

0.2 Full-time employees to build all the automation, scale up (and down!) the stack as needed, monitor, operate and maintain it at an HA level? That is a huge feat, and I'd love to read about how you pulled it off. Please consider sharing your story, and perhaps your Salt recipes ;)

With all those things that you have in place, AWS would not be a big win for you, though you still might find some use of AWS services like S3 for reliable object storage, Route53 for DNS, or Lambda for "server-free" event handling. The unique thing about AWS is the huge portfolio of services they have and the absolutely amazing rate at which they manage to pump out new (and useful!) ones.

It takes some weeks to get it working, I'd say two to four, depending how often you've done it. Then for weeks there is nearly no Ops at all (security updates only mostly while the servers are humming) with releases done with Docker push by developers. Adding and provisioning a new VPS takes some seconds (and some seconds later checking), with Consul the stuff running on the VPS usually integrates itself). Sometimes hardware breaks down, then it's either setup a new VPS (setup_server <IP> <Role>) which takes around 5sec or mail the hoster if it's a DB server with a (mirrored RAID) SSD crash, which takes around 2h to fix all in all. So on average per year I'd guess 0,2 Ops FTE (which equals 10 weeks a year, which is a lot if you're not Facebook and don't need to do releases of new app versions). Today ops is relativly easy with a well trodden path, it's not 1999.

The Salt is of the shelf, with some bash scripts on top to auto generate some config and make installation of a new server one script call.

We surely could put endless hours into Ops, playing around with stuff etc. but the benefit would be marginal.

Our DNS setup is simple, not much to say, we use DNS Made Easy for failover, we use S3 though I do not consider it "AWS" (S3 is comparativly cheap).

I've wondered about Mesosphere with autoscaling etc. but with adding a new VPS in seconds setting up Mesosphere/Kubernetes taking two weeks would take quite some time to amortize.

If you're just using AWS for EC2 then you're probably better off with another provider, but I think the key advantage AWS has is in the 'WS' part of their name.

Cloudwatch, SQS, SNS, Dynamo, S3 etc offer good managed services that you can plug into your applications/systems with not that much effort - and crucially very little OPS needed.

FWIW, I'm a current Linode customer that will be migrating to AWS soon. I like Linode as a basic provider, but I really need the better features such as ASGs, AMIs, and RDS. That's worth the higher cost per hour for me.

AWS hands out credits like candy and there's tons of ways to get AWS credit or earn it. I wouldn't be surprised if cash-strapped startups built on AWS credit to go out and get funding and then just never move off of it.

Correct. Lock-in is a key part of their strategy.

Hmm. Well, somehow these heavy-hitters have presumably done their research and then chose AWS:


I see your characterization of AWS above as "terrible", but I'm curious how that squares with the (probably very well-informed) decisions these people made to use AWS.

I would not "do my research" on the sales site of one of the vendors, if that's what you're suggesting.

The poster is suggesting that the firms listed on Amazon's page did their research before buying, then settled on AWS and felt strongly enough that it was a good choice to make a public recommendation.

Or got enough of a discount/payment to make a public recommendation in return?

People pay for convenience. AWS is just convenient. Managing your own servers requires more work. It is far far far cheaper in terms of server costs. You do have to:

- Know how to maintain a server by yourself

- Hire someone who can

AWS removes part of that and thus people pay for it. An average (good enough) sys admin costs the same as a programmer.

A disclosure does not allow you to make a blanket statement like "Perhaps the first AWS mistake you might make is... using AWS?".

Reply by kawsper to your comment highlights several drawbacks of Linode. Specifically this feedback "Your API is a mess, and so is your documentation. Everything is GET-requests. And creating a new instance, shouldn't take 11 requests" alone, if true, is sufficient for me to not consider Linode for hosting. As being in an ultra small team of 2, we can't afford the luxury of API requests failing so many times.

Having used AWS since 2010, and since 2007 for backups(s3), I can say that your comparison $20,000 in AWS vs $250 is wrong. Have you assumed a very foolish selection by a user? Where in a user will just pick up all their services (e.g. RDS for a DB, etc.)

Just as a fact, we don't use RDS but install MySql by ourselves as it allows us to have other things on that instance, and also comes out cheap in comparison. On this note my contribution to the OP (article) would have been bundle different services on an instance, rather than buying stereotyped instances. Of course YMMV.

The biggest advantage is for ultra small teams, where you have comfort of a stable environment, with respect to instances launching; APIs working; volumes getting attached/unattached; snapshots getting taken; backups done on s3. All taking place automatically, while you rest peacefully.

I am not arguing that apples to apples Linode (or some other service) will not be cheaper than AWS. I am sure it will be. But still many people would like to stick with AWS because of stability and maturity of its cloud.

Also you got to give credit to AWS for pioneering hourly billing, and disrupting the cloud environment. Despite agreeing with you mainly on the cheaper point, when I moved to AWS from dedicated hosting in 2010, my monthly bills reduced.

Lastly. The experimentation. The wide variety of instances it has from micro (to the recently launched nano) to ultra large instances, allow you to experiment a lot. For example: in the past few months: I moved from couple m3.large to couple c4.xlarge, while experimenting in between on c3.xlarge type of instances.

Finally, when you make such a blanket statement, you not only demonstrate your naivete but also undermine the decisions of thousands of satisfied AWS users.

All said, I would like you (Linode) and others to be a good alternative. I am glad you are there. This is to keep AWS on its toes. Which is in my interest as their customer.

PS: Lest my handle makes someone think, I represent AWS. Please be assured I do not. I just created this handle, when I had to ask an AWS specific question (you can check my first post in this handle). And then continued with it for other things.

Edit: minor correction

I can't believe this is the top comment. This is supposed to be about AWS advice, not your personal feelings about AWS. This comment is bad and you should feel bad and everyone who upvoted it should feel bad.

As an AMZN shareholder, I encourage you to please use AWS. That is all.

This definitely reads like an ad for Linode to me and it really is turning me off from your company. Does the company know you are posting these comments on HN?

This definitely reads like an ad for Linode to me and it really is turning me off from your company. Does the company know you are posting these comments on HN?

Really? I didn't get that at all.

What I saw was a post giving negative opinions on various aspects of AWS, and stating that AWS' offerings are available elsewhere. There was an example given, of a previous project where he'd switched away from AWS to dedicated physical servers (at an unnamed provider). There was no suspicious boosterism/puffery in the direction of his own employer, like you generally see when a shill is trying to get away with something.

In fact, it seems like the only real mention he made of linode was in the disclosure that he kindly added, pointing out that he worked there and was therefore biased. Most ads don't bother to do that.

If this counts as an ad, then by those standards nobody could ever post in any thread covering the industry in which they work.

Things are looking really bad for Linode:


I feel like you wouldn't be too pleased if a super-biased Amazon employee posted a similarly anecdata-filled post about Linode. Is it really a smart idea to bad-mouth AWS in a post that really doesn't intersect with what Linode provides in any substantive way?

Considering he put the disclaimer that he is super-biased up front, I personally appreciate the insider review

You know what I've realized that's really important. More AWS tutorials is really needed. There's numerous of new programmers who want to learn AWS, but can't finish building anything because they get buried in documentation.

I find there are a lot of high-level abstracted tutorials, but for the new services, there aren't a lot of detailed tutorials.

For instance, an implemented cognito->gateway->lambda->dynamodb is really hard for a newbie to do.

I agree that a lot more AWS tutorials and cookbooks would be helpful- but one reason I find so many developers having issues wrapping their head around AWS concepts is that most developers don't have the basic understanding of what they're doing with AWS. Programmers that just write code don't really need to know ANYTHING about AWS because someone else should be setting it up for them. I see so many people doing the most ridiculous things or leaving things completely wide open because they have no concept of network topology, firewalls, segments, VPCs, security groups, deployment templates, etc. I would not trust the vast majority of developers I've ever met to properly setup a simple architecture in AWS because the things you NEED to know aren't covered in the things they've been focusing on for most of their lives. It was predominantly in the domain of the operations/networking/system admin side of things.

I think there's a big need for a crash course for devs that starts with all the crap they previously ignored or had someone else do for them and I say this as someone who has always written code first and done sysadmin second.

> There's numerous of new programmers who want to learn AWS, but can't finish building anything because they get buried in documentation.

I agree partially. In the end the documentation is what you are going to need to read sooner or later. Or trainings equivalent to that documentation. Good tutorials are good for starting, but doesn't make you a professional.

I guess that in 10 years any one will be able to create websites for billions of internet connected users. But for now, as easy at it is, it is still complicated enough to require an expert. In the same fashion that 20 years ago you needed an expert to make a 3D game and nowadays there is plenty of technologies that allow you to do that with a limited amount of programming knowledge.

I have seen horrible things, usually security related, because non trained people think that they can achieve anything just with standard configurations and quick tutorials. And it looks like they where able to do it, until something really bad happens. Even people with long experience can make mistakes because it is a complex thing.

AKA what happens when you take a bunch of God complex programmers and set them loose operating production infrastructure on the Internet?

The real issue is that AWS isn't designed as a tool for product developers. Product developers get asked to use it but do not usually have a clue about good systems engineering. AWS was designed for ops and systems engineers first and foremost.

> Product developers get asked to use it but do not usually have a clue about good systems engineering. AWS was designed for ops and systems engineers first and foremost.

Yeah that line is blurring, too.

That may have been true 5 years ago, but a lot of the newer services require developers to just read the documentation, not have a serious amount of background in the setup of the service they are using. Look at things like ECS and Lambda for compute, SQS for messaging etc.

I found this series very helpful: https://medium.com/aws-activate-startup-blog

There's a tutorial on the AWS Labs GitHub that goes through this exact scenario: https://github.com/awslabs/api-gateway-secure-pet-store.

I definitely agree that better tutorials and/or a simpler interface would makes AWS more accessible and user friendly.

A YC S15 startup, Convox (http://convox.com/), aims to "make AWS as easy as using Heroku." It looks really promising.

Convox member here. Thanks for the shout out.

This is definitely a goal of Convox: to remove as much AWS complexity as possible.

Our approach matches this guide to a tee. We are using CloudFormation to set up a private app cluster, as well as to create and update (deploy) apps. We are also using ASGs.

The instance utilization point is spot on too. The fist thing convox does to make this easy is a single command to resize your cluster safely (no app downtime).

Coming next is monitoring if ECS and CloudWatch and Slack notifications if we detect over or under utilization.

I strongly believe that these AWS best practices can and should be available for everyone. For anyone starting from scratch or migrating apps off a platform or EC2 Classic onto "modern" AWS.

I hadn't heard of Convox before, but it sounds interesting.

I'd like to use AWS more, but each time I tried to get into it I felt overwhelmed. I currently use PagodaBox a lot, which is great (most of the time) because it handles a lot of the complexity for me, but it can often be expensive. How does Convox compare to PagodaBox?

I've never used PagodaBox but it looks like a nice PaaS.

Convox has the same goal of a PaaS: to give you and your team an easy way to focus on your code and never worry about your infrastructure.

One big difference with Convox is that we accomplish this with single-tenant AWS things. You and your team's deployment target is an isolated VPC, ECS (EC2 container service), and ELB (load balancers).

If you're asking for a cost comparison, we're building Convox to be extremely cost competitive by unlocking AWS resource costs for everyone.

Its easiest to compare the cost of memory across platforms, though not always apples to apples...

The base Convox recommendation is 3 t2.smalls which is 6 GB of memory which costs about $100 / month. If your app can be sliced up into 512 MB processes, you can easily run 10 processes, which could be 2 to 5 medium traffic PHP apps on the cluster.

I'm finding PagodaBox pricing calculator a bit confusing but 6 512 MB processes, so 3 GB of memory, is $189.

So eventually Convox will have to make money, but I'm not sure I see the path... given that I have free access to the software and the infra is provided by AWS.

Do you plan to eventually charge a monthly fee for using the command-line tool?

Thanks for your question. A much more thorough pricing page is in the works.

The most straightforward model, and where we are already making some money, is running a Convox as a managed service.

In this setup you and your team get Convox API keys. Convox installs, runs and updates everything for you in our accounts. You get a monthly bill that's your AWS resource costs plus a percentage to Convox for management.

We will be tweaking this model to sell packages so bills are really easy to understand.

Some other experiments we're doing...

We sell support packages and professional services for app setup, migration and custom feature development.

We have a per-seat model for productivity features. Private GitHub repos and Slack integrations are $19 / user / month. There are more closed SaaS tools like this coming.

Infra is trending to commodity prices industry wide.

We'll be selling SLAs, support, productivity tools on top of that infra.

You'll get a cutting edge private platform without hiring and managing your own devops team to build and maintain it.

Open source users will help grow the user base and make the platform better without us running a freemium platform.

Any plans to support the other public cloud providers?

Short term, no.

The plan is to get the Convox API locked in while mastering advanced AWS like VPC, ECS, ELB, Kinesis and Lambda behind the scenes.

Long term, yes, and in tandem with when the other cloud providers leveling up. For example Google Cloud Logging (for continer logs on GCE) is still in beta.

As an alternative, Cloud Foundry already drives AWS, as well as Azure, vSphere and OpenStack. Those coming from Heroku will find most of what they want, including buildpacks. Those who want to skip buildpacks can use docker images instead.

Disclaimer: I work for Pivotal, who donate the majority of the engineering effort to CF.

Convox member here.

CloudFoundry is a really solid platform, but there is a very important distinction between CloudFoundry and Convox.

Convox is a very thin layer on top of "raw" AWS. It gives you a PaaS abstraction but behind the scenes is well configures VPC, ECS, Kinesis, Lambda, KMS, etc.

For those of us with no need to run on multiple clouds, using pure AWS is simpler, cheaper and more reliable than a middleware like CloudFoundry or Deis.

If you want to run a private platform without bringing in operation dependencies like etcd (Deis) or Lattice (CloudFoundry), give Convox a look.

how does convox scale when moving to non-trivial aws scenarios? if i have three autoscale groups running seven different apps with associated kinesis streams, s3 buckets, elasticache, rds and elasticsearch instances how cleanly does convox handle this? what about monitoring and reporting?

We are working to normalize the AWS scenario for all apps.

Every Convox cluster is an autoscale group managed cluster of ECS instances.

Every Convox app gets its own Kinesis stream for logs, ELB for load balancing, and S3 buckets for settings, build artifacts and encrypted environment. And the app processes are run via ECS.

So I'm confident we could handle the 7 apps in a single cluster, scale the cluster instance size and count, scale any individual app process type, and handle any individual app load balancing or log throughput.

You can provision some services like RDS with our tooling which make it really easy to link to apps. You can also bring your own services like elasticsearch or pre-existing RDS and set them in an app's environment to use it.

There are a couple monitoring tools built in.

We automatically monitor AWS events like ECS capacity problems and send Slack notifications.

You can also use our tooling to forward all your logs to Papertrail and configure your searching and alerting there.

More CloudWatch Logs and Metrics work is coming in the near future.

i appreciate you taking the time to reply, thanks.

some additional questions though:

1. are security groups opaque within convox or are they exposed to developers?

2. when you say monitoring tools are built in and you have tools for logging does this lock me in to the convox log pipeline and monitoring? what i want to use sensu on my instances? do i have to add sensu to every container? if you run, for instance, five containers per vm do i pay the overhead of five seperate sensu instances? same question for something like logstash?

3. you mention vpc. aws has proven stingy with vpc service limit requests in the past. i have trouble getting them to grant more than low double digits per region/account. can i run multiple convox racks per vpc or is the one vpc per rack a hard requirement?

1. Devs don't have to worry about security groups with `convox install && convox deploy && convox ssh`. But security groups are created with CloudFormation and you can use your AWS keys to introspect and change them.

2. Currently every app gets a Kinesis stream and we tail all Docker logs and put them into Kinesis. Then `convox logs` can stream logs from Kinesis, and `convox services add papertrail` adds a Lambda / Kinesis event source mapping to emit the stream as syslog to Papertrail.

I'm pretty happy with this setup and think it represents a good default infrastructure that is still extensible.

Would Kinesis -> Lambda -> Sensu make sense too? It's a pretty new pattern but this seems a lot saner to me than per-container log agents, or even bothering with custom logging drivers.

That said, one user has been using logstash by bringing a custom AMI with his logstash agent and creds baked in.

3. It's one VPC per rack, but I could see modifying that. We've already started to parameterize some VPC settings like the CIDR block to help integrating with your existing VPC usage.


Obviously I disagree with the specifics of relative merits. As a nitpick, Lattice was a project to extract core components out of Cloud Foundry into a self-contained unit intended for experimentation, not a standalone component folded into CF.

The feedback was that Lattice is not what developers wanted, so it's been wound up in favour of MicroPCF[1], which is a single VM image that runs an entire, actual Cloud Foundry installation.

When developers decide they want to scale up to any size, they simply retarget a regular AWS/vSphere/OpenStack/Azure CF API server and push again.

I'm sure Convox has a single-VM version I can tinker with on my laptop.

[1] https://github.com/pivotal-cf/micropcf

There are also quite a few Ansible core modules for EC2 if you're looking for a uniform way to manage your resources outside of the AWS GUI, although to be honest I'm not sure if that's "easier" so much as more flexible. Still solves the problem with mistake #1.

Would recommend Terraform here - https://terraform.io/

Try Kubernetes may be?

That's for containers. It isn't used to, say, setup SQS.

As much as I agree with your point about being hard, I'd also like to add out that there isn't anything even remotely similar to documentation on how a pure dev person could do this outside of AWS, with the lack of managed cognito, API gateway + lambda, and dynamodb.

Making those notions and technologies easy and cheap to access AWS suddenly gave devs the idea that they can roll out complex infrastructure on their own, similar to copy-pasting a piece of code. Well, it is still a bit harder than that, and if you are that kind of dev (which I'd applaud), you'd better dedicate some time to learning those technologies.

This is probably one of my biggest gripes. Our CTO/PMs are like "you weren't able to containerize and get our app autoscaling in a few days? I don't get it? How long will it take? While you're at it, can you implement kinesis/lambda/microservices for the parts of our app that are taking more than 3000ms?"

Me: "I just said I was interested in DevOps and would like to give it a shot, I don't know how long it'll take, I'm working on it."

Great point. Also, the pipeline you mention removes many risks related to the 'common' mistakes mentioned in the article (scaling, monitoring, provisioning it all done for you - or almost).

We're using gateway -> lambda -> dynamodb, and there are a tons of gotchas and small things that AWS need to iron out, especially with gateway -> lambda.

Agreed. Was looking into that for a serverless model inspired by (1) but there is still loads of things they can't do/won't do

1 https://github.com/serverless/serverless

Working on it :) We are hacking away full time on the Serverless Framework and have some big features coming in the next few days.

acloud.guru has some pretty decent video tutorials. It was easier for me to start with those tutorials before diving into the AWS documentation.

I have some good experience with qwiklab. Their self-paced lab starts with UI, then turned to aws cli. They also give a nice lesson on CF.

CloudAcademy provides several good ones.

> There is no reason - beside manually managed infrastructure - to not decrease the instance size (number of machines or c3.xlarge to c3.large) if you realize that your EC2 instances are underutilized.

CPU/Memory aren't the only measures of underutilization. If you require high instantaneous bandwidth throughput, then the networking capacity available to your instance roughly increases with the size of your instance. This includes both EBS as well as other Network traffic.

Table with Low/Medium/High: https://aws.amazon.com/ec2/instance-types/#instance-type-mat...

Example benchmark with c3 instances: http://blog.flux7.com/blogs/benchmarks/benchmarking-network-...

If you're more concerned with just EBS network throughput, check out the table on this page instead: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-ec2-...

This is super important - I've run a number of 'oversized' instances just to get better networking. AFAIK none of the cloud providers offer a way to create a 'network' optimized instance, though containerized deployments should help with under utilizing the rest of the instance.

This was a big issue for me too. Google Compute Engine seems to have no throttling like AWS does, or at least it's a much higher cap. On an n1-standard-1 I routinely get 250+ MB/s, and near 1 GB/s for n1-standard-4 and larger. The only high bandwidth AWS instances I've found are the few expensive ones that spec 10 Gbit.

According to this page, you can get ~14Gbit networking on Google Cloud's n1-standard-8


I'd like to add, if using Elastic Beanstalk, don't directly attach an RDS instance when creating the environment. If you do, you won't be able to destroy your environment without also deleting the RDS instance. Instead, create the RDS instance separately, and just add the proper security group for the environment to be able to access the host. Then you can easily create a new eb environment with any config changes (there are some config changes you can't make to an eb environment without creating a new one from scratch) and then connect to your existing db.

Is there a site that collects these sorts of tips? Seems like there are a lot of potential pitfalls that can be hard to find in docs like: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/java-r...

interesting. i was under the impression you could just create a snapshot (before destroying the environment) and then restore as needed. am i missing something?

Edit: http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_R...

If you can take the database instance offline and if you do not need to access it from outside of the beanstalk app then that is fine.

Another, somewhat obvious one:

Be extremely careful when using public customized AMIs, a lot of times ~/.ssh/authorized_hosts contains public keys and this is obviously a huge security problem

Yeah, I almost feel like AWS should clear the contents of authoried_keys for each user when you make an AMI public. That of course is bound to break some things, but prevents security oversights.

I would be more worried about a backdoored sshd in a public ami. At least authorized_keys is a known place to check.

no, public keys are not a security problem.

They are if you don't know who has the corresponding private key.

oh right when using public AMIs. I thought parent meant publishing AMIs. derp.

Public keys on your host where you don't control who has the private key are.

But the fact they allow access to your system is.

I like CloudFormation. Unfortunately it is very unwieldy to write CloudFormation templates directly, and we're not about to start using the AWS CFN GUI editor!

It seems like the assembly of the AWS ecosystem.

Does anyone else have a favourite hammer for this particular nail? I'd love to have something better than our home-baked solution, but I'm yet to find anything which doesn't introduce other flaws, such as an incomplete implementation (missing parameters or resource types) or ultimately making a leaky abstraction on top of CloudFormation somehow.

I ended up brewing a reasonably straightforward solution using Python as a (minimal) DSL which emits JSON. Its primary purpose is to support the whole of the CFN ecosystem (not just implement some small part of EC2, for instance) while also not trying to be too clever.

It has about 50-100 lines of python which implements helper functions such as ref(), join() and load_user_data(), and not many other things. There is an almost 1-to-1 correspondence between the generated CFN configuration and the python source. As a bonus it checks for a few common mistakes like broken refs or parameters which aren't used.

I have heard that similar solutions have been reinvented in a few places, including the BBC. But I'm yet to see a good public solution!

A big problem of many tools out there, and of cloud formation in general, is that validation is a mess. And the bigger the template, the bigger the problem, even with existing tools.

There are validation problems even within specific tools: Just look at the RDS setup alone: A ton of options that are often mutually exclusive. It's brutal. And don't get me started with security groups.

At Monsanto, we built our own toolset, and open sourced it. It is all Scala, so it might be a bit of a learning curve for many folks, but there's an actual attempt in there at making sure that if you can write it, have the tool blow up before it gets to AWS, which then realizes something went wrong, and that it has to roll everything back.


If you'd consider something other than CloudFormation, there is also Hashicorp's Terraform. It has an AWS provider (https://terraform.io/docs/providers/aws/index.html) which creates resources and maintains the state in a file that you can store in version control (https://terraform.io/docs/state/index.html).

Terraform, as an idea, is brilliant. Mitchell and company isolated a hugely important need and tried to fill it, and I give them all the credit in the world for that. Cross-platform cloud provisioning? Gimme. But I cannot in good conscience not relate what a disastrous experience Terraform has been for me at both jobs and clients.

Writing reusable code in Terraform is an exercise in frustration due to the extreme clumsiness of HCL (which, I understand, was used because "YAML is complicated"--well, that's true, but YAML isn't a good solution either, you're HashiCorp, you wrote Vagrant, you already know how to do this!). The application architecture is reckless and full of race conditions; your state will be hosed if one resource errors out at the wrong time, while other resources are being successfully updated--the resources that return successfully after the failed resource will on many occasions fail to be persisted to state. What's more, application testing seems to be at best an afterthought: there have been regressions in the providers that will break your existing states.

I would under no circumstances use Terraform if I didn't have clients who had selected it before I was working with them. If in AWS, I would use CloudFormation, with a tool like Cfer[1] (which is excellent, reliable code) or SparkleFramework[2] (which is more full-featured but I hope you never need to debug it) to provision my stuff.

(Full disclosure: I'm building a much, much better provisioner for multi-provider cloud infrastructure. Neither of the projects I recommend are mine; mine's not done yet.)

[1] - https://github.com/seanedwards/cfer

[2] - http://www.sparkleformation.io/

If you're writing your own, you might also look to BOSH[1] for inspiration.

It's older than CloudFormation and Terraform (born 2010). It can manage anything that someone's written a driver for. So far that includes AWS, Azure, vSphere/vCloud, OpenStack, VirtualBox, Google Compute Engine, Apache CloudStack and there might be others I missed.

It stores state in a database. It is able to recover from mismatches between the state of the world and the desired state. Cloud Foundry users have been using it for years to deploy and update CF installations. Pivotal Web Services (I work for Pivotal, in a different division) has been upgrading to the most recent CF release every few weeks, live, without much fuss, for years.

For any kind of heavily stateful infrastructure, BOSH is a strong candidate.

[1] https://bosh.io/

Augh, how did bosh slip my mind? I've never used it in production, but I've used it to roll out a CF environment for testing and was impressed to dig into it a little more (most of a year ago now, I think your mention of CF was actually what kicked that off). From (admittedly limited) experience I'm not crazy about its developer-facing feel, but I appreciate the significant and responsible effort in it.

> The application architecture is reckless and full of race conditions

Honestly curious, can you point to one or two?

I don't have a toy example offhand, but the resource failure case I mentioned is one. If resources A, B, and C are in flight at the same time and A fails, Terraform in some as-yet-undiagnosed circumstances will not record state changes caused by the still-in-progress work on B and C. This happens a lot with SNS queues, IIRC, because SNS queue operations on the AWS API take a relatively long time to resolve. So if, say, you mistyped an attribute for an EC2 instance, it can fail out and Terraform will happily forget that it created an SNS queue for you.

I have a sneaking hunch that the continuing problems with template-file resources (complaining that the "rendered" attribute doesn't exist in dependent resources) are related to this, but can't prove that; my clients don't pay me to debug Terraform, but to get their stuff working, and that doesn't leave much time to get in-depth with it now that I've decided not to use it for my own purposes anymore.

> Full disclosure: I'm building a much, much better provisioner for multi-provider cloud infrastructure. Neither of the projects I recommend are mine; mine's not done yet.

One of the convenient things about software that doesn't exist is that it doesn't have any bugs.

Let your software speak for itself when it exists; until then, this seems an undeserved critique of software, and a team, that is solving problems every day.

This is a crazy sentiment. I was just considering checking out Terraform, and I'm really glad to have read the previous commenter's experience.

I've been using Terraform for over a year, maintaining a standard 3 AZ load balanced production cluster.

HCL has improved dramatically, and now that template strings are a thing, most of my variable interpolation issues are solved. However you still can't specify lists as input variables so you frequently have to resort to joining and splitting strings. It's hackish and worse, changing one value in the list will invalidate all other resources that use the variable.

Race conditions and dependency cycles are still a problem. Particularly with auto scaling groups and launch configurations -- I have to migrate them in two steps (create then destroy) to avoid a conflict. Same with EBS volumes, I ended up scripting my instance to attach the volume by itself, otherwise there's ordering issues when destroying and replacing.

There's also missing features, such as the ability to create elasticache redis clusters and cloudformation resources.

I'm still glad that I went with Terraform though. It takes a good amount of time to get around the limitations and bugs, which can be really frustrating, but when it works, it works beautifully.

Strongly agree. Support for new AWS features hits Terraform much faster than CloudFormation (still waiting for CF support for AWS's managed ElasticSearch service that was unveiled two months ago--Terraform got it right away). Some of the critiques below are true... HCL is fine, but Terraform's interpolation syntax has a long way to go. That said, CF's JSON is way more painful to deal with. As for the other problems, they go back mainly to someone using a tool they don't fully understand. Yes you can get into some odd states in rare cases, but Terraform gives you the ability to rapidly build and tear down your infrastructure over and over if necessary to work out details and you have fine-grained control over which pieces are built how. Not only that but you can inspect the logs to see what's happening and if there are bugs, you can fix them yourself because the tool is open source and free. CloudFormation gives zero visibility, no fine-grained control, and it's completely opaque and where it's broken, you can't fix it.

Terraform is relatively new and improving rapidly. It has its problems, but it's light-years beyond CloudFormation. It's clear that Amazon doesn't place a high priority on making CloudFormation easy to use, or to support new features. The right approach to any problems with Terraform is not to spread FUD about it like below, but to contribute code fixes.

> HCL is fine, but Terraform's interpolation syntax has a long way to go

Oh, HCL is fine, you say so authoritatively? Well then do me a solid and show me an if statement, show me a for loop. Because you're not building nontrivial, reusable infrastructural modules without logic. I know. I've tried. I've committed, between different projects and clients, somewhere around ten thousand lines of Terraform and probably half are copy-paste garbage because HCL is so crippled a tool.

It hurts me to say this at a deep and visceral level: Terraform's interpolation syntax makes freaking Ansible and its "no, really, it's totally cool, string templates for logic are awesome" look good.

> The right approach to any problems with Terraform is not to spread FUD about it like below, but to contribute code fixes.

Spread FUD? Oh, no no no, you can take your assertions of FUD and insert them somewhere uncomfortable, thank you very much. I wrote Terraframe[1] specifically to contribute back to the Terraform community, to make it better, and stopped (to create a different project) because I was stymied. By no documentation, by HCL <-> JSON not actually working, and by no interest from the developers in any sort of dialogue about actually fulfilling the promises they themselves assert for their software. Between this and bugs that a trivial testing framework should catch (Why are you validating AWS resource names differently between point releases? Why are you changing that validation to be wrong? Why are you breaking my existing states when you've done this? Why did your tests not catch this before you pushed this out to your entire userbase?) I cannot take the project seriously as a tool for being used in infrastructure I care about. Because I don't trust them to take Terraform seriously, either.

[1] - https://github.com/eropple/terraframe

I believe you can hack an if statement by doing a length, substring and equality comparison to make it equivalent.

I'm a big Terraform fan, but I really don't like HCL and its limitations. I ended up writing a PHP "SDK" of sorts that generates JSON that Terraform consumes [1]. It uses the AWS SDK for some things (like listing all available AZs in a VPC), and provides some macros. I made this for use at work, and it powers a few production sites for a large company.

There's still a lot to do to make it ideal for public consumption (like writing docs and freezing the API), but it'll get there sometime soon. PRs are most welcome.

[1] https://github.com/ameir/terraform-php

I second the Terraform suggestion...my team loves it. But we've found storing state in version control to be clunky. Storing state remotely in Consul has been less problematic for us, though S3 would also work for those that don't have a running Consul cluster.

What I love most about Terraform is that we can include the output of terraform plan in pull requests that make infrastructure changes. Then our continuous deployment process runs plan again and requires an identical output before running apply. This both makes it easier for team members to review changes but also ensures that we don't accidentally destroy infrastructure, which is really easy to do with a lot of these infrastructure-as-code tools.

The other thing that Terraform has going for it over CloudFormation is for hybrid cloud deployments, since it can provision infrastructure in vSphere and OpenStack as well as AWS.

Can you go into how you're using consul with terraform?

We're using Consul to store the state remotely (see: https://terraform.io/docs/commands/remote-config.html). In a nutshell, it just stores the JSON it would have stored in the tfstate file in a key in Consul instead. In addition to being easily available in a shared location, this allows you to leverage Consul's features (ACLs, watches, etc) to improve the process of making infrastructure changes.

Stuff we've thought of but haven't gotten around to yet: - Build relatively simple tooling around terraform and Consul to acquire a lock before running apply...we haven't gone to that length yet since only our continuous deployment environment has credentials to mutate production and it runs builds of the infrastructure project sequentially. - Watching the Consul key where the tfstate is stored for changes to kick off sanity checks to ensure that everything is still healthy.

They're both so flexible that there's probably other ways in which they'd work well together that we haven't thought of yet.

https://github.com/cloudtools/troposphere is a good option here

Also https://github.com/russellballestrini/botoform looks to be a newer solution in this space

Terraform is another option and then there's the model we're actively moving towards at work: using Ansible to abstract and completely replace calls to CloudFormation with a combination of existing and bespoke modules to dynamically spin up the infrastructure we need.

Troposphere (https://github.com/cloudtools/troposphere) is a mature Python CloudFormation solution that sounds similar to your home brewed one.

Yes.. It is mature and very active too. AWS keeps on adding services and also make them available on CF. Troposphere community is very quick in implementing them..

CloudFormation is not very elegant. Things become complicated with the concept region, zonal resources (same AMI is represented by a different id in different region etc). Try Google Clouds's Deployment Manager. Functionally similar, but Google Cloud Deployment Manager is far easier (everything is a global resource), Jinja based templates (you get to write for's and if's, evaluate lists, dictionaries inside templates .. )

I am asking this question often.

We are using very advanced CloudFormation in the open source Convox platform.


I have touched every corner of CF including lots of Custom Resources.

Right now we are using the golang template tools and tests to generate our templates.

But I have lots of needs and ideas for improving this. A CF template compiler and simulator should be possible, giving us all tons of confidence in making template changes and therefore any infrastructure update.

I have some sketches that I haven't published yet.

And I strongly believe CF is the best tool in this space if you're all in on AWS. Let Amazon be responsible for operating a transactional infrastructure mutation service. It's ridiculously hard to do this right.

If you want to brainstorm some ideas send me a message :)

One of the things that I thought would be neat is an open source CloudFormation that could work for multiple cloud vendors, possibly using a driver pattern.

Also, you might want to update your HN profile with contact info.

Thanks, I updated my profile with contact info.

Terraform is awesome if you want an OSS project to manage multiple cloud vendors.

But I think that infrastructure change management is a really hard problem and the state of the art solution is how AWS runs CloudFormation as a managed service.

Once it's set up properly, it's amazing watching what CloudFormation can do. It can execute updating 20 instances to roll out a new AMI, and then roll the whole operation back on demand or if a failure happens. All with no application downtime in the cluster!

I built a JSON templating system[0] (based on Handlebars) that is project based. Basically, you create new JSON files for each project and the system generates a CFN template that contains everything that project needs (EC2 instances, RDS instances, security groups, etc.) It's still a work in progress but I'm using it in production in my day job and I'm pretty happy with how it works. More help is always accepted. :)

[0] https://github.com/rnhurt/CFNBuilder

Create the infrastructure manually and then use Cloudformer tool to generate the template based on the already created infrastructure. You can then edit the generated template to make it more maintainable and you have a nice reusable template.

Some people like SparkleFormation (http://www.sparkleformation.io/). I'll warn you, though, that it's not a good example of how to program in Ruby. It abuses method_missing to the point that it makes your implementations difficult to debug.

I also don't like their made-up terms such as "dynamics", etc. The documentation is pretty confusing as well.

Using a DSL is tempting. I've found the AWS CLI best, and a lot of the time I think it's easier just to write a Ruby script using the SDK.

This obviously doesn't necessarily handle teardown very well, and it tends to be copying boilerplate and modifying it, but I find it the most straightforward thing, and simple, if a little verbose.

There's already a python module called troposphere.

Also, dont use Dynamo unless you have a really good reason. "It scales better than MySQL" is not a good reason.

Have fun migrating data and re-indexing constantly!

Is local secondary indices and lower operational costs than Cassandra a good reason(s)?

Operational costs are in the eye of the beholder. We evaluated DynamoDB for a new use case but we're going with Cassandra because the storage costs alone more than made up for ops overhead. At least IMO, Cassandra's support for multicolumn range queries (via clustering columns) and the cheaper storage you can use obviate local secondary indices. (Cassandra has secondary indices as well but it seems like most folks prefer further denormalization--at least, that's what we're doing.)

Not without considering the throttling you will hit if you exceed provisioned IOPS. You really want to consider all the angles on dynamoDB before committing to it; limitations, unique benefits, interface, etc.

What about as a kind of Redis stand-in? I'm curious about the DynamoDB PHP session driver.

It sounds like you have a lot of experience with Dynamo. All the use-cases I seem to keep coming up with are more for storing global environment keys outside of the environment itself and using a large number of IAM roles to access individual keys, and as a kind of throwaway 'I need to store this somewhere, but it doesn't really fit in the main DB, and I still need it to persist for at least awhile' case.

Not the grandparent, but I have a lot of experience with DynamoDB as well. I migrated a self-managed, sharded redis store to DDB close to a year ago, and what it really depends on are your read and write patterns, and whether or not you can live with the opacity of what DDB is doing behind the scenes.

An example: if you provision X capacity units, you're actually provisioning X/N capacity units per partition. AWS is mostly transparent (via documentation) about when a table splits into a new partition (I say "mostly" because I was told by AWS that the numbers in the docs aren't quite right), but you'll have no idea how the keys are hashed between partitions, and you won't know if you have a hot partition that's getting slammed because your keys aren't well-distributed. Well, no, I take that back -- you will know, because you'll get throttled even though you're not consuming anywhere near the capacity you've provisioned. You just won't know how to fix it without a lot of trial and error (which isn't great in a production system). If your r/w usage isn't consistent, you'll either have periods of throttling, or you'll have to over-provision and waste money. There's no auto-scaling like you can do with EC2.

Not trying to knock the product: still using DDB... but getting it to a point where I felt reasonably confident about it took way longer than managing my own data store... and then they had a 6-hour outage one early, early Sunday morning a couple months ago. Possibly solution: dual-writes to two AWS regions at once and the ability to auto-pivot reads to the backup region. Which of course doubles provisioning costs.

Ok, maybe I'm knocking it a little. It's a good product, but there are definitely tradeoffs.

You should put some effort on designing the key schema, especially the hash key. Don't do timestamp, or sequence. I had to do a 'encrypted' key schema on my sequence IDs so sequential records are spread into different partitions. Actually it worked out well as I can exposed these encrypted keys externally too.

Agreed that DDB documentation could be better on the best practices. I found this deep dive youtube video very helpful: https://www.youtube.com/watch?v=VuKu23oZp9Q

Oh, absolutely. The problem is that your key schema also ties you to what kind of queries you can do efficiently. My primary keys actually are very well-distributed across the hash space, but in my initial implementation, one of the global secondary indexes had a key that caused throttling and absolutely destroyed performance. Dropping that index meant losing the ability to do an entire class of queries. In this case, we were (sorta) able to live with that, but I can imagine many cases where that's a problem.

That's actually another instance of the lack of transparency and trial-and-error: "oh hey, writes are getting throttled... no idea why... let's drop this index and see if it helps".

Was the ElastiCache service around when you migrated Redis to Dynamo? This is another alternative I'm considering.

The problem with ElastiCache -- and why I rejected it as an option -- as that they make you define a 30-minute "maintenance window" where AWS can apply patches and reboot your cache instances. In practice I've heard that this happens rarely, and when it does, the downtime is short, but it can in theory cause longer outages.

And in the case of both the redis and memcached backends, if the maintenance requires restarting redis/memcached or rebooting the instance, you lose all data in the cache (at least up until your last backup). For this particular project, that amount of downtime would easily cause a real outage for customers, and was unacceptable.

I use Dynamo because I'm lazy. Need some JSON fast? 3 lines and no worries.

        DynamoDB db = new DynamoDB(new AmazonDynamoDBClient());
        Table myTable = db.getTable("table");
        Item myItem = myTable.getItem("id", id);
Sure, I don't use it for a lot of things, but for cross-VPC / cross-account configuration, it's pretty nice.

Can you explain these comments? I am currently developing a system using Dyanmo as the database, I need a JSON like structure as the documents need to be flexible as different documents will contain different keys. So I am interested to hear why you think Dyanmo is a bad choice.

Can you expand on your last statement?

Beginner here, how do you keep the database available to multiple instances without Dynamo?

Use RDS.

My huge recommendation is to put production instances in a completely seperate region than development and staging instances. I actually just discovered that you can limit IAM API keys to a specific region, you just need to create a custom policy. The following policy is an example:

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "SOME-ID-HERE",
            "Effect": "Allow",
            "Action": [
            "Condition": {
                "StringEquals": {
                    "ec2:Region": "us-west-2"
            "Resource": [

Why would you do this instead of using multiple AWS accounts? Different regions have different feature sets (available instance types, beta eligibility, etc.). I strongly recommend instead using multiple AWS accounts instead and keeping them in the same region.

That said, since you should be using an infrastructure provisioning tool like CloudFormation, the tagging solution should not be a particularly big obstacle.

yes, use multiple accounts. you can use STS to grant permissions between the accounts if needed.

STS works, but I feel like multiple accounts linked is more of a hack than using a single AWS account. Of course, if you use a service that is not available in both regions, use STS.

How is multiple accounts a hack? It's the correct solution to isolation. Using regions is just stupid because now you've attached extra meaning to regions and can't bring up production stuff in other regions for better latency.

Alternatively, you can run two AWS accounts and get a single bill with 'combined billing'. One account being prod and one is test.


If you segment production instances from develop/staging you can use IAM rules to grant specific privileges based on the entire region, instead of by tag (which is fragile). Additionally, it is less error prone when you are making manual changes in the console as it requires switching regions between production and develop/staging.

The biggest mistake I've seen with AWS (and committed myself), is not reading the manuals for the services you're using. While some people complain about the AWS manuals not being complete, there is still a lot of good information in there that you might miss if you're just clicking through the console.

I wonder if AWS is a good location to run a VoIP server. VoIP is a real time application that is very prone to jitter, latency and packet loss. I'm concerned about "noisy neighbors" and decreased network performance at AWS.

Does anybody have experience with running a VoIP (e. g. Asterisk) on AWS?

If you're willing to pay, then you can avoid the noisy neighbor problem almost completely by using c3.8xlarge or c4.10xlarge instances. Those instances get their own dedicated 10G NIC. You can use reserved instances to reduce the cost.

But even with smaller instances, AWS network performance is going to be about the same as any VPS provider. The only way to get guaranteed performance is to do Colo, which is expensive.

I sell an application built around VoIP. I dislike running my own servers, so just lately I moved them to AWS.

At first, I used instances that were not very performant, causing some problems. I quickly moved to more powerful instances and since then there have been no problems at all. There is a slight delay, which is normal, since the servers are not in the same country as the users anymore, but the users haven't noticed it at all.

I can't say how well it scales, though. I have a small user-base, so scaling has not been a problem yet. Network performance might become a problem if you have a huge user-base.

How can you be sure that your users haven't noticed?

Twilio runs on AWS.

Check out voicehub.com, their entire stack is on ec2.

I second the recommendation to use CloudFormation + packer + ELBs + auto scaling groups for web applications whenever possible, it just makes everything so easy and automatic. Of course, there's a learning curve and you pay a premium for all that automation, but in my experience it has been usually worth it so far.

I like going one step further and making deployments and AWS resources into reusable code using Troposphere: https://github.com/cloudtools/troposphere

I wonder when Amazon is going to invest their ungodly earnings into UX. Their UI is absolutely terrible and I shudder every time I have to log into it. Is it purposely built that way to confuse people who don't belong in there?

I've heard this a few times -- AWS is a very complex product that gives you a lot of freedom in how you build your product using it. I think the web UI works about as well as it could given that complexity.

If you want something simpler, you have Heroku and a bunch of similar things which make a bunch of decisions for you -- but you don't have the flexibility there that you do with AWS of course.

There has to be some middle ground between AWS and Heroku though.

There are some options, but the point is, what you're trading for simpler UX is a simpler overall product with less flexibility.

It's called Elastic Beanstalk! As it name suggests you can start as easily as with Heroku, and adapt it later with whatever additional AWS infrastructure you may need.

Use OpsWorks if possible. It's free and provides a simple interface that allows you to deploy/upgrade your apps automatically and monitors your instances automatically using CloudWatch.

+1. AWS has done an amazing job with OpsWorks. It's free Chef 12. What's not to love? (The only drawback I've noticed so far is some standard Chef stuff--vault in particular--not working. Other than that, very pleased.)

Check out http://convox.com/ (YC S15) for an alternative to avoiding manual infrastructure.

Convox co-founder here. Thanks for the shout out!

That's exactly right, avoid manual infrastructure.

Someday we will all have something like Rails for infrastructure. Strong conventions around best practices.

If you follow these conventions you can avoid bespoke or manual configurations and focus solely on your app logic.

We're building Convox to advance this goal.


+1 I concentrated on non security related mistakes. security will follow next week... :)

Great points siddharth_mal

So I'd modify these a bit. We run a very large AWS infrastructure as a engineering team (no dedicated ops).

1. Use CloudFormation only for infrastructure that largely doesn't change. Like VPC's, subnets/ internet gateways etc. Do not use it for your instances / databases etc, I can't recommend that enough, you'll get into a place where updating them is risky. We have a regional migration (like database migrations) that runs in each region we deploy to that sets up ASG, RDS etc. It allows us control over how things change. If we need to change a launch conf etc.

2. Use auto-scaling groups in your stateless front ends that don't have really bursty loads, it isn't responsive enough for really sharp spikes (though not much is). Otherwise do your own cluster management if you can (though you should probably default to autoscaling if you can't make a strong case not to use it).

3. Use different accounts for dev / qa / prod etc. Not just different regions. Force yourself to put in the correct automation to bootstrap yourself into a new account / region (we run in 5 regions in prod, and 3 in qa, and having automation is a lifesaver).

4. Don't use ip addresses for things if you can help it, just create a private hosted zone in Route53 and map it that way.

5. Use instance roles, and in dev force devs to put their credentials in a place where they get picked up by the provider chain, don't get into a place where you are copying creds everywhere, assume they'll get picked up from the environment.

6. Don't use DynamoDB (or any non-relational store) until oyu have to (even though it is great), RDS is a great service and you should stick with it as long as you can (you can make it scale a long way with the correct architecture and bumping instance sizes is easy). IMO a relational store is more flexible than others since you (at least with postgres) get transactional guarantees on DDL operations, so it makes it easier to build in correct migration logic.

6. If you are using cloudformation, use troposphere: https://github.com/cloudtools/troposphere

7. Understand what instances need internet access and which ones don't, so you can either give them public ips, or put in a NAT. Sometimes security teams get grumpy (for good reason) when you open up machines that don't need to be to the internet, even if its just outbound.

8. Set up ELB logging, and pay attention to CloudTrail.

9. We use Cloudwatch Logs, it has its warts (and its a bit expensive), but it's better than a lot of the infrastructure you see out there (we don't generally index our logs, we just need them to be able to be viewed in a browser and exported for grep). It's also easy to get started with, just make sure your date formats are correct.

10. By default, stripe yourself across AZs if possible (and its almost always possible). Don't leave it for later, take the pain up front, you'll be happy about it later.

11. Don't try and be multi-region if you can at first, just replicate your infrastructure into different regions (other than users / accounts etc.). People get hung up on being able to flip back and forth between regions, and its usually not necessary.

edit: Track everything in cloudwatch, everything.

Have you looked at Terraform? I have everything defined (unless it doesn't work very well... which still happens as it's still under heavy development), and if it needs to be dealt with care (e.g.: core EC2 instances) I'm slowly filtering it into a separate set of units/variables, and setting the "static" infrastructure (VPC) as a downstream group, and slurping the statefile upstream as to not potentially damage anything when playing nice with deploying/redeploying EC2 instances.

DynamoDB is way cheaper than RDS. It's fine for small apps.

I work in a team that uses quite a lot of AWS for various reasons beyond cost. We also use various other clouds too, depending on needs of a specific project or situation. Our experience with cloud is that it is a journey. Obviously, coupled with DevOps tooling, it’d allowed us to deal with environment requirements at the speed of software, i.e. Infrastructure as Code and all. One thing we find ourselves doing over and over is changing the infrastructure, either because requirements change, new cloud services are launched and its useful to us or simply because we find more efficient ways of running cloud resources (Trusted Advisor helps).

We’d built a tool called Liquid Sky (https://liquidsky.singtel-labs.com) to help us keep track of the cost impact of the changes we make constantly. I did mention that we use cloud for reasons beyond cost, but we definitely still want to know that we’re sensible and maximise cost efficiency as well, its just another (important) factor. Because we change our cloud resources so frequently, we didn’t want to make it a very rigid process when dealing with the sensibilities of cloud cost. Hence, we’d built Liquid Sky in a way that gives our engineers the freedom to explore better way of running things on the cloud while keeping cost in check as well as keeping the team (including cost guardians) in the loop.

Nice article, I am working in a small company (Beintoo) dice september 2015 and we use AWS here. I think is a very interesting set if products and you can build any kind of business. For sure the advices given in the article are really useful, in fact I will apply them at my job place.

About comparison with other services, I was working in Deloitte Analytics before, managing the cloud services provided by IBM Softlayer. You cannot compare them, AWS offers many more and I was not really satisfied with SoftLayer, for example I had a problem with a network upgrade they did on January 2014 and I have lost a lot of data, with poor support to restore it. Also the starting price of 25$ per month is really expensive. AWS is far more mature and interesting.

Then for my own servers I use CloudAtCost cause is cheaper but if I run a business for sure I would go with AWS. If you gain money, is not that expensive and if you stick with Amazon advices and philosophy is very reliable.

I'll add one that was especially common for people coming off the year of free tier a couple years ago. I'm not sure that AWS has changed it yet.

6. Not starting a box/instance/database and forgetting it's running until you receive the bill after your free tier expires.

I expected to see something like this generalized in the article as

6. Not setting up billing alarms, http://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/...

I warmly recommend the Serverless framework for building basic web applications on AWS. It handles CloudFormation details for you, but lets you customize them if needed. Not suitable for every possible app though.

This might be the wrong place to ask, but I'm curious how people feel Azure stacks up to AWS? The services seem comparable (maybe even nicer), but I'm unclear how it compares on cost.

I don't get all the complaints about AWS prices. I just spun up a 60 node Spark cluster with over 5TB of memory, processed 100B data records, and spent $5.50!

As there's too much push to Terraform, which I personally dislike due to many opinionated features and the marketing push to cover as many services as possible and not do one thing and do it great (AWS), you can look at Bazaarvoice's CloudFormation Ruby DSL [0].

[0]: https://github.com/bazaarvoice/cloudformation-ruby-dsl

Personally never quite got the appeal for EC2. You're basically replicating what you would be doing on physical servers anyway. The real productivity gains come from using a fully managed PaaS (Heroku, Azure Web Apps, Google App Engine, OpenShift etc.) where there is no maintenance and scaling and redundancy is taken care for you. Granted those are even more expensive but they are the only ones that provide any kind of value.

I find IAM particularly difficult to use - I feel like there should be a button to create a user/group that can do only X, Y, Z. I realise policy templates get most of the way there but I still had to go and read the syntax for them because DescribeRegions wasn't in the list I needed.

I'm also not sure how to make the jump from exporting AWS_ACCESS_KEY_ID and having my instances automatically request the permissions they need - STS?

If your code is using the AWS SDK then you don't need to export an access key for your running code. Just create an EC2 Service Role in IAM with the policy you want attached to the role; then launch the instance with that role, to get automatic temporary credentials.

Refs: http://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use... http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles...

> I'm also not sure how to make the jump from exporting AWS_ACCESS_KEY_ID and having my instances automatically request the permissions they need - STS?

Check out instance profiles. This feature allows any AWS API-aware application to request credentials on demand, eliminating key management/rotation:


I'd like to add two more:

      1) Not giving out your access and secret keys in scripts/buckets.

      2) Always using IAM roles with your EC2

+1 I concentrated on non security related mistakes. security will follow next week... :)

IAM roles is ok as long as you realize that anything and anybody on that instance gets access to those credentials.

how do you avoid 1 - it seems impossible ?

Take a look at hashicorp vault - https://hashicorp.com/blog/vault.html

IAM roles let you assign temporary credentials to machines running scripts. The machine can then hit an internal AWS URL to get the temporary credentials. Many tools know to look for these credentials by default- eg boto checks for credentials in environment variables, config files, and the machines IAM role.

And there's a few tools to emulate the metadata service locally if you need it on dev laptops which makes it use a role as if a server

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact