Hacker News new | comments | show | ask | jobs | submit login
Why I left Heroku, and notes on my new AWS setup (holovaty.com)
404 points by adrianh 1585 days ago | hide | past | web | 217 comments | favorite

We're working on solving these sorts of devops problems in Ubuntu with Juju: https://juju.ubuntu.com/

As an alternative approach would be that a Juju charm (script) would handle the initial deployment of a stock Ubuntu AMI and the customization in one step (or with puppet/chef) and then allow you to add new instances based on scale (though currently not automatic). When you have changes to your service you update the charm and just `juju upgrade-charm`.

This would consolidate step 1, 5, and the "ongoing" into one tool and you'd get a cloud-agnostic deployment (openstack cloud or bare-metal).

I would really love to recommend Juju. I used it for a good couple months attempting to get even just a working Openstack deployment. The trouble is that for the bare-metal functionality you guys mean it to be paired with MaaS. Unfortunately the community I found to exist around this combination was basically zero.

Even a simple question went unanswered the couple times I posted it. Eventually I gave up and just went back to Puppet+Openstack modules for the configuration I was working on at the time.

This was exactly my experience as well. Then I was introduced to Mirantis Fuel which uses Cobbler/Puppet to do bare metal provisioning and remote configuration.

Very cool. I looked at the Fuel stuff as well, but at the time it wasn't quite able to do everything I needed so I ended up rolling my own. The 2.2 update they have coming up soon sounds like it fixes all of the things that were 'wrong' for me though!

FWIW, I tested juju a few months ago and found it to be buggy and unreliable. Sometimes the instances would connect together correctly, and sometimes they would fail inexplicably. Didn't seem ready for any kind of production use to replace config mgmt tools.

If you are looking for another general purpose orchestrator, I'd suggest taking a look at Ansible - http://ansible.cc/

Disclaimer: I am the primary author.

I have tried ansible and love it. I invested the time (a few hours) to get a script working for my stack, and now I have a 70ish line script that will provision a server or VM (with a shared code directory in the VM), from clean installation to the codebase up and running in its production state, in one command.

New project? I just copy the script, change a few variables (names and packages it needs), and I get deployment for free.

Sounds interesting. Any possibility that you could anonymize the script and paste a link to a gist of it? would be interesting to see some real-world Ansible examples.

Sure: http://pastebin.com/KPrm3Zky

It could use variables a bit more, I think, but there were a few bugs with expanding them, so I didn't use them. I'll fix them later on, though.

amazing. that example explains how it works better than 200 pages of documentation :-)


You're welcome, examples are pretty useful for this sort of thing. Plus, after you have a base to work off, it's trivial to extend.

I'll put this up in a post on my blog, maybe it'll help others as well.

I never thought of Ansible for orchestration particularly, but that sounds interesting! The current orchestration support in saltstack is somewhat limited, so it would be neat to check out.

Are people using Ansible to replace or manage fabric scripts? I'd like to figure out some way to limit the spread of one-off fabric scripts (reminds me too much of bash script proliferation) and it'd be great to get everything under one roof.

1.10 (what's in 13.04) is much more reliable now if you want to give it a shot, here's a PPA: https://launchpad.net/~juju/+archive/devel

Yet another post about scalability/architecture that goes down under load. There should be some rule for that.


EDIT: Back online, the site migrated to AWS is not the blog.

Heh -- my blog is still hosted on Heroku, and I forgot to up the dynos. Which I suppose is a nice (and ironic) argument for doing the AWS auto-scaling stuff!

The site I was writing about, soundslice.com, has not seen any blips.

There's no need to increase dynos yourself. Just use the [Adept Scale](https://www.adeptscale.com/) add-on.

Heroku App Error page. Clever way to show why he left Heroku? I'm sure my constant refreshing isn't helping.

It also looks like the Heroku app error page to me

Isn't this guy one of the two best Django people around? Maybe heroku shut him down?

> The way we set up Soundslice is relatively simple. We made a custom AMI with our code/dependencies, then set up an Elastic Load Balancer with auto-scaling rules that instantiate app servers from that AMI based on load.

Doesn't sound that simple to me (as a complete sysadmin noob). Somebody should write a book about this.

Would you be interested in a blog post about it? I'm a sysadmin/devops with 12 years in, and build stuff like this on a daily basis. I wouldn't mind sharing how the sausage is made.

I'm a PHP dev working with Apache on a daily basis, but besides .htaccess and some minor changes I can't really do much. I would love an article which explained in a simple way how to scale your server and really debug problems with it.

Last week, I literally spun down an EC2 instance and signed up for Digital Ocean because I couldn't figure out how the hell to make stuff work on EC2, but have lots of VPS experience. As a dev and not a sysadmin, it's much easier to go with what you know... but I want to learn.

What's the difference between EC2 and a VPS for you? Do your VPSs already have things installed or a GUI? I've used EC2 before as a single server, never scaling. The main difference was installing things that are usually pre-installed (like on Ubuntu's official desktop image). Is that it or is it more about the scaling?

And thanks for the reference to Digital Ocean. Never heard of them before. Seems great, might try using them :)

> What's the difference between EC2 and a VPS for you?

For me it that restarting an EC2 instance deletes all the local storage. I have had good success just getting a big ass VPS, and running the database locally, and pushing text backups to S3. It is trivial to manage, and in the real world, downtime is more likely to be caused by configuration wonkiness than hardware failures.

You also have to have a huge amount of traffic to overwhelm a 24core / 96GB ram server. Why not put off the managing the complexity until you really are doing 10M page views per day?

If you where having all your data deleted when you restarted your EC2 server, then something was VERY wrong. I'm not an expert, but I've used EC2 a little bit and I think I hit that exact problem.

The thing is that for whatever reason, data wasn't being written to the EBS (the virtual hard disks for use with EC2 instances) and was instead being written to the "ephemeral storage", a really big local data store that every EC2 instance has that is basically a `/tmp` directory. If the server restarts, everything in the ephemeral storage is destroyed.

EBS is a slow turd. Running a database on EBS is running a database on network attached storage. I know tons of people do this, but I have no idea why. Provisioned IOPS will work, but it is expensive.

Local storage on a $5 Digital Ocean plan will do 2000 IOPS, where as that would cost $200/month with Amazon Provisioned IOPS. I know it is not an apples to apples comparison, but it is worth thinking about. Running a database on a local SSD is a good option for many people, and it is not an option that Amazon offers.

Actually there absolutely is an instance type for this. It's the high storage io instance. Not cheap, but if you want local SSD it is definitely on the menu

I have no idea, just things don't work on EC2.

> Do your VPSs already have things installed or a GUI?

Nope. I prefer straight-up Arch linux. ssh in and go from there.

To add some concrete-ness to the mix, I was installing ejabberd. When it came time to ping the server... no response. I did the exact same steps on my Digital Ocean VPS and everything went fine. I had done whatever commands EC2 expects to open the right ports...

Ha! I knew I wasn't the only one.

I'm a large user of AWS but, in general, I never feel like I'm in a true VPS. My last experience:

Out of the sudden one of our EC2 instances could not connect to another, causing our HA solution to spun dozens of instances and eventually crash too. It was clear to me that the dest machine was behind some firewall, we went to the security group, the machine was supposed to accept any connection, from any port, any host. The instance itself had no active firewall.

Out of desperation I added the very own security group to itself. It worked for a few hours, then stopped again, I removed, it came back to work and still working (8+ months now)

This is only one of various mysterious events I've seen happening on AWS.

For EC2 security groups you have to open access to both the correct ports and protocols. Ports are a concept at layer 4 of the OSI model, while ping, or more correctly 'ICMP Echo Request', is lower down the TCP/IP stack at layer 3. So when configuring the security group, look for the option to choose Protocols, then enable ICMP :)

Thanks. This is exactly the kind of stuff that I could learn from said blog post, and why I think "EC2 == VPS" is false.

The documentation I was following made no mention of this at all.

> I have no idea, just things don't work on EC2.

A raw EC2 instance is identical to your average VPS offering. You don't have to use all the extra alphabet soup of ELB, SQS, SNS, SES, etc.

> When it came time to ping the server... no response.

What were your security group settings? ICMP (ping, traceroute, etc.) are blocked by default, you have to enable them.

> A raw EC2 instance is identical to your average VPS offering.

See, you say that, but then you later say

> What were your security group settings?

I dunno, man, I just want a server. Every other VPS provider works, and I can set up my own iptables etc.

I'm not saying EC2 is _bad_, as I'm sure things being extra mega locked down by default is good overall. But I don't think it's fair to say that EC2 is the same as a VPS.

The solution to your problem was one Googling for "AWS ping" away, and I've had more significant setup differences between two VPS providers than having to configure the AWS firewall to allow pings.

EC2 fits every definition of VPS I've ever seen.

At that point, I had already been googling for so long that giving up was the best decision. Also, when I said 'ping' I meant 'hit via a web browser' as well as ping on the command line.

Furthermore, I don't feel very comfy when doing 'sysadmin via Google,' who knows what stuff I'm screwing up?

I had never used Digital Ocean before. But getting going with them was the exact same as my previous Linode, Rackspace Cloud, prgmr.com, and every other VPS provider I have tried. I don't need to install special command line tools, or set up security groups, or generate .pems... I ask for a server, they send me an email with a password, I log in, change it, and set up my ssh prefs. Super easy, using the same stuff I use everywhere else.

The big difference between traditional VPS and IaaS services like EC2 (or Google Compute Engine, etc.) is that the latter has dynamic scaling (and a pricing model built around the assumption that you will use dynamic scaling) as a core feature; for traditional VPS-style work on an IaaS, you are likely to end up paying a premium for flexibility you aren't using (and possibly dealing with some attendant management complexity from the same source), but, other than that, IaaS should be a complete substitute for VPS.

I would absolutely love this. Can't get enough solid sysops info - as a dev I find the docs, tutorials, etc out there fairly esoteric. Anything written recently and actually finished helps a ton : )

I have quite a bit of devops experience myself, especially on Amazon Web Services and their CloudFormation service combined with Chef. I would be very interested in a blog post (or a series, go nuts!) and I'm considering doing a series of write-ups on our setup as well.

I would LOVE to see a detailed breakdown on how an experienced sysadmin would set something like this up. I've cobbled together systems before, but I've found that doing so in a robust way is difficult.

Hot dog! I am also interested. I hope your recipe is free of fillers and artificial ingredients. Looking forward to casing your blog site. Address?

Very interested!

Definitely interested. The less you assume about your readers the more helpful it will be to me.

I wouldn't mind it personally.

yes. very interested. very very interested :)

yes please.. ohh god yes.



Very interested as well!

That's actually, honestly, a bad way to do it. It's fine to pre-load an instance with source and application/package dependencies. At this point though, you really should be backing those up with it being deployed by Puppet/Chef.

Essentially you pre-run a puppet manifest and have it do all the first-run processing and then store THAT as the AMI/instance. This way you have an instance that only needs a few seconds to get itself ready while also integrating configuration management (which really you should be doing these days). Just some suggestions from a sysadmin.

Depends on how fast you want new instances to be online. Sometimes you want to respond to demand very fast, it can take a few minutes for an instance to register with ELB, if you have to download/build/configure packages prior you are just adding minutes on time. The faster you can spin up instances the higher you can run your servers as you need to tell amazon at what threshold should I spawn more instances

There are 3 general levels you can take with AMIs:

1) Vanilla instance AMI, have puppet/chef install everything for you, then fetch your app code and configure that

2) Use another tool to have your instance built with puppet/chef and then capture that into an AMI. Once it spins up it just needs to get your app code. Idea here is your services arent likely to need updated as fast as your app code.

3) Same as two, but when your production code is in release/maintenance mode, bake everything into an AMI. When you need to deploy new code, you need to create a new AMI, but you are creating an AMI with scripts so its no big deal :) All you have to do now is update your cloudformation

Well this is why I suggested pre-running with Puppet and 'install' your code and dependencies into the AMI before saving it. That way you get the best of both worlds.

Obviously if you have a setup with rapidly deployed code changes you'll want to have your puppet manifest grab the latest version before deploying live when spinning up a new instance.

I believe #3 is what Netflix does.

Puppet/Chef is overkill for a site running a config this simple.

I don't really agree with this. If you're defining a whole crazy master/agent relationship with Puppet then it's overkill for them, no doubt.

There's a reason many Vagrant boxes I see use Puppet to instantiate though: It's friggin simple to get started with. More advanced configurations might even end up requiring you to build your own modules and extended manifests, but I expect this configuration detailed in the article would probably amount to a couple hundred lines for the entire manifest.

I think this also underscores one of the things I like the most about Puppet (instead of Chef, which is a fine CM system as well); It's easy to get started with but powerful enough to get things done down the road. Honestly, there's never a better time to start working with Puppet than from the very beginning of a project!

What do you recommend for simple configurations where you still want the repeatability/documentation/versioning tools like Chef/Puppet provide? Shell scripts?

You could build containers using Docker (http://docker.io) and Dockerfiles. Here's an example: https://github.com/steeve/docker-opencv

scriptrock.com - FD: I'm one of the founders, we think of it as a executable documentation tool.

babushka - http://babushka.me/

Overkill for now. But it's best to learn the tools you will need while things are still simple. You know how things should get set up and can ensure that what you're telling Puppet/Chef to do achieves that.

If you wait until you've got a more complex configuration, you're fighting both the tool syntax and system setup requirements. Not to mention the clock that's ticking and telling you that you needed to be all up and running yesterday.

After building a half dozen one-off sites, each with a simple, single-server config, I found that going sitting down and learning Chef was helpful for jumpstarting the next project. All of the basic needs are the same: nginx, unicorn, postgresql, etc, and I was tired of reading through my jumble of system configuration notes. Yes, I could have baked AMIs and created new AMIs, but not everything was Amazon-based.


Elastic Load Balancer = software based load balancer or traffic director

auto-scaling rules = if number of connections >50 or bandwidth > 100k or reponse time >10ms start a new instance and automatically update the load balancer so the new instance takes connections

Add in reasonable max-node count: automated processes that cost you money are dangerous if not governed. You don't want your provisioning tool spinning up endless amounts of nodes, all because some backend database has become slow (some new user pattern has emerged and is causing indexed queries to queue up), and all response times are getting trashed.

Hehe, yeah, that sentence doesn't sound particularly simple. :-)

Check out the code snippets I linked to from that blog post. It's pretty easy, I promise -- the tricky thing is just figuring out the various APIs.

I apparently need to get back to that book.

You know those days when you think you know a bit, and stumble upon someone who knows vastly more than you? Today was one of those days. It appears we're both Devops guys in Chicago; can I buy you a beer sometime?

If free beer is the price of entry here, I'd love to pick your brains as well, since we're about to run into this exact same problem. (I'm in Chicago as well)

Where do all you Chicago hackers hang out? I'm new to the city and I never see anything like a Chicago HN meetup posted here.

If you find out, let me know. I can tell you where the infosec / security researcher types are, but not where the startup people are.

Maybe we should just invent something new.

Sounds great. where do I sign up? :-)

Mail me? We should just arrange the next HN meetup ourselves. We can probably pick a better venue than the last one I went to here. :)

Email sent (earlier today).

I would love to come, just to listen and learn.

Hey, I'd love to join! I'll buy the beers if you two guys just teach me all the sysadmin stuff you know. :-D

I'd certainly be down. I love talking about this entire arena, specifically where it falls on its face.

Thanks for the post. Would you mind sharing how much you were paying Heroku per month (on average and peak usage)? Thanks again!

It's not simple, AMIs are a pretty bad idea and not something you should rely on due to the statefulness and vendor lock-in unless you can reproduce it in disparate deployment environments from scripts.

The AMIs can be imported and work on Eucalyptus private cloud. So there is no lock down: http://en.wikipedia.org/wiki/Eucalyptus_%28computing%29

It's actually a lot simpler than it sounds:

- Create a new EC2 instance (Literally click "Launch instance" and select Ubuntu. - Login and install any dependencies (sudo apt-get update && sudo apt-get my list of awesome packages i need installed) - Go back to the EC2 dashboard and click "Create AMI" which is just imaging the server. - I can't attest to the ease of ELB and auto-scaling rules as I haven't used them, but I would assume it's fairly straight forward - and there is a ton of resources on AWS to help you out :-)

Have fun!

I pitched a book to No Starch and a few others, and they all said there wasn't enough demand...

I know Packt was looking for a Chef author a few months ago. I declined to work with them, but I'm working on an outline for a Chef/DevOps book to self-publish.

I'm surprised I didn't know that; they've invited me to write a book on nearly every other technology I listed on my linkedin profile. :-/

That's why I don't buy books from Packt unless I can find good reviews of it written by members of the community around the software the book is about.

PostgreSQL 9.0 High Performance is a Packt book with excellent content, but it has multiple typos and grammar errors on every page. I'm sad to see such an outstanding book marred by bad editing.

I'm a friend and former colleague of the author, and having read the book from cover to cover I don't think it's true that it has lots of typos. I think you might be confusing it with another title.

If you think they're wrong, write it anyway and put it in the Kindle store.

That's the approach I'm taking :)


Check out Amazon's Elastic Beanstalk - it handles the provisioning/deployment side of things - setup your environment with "eb start", update your app with just a "git aws.push".

Quite configurable via scripts too (install additional packages, etc). Under the hood it's basically the same - Elastic Load Balancer and EC2 instances.

I gave AEB a try with django about 3 months ago and let me say it was a major headache. Some times it would work other times the load balancer would fail other times it would just laugh in your face( there were many more but you get the point). I wasted weeks trying to get our app running, when it came down to only 3days before our demo run I made some bash and fabric scripts and all was running on multiple instances in about 1hr. Took a snapshot of our state and used that Ami for auto scaling. I myself won't be trying AEB again anytime soon.

I am working on a site right now http://makerops.com that will walk through via screencasts, texts, and interactive learning a lot of the issues described by the OP. How to interact, as a dev with various cloud APIs, to auto scale, manage configurations, etc.

This really resonates with me. I have been having identical issues with Dotcloud, a Heroku competitor.

There's nothing worse than wondering if deploying your stupid one-character typo fix will hang or leave the app in an incomplete state.

I myself am working on setting up salt stack for deployments. I like that all of it is in python, and after a couple of days I've begun to make progress.

I am migrating things slowly: first cron jobs, then internal tools, and at some point, our website. The only thing that really needs "scale" is our frontend, and I don't yet know how I'll manage that, but you know, I guess it is time to learn!

Huge fan of salt stack, and I may push Adrian in that direction. The changes we made here were minimal in the way he deploys, and with so many architecture changes behind the scenes, we decided less is more on that front.

any link on the advantage of using saltstack for aws instead of AMIs ?

Salt is more easily repeatable across heterogeneous environments.

For example, we're using Salt to manage our development, testing, staging and production environments. We couldn't do this with AMIs, since the environments are quite different.

It makes updating quite easy, as you don't need to create a new AMI, just push your update to your release branch, then run state.highstate from your Master.

If you configuration changes, you update the master and then you can push that change across all (or just one) of your environments.

Does having an identical staging environment help, or are these deploy errors transient/non-deterministic?

The latter, unfortunately. (I wish I had more details - but for one reason or another, it never seems to work, be it a bug in the deployment system or a host which is down.)

Dreams of an identical staging environment were somewhat shattered when the "free tier" plans went away. Plus, sometimes a new version will be deployed to 2 servers, and the 3rd one will be hanging on an update. These kinds of things are difficult to plan for and test, since intentionally messing up a deployment is rather difficult with PaaS.

Shame, I was looking into Dotcloud just the other day and liked the more control they gave you, especially the custom nginx conf. And that they hosted static files for you instead of relying on third party services (i.e. S3).

i'm starting a process to deploy on aws, and i just finished creating my salt stack conf for a vagrant VM ( as a warm up). After reading the post i was in the process of ditching that for a simpler AMI-based deployment.

Could you elaborate on why salt stack is better than custom AMIs ? also if you have any link explaining how to deploy to ec2 using saltstack, that would be awesome.

AMIs are really useful for production servers when you want to quickly launch a bunch of similar servers or set up auto-scaling groups to launch new servers based on load. The issue is that to take full advantage of using AMIs, you need to re-build (bake) the AMI every time you make a change to your production machines.

Disadvantages of AMIs are that they are large files (which can be unwieldy) and are not in a format that can be immediately used in a vagrant instance or elsewhere, so you need to convert between AMI and .vbox or .box formats to use amis with your local vagrant instance.

With saltstack, you can deploy and configure machines from base Ubuntu or RHEL distros by using salty-vagrant for vagrant instances and salt-cloud for AWS / rackspace / openstack / vps machines. Once machines are deployed and connected to your salt-master, you can use your yaml config files to bring all the machines to the desired roles (api machine, load balancer, web front-end, etc.). This is more flexible than using AMIs, since you can roll out a minor change quickly to all servers, including production servers, without needing to wait to build a new AMI.

There are hybrid approaches also (some puppet/chef users do this) where for production machines you can have saltstack deploy to a machine that then gets built into a new AMI automatically. And then this new AMI gets updated into the autoscaling groups and is deployed to replace the current production servers. Sounds complicated... and it is, but at the cost of extra complexity it does give you the best of both worlds.

Thanks, Sounds like the last option seems the best indeed. My salt configuration is way too slow to build to be used in an autoscaling scenario, that's one reason why AMI seemed suitable to me, but that last combination you mentioned (which i didn't know was possible) looks great.

Thanks again

On my end I get a complete crash once a week when running with one dyno. The official response for support I got was to upgrade to two dynos.

Apart from the PostgreSQL hosting, I'm not really happy with Heroku. Add to the random deploy breakage I've experienced mid-launch (not to mention it's slow as hell), I'm going to move back to my own servers.

Heroku docs do say (and I'm afraid the link to it is escaping me) that if you run only 1 dyno, it will shut down after something like 6 hours of inactivity. So if no one hits your site for 6 hours, the next time someone does it has to boot the dyno up again, which can take enough time for the request to time out.

Not saying this is a great policy, just explaining in case you weren't sure. When you go to 2 dynos they keep things up for you presumably because now you're a paying customer.

This can be mitigated by having cron hit your site every few hours. There are some free cron services like https://www.setcronjob.com/ that do this.

Welcome to the cloud, where you need to rely on a third party service to keep your other third party service online. Of course, what happens when setcronjob.com goes down.... well you then use pingdom to check that it is up...... down the rabbit hole we go!

Welcome to the web, where you get an easy hosting service like Heroku without paying a dime and then complain about how you need to put some effort to hack the system so you can continue to pay nothing.

Just to be clear: I am paying $70/mo for SSL and Postgres, so it's not that I'm paying nothing. However, based on the minimal server requirements for my app (most users are using our client-side Javascript tools), one dyno "should" be enough.

I really can't blame Heroku for spinning down non-paying instances that don't get a single hit in 6 hours.

I completely agree, but sometimes third party tools are better than baking in-house, especially single purpose services. App crash analytics for example, or performance trending tools like NewRelic.

But cron? That's just ridiculous.

Or enable availability monitoring with the (free) newrelic addon.

My app is pretty highly trafficked, there isn't a 6 hour window of inactivity anywhere from my logs. What about it running out of RAM?

I've not seen that issue, I suppose it's possible, without knowing more about your app it's hard to say. I only know of the 1 dyno thing, that's what people most commonly run into when they are trying to run a free app.

I think it is much less than 6 hours. Like 30 mins or so.

What's your monthly heroku bill?

So I take it you don't pay for the service? No offense but if I'm in management at Heroku I say to you "I'm very sorry to hear about that. We'll take your concerns into consideration [as soon as you have something to offer us in return]."

See my replies below. I currently spend $70/mo on Postgres and SSL support. I would up my dyno's but I don't see the point since one handles it just fine (minus the crashing).

If you think puppet/chef is too much complexity you might find saltstack worth some attention. I haven't used it in real anger yet, but I've really liked what I've seen so far. I see they've even got instructions specific to aws now too [1]

Also worth mentioning cloud formation as well [2]. That might make the pain of chef/puppet more of a worthwhile investment!

[1] https://salt-cloud.readthedocs.org/en/latest/topics/aws.html

[2] http://aws.amazon.com/cloudformation/

Ansible is another alternative that is relatively simple.

I've been using Ansible for a couple weeks now and it is awesome how simple it is compared to Chef/Puppet (IMO)

I really want to use Salt for something one of these days given I work with Python quite a bit... but basically at this point my investment with Puppet is relatively significant, so I'm not sure if the pain of switching would be worth it. I'm sure eventually I'll find a use for it though! I like what I've seen before.

Also there are some useful example states on GitHub: https://github.com/esacteksab/salt-states

My salt states are public[1] and include among other things setup for nginx/redis/postgresql/uwsgi which powers a few Django and Flask sites.

[1]: https://github.com/uggedal/states

Just my 2 cents... the problems described here seem to fall into the category of "things didn't work 100% of the time".

I have some bad news for you; nothing I have ever used worked 100% of the time. Doubly so on AWS. Just off the top of my head; just in the last week we have seen 5% of AWS instance act so badly that we had to recycle them, and that is just the easy to diagnose problems. Don't get me started on the IOPS marketing BS that Amazon sells.

It's just the reality of a lot of moving parts and complex systems. From the sound of it, the OP had very little _actual_ downtime, and had to make some end-runs a couple times. Shrug, just life in a high-scale world IMHO.

The goal should not be to bounce around providers until you find the Garden of Eden. The real goal should be to accept failure and build to tolerate it. Now maybe it is easier to build fault-tolerance if you are closer to the metal... BUT I would bet that Heroku has much better experience and tooling to detect and resist faults then rolling your own.

Really loved working on this with you, and I'm excited to see what you come up with now that you have a much better understanding of AWS. Congrats on the re-launch!

When engaging a super qualified expert like this, say in devops, what levels of interaction are available? For example, is is possible to get the one hour consult that points the way and mentions the land mines to avoid, with maybe some template files, or is it only the complete package where the expert does everything from soup to nuts? Is it cost effective for an early-stage venture to use the rockstar for something like this? Would someone like this only apply after product-market fit?

Well, it depends on the application. Most people think about scaling far too early, and the same applies to configuration management. I am a fan of shortcuts, early and often... worry about things when you have things to worry about.

That being said, making some easy architecture choices early on can have an enormous effect on your sanity. It is probably worth it to bring someone in to hint you in the right direction after you have a prototype to show and have already started to make choices. This is basically where I helped Adrian.

If your app is struggling under success, you are ideally in a better position financially than most startups are at the beginning. This is when you bring someone in for longer term engagements, or potentially when you hire someone to work on this full time.

I think you are hitting on something that is really a need in the market right now. When you don't need a consultant, and you need 30-90 minutes of "how do I do this?" Q&A, it is hard to figure out who to talk to.

From my perspective, it is hard to find those engagements. I do a sad majority of my time in the sales side of things. Recently, I spoke to some folks at 10xmanagement and Adrian pointed me to anyfu, both of which are approaching this problem but in slightly different manners.

Lastly, you don't want the soup to nuts guy. This is your app, and if you aren't deeply involved with what is going on with it, you are putting yourself at a huge disadvantage post consulting engagement.

I've often gone on Quora or Stack Exchange to answer questions about scaling or stack design, but the questions (and answers) are often very simplistic, or the questions are so specific that I'd need to spend hours answering.

If there was a place I could go to get some cash for my time answering these questions, I'd certainly be more inclined to spend the time to do so thoroughly.

Mentoring contractor start-up perhaps?

From the OP:

> I changed the app to use cookie-based sessions, so that session data is stored in signed cookies rather than in memcache. This way, the web app servers don't need to share any state (other than the database). Plus it's faster for end users because the app doesn't have to hit memcache for session data.

The switch to using cookies for storing session data instead of memcache has tradeoffs. Sure, you no longer need to ask memcache for the session data. But you are also shipping a significantly larger cookie back and forth on every request.

If you're storing a lot of data in your session, this could actually slow things down in the long run. [1]

[1]: http://yuiblog.com/blog/2007/03/01/performance-research-part...

Definitely a great point. Fortunately my session cookies are teeny.

Sad to see the recommendation to use MySQL - in my experience RDS hasn't been worth the effort - but I'm probably biased since I already invested the time in automating PostgreSQL replication setup.

Believe me, I am sad to stop using Postgres. What parts of RDS haven't been worth the effort? I didn't have to make much effort, beyond rewriting my Postgres triggers into MySQL syntax.

Inconsistent performance, significantly slower than MySQL on a real box, and failure of the ridiculously expensive automatic failover were the reasons I dumped RDS. I ate a big loss after reserving instances, but seeing the server fall off the planet, and no spare ever take over for it despite paying for MultiAZ, what other response would have been reasonable?

RDS is pretty expensive for what you get. You can't restore to a running instance from snapshots, you get very little control over the environment, and you can't replicate between geographic regions (only availability zones).

I know you said you're a two man shop, but in this case it may make more sense to leverage other IaaS DB services instead of RDS.

I am not sure what you mean by "You can't restore to a running instance from snapshots" cause, well, I've done it a few times.

Could you expand a bit maybe?

My apologies. I should've said "to an existing instance."


"You must create a DB snapshot before you can restore a DB instance from one. When you restore the DB instance, you provide the name of the DB snapshot to restore from, and then provide a name for the new DB instance that is created from the restore. You cannot restore from a DB snapshot to an existing DB instance; a new DB instance is created when you restore."

I have also used their snapshots with success

"you get very little control over the environment" also seems spurious. You get parameter groups to control just about everything you can from a standalone mysql instance. logs or things that load from disk (LOAD DATA ...) are the only things you can't touch, but a small price to pay for automated backups, failover and scaling (both up AND down...)

Sorry if I wasn't specific enough.

In my opinion, you don't get enough control over the environment for what you're paying on a per hour basis. Automated backups? Great, they aren't that difficult to being with. Failover? Sure, within the same AZ, when you need to be doing it between datacenters. Scaling? I will grant that it scales up and down fast automagically.

I've seen RDS have significant per-query overhead, and some very, very variable read times - in particular, some simple SELECTs randomly taking > 10s every so often.

RDS is subject to the exact same intra-region latency issues that other AWS instances have. Perhaps worse in some cases, since their failover appears to be DRBD based.

Is your ec2 instance in the exact same zone/region as your rds instance? An ec2 instance in us-east-1a will have extra latency dealing with an instance in any other zone, even us-east-1c.

Stating us-east-1a means nothing, since your us-east-1a is someone else's us-east-1d. They're not the same across AWS accounts.

Right, but 1a is never the same as 1c.

Did you consider using Heroku Postgres?

I'm not the OP, but the thing that's always scared me off from those types of services is that I think there would be high latency, where in a DB you really need low latency. Is that not true?

Since Heroku runs on EC2, as long as OP kept his servers in the same availability zone as the DB, latency shouldn't be any worse than running it himself. But yes, in general I'd be wary of a database far away from my app servers.

That can depend on whether you can hit the internal IPs, or just the external IPs. If you can only hit the external IPs, you're subject to greater network latency, plus data transfer fees.

We came across this issue as well, we'd been wasting too much time setting up Postgres. RDS seemed liked a good choice, except that MySQL is just that much worse than Postgres. We had just used things like concurrent indexes.

We ended up going with Heroku's hosted Postgres solution. It costs the exact same we were spending w/ two High CPU instances w/ provisioned IOPS volumes. Now we get fully managed, same price, and all the features of Postgres.

We still host our full application on AWS, the only thing is that we have a managed database. While Postgres makes it easy to setup replication and what not, AWS hardware just sucks. It takes time to properly tune it.

What tier of postgres do you use? How's performance?

In my experience RDS has been amazing, and I am a MySQL guy. I say it like that because as a MySQL guy I would much prefer to setup my own server and use a third party engine like Percona or Maria, but RDS(MySQL) performance has been incredible.

The only thing I would complain about is the lack of a proxy or load balancer in front of their Multi-AZ setup. With any failover you are forced to have a downtime of 3-5 minutes while DNS propagates.

A month or so ago I thought about starting a hosted Postgres service on top of EC2, since lots of people seem to want "RDS for Postgres." I think if you specialized you could beat Heroku's offering, but ultimately I decided there was too much risk of Amazon simply adding Postgres themselves. But if anyone is more daring than me, perhaps there is an opportunity here.

I saw this in the AWS marketplace (aws.amazon.com/marketplace) - http://www.enterprisedb.com/cloud-database/amazon

no experience or affiliation, though.

I saw that, but I could do better. :-) Heroku's Postgres-only option is the product to beat. It makes it really easy and has some great replication features. But a dedicated company could offer more.

I hadn't seen it prior reading/replying to your comment, but I thought it looked comparable to RDS as I skimmed http://www.enterprisedb.com/products-services-training/produ...

I've not used Heroku's Postgres stuff, so no idea what it offers beyond that

I believe EnterpriseDB already offer something like this: http://www.enterprisedb.com/

I wouldn't say there is much effort required at all with RDS. We use it without issue and while there are some annoying issues with limited access to the environment, it's been pretty solid for us.

Two quick Heroku recommendations that wouldn't fix all (any?) of these issues, but would at least make them more palatable.

1. Change your default heroku error pages. From this: http://s3.amazonaws.com/heroku_pages/error.html and this: http://s3.amazonaws.com/heroku_pages/maintenance.html to something else (in the app settings tab).

2. Keep a staging instance up and running with a cloned database so that you can avoid random heroku-created errors in production.

I get a 500 when I go to the site. It seems like a bad omen for an article about your hosting setup.

Which site? My blog? The blog is still on Heroku, and it sounds like it was overloaded for a while there. I just upped the number of dynos, so it should be fine.

I don't have any indication that soundslice.com (which is the site I'm hosting on AWS) was returning any 500s, but if it was, please let me know.

I am curious how many dynos you end up using under traffic for a blog site. Care to share?

Also, how many dynos did it take to handle the load on soundslice, when it was on the front page of reddit?

I don't have a definitive answer for the blog, to be honest. I just add four more dynos (to bring it up to five total) and it seems to be fine. I figure it's only for a few hours, so I can be sloppy about it without needing to worry about huge costs. Wish I could be more scientific about it!

As for the number of dynos when Soundslice was on the Reddit homepage, I think was using 10 at that point, but, again, it was a totally non-scientific thing. 10 did the job, but fewer dynos might have been just as fine.

Thank you for responding. That's exactly what I was looking for, a ballpark figure for what can get you through a peak in traffic.

It's timing out for me. This has certainly never happened to me with Heroku.

AWS is great. We use it a work and I spun up extra server capacity for a once a year open studios event.

That being said as someone who's been in the Unix world for a while figuring out which AWS services to use is not obvious at all. It took me some time to figure out the alphabet soup of sevices and I was familiar from work. I ended up using "cloud formation" which builds you a server (LAMP or other) and optional database/loadbalances configuration. Its was that or selecting a LAMP ami (Amazon machine image). They have a lot of documentation but its hard to get an overview of what everything is (S3, elastic storage....) Plus configuring web server/ database servers for best performance can be non-trivial.

I get better speeds from our organizations cheap "shared hosting", during low loads. I was pretty sure the one week of very high loads would have crushed the shared host, thus AWS was perfect.

Switched from Rackspace to AWS - couldn't be happier. I think AMZN is really far ahead of the competition and their pricing seems to be as good if not better for many use cases.

The OP says he opted not to use Chef/Puppet and went with baked AMIs instead. In my experience, Chef simply takes too long to build a machine. I use it to ensure a repeatable, self-documented machine setup, but then if I'm on AWS I snapshot an AMI for quick scaling. This also lets you more easily use Amazon auto-scaling tools like Cloud Formation, which bring up new machines based on an AMI. But having the Chef script is still great for knowing what's on your box and having "source code" to change it.

One thing I'm curious about re the OP's process: he says he is using Fabric for deploys. Does that mean every time he deploys new code he has to snapshot a new AMI? In that case, why use Fabric at all? I'd be worried about auto-deloying AMIs with outdated code.

Since EC2 instances are "disposable," one approach is to never "update" an instance, but instead you release new code by simply launching fresh instances with the latest code, then destroying the old ones.

The baked AMI comes up and pulls, so it doesn't need to be baked each time.

If you take the route you just mentioned, I highly recommend using Netflix's Asgard (https://github.com/Netflix/asgard) and check out their recently released Aminator (https://github.com/Netflix/aminator). Asgard specifically makes AMI based deploys outrageously easy.

My preference is to make a base ami that's configured with puppet and a couple core user/keys. I make this instance automatically hit our puppetmaster and use things like security_groups and user data to figure out what type of node it is. One of the things that puppet installs is a script to generate a new ami from the built instance, this script optionally deletes the puppet package.

Now you have an easy way to build your ami from the ground up, you only need to worry about core OS package enhancements on the base ami, and you can use your created AMI with autoscaling behind an ELB.

For VERY rapid changes - direct git access is optional (even on startup for current code pulls), but I tend to side with the Netflix guys and focus on an AMI as the minimum unit of deployment.

Check out asgard to make this easier (senseless self plug: I built an asgard ami this weekend - http://imperialwicket.com/netflix-asgard-ubuntu-1204-lts-ami...).

I'm a fan of CFEngine and use it for the things that many use Puppet or Chef to do. I think the key advantage of some kind of system domain specific language, is it helps document how things are configured. The only way to figure out what's going on in AMIs is mounting their file systems and running diffs to see what changed AMI, to AMI. Having a large number of AMI's can also be a problem in the respect that they proliferate, particularly when you deploy new applications, based on old applications, but slightly different.

On the other hand, if you are using auto-scaling, Chef/Puppet/CFEngine may not be able to finish their work in time if they have to do a lot of work. You have to strike the right balance between what needs to be in the AMI (provisioning) and automation (post-install).

How come nobody thinks about Google's App Engine (and/or Compute Engine) in these situations? Is it really that unbaked?

OP here. I actually used to use App Engine for Soundslice, about a year before Soundslice launched publicly. It was cool in principle, but it's severely limited to the point where I just couldn't run my app on it. Plus there was wildly varying performance with the database layer.

I'm sure it's gotten better since then (this was circa late 2011), but I'd think long and hard about using App Engine for anything nontrivial.

It's too expensive and has dangerous vendor lockin. I myself was burnt 2 years ago and vowed never to return to that platform. Now we use combined Heroku and AWS services. However I do miss GAE though, its deployment is smooth, their auto scaling up is great. It's just that I don't want to be locked in any more.

That being said, we still use GAE as a platform for fast prototypes, as a cron service to keep our Heroku instances up + firing up an EC2 instance every night to run a script for 5 minutes. So we only use GAE when there's minimal risk of being locked in.

Compute Engine looks promising. It used to require a subscription to Gold support ($400/month) before you could create any VMs, but they removed this restriction last week.

It still has very limited quotas, though: Maximum of 8 instances or 8 total CPUs. You can request a quota increase through a web form, but I did this a week ago and haven't heard back yet.

Also, it's still in a "limited preview" period, which means your entire zone can go down for weeks for scheduled maintenance, which is pretty inconvenient:


What's the cost difference between Heroku and AWS for you?

Ah, I should have mentioned this in the post. The price is going to end up being basically the same. I could have done it even cheaper, but I'm paying extra for multi-AZ stuff (i.e., making the database and load balancer available in multiple availability zones, for failover protection).

You should edit this in at the end.

edit: though you'd have to factor in the cost of bugging Scott when you're not his friend.

Be careful about the multi-AZ promise. One of the reasons we don't use RDS is because there is no multi-region failover/replication capability.

For discerned users who are ready to spend effort, AWS is definitely the way to go. No restrictions, fire up your virtual machine, setup everything as you like and optimize as you need - you have the control button.

Can you share approximately how long the process took you? Thinking of going down that road with 2 small basic heroku rails apps 1 web, 1 worker, 1 small postgres DB. very little traffic. Both apps already at ~$170/month on Heroku. These are paid SaaS B2B apps, but they generate so little traffic but the way heroku partitions their services or addons, this really should cost like $30-40/month on a regular "hosting environment"

I know almost nothing from a unix sysadmin perspective, but would invest the time if it is feasible.

It took about four full days of work, spread out over a couple of weeks. It would have taken me much longer had I not been helped by Scott (my friend mentioned in the post).

I don't know the details of your app, but $170/month sounds high for that server arrangement. If you set up a few auto-scaling Amazon micro instances, it would give you more capacity and be cheaper. Of course, the tradeoff is that you'd need to either learn how to do it or hire somebody. :-/ Good luck!

2 apps, same profile SSL $20 postgres basic $9 1 free web dyno $0 1 additional worker $36 Scheduler addon usage: $5

$70 for each app(Its actually ~$140.00) (approximately 100-200 business customers might use the apps daily)

And of course, bad billing practices where they dont stop charging you for addons you have stopped/remove and send invoices 1 month late so by the time you notice, another round of billing as occurred incorrectly. Have resulted in me paying $350 and $251) in the past 2 months.

Exact quote from Heroku customer support email, after overcharge of ~$300 in past 2 months

"I'm sorry that the delay in receiving your invoices caused the charges to continue for longer than you would have liked, however please keep in mind that we offer your Current Usage details[ on your account page]"

Yeah, not happy with heroku at all. Even though i can better use my time generating additional sales for the apps, Im so pissed, I plan to burn time move and cancel ASAP...

You could host your apps easily with a $5 a month digital ocean VPS.

You can also look into aws beanstalk. It does all the autoscaling/deployment stuff in a gui/easier api

I used to be a big fan of Heroku for a few customer's projects and for small stuff (mostly free hosting) of my own. At least for my own projects, I decided that for the monthly cost of a large VPS (I use RimuHosting) I can run 5 web apps written in Clojure (I mention the language because this is an efficient run time setup, once the JVMs warm up). I am giving up temporary scalability for a lot more bang for the buck. I also like dealing with smaller companies because you get great personal service.

Mark, thanks for sharing. Do you use an app server like immutant or does each app run on a separate JVM instance?

AWS OpsWorks is another reason to move to AWS. It is not good as Heroku yet! But is it evolving...

I had a great conversation the other day with one of the OpsWorks product managers. Been using Opsworks for a new client and so far I'm loving it over Rubber/Capistrano. Rubber was ok, but Opsworks wraps up the entire process just enough (but leaves enough control)

Not sure how I feel about these "Being a sys-admin is hard, I'd rather not do it" type posts. This is literally THE CORE OF YOUR COMPANY, your site not working is inexcusable!

I think this is going to be an emerging problem for many startups these days. Patterns of completely unstable products because "ops is hard LOL" followed by a desperate attempt to hire actual sysadmins for their company and excuse after excuse of how "awesome we are for growing so quickly!".

This. If you don't want to and/or are unable to properly operate your systems, please stay on Heroko, AppEngine etc., otherwise this is a disaster waiting to happen.

Is anyone aware of an up to date comparison of the various orchestration systems?

I have rolled my own which, being totally biased, I do believe is more full featured than most of those mentioned in this thread, as well as more redeployable/generic. (100% pluggable platforms, built to integrate tightly with build services and developer environments, etc.). Limited internal testing at this stage (1000s of VMs instantiated, across three cloud providers (one internal), only two distinct OS deployment environments targeted thus far).

However, it'd be great to have an overview of this area... it's certainly exciting. I am beginning to feel like the DevOps batton has been passed from large companies with massive infrastructure requirements back to the community over this last year, as more developers are becoming aware of the pitfalls of single provider cloud solutions and the rough edges and missing features of existing multi-cloud deployment APIs. We can probably expect great things in this area over the next 12 months.

When exactly do you bake a new AMI? On every Soundslice deploy?

No, definitely not on every deploy! For a deploy, I just do a "git pull" on all of the production boxes and restart the web server.

Baking a new AMI only happens when there are new underlying dependencies, like a new package from apt-get or a new Python module from pip. In other words, it's rare.

Hope that helps!

Is there any reason for not letting pip update things automagically from your `prod_requirements.txt` using a simple fabric `fab update_pip` when necessary?

Yes, in fact that's what I do! The issue is I need to account for any new servers that might spin up later (hence updating the AMI).

I could also change the AMI's "user data" to run pip when the instance loads, but I'm not 100% confident it will always run without errors. I feel better doing it manually and baking it into the AMI. Personal taste.

For the record, we have a very similar production environment and we bake on every deploy. We have some tricks to make it as fast as possible though. A deployment takes 2-3 minutes with code changes and more with requirements changes. We then place it in the ELB server pool and pull out old machines if everything behaves correctly. Looking into autoscaling now.

I saw your video (http://37signals.com/svn/posts/3446-adrian-holovaty-talks-so...) on the 37signals blog where you mentioned that the server stuff was boring. Now it just got interesting!

Ha, I guess you're right! I must say it's become more interesting than I expected.

I have always wanted to learn how to build stacks on AWS and use them with my apps. Does anyone know of any good resources out there for this, definitely for a noob like myself? Perhaps a codeacademy-like site or user-friendly book?

This sounds pretty great. I haven't used Heroku, but have used Rackspace, AWS and Linode.

What worries me the most with AWS, is that there's virtually no support. I guess you can pay and get decent support, but on the low-end of the scale, it's pretty thin or non-existent. Now, I don't hear of so many issues with AWS that warrant contacting support about, and I don't face any issues myself, but just the thought of something going strange and having nobody to talk to makes me nervous.

In contrast, Linode does not have so many cool features, but their support is there and very responsive.

You can just bump up to AWS's $50 paid support as-needed for one month, if you need to. They've been very responsive to us with the three cases we've opened - roughly an hour to resolution each time.

Thanks. That makes more sense, and I guess a reasonable price if you consider you don't need this every month.

According to https://aws.amazon.com/premiumsupport/ - the developer support is within local business hours though...

I still wish this could have been factored into your hourly AWS costs rather than as a monthly fee. Say, pay extra 10% for every EC2 instance-hour or something and get it included (without the $100 minimum as in the business support that is).

It wouldn't be sustainable. The folks trying to build the next Amazon.com on a t1.micro would be paying a buck a month and expecting the same level of support Netflix gets.

Yep, you're right. I guess support is one of those aspects that can't be measured per Gb, so there must be some kind of an entry cost. I do wonder how Linode manages to keep this manageable for them though.

   In contrast, Linode ...  but their support is there and very responsive.
People keep saying this but it simple isn't true. It is only responsive when everything is working fine and you have some OS/app level issue.

But watch what happens when their Fremont DC goes down. You will submit a ticket and not hear back for hours (if at all) all whilst your site is down. You're then forced to jump on IRC and listen to gossip about what is happening.

Linode's support is as good as their security. Smoke and mirrors.

The notes are nice, but this seems like a highly relevant detail: "I'm lucky to be friends with Scott VanDenPlas, who was director of dev ops for the Obama reelection tech team"

At least a bit of the knowledge has now been shared via this article.

My company does automatic setup too (https://circleci.com). However, we went in a different direction than Heroku, and allow people to nail down the exact version of their platform they want to use (for example, our users use over 30 different Ruby versions). We haven't needed the same flexibility for DBs and libraries, but I imagine that will come.

I love Heroku: git deployment, the "dyno" abstraction, Procfiles, buildpacks, putting configuration in environment variables (all the 12 Factor App stuff: http://www.12factor.net/).

Is there an open source implementation of Heroku (close to 100% compatible, not just similar ideas) that runs on your own cloud?

Clone of Heroku? None I'm aware of. A bring-your-own PaaS? Many. I work with Cloudify myself: http://www.cloudifysource.org/

Nice things: It integrates with Chef/Puppet, allows built-in auto-scaling so it's cloud independent... has recipes for many applications and services ready to go: https://github.com/CloudifySource/cloudify-recipes

Drawbacks in my mind are that it's based off of Groovy which may or may not be helpful to some people. The other thing is that it doesn't have a direct method to do source/application deployment so you kind of have to roll your own via 'Custom Commands'+Groovy.

You can also launch many of the popular frameworks such as MongoDB, Play Frameowrks etc online: http://www.cloudifysource.org/cloudifyRecipeCatalog.html

See OpenRuko:


I haven't used it myself, but it seems to be a Heroku clone written in Node.js.

There's also CloudFoundry, which used to be quite different but is now adding support for Heroku-style buildpacks in V2:


I haven't seen a very good reason to go with Heroku over AWS in any relevant scenario. It seems like you're paying more for less.

Just wait until AWS East goes down. Not if, just when. But Heroku is in AWS East, too. I am so happy to have gotten out of AWS.

This doesn't put me off heroku very much. Seems a pretty meager list of complaints.

you can use Cloud 66 to deploy your apps from Heroku right to your own servers with little hassle and no need to do much sys admin work.

well, that escalated quickly

AWS, is simply amazing infrastructure automation. Learning how to use it, even if your not a sysadmin, is going to be a part of your job as an engineer at some point in the future.

How many times has your boss or your organization said "we want xyz", but we cant hire admins and we have no hardware?

Every time someone hits that brick wall, AWS is right there. Although I rarely use the retail side of Amazon.com, I think AWS is a stroke of genius whose full impact wont be felt for a few more years.

very interesting , thanks !

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact