As an alternative approach would be that a Juju charm (script) would handle the initial deployment of a stock Ubuntu AMI and the customization in one step (or with puppet/chef) and then allow you to add new instances based on scale (though currently not automatic). When you have changes to your service you update the charm and just `juju upgrade-charm`.
This would consolidate step 1, 5, and the "ongoing" into one tool and you'd get a cloud-agnostic deployment (openstack cloud or bare-metal).
Even a simple question went unanswered the couple times I posted it. Eventually I gave up and just went back to Puppet+Openstack modules for the configuration I was working on at the time.
Disclaimer: I am the primary author.
New project? I just copy the script, change a few variables (names and packages it needs), and I get deployment for free.
It could use variables a bit more, I think, but there were a few bugs with expanding them, so I didn't use them. I'll fix them later on, though.
I'll put this up in a post on my blog, maybe it'll help others as well.
Are people using Ansible to replace or manage fabric scripts? I'd like to figure out some way to limit the spread of one-off fabric scripts (reminds me too much of bash script proliferation) and it'd be great to get everything under one roof.
EDIT: Back online, the site migrated to AWS is not the blog.
The site I was writing about, soundslice.com, has not seen any blips.
Doesn't sound that simple to me (as a complete sysadmin noob). Somebody should write a book about this.
And thanks for the reference to Digital Ocean. Never heard of them before. Seems great, might try using them :)
For me it that restarting an EC2 instance deletes all the local storage. I have had good success just getting a big ass VPS, and running the database locally, and pushing text backups to S3. It is trivial to manage, and in the real world, downtime is more likely to be caused by configuration wonkiness than hardware failures.
You also have to have a huge amount of traffic to overwhelm a 24core / 96GB ram server. Why not put off the managing the complexity until you really are doing 10M page views per day?
The thing is that for whatever reason, data wasn't being written to the EBS (the virtual hard disks for use with EC2 instances) and was instead being written to the "ephemeral storage", a really big local data store that every EC2 instance has that is basically a `/tmp` directory. If the server restarts, everything in the ephemeral storage is destroyed.
Local storage on a $5 Digital Ocean plan will do 2000 IOPS, where as that would cost $200/month with Amazon Provisioned IOPS. I know it is not an apples to apples comparison, but it is worth thinking about. Running a database on a local SSD is a good option for many people, and it is not an option that Amazon offers.
> Do your VPSs already have things installed or a GUI?
Nope. I prefer straight-up Arch linux. ssh in and go from there.
To add some concrete-ness to the mix, I was installing ejabberd. When it came time to ping the server... no response. I did the exact same steps on my Digital Ocean VPS and everything went fine. I had done whatever commands EC2 expects to open the right ports...
I'm a large user of AWS but, in general, I never feel like I'm in a true VPS. My last experience:
Out of the sudden one of our EC2 instances could not connect to another, causing our HA solution to spun dozens of instances and eventually crash too. It was clear to me that the dest machine was behind some firewall, we went to the security group, the machine was supposed to accept any connection, from any port, any host. The instance itself had no active firewall.
Out of desperation I added the very own security group to itself. It worked for a few hours, then stopped again, I removed, it came back to work and still working (8+ months now)
This is only one of various mysterious events I've seen happening on AWS.
The documentation I was following made no mention of this at all.
A raw EC2 instance is identical to your average VPS offering. You don't have to use all the extra alphabet soup of ELB, SQS, SNS, SES, etc.
> When it came time to ping the server... no response.
What were your security group settings? ICMP (ping, traceroute, etc.) are blocked by default, you have to enable them.
See, you say that, but then you later say
> What were your security group settings?
I dunno, man, I just want a server. Every other VPS provider works, and I can set up my own iptables etc.
I'm not saying EC2 is _bad_, as I'm sure things being extra mega locked down by default is good overall. But I don't think it's fair to say that EC2 is the same as a VPS.
EC2 fits every definition of VPS I've ever seen.
Furthermore, I don't feel very comfy when doing 'sysadmin via Google,' who knows what stuff I'm screwing up?
I had never used Digital Ocean before. But getting going with them was the exact same as my previous Linode, Rackspace Cloud, prgmr.com, and every other VPS provider I have tried. I don't need to install special command line tools, or set up security groups, or generate .pems... I ask for a server, they send me an email with a password, I log in, change it, and set up my ssh prefs. Super easy, using the same stuff I use everywhere else.
Essentially you pre-run a puppet manifest and have it do all the first-run processing and then store THAT as the AMI/instance. This way you have an instance that only needs a few seconds to get itself ready while also integrating configuration management (which really you should be doing these days). Just some suggestions from a sysadmin.
There are 3 general levels you can take with AMIs:
1) Vanilla instance AMI, have puppet/chef install everything for you, then fetch your app code and configure that
2) Use another tool to have your instance built with puppet/chef and then capture that into an AMI. Once it spins up it just needs to get your app code. Idea here is your services arent likely to need updated as fast as your app code.
3) Same as two, but when your production code is in release/maintenance mode, bake everything into an AMI. When you need to deploy new code, you need to create a new AMI, but you are creating an AMI with scripts so its no big deal :) All you have to do now is update your cloudformation
Obviously if you have a setup with rapidly deployed code changes you'll want to have your puppet manifest grab the latest version before deploying live when spinning up a new instance.
There's a reason many Vagrant boxes I see use Puppet to instantiate though: It's friggin simple to get started with. More advanced configurations might even end up requiring you to build your own modules and extended manifests, but I expect this configuration detailed in the article would probably amount to a couple hundred lines for the entire manifest.
I think this also underscores one of the things I like the most about Puppet (instead of Chef, which is a fine CM system as well); It's easy to get started with but powerful enough to get things done down the road. Honestly, there's never a better time to start working with Puppet than from the very beginning of a project!
If you wait until you've got a more complex configuration, you're fighting both the tool syntax and system setup requirements. Not to mention the clock that's ticking and telling you that you needed to be all up and running yesterday.
Elastic Load Balancer = software based load balancer or traffic director
auto-scaling rules = if number of connections >50 or bandwidth > 100k or reponse time >10ms start a new instance and automatically update the load balancer so the new instance takes connections
Check out the code snippets I linked to from that blog post. It's pretty easy, I promise -- the tricky thing is just figuring out the various APIs.
Maybe we should just invent something new.
- Create a new EC2 instance (Literally click "Launch instance" and select Ubuntu.
- Login and install any dependencies (sudo apt-get update && sudo apt-get my list of awesome packages i need installed)
- Go back to the EC2 dashboard and click "Create AMI" which is just imaging the server.
- I can't attest to the ease of ELB and auto-scaling rules as I haven't used them, but I would assume it's fairly straight forward - and there is a ton of resources on AWS to help you out :-)
That's why I don't buy books from Packt unless I can find good reviews of it written by members of the community around the software the book is about.
Quite configurable via scripts too (install additional packages, etc). Under the hood it's basically the same - Elastic Load Balancer and EC2 instances.
There's nothing worse than wondering if deploying your stupid one-character typo fix will hang or leave the app in an incomplete state.
I myself am working on setting up salt stack for deployments. I like that all of it is in python, and after a couple of days I've begun to make progress.
I am migrating things slowly: first cron jobs, then internal tools, and at some point, our website. The only thing that really needs "scale" is our frontend, and I don't yet know how I'll manage that, but you know, I guess it is time to learn!
For example, we're using Salt to manage our development, testing, staging and production environments. We couldn't do this with AMIs, since the environments are quite different.
It makes updating quite easy, as you don't need to create a new AMI, just push your update to your release branch, then run state.highstate from your Master.
If you configuration changes, you update the master and then you can push that change across all (or just one) of your environments.
Dreams of an identical staging environment were somewhat shattered when the "free tier" plans went away. Plus, sometimes a new version will be deployed to 2 servers, and the 3rd one will be hanging on an update. These kinds of things are difficult to plan for and test, since intentionally messing up a deployment is rather difficult with PaaS.
Could you elaborate on why salt stack is better than custom AMIs ? also if you have any link explaining how to deploy to ec2 using saltstack, that would be awesome.
Disadvantages of AMIs are that they are large files (which can be unwieldy) and are not in a format that can be immediately used in a vagrant instance or elsewhere, so you need to convert between AMI and .vbox or .box formats to use amis with your local vagrant instance.
With saltstack, you can deploy and configure machines from base Ubuntu or RHEL distros by using salty-vagrant for vagrant instances and salt-cloud for AWS / rackspace / openstack / vps machines. Once machines are deployed and connected to your salt-master, you can use your yaml config files to bring all the machines to the desired roles (api machine, load balancer, web front-end, etc.). This is more flexible than using AMIs, since you can roll out a minor change quickly to all servers, including production servers, without needing to wait to build a new AMI.
There are hybrid approaches also (some puppet/chef users do this) where for production machines you can have saltstack deploy to a machine that then gets built into a new AMI automatically. And then this new AMI gets updated into the autoscaling groups and is deployed to replace the current production servers. Sounds complicated... and it is, but at the cost of extra complexity it does give you the best of both worlds.
Apart from the PostgreSQL hosting, I'm not really happy with Heroku. Add to the random deploy breakage I've experienced mid-launch (not to mention it's slow as hell), I'm going to move back to my own servers.
Not saying this is a great policy, just explaining in case you weren't sure. When you go to 2 dynos they keep things up for you presumably because now you're a paying customer.
But cron? That's just ridiculous.
Also worth mentioning cloud formation as well . That might make the pain of chef/puppet more of a worthwhile investment!
I have some bad news for you; nothing I have ever used worked 100% of the time. Doubly so on AWS. Just off the top of my head; just in the last week we have seen 5% of AWS instance act so badly that we had to recycle them, and that is just the easy to diagnose problems. Don't get me started on the IOPS marketing BS that Amazon sells.
It's just the reality of a lot of moving parts and complex systems. From the sound of it, the OP had very little _actual_ downtime, and had to make some end-runs a couple times. Shrug, just life in a high-scale world IMHO.
The goal should not be to bounce around providers until you find the Garden of Eden. The real goal should be to accept failure and build to tolerate it. Now maybe it is easier to build fault-tolerance if you are closer to the metal... BUT I would bet that Heroku has much better experience and tooling to detect and resist faults then rolling your own.
That being said, making some easy architecture choices early on can have an enormous effect on your sanity. It is probably worth it to bring someone in to hint you in the right direction after you have a prototype to show and have already started to make choices. This is basically where I helped Adrian.
If your app is struggling under success, you are ideally in a better position financially than most startups are at the beginning. This is when you bring someone in for longer term engagements, or potentially when you hire someone to work on this full time.
I think you are hitting on something that is really a need in the market right now. When you don't need a consultant, and you need 30-90 minutes of "how do I do this?" Q&A, it is hard to figure out who to talk to.
From my perspective, it is hard to find those engagements. I do a sad majority of my time in the sales side of things. Recently, I spoke to some folks at 10xmanagement and Adrian pointed me to anyfu, both of which are approaching this problem but in slightly different manners.
Lastly, you don't want the soup to nuts guy. This is your app, and if you aren't deeply involved with what is going on with it, you are putting yourself at a huge disadvantage post consulting engagement.
If there was a place I could go to get some cash for my time answering these questions, I'd certainly be more inclined to spend the time to do so thoroughly.
Mentoring contractor start-up perhaps?
> I changed the app to use cookie-based sessions, so that session data is stored in signed cookies rather than in memcache. This way, the web app servers don't need to share any state (other than the database). Plus it's faster for end users because the app doesn't have to hit memcache for session data.
The switch to using cookies for storing session data instead of memcache has tradeoffs. Sure, you no longer need to ask memcache for the session data. But you are also shipping a significantly larger cookie back and forth on every request.
If you're storing a lot of data in your session, this could actually slow things down in the long run. 
I know you said you're a two man shop, but in this case it may make more sense to leverage other IaaS DB services instead of RDS.
Could you expand a bit maybe?
"You must create a DB snapshot before you can restore a DB instance from one. When you restore the DB instance, you provide the name of the DB snapshot to restore from, and then provide a name for the new DB instance that is created from the restore. You cannot restore from a DB snapshot to an existing DB instance; a new DB instance is created when you restore."
In my opinion, you don't get enough control over the environment for what you're paying on a per hour basis. Automated backups? Great, they aren't that difficult to being with. Failover? Sure, within the same AZ, when you need to be doing it between datacenters. Scaling? I will grant that it scales up and down fast automagically.
We ended up going with Heroku's hosted Postgres solution.
It costs the exact same we were spending w/ two High CPU instances w/ provisioned IOPS volumes.
Now we get fully managed, same price, and all the features of Postgres.
We still host our full application on AWS, the only thing is that we have a managed database. While Postgres makes it easy to setup replication and what not, AWS hardware just sucks. It takes time to properly tune it.
The only thing I would complain about is the lack of a proxy or load balancer in front of their Multi-AZ setup. With any failover you are forced to have a downtime of 3-5 minutes while DNS propagates.
no experience or affiliation, though.
I've not used Heroku's Postgres stuff, so no idea what it offers beyond that
1. Change your default heroku error pages. From this: http://s3.amazonaws.com/heroku_pages/error.html and this: http://s3.amazonaws.com/heroku_pages/maintenance.html to something else (in the app settings tab).
2. Keep a staging instance up and running with a cloned database so that you can avoid random heroku-created errors in production.
I don't have any indication that soundslice.com (which is the site I'm hosting on AWS) was returning any 500s, but if it was, please let me know.
Also, how many dynos did it take to handle the load on soundslice, when it was on the front page of reddit?
As for the number of dynos when Soundslice was on the Reddit homepage, I think was using 10 at that point, but, again, it was a totally non-scientific thing. 10 did the job, but fewer dynos might have been just as fine.
That being said as someone who's been in the Unix world for a while figuring out which AWS services to use is not obvious at all. It took me some time to figure out the alphabet soup of sevices and I was familiar from work. I ended up using "cloud formation" which builds you a server (LAMP or other) and optional database/loadbalances configuration. Its was that or selecting a LAMP ami (Amazon machine image). They have a lot of documentation but its hard to get an overview of what everything is (S3, elastic storage....) Plus configuring web server/ database servers for best performance can be non-trivial.
I get better speeds from our organizations cheap "shared hosting", during low loads. I was pretty sure the one week of very high loads would have crushed the shared host, thus AWS was perfect.
One thing I'm curious about re the OP's process: he says he is using Fabric for deploys. Does that mean every time he deploys new code he has to snapshot a new AMI? In that case, why use Fabric at all? I'd be worried about auto-deloying AMIs with outdated code.
Since EC2 instances are "disposable," one approach is to never "update" an instance, but instead you release new code by simply launching fresh instances with the latest code, then destroying the old ones.
If you take the route you just mentioned, I highly recommend using Netflix's Asgard (https://github.com/Netflix/asgard) and check out their recently released Aminator (https://github.com/Netflix/aminator). Asgard specifically makes AMI based deploys outrageously easy.
Now you have an easy way to build your ami from the ground up, you only need to worry about core OS package enhancements on the base ami, and you can use your created AMI with autoscaling behind an ELB.
For VERY rapid changes - direct git access is optional (even on startup for current code pulls), but I tend to side with the Netflix guys and focus on an AMI as the minimum unit of deployment.
Check out asgard to make this easier (senseless self plug: I built an asgard ami this weekend - http://imperialwicket.com/netflix-asgard-ubuntu-1204-lts-ami...).
On the other hand, if you are using auto-scaling, Chef/Puppet/CFEngine may not be able to finish their work in time if they have to do a lot of work. You have to strike the right balance between what needs to be in the AMI (provisioning) and automation (post-install).
I'm sure it's gotten better since then (this was circa late 2011), but I'd think long and hard about using App Engine for anything nontrivial.
That being said, we still use GAE as a platform for fast prototypes, as a cron service to keep our Heroku instances up + firing up an EC2 instance every night to run a script for 5 minutes. So we only use GAE when there's minimal risk of being locked in.
It still has very limited quotas, though: Maximum of 8 instances or 8 total CPUs. You can request a quota increase through a web form, but I did this a week ago and haven't heard back yet.
Also, it's still in a "limited preview" period, which means your entire zone can go down for weeks for scheduled maintenance, which is pretty inconvenient:
edit: though you'd have to factor in the cost of bugging Scott when you're not his friend.
I know almost nothing from a unix sysadmin perspective, but would invest the time if it is feasible.
I don't know the details of your app, but $170/month sounds high for that server arrangement. If you set up a few auto-scaling Amazon micro instances, it would give you more capacity and be cheaper. Of course, the tradeoff is that you'd need to either learn how to do it or hire somebody. :-/ Good luck!
$70 for each app(Its actually ~$140.00) (approximately 100-200 business customers might use the apps daily)
And of course, bad billing practices where they dont stop charging you for addons you have stopped/remove and send invoices 1 month late so by the time you notice, another round of billing as occurred incorrectly. Have resulted in me paying $350 and $251) in the past 2 months.
Exact quote from Heroku customer support email, after overcharge of ~$300 in past 2 months
"I'm sorry that the delay in receiving your invoices caused the charges to continue for longer than you would have liked, however please keep in mind that we offer your Current Usage details[ on your account page]"
Yeah, not happy with heroku at all. Even though i can better use my time generating additional sales for the apps, Im so pissed, I plan to burn time move and cancel ASAP...
I think this is going to be an emerging problem for many startups these days. Patterns of completely unstable products because "ops is hard LOL" followed by a desperate attempt to hire actual sysadmins for their company and excuse after excuse of how "awesome we are for growing so quickly!".
I have rolled my own which, being totally biased, I do believe is more full featured than most of those mentioned in this thread, as well as more redeployable/generic. (100% pluggable platforms, built to integrate tightly with build services and developer environments, etc.). Limited internal testing at this stage (1000s of VMs instantiated, across three cloud providers (one internal), only two distinct OS deployment environments targeted thus far).
However, it'd be great to have an overview of this area... it's certainly exciting. I am beginning to feel like the DevOps batton has been passed from large companies with massive infrastructure requirements back to the community over this last year, as more developers are becoming aware of the pitfalls of single provider cloud solutions and the rough edges and missing features of existing multi-cloud deployment APIs. We can probably expect great things in this area over the next 12 months.
Baking a new AMI only happens when there are new underlying dependencies, like a new package from apt-get or a new Python module from pip. In other words, it's rare.
Hope that helps!
I could also change the AMI's "user data" to run pip when the instance loads, but I'm not 100% confident it will always run without errors. I feel better doing it manually and baking it into the AMI. Personal taste.
What worries me the most with AWS, is that there's virtually no support. I guess you can pay and get decent support, but on the low-end of the scale, it's pretty thin or non-existent. Now, I don't hear of so many issues with AWS that warrant contacting support about, and I don't face any issues myself, but just the thought of something going strange and having nobody to talk to makes me nervous.
In contrast, Linode does not have so many cool features, but their support is there and very responsive.
According to https://aws.amazon.com/premiumsupport/ - the developer support is within local business hours though...
I still wish this could have been factored into your hourly AWS costs rather than as a monthly fee. Say, pay extra 10% for every EC2 instance-hour or something and get it included (without the $100 minimum as in the business support that is).
In contrast, Linode ... but their support is there and very responsive.
But watch what happens when their Fremont DC goes down. You will submit a ticket and not hear back for hours (if at all) all whilst your site is down. You're then forced to jump on IRC and listen to gossip about what is happening.
Linode's support is as good as their security. Smoke and mirrors.
Is there an open source implementation of Heroku (close to 100% compatible, not just similar ideas) that runs on your own cloud?
Nice things: It integrates with Chef/Puppet, allows built-in auto-scaling so it's cloud independent... has recipes for many applications and services ready to go: https://github.com/CloudifySource/cloudify-recipes
Drawbacks in my mind are that it's based off of Groovy which may or may not be helpful to some people. The other thing is that it doesn't have a direct method to do source/application deployment so you kind of have to roll your own via 'Custom Commands'+Groovy.
I haven't used it myself, but it seems to be a Heroku clone written in Node.js.
There's also CloudFoundry, which used to be quite different but is now adding support for Heroku-style buildpacks in V2:
How many times has your boss or your organization said "we want xyz", but we cant hire admins and we have no hardware?
Every time someone hits that brick wall, AWS is right there. Although I rarely use the retail side of Amazon.com, I think AWS is a stroke of genius whose full impact wont be felt for a few more years.