
Surviving AWS Failures with a Node.js and MongoDB Stack - djensen47
http://www.kinvey.com/blog/104/surviving-aws-failures-with-a-nodejs-and-mongodb-stack
======
shykes
Here's how I do it:

    
    
      $ pip install dotcloud
      $ echo 'frontend: {"type": "nodejs"}' >> dotcloud.yml
      $ echo 'db: {"type": "mongodb"}' >> dotcloud.yml
      $ dotcloud push $MYAPP
      $ dotloud scale $MYAPP frontend=3 db=3
    

This will deploy my nodejs app across 3 AZs and setup load-balancing to them,
deploy a Mongo replicaset across 3 AZs, setup authentication, and inject
connection strings into the app's environment. It's also way cheaper than AWS.

The only difference with OP's setup is that the Mongo ports are publicly
accessible. This means authentication is the only thing standing between you
and an attacker (and maybe the need to find your particular tcp port among a
couple million others in dotCloud's pool).

(disclaimer, I work at dotCloud)

~~~
veesahni
"It's also way cheaper than AWS."

3 AWS Small instances cost under $200 / mo and come with 1.7GB of RAM each.

The dotCloud pricing calculator is coming up with $700 / mo for 3 mongodb
instances with 1.7GB of RAM.

Obviously this isn't an apples to apples comparison. But How are dotCloud
instances different from AWS instances?

~~~
shykes
It's cheaper at equivalent level of best practice:

* For a clean architecture you want to isolate each Mongo and node process in its own system. So you need 6 instances, not 3.

* You'll need load-balancers in front of these node instances. That costs extra on AWS, and is included on dotCloud.

* Did you include the cost of bandwidth and disk IO in your estimate? Those are extra on AWS, but included on dotCloud.

* Monitoring is extra on AWS. It's included on dotCloud.

* I love to have a sandbox version of my entire stack, with the exact same setup but separate from production. That's an extra 2 instances on AWS (+io +bandwidth +load-balancing +monitoring). It's free on dotCloud, and I can create unlimited numbers of sandboxes which is killer for team development: 1 sandbox per developer!

* We only charge for ram usable by your application and database. AWS charges for _server_ memory - including the overhead of the system and the various daemons you'll need to run.

* For small apps specifically, you can allocate memory in much smaller increments on dotCloud, which means you can start at a lower price-point: the smallest increment is 32MB.

I didn't even get into the _real_ value-add of dotCloud: all the work you
won't have to do, including security upgrades, centralized log collection,
waking up at 4am to check on broken EBS volumes, dealing with AWS support
(which is truly the most horrible support in the World, and we pay them a lot
of money).

\+ Our support team is awesome and might even fix a bug in your own code if
you're lucky :)

~~~
Teef
Recommendation is to validate "best practice" claims. Doesn't matter what
hosting solution used. Measure, measure, measure to make sure not only are you
getting said claim but also that the end result meets your expectations. An
example is in the past I had 7 "instances" (as shykes points out make sure
they are hosted on separate nodes!) 4 of which where load balances Python web
app. One of the nodes was overloaded so 1 out of 4 requests was very slow
(5-10x). This was a big ajax app so initial page load would hang on the
request(s) to that one instance. My point was since I had measured I could see
that the node was the problem and now that I am on dedicated EC2 each node is
consistant. Good luck.

~~~
shykes
That's good advice. As the saying goes: "trust, but verify".

Regarding your performance issue - most platforms (including dotCloud) enforce
ram and cpu separation between nodes, but are vulnerable to IO contention at
some level. This is also true for EC2 if you use EBS: your standalone
instances will almost certainly, at some point, suffer from degraded and
inconsistent performance because another instance is competing for IOPS [1].

You can avoid this with the new "provisioned IOPS" volumes [2], or by skipping
EBS altogether for stateless components.

[1] [http://blog.scalyr.com/2012/10/16/a-systematic-look-at-
ec2-i...](http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/)

[2] [http://aws.amazon.com/about-aws/whats-
new/2012/07/31/announc...](http://aws.amazon.com/about-aws/whats-
new/2012/07/31/announcing-provisioned-iops-for-amazon-ebs/)

------
helper
Considering how often EC2 outages are EBS related, we've moved all our servers
off of EBS to ephemeral drives. I'm surprised there aren't more people
advocating this route.

~~~
zorked
I think part of the problem is that EBS issues also cause ELB problems, from
that I read here on HN. I wouldn't know because we only use us-east for
Hadoop.

On the other hand, our Cassandra cluster runs on ephemeral drives and it's way
better than EBS even with the guaranteed IOPS thing. Everyone should
definitely give this option a try.

~~~
helper
Yup. ELB is great except it uses EBS. So part of our migration was to move off
of ELBs.

~~~
shykes
Same here, dotCloud originally used ELB and we eventually moved off, which
brought immediate and huge gains in latency and overall reliability.

------
diminoten
Reddit refuses to move away from their current infrastructure, despite being
held together with little more than string and silly-putty.

According to a dev, they haven't even _talked_ about it. Simply hasn't ever
come up.

So Reddit's gonna keep going down like this. Don't be like Reddit.

------
justinsb
You need to be in multiple regions to tolerate EC2 outages, not just multiple
AZs. Even then, this is only good until AWS's first multi-region failure; this
doesn't seem to be an impossible event given EC2's recent track record. Though
I can well understand that designing for EC2 region failure is not worth the
cost for most systems.

~~~
ceejayoz
> Even then, this is only good until AWS's first multi-region failure; this
> doesn't seem to be an impossible event given EC2's recent track record.

Doesn't everything in their track record indicate that regions are nicely
partitioned from each other? Even the biggest region failures they've had have
stayed completely isolated to that region.

~~~
justinsb
AZs were supposed to be that unit of isolation, then when multiple AZs failed
that shifted to be Regions; it seemed like a "blame the victim" mentality to
me.

Given that AWS are running the same software across regions and have the same
people & processes in place, and further that there's software that runs
across regions (e.g. S3), I'd wager it's not long before we have a multi-
region outage.

Finally, some of the multi-AZ problems in the past were compounded because as
one AZ went down everyone hammered the other AZs, taking out the APIs at
least. That's when everyone believed that AZs were isolated. Now that people
know that's not the case, those same systems are going to be hammering across
multiple regions.

~~~
res0nat0r
Regions are 100% independent of one another, both physically and also control
plane wise. Also code pushes to regions for new features don't ever happen on
the same day.

~~~
justinsb
Source?

AZs were supposed to be independent; they aren't. Fool me one...

~~~
res0nat0r
I used to work on the EC2 team. The regions are wholly independent of one
another.

~~~
justinsb
I hope you blog more of these practices then. AWS doesn't put this stuff in
writing, which is very convenient for them when something goes wrong, but
makes it nigh on impossible to build a reliable system on EC2.

I don't think it's an easy problem to solve, but to suggest that the regions
won't go down together strikes me as "the Titanic is unsinkable" hubris. I
hope the AWS team doesn't share your attitude :-)

