My setups usually consist of an nginx serving static content and proxying applications requests (doing gzip, etc). The data tier is initially collapsed into the application as described in http://www.underengineering.com/2014/05/22/DIY-NoSql/ This architecture allows very fast iterations while providing enough performance headroom; it can serve 10k simple (CRUD) http requests per second on a single core.
You could do the same thing without the IP swap trick using DNS, it just take a bit longer to propagate. The real catch is what happens to data that was added or changed between when you clone the server to when you make the switch? Ideally you'd want to put your site in a readonly mode if possible.
>I think rolling release distros are your friend for these type of all-in-one setups. Small weekly updates. Easy to test too because you only need one test machine.
I think the important thing here is a distro that tests it's changes well, and one that doesn't force you into a major upgrade (where you have to change your configs) very often.
Releases like Debian that want you to do a rolling major upgrade every two years, I think, are more difficult to deal with, because the major upgrades, if you have anything at all custom and the config file format changed, are going to require work and testing for you to move the configs over, even if the developers test perfectly (and nothing is perfect.)
I think RHEL/CentOS is best, assuming that the latest RHEL/CentOS supports all the packages in-distro. (If you need to step outside of the distro repos, that kind of defeats the point. maintaining a package yourself on an ancient distro gets old fast, and most of the smaller 3rd party repos don't put as much effort into keeping the old package versions patched up.)
That's the thing, sure, you have to format and re-install for a major upgrade, but you have ten years before you have to worry about that.
I've seen quite a few places with over-complex setups due to gross inefficiencies in the software they deployed.
KISS is good advice, even in 2014.
I'll add that security is a concern in that case, too. Getting access to your frontend server is getting access to your whole solution, while on a properly architechtured solution that wouldn't be true. In any case I don't know the details of your situation so I might be wrong, but I highly doubt it :(
Edit: You just left an answer to this on a different part of this thread. I'll add a reply to your answer here:
>>> MTBF is a factor of the number of parts in your system. A single machine will have substantially less risk of breaking down than a complex setup. In the past (when servers were not this powerful) and the system consisted of 9 (!) servers, 5 web front ends, a load balancer, replicated DB back ends and a logging host we had a lot of problems due to bits & pieces failing.
That's not true. A single machine has more chances of breaking than a complex setup... If it is properly designed. It doesn't matter if you have 5 frontends and replicated DBs if you're going to have one single load balancer. Also redundancy is not only about hardware, but about processes. If a critical process fails and there are no plans or processes to avoid disaster then it doesn't matter if you've got every single piece of hardware at least duplicated.
I mostly agree with the rest of your answer and happy to see that going simpler suits you. There's no "final solution design" and it depends a lot on your comapany's particularities and your application design.
What matters is that you stay away from complexity as long as you can't afford to expend your time, energy and funds on it.
And if you have to then go for it.
Staying away from complexity is one thing. But only if you are running some toy site.
After all, if you controller goes on the blink your precious raid could easily die with it.
One machine is OK when downtime is OK too. Such situations do exist, so it is sometimes a viable solution.
If you're serving traffic to 100K+ users then chances are that's not one of those situations. How are you handling redundancy? Or is possible downtime just an accepted risk?
Moving it all to one box was an interesting decision, it has paid off handsomely over time.
Redundancy is a good thing to have, obviously. But it is not simple (nor cheap) to get it right. This machine has redundant power supplies, redundant drives and we back-up multiple times per day. Worst case (a total system failure or a fire in the hosting center) we'd be down for a while but that exact scenario has hit us once before (we were an EV1 customer when their datacenter had a fire) and we came through that quite well.
It all depends on the kind of service you are running what your competitive space looks like and how much money you can throw at the problem.
But for the majority of web apps, especially when funds are critical and you're concentrating on the business side of things rather than the tech you will find that having it all on one box allows you to focus on your immediate problems rather than on how to stay on top of all the complexities running a distributed application brings.
A. The probability that a single component in your system fails increases as the number of components increases.
B. The probability of the entire system failing decreases as the number of components increases.
Where (B) goes wrong is if the system is designed in such a way that components are dependent on each other.
Imagine you have a system containing 4 parts, all of which have to have at least one operating component for the system to remain operational. The components are:
WS = Web Server
DB = Database Server
AS = Application Server (executing long-running tasks)
LB = Load Balancer
Each component has a different probability of failure on a given day, given here:
WS, AS = 0.001
DB = 0.002
LB = 0.000001
If you do this:
10 x WS = (0.001)^10
10 x AS = (0.001)^10
1 x DB = (0.002)^1
10 x LB = (0.000001)^2
Then the probability of failure is 0.002, because if the database fails then the system fails. To increase redundancy you need to increase the number of DB servers too. If you have two DB servers, then the probability is 0.000004, 500 times lower.
I believe that you really did experience a problem with your setup: and I'll hazard a guess that the root cause is nothing to do with the architecture of your system but everything to do with the exponential increase in SNAFUs cause by the extra complexity.
Hardware failures are rare, people failures are common.
Almost :) It has more to do with the fact that testing such a setup under realistic conditions modelling all the potential failure modes is no substitute for the variety of ways in which a distributed system can fail. Network cards that still send but don't receive? Check (heartbeat thinks you're doing a-ok). Link between to DCs down, DCs themselves still up and running? Check... and so on.
Doing this right is extremely hard, and even the best of the best still get caught out (witness Amazon and Google outages, and I refuse to believe they don't know their stuff).
Hardware failures are rare, people failures are common, distributed systems are hard.
Being pretty disingenuous here.
Single machine breaking down means an outage. Complex system breaking down means no outage. Again if you are running a toy site then sure go with the single machine.
And redundancy is very easy to get right if you are using something like AWS or even DigitalOcean. Provider based load balancer + App Tier + Multi Master database like Cassandra.
It's fine to say in most instances you don't need more than one server that you can scale "physically" - but it's unfair to suggest you shouldn't consider the risks involved and the (sometimes very rapid) needs at future scale.
It's a run-of-the-mill HP, total costs including memory were under $10K.
Spending a few hours early on thinking about the following, will save you days of headache.
Eventually talk to a database on a different machine.
Your web app is either stateless, or has an external session cache.
You can have a process that is not your web app process, do "jobs."
It probably depends upon the size of the project you have in mind, but if you think you'll have to scale, imho solution 3 is not that an extra burden and forces you to do things better.
The worst thing is to stand in the 11th hour when your app is a raging success and having to plan for infrastructure upgrades and potential downtime right at the moment when you would be hurt the most by downtime and upgrade related problems.
However, I find it odd that it's seen as such a wide-ranging absolute in these comments. These days, I'm not given to thinking of number of 'machines' as a sole or even primary axis of complexity.
Like so many things, it depends.
That's a text-book example of premature optimization.
> A lot of my job is helping people who are stuck on a single box with an increasingly complex rapidly scaling application and no real well thought out plan for how to start splitting that app into clustered components.
That's a fine time to start thinking about how to solve that problem and there are plenty of good, battle tested solutions out there. The first one is cache as much as you can, that will buy you a lot of time to get to a more scale-friendly setup.
Remember that if they had spent their precious runway time on thinking about scaling instead of customer acquisition that they probably would not have a company at all at this stage, rather than a solvable scaling problem.
[root@one www]# uptime
18:06:48 up 794 days, 1:46, 2 users, load average: 0.32, 0.28, 0.20
Are you comfortable sharing your application's actual availability? I'd love to throw away my preconceptions about redundancy but I find it hard to believe you can reach five 9s with a single machine.
23:42:22 up 1433 days, 22:56, 1 user, load average: 0.20, 0.11, 0.03
[root@twelve ~]# uptime
23:54:38 up 1425 days, 5:38, 5 users, load average: 0.07, 0.15, 0.16
The biggest issue in reliability are hard drives, power supplies, network interfaces and power infrastructure. At least, over the last 16 years of operating a series of websites those have been the main causes of trouble.
Uptime does not say much about service uptime, for instance, if the network uplink on one of those machines is down then the users will experience an outage, having a redundant, multi-data center setup would guard against such a situation.
But that would immediately introduce a whole pile of other problems. For instance, in a multi-master setup it would be quite difficult to recover if the only thing that went down was the peering link between the two data centers, with both locations still accessible from the public internet.
In that situation there is a 50/50 chance that my simple-but-dumb strategy would not even be noticed and a 50/50 chance that we'd be down.
That doens't mean there are no situations where such a distributed setup would be warranted but from where I'm sitting the economy just isn't there.
Having regular hardware is no reason by itself why such hardware could not be reliable, regular applications stacks perform remarkably well and the weak points in networking are just as weak when they are connecting otherwise reliable components across WAN links as when they are connecting outsiders to your co-location facility.
Once you start scaling up and/or out the whole equation changes and you need to invest a lot more into planning and testing your setup. Most people find out that their distributed setup was a little less distributed than they thought it was when the first outage hits them. This stuff is very hard to get right and most companies do not operate at a scale where this is a requirement, nor do they have a 100% uptime requirement. Of course we'd all like to pretend we're that important but that's a nonsense argument, the only way you're going to get to 100% is by spending an infinity of money. Everything can go down.
Now, when dealing with what you are talking about, typically I've seen that from developers just making bad decisions in the development process full-stop. E.g. Moving the database to a different server should amount to a config change, not a code change.
No, you probably mean “waste”.
It should soon be possible to recover automatically, in a different data center, in matter of minutes. That's tolerable for me.
The "orange box" that represents the private network in each of the examples is taken for granted, but for someone coming from an application development perspective that piece isn't trivial to make. EC2 Security groups make that sort of box incredibly easy to make, but DO doesn't have anything like that.
You can find our most recently published tutorials in our community here: https://www.digitalocean.com/community/?filter=tutorials or catch it in our twitter feed (@digitalocean)
But these apply to your desired documentation:
Let us know in the comments if you have any other suggestions :)
It's nice to see you got a lot out of the article but this is hardly a complete course on how the web works from the server side. It is more of a quick guide on a number of common server set-ups for mid sized web sites. If you want to learn more about 'how web serving applications work' I suggest you follow one of the how to guides about setting up a web server of your own and serving up a couple of pages. You won't need any extra hardware for this, all the software is 'open source' and won't cost you a dime. Depending on what kind of operating system you normally use you could start with any of these:
best of luck!
I actually meant just the hardware aspect of the setup, sorry for the confusion. That said, I'm still super interested in how the actual serving works. The resources you've provided seem to be exactly what I'm looking for. Thanks so much for providing those.
My experience is in HPC where 'serving content' actually means 'sending data to other nodes'. The upside of this is that in a compute cluster, all the nodes are, usually, in the same room and are actually located very close together. There's still a lot of networking involved in getting the nodes to communicate, but it's super interesting to me to see how to scale things on the web where nodes are not necessary even located in the same country! The example of having the DB and application servers on different machines is a good example.
Anyway, sorry for the digression, and thanks again for the links. It'll be bed-time reading for me :)
Regarding scaling: a couple of years ago I ran a database on a single CPU core (because of licensing issues). It stored 50M rows a day and also executed various queries quite quickly. So I seriously doubt that most of us is going to need large clusters.
"1.126 million inserts per second (single insert)"
website hosted on 1 droplet. additional 1 droplet per every customer is deployed through Stripe and DO api.
DO let's you save a snapshot and load it to the droplet. I have a snapshot that is basically a copy of my 'software'. It's a LAMP stack with init script to load the webapp from git repo.
Customer logs in at username.mywebapp.com
The beauty of this is that I never have to worry about things breaking or becoming a bottle neck. if one customer outgrows themselves, they won't affect other resources. It has linear scalability, new customers, add a new droplet. I don't need to worry about writing crazy deployment scripts although I use paramiko to ssh in to each server when I need to get dirty.
The main website is mostly static content. I could host it even on Amazon S3 but currently using cloudflare.
Updating the product code requires me to restart the droplet instance. However, I test things out on another staging droplet. Once things work on there, I use the DO api to iterate through all the customer droplets and do a restart.
However, this approach won't give you a viral article title like "eight server setups for your app" (replace eight by 2^n where n is the layer count).
For now though, those tasks are still "hard" which means that for many developers digital ocean is still hard to use relative to other emerging platforms such as Redhat's Openshift or Heroku. I know there are many shops who would love to jump ship from IaaS to a less expensive platform but they feel the cost of rolling their own zero-downtime clustered deployment infrastructure is not worth the $ savings.
I suspect that if IaaS providers were to dedicate resources towards producing more educational material for developers with the aim of demonstrating how to achieve these deployment objectives on all the popular platforms using modern open source tools then loads of PaaS developers would jump ship.
For example: How can I use ansible to instantiate 5 new droplets and automatically install a load balancing server on one of them while setting up the Ruby on Rails platform, and ganglia on the remaining ? How can I run a load balancing test suite against the newly created cluster, interpret the results, and then tear the whole thing back down again all with a few keystrokes ? How could this same script allow me to add additional nodes and how does the resulting system allow for the deployment of fresh application code ? How can it be improved to handle logging and backup ?
I know that it's possible to create a deployment system to answer the above questions in less than a few hundred lines of ansible + Ruby, so I imagine it could be explained in a short series of blog posts, but you would probably need to hire a well-paid dev-ops guru to produce such documentation. I bet if you ask around on HN...
p.s. keep an eye on these:
^ If either of these become production quality software it could be a game changer for Digital Ocean.