So, I'd think that rather than place something in Ohio and New York, both which can be affected by the same weather pattern within a day or two, different coasts offers better protection.
I lived in KC for some years and we'd occasionally get remnants of gulf hurricanes, but they'd just be severe storm systems that would pass through. We rarely lost power, but if we did, it was never for protracted time periods, and generators in data centers should easily be able to deal with power loss from severe storms.
Tornadoes are pretty much the least threatening natural disaster out there, as their area of effect is usually small and their duration is usually short, so I think "tornado alley" is actually a fine place for a data center meteorlogically.
I think the idea of the second data center is that it is far away from the first one.
I think you got confused about the "NE Pacific Ocean" thing under the Hurricane column. The NE Pacific Ocean is the part of the Pacific Ocean that Oregon is on.
It would technically be an extratropical cyclone, but it would likely have started as a tropical hurricane off the coast of Mexico.
Thank you for this discussion!
When you offer hosted services (not cheap, mind you), you take on responsibilities. Among them are disaster recovery scenarios. We do have ours and I'm expecting any company for a cloud-hosted solution to have theirs.
Cost a small fortune but there is nothing more expensive than not being there for your paying customers.
(Sorry for the conversation derail.)
So what's the alternative? Keep it hosted-only. The downside is that, yes, outages like this happen. But I would argue that, on the whole, the overall Trello downtime has been far less than the cumulative downtime of people trying to run it themselves. Moreover, this was an extremely unusual storm. Buoys reported waves 5 times higher than anything on record. My guess is, the cost-benefit analysis is still solidly in on the side of having a hosted product in an easily accessible data center.
They brought all the email server management in-house, probably cost a lot of money, took a while, and when they finally turned it on... half the company was without reliable email for about a month.
If you don't have a good answer, it's simply grass-is-greener thinking.
And I do not want to host FogBugz myself, in fact I'm paying exactly for the comfort of not having to plan for failures in the case of FogBugz.
Incidentally, we run a SaaS company. Our disaster recovery worst-case scenario means recreating our services from scratch in any Amazon AWS datacenter on Earth in less than 4 hours. Yes, we have an easier job, because we do not store a lot of transactional user data. But our service is also way, way cheaper than FogBugz.
Preferably one not locate in a hurricane path, flood plain, earthquake zone, fire area, landslide track, or subject to political or economic instability.
Or highly redundant with tested failover paths.
All of which costs money, and still doesn't assure reliability. Look at last week's AWS EBS outage and root cause analysis: the service was brought down by its own monitoring (exacerbating an existing memory bug).
It's not easy. Sandy is the most extreme hurricane to hit NYC in a century (though the second in as many years). NYC is a sufficiently important commerce and financial hub to have excellent services and recovery capabilities, but it still isn't immune to perturbations.
The "reasonable cost" is a good point of course. Also relative to what they are able to charge and how that would change their business model. One type of customer might be willing to pay for a more robust service, others wouldn't. Take any garden variety website hosting service where the charge is under $10 per month and try to operate it giving better uptime and charging, say, $20 per month and see your customer base vanish. People expect it to work 24x7 but aren't willing to pay for it. They would rather take their chances.
That said one thing they might be able to do at a "reasonable" cost is simply spread their customer base over multiple data centers. So the failure of one would only bring down a smaller percentage of their customers.
Frankly this is just making us appreciate Fogbugz all the more since tracking our time without it will be a real PITA.
Trello is fantastic, but now I'm worried that I'm too dependent on it and I should arrange an offline alternative.
Take my money, Fog Creek.
If this is the kind of problem that excites you, we're hiring :-)
Sadly the better implementations I've used myself (or have heard about) are not publicly available. The closest thing in semi-widespread use seems to be Zookeeper, but it's more like Oracle when you really wanted SQLite (standalone service vs. library).
There is also a print option, but that doesn't seem to print the back of the cards so it's pretty useless, unless I'm mistaken.
In my experience, what self-hosted bug-tracking systems might gain in redundancy (a few hours every time a major storm hits) they lose to context switches (which hit team morale, not merely time.)
"Hey, Bob! Can you reset the MySQL server again on the bug-tracking maze set up by Fred over 4 years ago?"
Bob obligingly does so, and then, having been interrupted in the middle of a hard problem, loses his place and can't get back up to speed by the end of day. So while you may gain four hours of butt-in-seat time, you're losing four hours of real productivity on a far more frequent basis.
I'd rather pay Fog Creek to worry about that stuff, and actually ship products.
I do hope at a better time, Fogbugz can consider redundancy/failover that Stack Overflow enjoyed.
That sounds like some really ... unfortunate planning of the positioning of these machines, made me think of the Daiichi incident when backup assets failed to come online because parts of the backup infrastructure were destroyed. Not as serious, of course. Fog Creek's hosting isn't a nuclear power plant. :)
Makes me glad I don't work with things that are as hard to test in the real world as these kinds of backup solutions must be. I hope they manage the refuelling, somehow.
The first is that the fluid will draw a vacuum at the top of the pump. A vacuum on earth can only sustain 14psi - atmospheric pressure. For water, this constitutes a 33.9' column - after that you could draw a perfect vacuum at the top of the pipe and wouldn't the water to rise higher. Since diesel is less dense than water you might go a little higher, but probably not much.
The second problem is that as you create a low-pressure zone at the top of the pipe, the fluid will boil (cavitate). This gasified liquid will fill the space and drop your pressure. Oils like diesel should have a significantly higher boiling point (aka lower boiling pressure) than water, but it's still a limit.
If you keep sufficient water (up to the weight of the fuel) at the location where it's needed, the entire system needn't need a pump at all when called into action, just a tap - gravity would be sufficient. This is presuming that the fuel is kept at basement level for safety purposes, of course, otherwise you could just keep the fuel where you're storing the water. You can get by with less water and active pumping, since hydraulics are easy to turn into gearing (force multiplication) effects.
At least in this scenario people can carry fuel up stairs to fill the generators.
move the servers? On what planet is server infrastructure movable on a whim?
EDIT: Just because it's feasible doesn't mean we will actually do it; just wanted to clarify that we weren't firing from the hip.
Having power and/or rack space is not the same as having servers, switches, etc. anyway.
Hopefully it will not be down that long. We'll let you know more when we know more.
FogBugz is more of a problem. Some days the "Resolve" button is my only source of job satisfaction.
I'm not trying to 2nd guess your ops team, but the whole point in having off-site backups is to facilitate a your RTO plan in case you lose your primary DC with no warning. I guess I'd be surprised if you don't have a < 24 hour RTO plan in place. With how quickly you can get VMs and even dedicated server provisioned by many hosting providers (minutes to a couple hours), the idea of physically moving servers off-site into a new racks, with new networking, etc... seems kinda nutty...
Moving the hard drives could be an option, and I believe it is sometimes done, but it assumes there are empty boxes on the other end waiting to receive the hard drives in a similar configuration to how the hard drives came. Also there's separate issues depending how many drives they are dealing with, and what redundancy is involved. If it's very few, then you might as well move the server outright. If it's very many, then there's extra human overhead (and room for error) in keeping the drives together.
On the other hand, with a 1000/Mbit uplink that they were allowed to saturate, they'd still only be able to copy out 1 terabyte in 3 hours.
Essential quote (literally from Networking 101): "Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway."
100 TB of high speed RAID-10 would fit in maybe 30 shoe boxes.
Then, after things are moved, you have to deal with drives that have jiggled loose, components that outright fail to work again, or things that get accidentally broken in transit.
Let's just declare an emergency federal holiday until Nov 2 so everybody can recover without dangerous heroic measures.
We seem to be arguing separate points -- I'm saying that it's not unreasonable that a small company could be moved fairly easily. Possibly as easily as unplugging a blade enclosure and throwing it in a station wagon. There are loads of small businesses that can run on an amount of hardware that can be easily transported. (When I used to gig on electric bass, my amp and other rack gear was in a portable 8U rack and that was more "portable" than the 100 pound speaker cabinet.)
I can't tell if your point is that it's unreasonable for all companies, which is wrong, or that it's unreasonable for some companies, which is obvious.
1) Like any prepared person, I got all my data out of the east coast before this whole hurricane thing. If someone from Fog Creek hooked me up with some emergency licenses while they got their stuff sorted out, we'd be fine. Actually, this would be a good time to switch to self-hosted.
2) Second, while I was doing our hurricane prep, I ran into this blog post from Joel:
> Copies of the database backups are maintained in both cities, and each city serves as a warm backup for the other. If the New York data center goes completely south, we’ll wait a while to make sure it’s not coming back up, and then we’ll start changing the DNS records and start bringing up our customers on the warm backup in Los Angeles. It’s not an instantaneous failover, since customers will have to wait for two things: we’ll have to decide that a data center is really gone, not just temporarily offline, and they’ll have to wait up to 15 minutes for the DNS changes to propagate. Still, this is for the once-in-a-lifetime case of an entire data center blowing up
Obviously this was written in 2007, but they claim to be geographically redundant and have geographic backups that are "never more than 15 minutes behind". Presumably things haven't deteriorated since then.
Not sure I see the need for a propagation delay if customers can be pointed to the new site by simply using domainbackup.com instead of domain.com (in other words completely separate dns as well as a completely different domain (even through a completely separate registrar) to a site hosted elsewhere. They can know this in advance of course.
What happened to that? Was it turned off? How long did it last?
I also wonder how long FogCreek will still maintain the for-your-server version of FogBugz? Will it still be available next year?
Sorry for what you are going through obviously. One thing I have found though in disaster planning is that it pays to be a pessimist. While worrying certainly doesn't help once you have a problem to solve doing so in advance helps you anticipate things that you need to take into consideration when planning.
Still, carrying 200 lb barrels up 17 flights of seawater/diesel-slicked stairs sounds ...unpleasant.
fogcreekstatus.typepad.com and @fogcreekstatus on twitter will continue to have updates.
edit: All Fog Creek services have been shut down ahead of power failure. We'll update the status blog as we know more from our DC.
Edit: here are the emails I've received from Internap regarding LGA11 https://gist.github.com/3980482
That bit made me chuckle.
Try not to let your fingers type cheques your datacenters can't cash...!
Alternatively, you could design some type of system that would allow you to fail over to a geographically redundant datacenter. Joel claimed in 2007  that they had such a system, and touted it as a selling point of the reliability of the hosted service. What has happened to it is probably only something that a Fog Creek engineer can tell you.
(The hurricane was also predicted to weaken to a tropical storm before landfall at the time I wrote that.)
I think using a general catch-all that was not a narrowly-defined technical term that didn't/wouldn't universally apply was actually a prudent and defensible thing. Given their goal of collecting all the concerns of all the stages of the storm under one umbrella.
Even if their motivation was just stupid news branding/sensationalism.
Even now, you're concerned with the hurricane classification and missed the fact that barometric pressure, tide timing and bathymetry of the New York Harbor/Long Island Sound were the currently predicted causes of flooding, not simply windspeed.
As I responded to your comment, "hurricane" or "tropical storm" classifications were not appropriately descriptive, as Sandy was predicted to (and did) merge with another system to morph from a warm core tropical style system to a cold core nor'easter system. The area of the storm was particularly large, which was another reason for the "super" attribution.
It would have been much better to say something like "We have put all reasonable preparations & precautions in place (see above), and we feel confident they will deal with most things the storm will throw at us. However please be aware that we have no fail-over datacenter available so please plan accordingly."
When you get this statement from an outfit with the pedigree and experience of Fog Creek, its as close to a guarantee as you're going to find. No misreading necessary.
At the end of the day, they could not be reasonably sure, there was a non-trivial risk that they should have been (and almost certainly were) aware of (this isn't the first time bad weather and datacenters have mixed), and they didn't communicate that to their customers.
Seriously? Were they not living in New York in 2003?
I'm guessing everyone is basing their experience on last year's "hurricane", which was not nearly as bad as this one.
Humans have a hard time reasoning about events on decadal time scales.
This storm had days, if not a week, of head's up. While I agree "no one expected Manhattan to lose power" (another comment), as someone who heads up development of a SaaS product, I would have spent most of that week planning for worst cases and recovery. I constantly think about the worst case scenarios and how long they'll take to recover, even without huge storms bearing down on the data center.
So, I'm disappointed. I really love the Fogbugz and Trello products. Now I'm in a position where I have to question whether we should depend on them.
This was a monster of a storm, with unprecedented water levels. Buoys around New York reported waves 5 times higher than anything on record.
So consider it this way: would you rather the services be significantly more expensive (remember, doubling the hardware is the cheap part of it) or have the possibility of a few hours of downtime in a once in 100 years event?
"Consider this the "Everything is Perfectly Fine Alarm."
Having run a few HA Datacenters I don't think that level of confidence is ever warranted.
Our servers are in a state far away from hurricanes, but in a state with many other natural disasters, including tornadoes, so it's hard to say if it's a good trade or not. Interesting question: why aren't there more DCs in Utah, Wyoming, Idaho, or New Mexico? And is physical location a huge determinant in where you colo your servers?
At the time, physical location wasn't a big deal, but as the company grew, and the data center overhead did as well - it became cheaper to have the core data centers closer to our operations, where our staff could be utilized. Georedunancy ultimately ended up being used for DR and minimum required service availability during major issues.
There are a few of them out here in Utah that I know of, but none at the scale that they really could be. It would make a lot of sense to put some out here, I would think
I only wish Trello hadn't tried to reload on its own, so I could still see the screen before the shutdown. Now all I have is a blank page :(
The details are all here: http://status.fogcreek.com/2012/10/fog-creek-services-update...
Thanks everyone for your patience!
Storm #Sandy highlights value of #cloud storage http://www.peer1hosting.co.uk/industry-news/us-storm-highlig...
The geographic load balancing side is basically a solved problem (although you don't want to use only DNS-based load balancing like Route53 in most cases), but the hard part is wide area replication of databases for hot failover.
It's pretty easy to do failover if you'll accept a 5-10 minute outage, though.
Good luck FogCreek.
Excluding back-seat systems engineers on sites like this, I suspect that most of their customers will be a bit upset, but give them the benefit of the doubt and be glad to pay slightly less monthly (or nothing for Trello) and suffer a short outage.
Fogbugz IS a commercial app, and at $25/mo (if I remember right) it's in the same category as most other commercial apps, i.e. it's not particularly cheap. It's still down.
It's a bummer that their sites are down... but I think I can go a day without my to-do list when they have a once-in-a-lifetime natural disaster.
For perspective, it's not like they have down-time once every few months.
We're still down. I'm happy for FogCreek, and I'm generally happy with Peer 1, but I wish they would have been honest with everyone in this situation.
I think it's ironic, that Peer 1 is getting accolades from Business Insider's squarespace article, while their misinforming email has hurt my firm and the small companies that use our software.
I won't hold my breath for an apology email.
That's life in the big city.
Thanks to Michael and Joel for sharing their information and assessments with all of us.
And now that we're up, thanks to Peer 1 for doing their best in a trying situation.
Or is it just that something at the datacenter level is redundant?
EDIT: this site is amazing - divergent opinions seem to be actively discouraged given how many "points" I've lost thanks to stating mine. Is the point of this site for all of the members to think in the same way?
Then again, having cross-datacenter backups that can easily be taken online would be a bit more professional than 'we want to physically move the servers'.
As a simple example, I've seen at least a half dozen people who had issues because they thought it was as simple as throwing a mysql node into each datacenter, only to discover (much later) that the databases had become inconsistent and that failing over created bigger problems than it solved.
Similarly, I've seen complex high-availability infrastructures where the complexity of that infrastructure created more net downtime than a simpler infrastructure would've, it just went down at slightly different times.
And you really need to think about the implications of various failure modes. If you go down in the middle of a transaction, is that a problem for your application? Is it okay to roll back to data that's 3 hours old? 3 minutes? 3 seconds?
There are any number of situations where it's reasonable to say "we expect our datacenter will fail once every couple decades and when it does, we'll be down for a couple days."
What is a simple no-brainer how ever is to have offline offsite backups that can easily brought online. A best practice is to have your deployment automated in such a way that deployment to a new datacenter that already has your data should be a trivial thing.
But yeah, if you're running a tight ship something things like that go overboard without anyone noticing.
Remember the story of the 100% uptime banking software, that ran for years without ever going down, always applying the patches at runtime. Then one day a patch finally came in that required a reboot, and it was discovered that in all the years of runtime patches without reboots, it was never tested if the machine could actually still boot, and ofcourse it couldn't :)
Here's where it gets really simple. Resize the staging instance to match live. Put live into maintenance mode and begin the data transfer to staging (with a lot of cloud providers, step #1 and #2 can be done in parallel). As soon as it finishes copying, take live down, point the DNS records at staging and wait for a few minutes. Staging is now live, with all of live's data. Problem solved. Total downtime: hardly anything compared to not being prepared. Total dataloss: none.
You can look to Amazon to see that cloud architecture brings with it hidden complexity that also increases risk of downtime while you relinguish a lot of control on for example the latency and bandwidth between your nodes.
What I don't know by the way, is wether the total cost of ownership is larger for colocation or for cloud hosting.
1) Their engineers never thought of it
2) They considered it, and it is as simple as you think... but they don't care about uptime.
3) Implementing geographic redundancy is harder than you think given whatever other constraints or environment they face.
4) Some other explanation
#3 seems like the most likely explanation to me.
Unless you're just talking out of your arse of course and you have no experience with that sort of thing at all.
Code that exists in production is often buggy and unwieldy, and doesn't necessarily make a lot of sense. Because when you have a product that makes money, your priorities also change.
You need to become more defensive about your maneuvers, and you have to have a real reason to justify changing code.
To commit to doing redundancy well, you need a lot of resources, and you need to have a justify diverting resources that could otherwise be used to build a better product.
There's a common misconception that you can just throw stuff at the cloud (AWS, Heroku, etc), and things will just stay up. In practice, between cacheing, database server backups, heavy writes, and crazy growth, there's a lot to deal with. It's not nearly a solved or a simple problem.
So people are probably down voting you because your opinion seems naive to them. I've personally migrated a top 80,000 global eCommerce operation, and everything broke in a million different places, and we spent 2 weeks afterwards getting things working properly again.
There's a big difference between the way things are in your head, and the way things are in the production. Don't say people don't know what they're doing because they don't have a perfect system. No system is perfect.
The decision to avoid cross data center replication was probably a carefully considered one instead of amateurish. They probably have multiple layers of redundancy in their setup and decided that the cost and overhead of cross data centre replication was not justified.
In hindsight this doesn't seem like such a good decision, but I don't see how that makes someone an amateur or a fraud.
Whatever this post says Jeff clearly didn't share your view of Joel being an amateur and a fraud given that he went on to start a pretty successful business with him.
The argument you have just presented is irrational since it's central point rests upon the fallacy of false cause.
Another satisfied customer. Next!
After all, all we get from Joel is a decade of sharing what he's worked on and why he's done stuff a particular way, in a relatively transparent manner that allows us to maybe learn stuff but importantly to put it all in a context that allows us each to make a judgement on whether what he says is useful / interesting to us.
By contrast with you we have the rich tapestry of an anonymous account on an internet message board, a superior manner bordering on trolling and a series of aggressively worded posts.
I don't know what I was thinking. Death to Spolsky!
Just one thing. Now that you too have taken to the internet to teach the rest of us how things should be done, if someone spots any errors in what you say it's fine to term you a fraud I take it? What's good for the goose and all.
Label me however you want, it's a free internet (for now, anyway).
I still find it funny how anyone can start a blog and become famous for it. Maybe I should do the same and cash in on all that buttery goodness of advertising revenue...
Imagine you are at a dinner party at Paul Graham's house. He's there, obviously, along with several startup founders, aspiring founders, and a few established industry figures, including the person you are about to disagree with or criticize.
It will undoubtedly take more effort to figure out how to frame your criticism so that it doesn't make you a pariah, but the advantage will be that you will leave open the possibility of forming beneficial business and personal relationships.
In this case, I would try describing your own successes with building redundant services, and describe some of the other approaches you found while researching ones that you have built.
Incidentally, I'm not here to form relationships - personal or otherwise. The primary goal of social media sites is to indulge in procrastination while advertisers bombard us with new products, not to improve one's life. For the latter, there are books, actions and real people made of flesh and blood. This reminds me a lot of some of the people I encountered in my gaming days - they tend to forget about the context of the platform they are using.