Hacker News new | comments | show | ask | jobs | submit login
Fog Creek is about to go down (fogcreekstatus.typepad.com)
137 points by adv0r 1785 days ago | hide | past | web | 180 comments | favorite

Stack Exchange (Stack Overflow) barely made it out. We are in the same datacenter but we just finished building out and testing a secondary datacenter in Oregon literally last weekend. We did an emergency failover last night after the datacenter went to generators. Read more at http://blog.serverfault.com

Wow, nice timing there.

Just curious, wouldn't it have been wiser to put the failover servers somewhere in the Midwest? It's pretty much as far away from the ocean as one can get, making tsunamis/hurricanes/etc. irrelevant, low earthquake risk, and a shorter flight from NYC. Seems a little inadvisable to place the infrastructure in two coastal areas; I guess it's probably about the local talent pool.

The Midwest isn't immune to the effects of hurricanes. We've frequently lost power, sometimes up to a week, when bad hurricanes come through. They just become super storms over the Midwest and take down trees which cut power lines. And snow/ice can often take out power for up to a day.

So, I'd think that rather than place something in Ohio and New York, both which can be affected by the same weather pattern within a day or two, different coasts offers better protection.

In raw geographical terms, Ohio only barely counts as "the Midwest"; I think it is included in that region primarily for cultural/economic reasons. What about Kansas City or Omaha?

I lived in KC for some years and we'd occasionally get remnants of gulf hurricanes, but they'd just be severe storm systems that would pass through. We rarely lost power, but if we did, it was never for protracted time periods, and generators in data centers should easily be able to deal with power loss from severe storms.

Tornadoes are pretty much the least threatening natural disaster out there, as their area of effect is usually small and their duration is usually short, so I think "tornado alley" is actually a fine place for a data center meteorlogically.

You mean in Tornado Alley?

Tornado Alley runs from northern Texas through Oklahoma, Kansas, Nebraska, and South Dakota.

Yeah, Oregon is known for it's viscous tsunamis and earthquakes. Also if a tropical cyclone hit Oregon it would not be a hurricane, it would be a typhoon. http://www.diffen.com/difference/Hurricane_vs_Typhoon

I think the idea of the second data center is that it is far away from the first one.

FYI, according to your link: "The difference between hurricane and typhoon is that tropical cyclones in the west Pacific are called Typhoons and those in the Atlantic and east Pacific Ocean are called Hurricanes." Last I checked, Oregon is in the east Pacific, so the tropical cyclone would (indeed) be called a hurricane, yes?

If a tropical cyclone hit Oregon it would probably have started as a hurricane, not a tornado. Typhoons form west of the International Date Line, and they move predominantly west. Typhoons are pretty much guaranteed to never move east across the IDL, because of the prevailing winds.

I think you got confused about the "NE Pacific Ocean" thing under the Hurricane column. The NE Pacific Ocean is the part of the Pacific Ocean that Oregon is on.

It would technically be an extratropical cyclone, but it would likely have started as a tropical hurricane off the coast of Mexico.

I have been following the podcasts, and when you guys mentioned this failover for the relocation of your servers the impending hurricane didn't even come to mind. That is some amazing timing, it is a good thing you had to change data centers over there in NY. Also, it is a good thing you didn't move your servers over on a bed dolly.

> Read more at http://blog.serverfault.com

Thank you for this discussion!

We depend on FogBugz (hosted) to answer our support E-mails. If the downtime is on the order of several hours, I'm fine with it, these things happen. But if (as it looks like) it is on the order of days, I'll be looking for another solution.

When you offer hosted services (not cheap, mind you), you take on responsibilities. Among them are disaster recovery scenarios. We do have ours and I'm expecting any company for a cloud-hosted solution to have theirs.

This is precisely why we have a self host requirement for all of our software. We did have a ton of stuff in salesforce but due to a number of problems with salesforce availability and the inevitable problem of relying on British Telecom's infrastructure monkeys, it got moved to a locally hosted dynamics CRM solution with off site transaction log shipping should the office catch fire.

Cost a small fortune but there is nothing more expensive than not being there for your paying customers.

Fogbugz has a self hosted version too, it's your choice which one to use.

Judging by current Twitter traffic for @trello, there is a clear need for a self-hosted version.

Based on my past experience working on self-hosted Kiln, I firmly believe creating a self-hosted version would cost more than it would rake in. It's different for every product and it's different for every company, though.

(Sorry for the conversation derail.)

That's not going to happen.

It'd be a smart way for them to finally make money off of it.

I'm no longer with the company, so I may be out of date, but the consensus was that it was just too expensive to support licensed products if it weren't necessary. The thing that people don't take into account is that there are an endless number of server configurations that can screw up the application, and for a small company, dealing with each one of those is quite expensive.

So what's the alternative? Keep it hosted-only. The downside is that, yes, outages like this happen. But I would argue that, on the whole, the overall Trello downtime has been far less than the cumulative downtime of people trying to run it themselves. Moreover, this was an extremely unusual storm. Buoys reported waves 5 times higher than anything on record. My guess is, the cost-benefit analysis is still solidly in on the side of having a hosted product in an easily accessible data center.

The opposite is what happened with the company I work for.

They brought all the email server management in-house, probably cost a lot of money, took a while, and when they finally turned it on... half the company was without reliable email for about a month.

If you don't mind my asking, who is `we` in this?

We have a comms NDA which prevents me revealing the company name but we're in the financial sector and are an old fashioned "enterprise company".

Do you think that your company could do this better at a reasonable cost? How would your company handle a scenario like this internally (an entire data center becomes unrecoverable)?

If you don't have a good answer, it's simply grass-is-greener thinking.

I pay $25/month per a single hosted FogBugz user. I believe you can build in reasonable disaster recovery procedures at "reasonable cost" given these kinds of price points. I don't see any technical obstacles — in fact I think Fog Creek used to write about how they have hot backups (replication, I'd assume) in various geographical locations.

And I do not want to host FogBugz myself, in fact I'm paying exactly for the comfort of not having to plan for failures in the case of FogBugz.

Incidentally, we run a SaaS company. Our disaster recovery worst-case scenario means recreating our services from scratch in any Amazon AWS datacenter on Earth in less than 4 hours. Yes, we have an easier job, because we do not store a lot of transactional user data. But our service is also way, way cheaper than FogBugz.

Honestly, it's not FC vs. self-hosted, it's FC vs. an alternate product.

Preferably one not locate in a hurricane path, flood plain, earthquake zone, fire area, landslide track, or subject to political or economic instability.

Or highly redundant with tested failover paths.

All of which costs money, and still doesn't assure reliability. Look at last week's AWS EBS outage and root cause analysis: the service was brought down by its own monitoring (exacerbating an existing memory bug).

It's not easy. Sandy is the most extreme hurricane to hit NYC in a century (though the second in as many years). NYC is a sufficiently important commerce and financial hub to have excellent services and recovery capabilities, but it still isn't immune to perturbations.

An entire data center is becoming unrecoverable, but they had 3 days of advance notice to pull drives and bring up servers somewhere else. Don't people plan to have redundant data centers?

This wasn't a surprise storm, and it wasn't "more severe than expected" - this has been national news for a while. More it was "eh, we'll be alright, why pay for something and not use it, we'll just apologize later".

"reasonable cost?"

The "reasonable cost" is a good point of course. Also relative to what they are able to charge and how that would change their business model. One type of customer might be willing to pay for a more robust service, others wouldn't. Take any garden variety website hosting service where the charge is under $10 per month and try to operate it giving better uptime and charging, say, $20 per month and see your customer base vanish. People expect it to work 24x7 but aren't willing to pay for it. They would rather take their chances.

That said one thing they might be able to do at a "reasonable" cost is simply spread their customer base over multiple data centers. So the failure of one would only bring down a smaller percentage of their customers.

Ironic, given your comment, that FogBugz is $25-30/user/month, and aimed at non-individuals, typically, so teams and above, so a client could easily be paying Fog Creek hundreds or more a month.

We use Fogbugz for all our internal project tracking. The consensus among our engineers is that this downtime is understandable and we'd rather deal with it, even in a mission-important web app, than pay more every month to insure redundancy was available.

Frankly this is just making us appreciate Fogbugz all the more since tracking our time without it will be a real PITA.

I don't know what Fogbugz costs, but they really should start charging for Trello now, esp. if they can use some of that revenue to add geo-redundancy.

Trello is fantastic, but now I'm worried that I'm too dependent on it and I should arrange an offline alternative.

Take my money, Fog Creek.

Geo-redundancy is a tough engineering problem. We're building a long term solution but it's a lot of work and it's not in place today.

If this is the kind of problem that excites you, we're hiring :-)

It's particularly hard to shoehorn in after the fact. Certain development models (e.g. replicated state machines) make it much easier... mix in some magic Paxos dust and it can handle machine failures as well.

Sadly the better implementations I've used myself (or have heard about) are not publicly available. The closest thing in semi-widespread use seems to be Zookeeper, but it's more like Oracle when you really wanted SQLite (standalone service vs. library).

How tough it is depends on how you're engineering your geo-redundancy. I've been doing it since 1998 and a simple active-passive solution is not as hard as some would believe but it does cost money. Active-Active is much more challenging and multi-master is obviously the ideal and the most difficult to engineer with geo-network latency. I haven't seen one solution that couldn't be staged to at least provide active-passive DR capability for a pretty reasonable price. You can even do it in EC2, RackSpace Cloud or Joyent if your feeling "cloudy".

I totally agree. Did this thing have an export function or should I be prepared creating a scraper?

There is an export-to-JSON, from whose output I'm now trying to salvage my project notes.

There is also a print option, but that doesn't seem to print the back of the cards so it's pretty useless, unless I'm mistaken.

There is an api: trello.com/docs/api

Either your engineers are underpaid or you're anticipating a larger price hike than would seem justified to gain the redundancy you need.

Or the engineers want to work on interesting problems, rather than patching the bug tracker every 2 weeks.

In my experience, what self-hosted bug-tracking systems might gain in redundancy (a few hours every time a major storm hits) they lose to context switches (which hit team morale, not merely time.)

"Hey, Bob! Can you reset the MySQL server again on the bug-tracking maze set up by Fred over 4 years ago?"

Bob obligingly does so, and then, having been interrupted in the middle of a hard problem, loses his place and can't get back up to speed by the end of day. So while you may gain four hours of butt-in-seat time, you're losing four hours of real productivity on a far more frequent basis.

I'd rather pay Fog Creek to worry about that stuff, and actually ship products.

My first thought is that everyone affected is doing OK. I'm sure that everyone just wants their stuff working too, like electricity. :)

I do hope at a better time, Fogbugz can consider redundancy/failover that Stack Overflow enjoyed.

Too bad that the backup generator refuelling pumps have been submerged (while the generators themselves are running).

That sounds like some really ... unfortunate planning of the positioning of these machines, made me think of the Daiichi incident when backup assets failed to come online because parts of the backup infrastructure were destroyed. Not as serious, of course. Fog Creek's hosting isn't a nuclear power plant. :)

Makes me glad I don't work with things that are as hard to test in the real world as these kinds of backup solutions must be. I hope they manage the refuelling, somehow.

Note that its not as trivial as you might think. Pumps cannot pull fuel from the 17th floor, because even if they create vacuum in the tube, atmospheric pressure can't push the fuel to the 17 floor. So the pump has to be positioned at the base, thus subject to flooding.

You could siphon the fuel up the tube to the pump at the time of the pump's installation, right? Through a one-way valve?

that is not how a siphon works. The destination must be at a lower elevation than the supply reservoir.

That's what I meant - siphon the fuel up to the pump at installation, so you don't have any gas and don't have to rely on a vaccuum. As the pump draws fuel through the one way valve, the fuel continues to draw more fuel from below.

There are two reasons why this won't work.

The first is that the fluid will draw a vacuum at the top of the pump. A vacuum on earth can only sustain 14psi - atmospheric pressure. For water, this constitutes a 33.9' column - after that you could draw a perfect vacuum at the top of the pipe and wouldn't the water to rise higher. Since diesel is less dense than water you might go a little higher, but probably not much.

The second problem is that as you create a low-pressure zone at the top of the pipe, the fluid will boil (cavitate). This gasified liquid will fill the space and drop your pressure. Oils like diesel should have a significantly higher boiling point (aka lower boiling pressure) than water, but it's still a limit.

That's the first thing I thought of too. Seems like there should be a new rule that your backup generator infrastructure should be on at least the second floor.

That means that the fuel tanks will have to be there as well or you're going to need some specialized lines/equipment to prime the pumps. How much would a 3" cylinder of diesel fuel about ten foot tall weigh? (What ever it is that would be a crap load of vacuum to produce/maintain). Then you have the weight of the tanks themselves and refilling logistics. All fun engineering problems. :)

What if you pumped it up by having a hydrostatically balanced system: instead of sucking up the fuel, pump water down to drive a piston that pushes the fuel up.

Interesting idea, I'm not an engineer so I can't really speak to that. It sounds like you'd be adding more points of failure and still have the potential of a submerged pump being and issue. Maybe something other than a piston?

A flooded piston would be more reliable than a flooded motor IMO. Lots of reliability critical things like vehicle brakes use pistons. All you would need is enclosed tubing sufficient to withstand the internal pressure. Submersion shouldn't be an issue because the system already needs to be sealed - the pressure would cause a leak if it wasn't already sealed, and in fact flooding would reduce the probability of a leak by reducing the pressure differential.

If you keep sufficient water (up to the weight of the fuel) at the location where it's needed, the entire system needn't need a pump at all when called into action, just a tap - gravity would be sufficient. This is presuming that the fuel is kept at basement level for safety purposes, of course, otherwise you could just keep the fuel where you're storing the water. You can get by with less water and active pumping, since hydraulics are easy to turn into gearing (force multiplication) effects.

Most basements are already waterproof (since you don't want groundwater getting in). It's generally the first floor that is the weak point, when the water gets over the basement walls. So if you think about it, it's not that hard to make the basement waterproof to a flood... just make the walls higher.

Cuomo's suggesting steps such as that, given that major storms seem to be "the new reality" in New York (both city & state).


If water is leaking in, not a crushing tidal wave, all you need is a small elevation about drainage / pump to route leaking water to another location.

Or just make the generators work submerged?

The pumps have to be on the bottom floor. How else will you be able to get the fuel to them?

At least in this scenario people can carry fuel up stairs to fill the generators.

There's pretty much two ways to deal with this. Either admit this is a low probability failure scenario and it isn't cost effective to have global redundancies. The outage will be resolved as soon as possible. Or, admit you failed to build a georedundant HA infrastructure and apologize with a tentative plan to build out a redundant infrastructure in a different catastrophe zone.

move the servers? On what planet is server infrastructure movable on a whim?

It is movable. It would take a few days. We have another DC. We wouldn't say that if we knew it wasn't doable.

EDIT: Just because it's feasible doesn't mean we will actually do it; just wanted to clarify that we weren't firing from the hip.

Wow you expect fog creek to be down for that long when it does? Why can't you just scp everything on there to your other DC? or at least just move the harddrives? Seems like it would cost less time.

Kiln alone has >4 TB of data; you want to SCP that with 90 minutes heads-up?

Having power and/or rack space is not the same as having servers, switches, etc. anyway.

Hopefully it will not be down that long. We'll let you know more when we know more.

Ha, I wouldn't worry too much about Kiln. I can push when the servers come back. Heck, considering how often some folks I know of push their changesets I wouldn't be surprised if a lot of them never even notice that anything happened.

FogBugz is more of a problem. Some days the "Resolve" button is my only source of job satisfaction.

You could have rsync'd it with a few days heads up, and freshened that in the last 90min.

>4 TB of data? I didn't realize Kiln's gotten that big already. When this whole thing is sorted out, @gecko @kevingessner - would love to see a "State of the Kiln" post and some stats!

Do you guys do off-site backups?

Yes, we have multiple off-site backups (cloud as well as an offsite storage DC) for all customer data. All data is still safe in NYC -- we've just brought down service to prevent problems in case of an abrupt power failure.

Yes we do. Unfortunately, that's all they are -- data backups, sans infrastructure.

Sure, but that invalidates the complaint of it being infeasible to scp out 4 TB of data before your NYC DC runs out of power. Those 4 TB of data are already out of NYC, safely in some other DC where you have hardware. You just need new servers/VMs in/near that DC to restore the backups to.

I'm not trying to 2nd guess your ops team, but the whole point in having off-site backups is to facilitate a your RTO plan in case you lose your primary DC with no warning. I guess I'd be surprised if you don't have a < 24 hour RTO plan in place. With how quickly you can get VMs and even dedicated server provisioned by many hosting providers (minutes to a couple hours), the idea of physically moving servers off-site into a new racks, with new networking, etc... seems kinda nutty...

I can't imagine that relocating an entire environment for a multitude of applications and services is as simple as scping things over. I'm sure there is a process wherein scp could be a step, but I don't see it being any easier/faster.

Moving the hard drives could be an option, and I believe it is sometimes done, but it assumes there are empty boxes on the other end waiting to receive the hard drives in a similar configuration to how the hard drives came. Also there's separate issues depending how many drives they are dealing with, and what redundancy is involved. If it's very few, then you might as well move the server outright. If it's very many, then there's extra human overhead (and room for error) in keeping the drives together.

Why would moving the servers be hard? You can fit 100 terabytes of storage in a shoe-box these days. I'd be extremely surprised if you couldn't run all of FogCreek off of a single 10U blade enclosure. That would be up to 128 CPU cores; I suspect they need only a small fraction of that.

On the other hand, with a 1000/Mbit uplink that they were allowed to saturate, they'd still only be able to copy out 1 terabyte in 3 hours.

Essential quote (literally from Networking 101): "Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway."

Depends on your mental model of "servers." People have different views of servers from "that nosiy hot thing under my desk" to multiple cages and dozens of racks with 1G, 10G, 40G fiber, rack switches, core switches, edge switches, terminal servers, database servers, application servers, monitoring servers, corporate boxes, .... All with various weights, accessibility, cable routing, and those damn three servers with stripped mounting screws that have been there for six years.

100 TB of high speed RAID-10 would fit in maybe 30 shoe boxes.

Then, after things are moved, you have to deal with drives that have jiggled loose, components that outright fail to work again, or things that get accidentally broken in transit.

Let's just declare an emergency federal holiday until Nov 2 so everybody can recover without dangerous heroic measures.

In this case my mental model of "servers" is "the computers that run the specific small company under discussion, who has already said that they can move them if they want to".

We seem to be arguing separate points -- I'm saying that it's not unreasonable that a small company could be moved fairly easily. Possibly as easily as unplugging a blade enclosure and throwing it in a station wagon. There are loads of small businesses that can run on an amount of hardware that can be easily transported. (When I used to gig on electric bass, my amp and other rack gear was in a portable 8U rack and that was more "portable" than the 100 pound speaker cabinet.)

I can't tell if your point is that it's unreasonable for all companies, which is wrong, or that it's unreasonable for some companies, which is obvious.

So a couple of things:

1) Like any prepared person, I got all my data out of the east coast before this whole hurricane thing. If someone from Fog Creek hooked me up with some emergency licenses while they got their stuff sorted out, we'd be fine. Actually, this would be a good time to switch to self-hosted.

2) Second, while I was doing our hurricane prep, I ran into this blog post from Joel:

> Copies of the database backups are maintained in both cities, and each city serves as a warm backup for the other. If the New York data center goes completely south, we’ll wait a while to make sure it’s not coming back up, and then we’ll start changing the DNS records and start bringing up our customers on the warm backup in Los Angeles. It’s not an instantaneous failover, since customers will have to wait for two things: we’ll have to decide that a data center is really gone, not just temporarily offline, and they’ll have to wait up to 15 minutes for the DNS changes to propagate. Still, this is for the once-in-a-lifetime case of an entire data center blowing up


Obviously this was written in 2007, but they claim to be geographically redundant and have geographic backups that are "never more than 15 minutes behind". Presumably things haven't deteriorated since then.

"and they’ll have to wait up to 15 minutes for the DNS changes to propagate"

Not sure I see the need for a propagation delay if customers can be pointed to the new site by simply using domainbackup.com instead of domain.com (in other words completely separate dns as well as a completely different domain (even through a completely separate registrar) to a site hosted elsewhere. They can know this in advance of course.

That text says "To implement this warm backup feature, I wrote a SQL mirroring application that implements transaction log shipping: ..... Right now, we’re log shipping twice a day, so you might lose a day of work if an entire city blew up, but in a couple of weeks, we’ll implement a system that does more continuous backups, and we expect that the warm backups will never get more than 15 minutes behind."

What happened to that? Was it turned off? How long did it last?

I also wonder how long FogCreek will still maintain the for-your-server version of FogBugz? Will it still be available next year?

I'm kind of an optimist; I believe it'll only be a matter of hours total outage. The generator is fine, the equipment is fine, the internet connectivity is fine... the only problem is getting fuel up to the generator on the 17th floor, while the fuel pumps in the basement are submerged. Someone will carry it up 17 flights if need be.

Given an average requirement of 500 gallons an hour? That's nearly 2 tonnes. Nobody's carrying that up 17 flights.

"I'm kind of an optimist"

Sorry for what you are going through obviously. One thing I have found though in disaster planning is that it pays to be a pessimist. While worrying certainly doesn't help once you have a problem to solve doing so in advance helps you anticipate things that you need to take into consideration when planning.

I feel for the guy/gal/group that draws the short straw and has to tote a 400-ish pound 55 gallon drum of diesel up 17 flights. "Whelp that lasted for 5 minutes, again!" ;)

They are carrying half-full 50-gallon barrels of diesel (source: http://forums.peer1.com/viewtopic.php?f=37&t=7532&si...).

Still, carrying 200 lb barrels up 17 flights of seawater/diesel-slicked stairs sounds ...unpleasant.

I'm a cynic. It's going to be at LEAST a couple of days.

It's already back up. So I think that made it only two or three hours.

Yup, I'm pleasantly surprised. Although I don't envy the facilities people who are lugging jugs of diesel up 17 flights of stairs...

They've since completely evacuated the building, no?

The datacenter is in Zone A, which did receive the mandatory evacuation order from the city, but we have a few people from Fog Creek onsite or nearby right now and there is at least one person from PEER 1 there right now. Roads to the area are open, life is returning.

Good luck, guys!

We are beginning to bring down all Fog Creek services (FogBugz, Kiln, Trello, etc.) as our datacenter is shutting down.

fogcreekstatus.typepad.com and @fogcreekstatus on twitter will continue to have updates.

edit: All Fog Creek services have been shut down ahead of power failure. We'll update the status blog as we know more from our DC.

I'm assuming their servers are in LGA11, according to Internap the basement has flooded and damaged the fuel pumps for the generators. They also said in an earlier email they have no staff on site and are urging all customers to shut down their servers.

Edit: here are the emails I've received from Internap regarding LGA11 https://gist.github.com/3980482

Cut long lines to make it more readable. https://gist.github.com/3980570

> For our cloud customers, we will also being shutting down the infrastructure at this time.

That bit made me chuckle.

"Given the preparation work that's gone into this, we are confident that all of our services will remain available to our customers throughout the weather." - yesterdays update.

Try not to let your fingers type cheques your datacenters can't cash...!

It's not even clear what kind of prep you would do for a hurricane that ensures service will be available in a hurricane.

There isn't much beyond what they did, assuming you don't have another datacenter to preemptively switch load to. But what you don't do is tell your customers it will be all good. You tell them what you've done and warn them they may be downtime so they can plan accordingly.

You might start by moving infrastructure out of the hurricane. Off the east coast would be nice, but at the very least outside of a mandatory evacuation zone [1].

Alternatively, you could design some type of system that would allow you to fail over to a geographically redundant datacenter. Joel claimed in 2007 [2] that they had such a system, and touted it as a selling point of the reliability of the hosted service. What has happened to it is probably only something that a Fog Creek engineer can tell you.

[1] http://news.ycombinator.com/item?id=4718653

[2] http://webcache.googleusercontent.com/search?q=cache:lHEK939...

Nobody was really expecting this much of Manhattan to lose power.

It's pretty amusing to see you posting this when a couple days ago, you were accusing the media of "blowing it out of proportion". Oh hey, it turns out they were right.


I never use the words that you "quoted". I said "superstorm" was a stupid term. It's a hurricane. Use a well-defined word when it's available.

(The hurricane was also predicted to weaken to a tropical storm before landfall at the time I wrote that.)

Earlier yesterday it wasn't a hurricane though. And shortly after landfall it was no longer a hurricane again - even before it weakened. (cold core => post-tropical cyclone?)

I think using a general catch-all that was not a narrowly-defined technical term that didn't/wouldn't universally apply was actually a prudent and defensible thing. Given their goal of collecting all the concerns of all the stages of the storm under one umbrella.

Even if their motivation was just stupid news branding/sensationalism.

You predicted it was just going to rain a lot; you completely ignored everything reported on storm surges.

Even now, you're concerned with the hurricane classification and missed the fact that barometric pressure, tide timing and bathymetry of the New York Harbor/Long Island Sound were the currently predicted causes of flooding, not simply windspeed. As I responded to your comment, "hurricane" or "tropical storm" classifications were not appropriately descriptive, as Sandy was predicted to (and did) merge with another system to morph from a warm core tropical style system to a cold core nor'easter system. The area of the storm was particularly large, which was another reason for the "super" attribution.

I'm sorry, but these chaps have been around the block a few times, they're not new start-ups. They know that there is no way on earth you can guarantee (or even reasonably be sure that) a single datacenter won't fail, even under non-emergency conditions, so their customers (I'm not one) should be calling them out on why they said that all would be A-OK.

It would have been much better to say something like "We have put all reasonable preparations & precautions in place (see above), and we feel confident they will deal with most things the storm will throw at us. However please be aware that we have no fail-over datacenter available so please plan accordingly."

You can be reasonably sure without being certain. They didn't guarantee anything. If you misread anything they've written as "guarantee" their services will never go down, you're blaming the wrong person. (Full disclosure: I am one of their customers)

> "we are confident that all of our services will remain available"

When you get this statement from an outfit with the pedigree and experience of Fog Creek, its as close to a guarantee as you're going to find. No misreading necessary.

At the end of the day, they could not be reasonably sure, there was a non-trivial risk that they should have been (and almost certainly were) aware of (this isn't the first time bad weather and datacenters have mixed), and they didn't communicate that to their customers.

Well, it seems like their confidence was based on their single datacenter not going down. Which seems misplaced.

... especially since there is more than one historical example for all of Manhattan losing power. (One of those times involved looting and civil strife.) Combined with the verbiage about "once-in-a-generation storm", it is fortunate that any part of Manhattan had power.

Nobody except the whole world watching the Weather channel. I'm frankly amazed that Manhattan survived!

I would tend to disagree with you jrock. Nearly all the models talked about this storm wrecking this type of havoc.

I meant ordinary citizens, not emergency planners. Despite being told that they could be without power, many of my friends did not believe it. "How could Manhattan lose power?"

> "How could Manhattan lose power?"

Seriously? Were they not living in New York in 2003?

They were not.

I'm guessing everyone is basing their experience on last year's "hurricane", which was not nearly as bad as this one.

Same thing happens with people who moved to LA after the Northridge quake. They don't really believe in earthquakes at a gut level.

I live in California, so I know what you mean (s/weather/earthquake/). People were forgetting about events like:

http://en.wikipedia.org/wiki/Northeast_blackout_of_2003 http://en.wikipedia.org/wiki/New_York_City_blackout_of_1977 http://en.wikipedia.org/wiki/Northeast_blackout_of_1965

Humans have a hard time reasoning about events on decadal time scales.

You mean nobody that somehow missed the media coverage expected this. Those that watched the media coverage prepared for this level of disaster.

As a newish customer of Fogbugz, I'm disappointed.

This storm had days, if not a week, of head's up. While I agree "no one expected Manhattan to lose power" (another comment), as someone who heads up development of a SaaS product, I would have spent most of that week planning for worst cases and recovery. I constantly think about the worst case scenarios and how long they'll take to recover, even without huge storms bearing down on the data center.

So, I'm disappointed. I really love the Fogbugz and Trello products. Now I'm in a position where I have to question whether we should depend on them.

So, you agree that no one expected Manhattan to lose power. No one expected that in 2003 during the blackouts, either. And in that case, Peer1 was able to keep the servers running without disruption.

This was a monster of a storm, with unprecedented water levels. Buoys around New York reported waves 5 times higher than anything on record.

So consider it this way: would you rather the services be significantly more expensive (remember, doubling the hardware is the cheap part of it) or have the possibility of a few hours of downtime in a once in 100 years event?

Submerged fuel pumps are the same problem that Internap are experiencing. https://news.ycombinator.com/item?id=4715889

Fog Creek's services are hosted at Internap's LGA11 (75 Broad St) data center.

Seems their previous post[0] from last night showed a bit of overconfidence?

"Consider this the "Everything is Perfectly Fine Alarm."

Having run a few HA Datacenters I don't think that level of confidence is ever warranted.

[0] http://status.fogcreek.com/2012/10/feelin-fine-no-expected-d...

What's a little surprising to me is that as of now www.fogcreek.com returns nothing but an error message. Presumably they have control of their DNS, and could quickly throw something -- even a simple web page with a status message -- on a server at some other datacenter.

This is a good reminder that no system is immune to failure, cloud or otherwise. Georedundancy is expensive and difficult, so it's a delicate trade-off, but engineering good physical backup systems is also difficult.

Our servers are in a state far away from hurricanes, but in a state with many other natural disasters, including tornadoes, so it's hard to say if it's a good trade or not. Interesting question: why aren't there more DCs in Utah, Wyoming, Idaho, or New Mexico? And is physical location a huge determinant in where you colo your servers?

We once had space in a datacenter in Arizona, but everyone else had the same idea. We had to move out and find a new datacenter when they were at capacity and we couldn't expand. While they were building a new facility, it was over a year away and our expansion needed to happen sooner. As a final point, the added space was already being pre-reserved at a premium, and we couldn't afford the new rates vs. other areas of the U.S.

At the time, physical location wasn't a big deal, but as the company grew, and the data center overhead did as well - it became cheaper to have the core data centers closer to our operations, where our staff could be utilized. Georedunancy ultimately ended up being used for DR and minimum required service availability during major issues.

I know that Arizona is a huge place for DCs just because there aren't any natural disasters there.

There are a few of them out here in Utah that I know of, but none at the scale that they really could be. It would make a lot of sense to put some out here, I would think

I guess one of the most famous DCs in Utah is the NSA one: http://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/

Most places I've worked have had the first DC close to the office, then the second one in another country. When you're just starting up it's important to have easy physical access, and it's seldom worth migrating away from an existing DC rather than just opening up a new one.

Well, good luck to them, both personally and in bringing it back soon.

I only wish Trello hadn't tried to reload on its own, so I could still see the screen before the shutdown. Now all I have is a blank page :(

I've managed to successfully failover to our own emergency backup version: https://twitter.com/williamlannen/status/263294924382937090

And we're back! All Fog Creek services are back on line. Our datacenter has enough fuel for several hours and is working on getting a delivery of more. We are hoping that Kiln, FogBugz, Trello, and all our services will remain up, though things are still a bit dicey.

The details are all here: http://status.fogcreek.com/2012/10/fog-creek-services-update...

Thanks everyone for your patience!

I initially read that as Fog Creek the company was soon to be going out of business.

"When the tide goes out, you find out who is swimming naked." Warren Buffett said it in reference to economic problems, but somehow it applies...

Latest tweet from PEER1, Fog Creek's datacenter.

Storm #Sandy highlights value of #cloud storage http://www.peer1hosting.co.uk/industry-news/us-storm-highlig...


Why would loss of power cause "unrecoverable data corruption"? Don't modern databases work hard prevent this sort of thing?

They use SQL server for StackOverflow, FogBugz and Kiln. But for Trello they use only MongoDB.

What about Trello, is that implied as well?

Trello is now down. We're simultaneously working on migrating to AWS, as well as physically moving the hardware as mentioned above.

Trello is now down.


I guess "The Cloud" got rained on in this case.

Wow, that kind of sucks for them. I hope they consider prioritizing some kind of geographic replication after the storm is done. It adds cost and complexity (which slows down development, too), but seems like a good tradeoff when you have customers depending on it.

The geographic load balancing side is basically a solved problem (although you don't want to use only DNS-based load balancing like Route53 in most cases), but the hard part is wide area replication of databases for hot failover.

It's pretty easy to do failover if you'll accept a 5-10 minute outage, though.

Reminds me of NASDAQ's post from over a year ago: http://news.ycombinator.com/item?id=2928519

Good luck FogCreek.

Let me join in the chorus of people that are slightly miffed that I don't have access to my Trello boards. In a way it highlights again the importance of local storage and off-line accessibility for web-apps. I just checked and in the trello app on my nexus 7 I can still see (and browse) all my boards for example (the only problem is that the content is a few days old as I have not used the app recently)

Can someone explain to me how someone like Fog Creek would let an app like Trello go dark. Dont their carefully selected and perfectly screened engineers get paid gobs of money to prevent exactly this scenario from happening by having data centers in other locations replicate the one you have in your own house.

They decide that the cost is not worth the benefit.

Excluding back-seat systems engineers on sites like this, I suspect that most of their customers will be a bit upset, but give them the benefit of the doubt and be glad to pay slightly less monthly (or nothing for Trello) and suffer a short outage.

Last time I checked, Trello was a free app...

Because, per Spolsky, Fog Creek wants to achieve a widespread adoption first before they start charging for it (with probably a free tier remaining). Other than that, Trello is very much a commercial app.

Fogbugz IS a commercial app, and at $25/mo (if I remember right) it's in the same category as most other commercial apps, i.e. it's not particularly cheap. It's still down.

Sure, the situation is not ideal. Trello is a boon to my productivity and has been a gift at being free. Spolsky has given so much to the community that I, personally, can tolerate this inconvenience to my workflow.

I keep hearing that this is the worst natural disaster that people (who are currently alive) have ever seen in NYC.

It's a bummer that their sites are down... but I think I can go a day without my to-do list when they have a once-in-a-lifetime natural disaster.

For perspective, it's not like they have down-time once every few months.

Unfortunately smaller Peer 1 companies like mine were told that power was going to be shut and we therefore brought down our servers.

We're still down. I'm happy for FogCreek, and I'm generally happy with Peer 1, but I wish they would have been honest with everyone in this situation.

The issue is communication between the boots on the ground and the NOC. Earlier today, we brought our servers down too. People that stayed up (squarespace) have been up the whole time. Until I understood the whole situation, we made the same choice you did. Now that I know more and I spent the day at the DC, I realized we should get 1-2hr warning before power goes out. Peer1 should be running even after all fuel in the header (shared tank) on 17 runs dry... at least for a little bit. I can't guarantee it, but that's my take on the facts I have at hand (I am President of Fog Creek and spent the day at Peer1 and our office. They are currently using my aquarium pumps to try to pump diesel to the tank on the 17th floor).

Thanks for your assessment. I actually chose Peer 1 because of the recommendation from FogCreek and I'll still thank you even after this incident.

I think it's ironic, that Peer 1 is getting accolades from Business Insider's squarespace article, while their misinforming email has hurt my firm and the small companies that use our software.

I won't hold my breath for an apology email. That's life in the big city.

It's 7:30pm in NYC and after a few persistent support requests, we're back up.

Thanks to Michael and Joel for sharing their information and assessments with all of us.

And now that we're up, thanks to Peer 1 for doing their best in a trying situation.

Send me an email at my name at my company and I'll keep you updated with what I know.

Curious: What kind architecture shows a 503 error when your servers are dead (like they are currently doing), but can't show an error status page? Presumably that server is not in the dead datacenter.

Or is it just that something at the datacenter level is redundant?

We've shut down all of our servers to protect data, except some of the outermost infrastructure and gateways. That 503 is coming from HAProxy, our load balancer -- it's unable to send your traffic to any of the (powered-down) servers.

Any kind of front-end reverse proxy could be doing that. At a complete guess, might they be using cloudflare?

Does anyone else find it funny to think of the internet running on diesel?

You might be surprised to know how much is normally running on petroleum: http://en.wikipedia.org/wiki/List_of_power_stations_in_New_Y...

No redundancy what-so-ever? What an amateur operation. I still say Joel is a fraud.

EDIT: this site is amazing - divergent opinions seem to be actively discouraged given how many "points" I've lost thanks to stating mine. Is the point of this site for all of the members to think in the same way?

Ofcourse they have redundancy, just not cross-datacenter redundancy. And if you knew anything about cross-datacenter redundancy you'd know that cross-datacenter redundancy is something you do not decide upon lightly.

Then again, having cross-datacenter backups that can easily be taken online would be a bit more professional than 'we want to physically move the servers'.

I'll be the first to admit I don't really know anything about cross-datacenter redundancy; however, I always thought that was pretty high on the list once you had SaaS products that were pulling in enough revenue to warrant full-time employees outside of the founders. What are the reasons why you would choose not to do it? Are they all financial or are there other implications?

I think the biggest argument against complex cross-DC redundancy is that it can add complexity and failure modes, not just during the emergency, but every day.

As a simple example, I've seen at least a half dozen people who had issues because they thought it was as simple as throwing a mysql node into each datacenter, only to discover (much later) that the databases had become inconsistent and that failing over created bigger problems than it solved.

Similarly, I've seen complex high-availability infrastructures where the complexity of that infrastructure created more net downtime than a simpler infrastructure would've, it just went down at slightly different times.

And you really need to think about the implications of various failure modes. If you go down in the middle of a transaction, is that a problem for your application? Is it okay to roll back to data that's 3 hours old? 3 minutes? 3 seconds?

There are any number of situations where it's reasonable to say "we expect our datacenter will fail once every couple decades and when it does, we'll be down for a couple days."

Great explanation, thank you.

Are you kidding me? If you run big sites like FogBugz then ofcourse you have cross-datacenter redundancy. It's not complicated to host your staging site in another physical location and point the DNS records to it when things go pear-shaped.

Yes, so this staging site of you has exactly the same databases as your production site? Without customer data Fogbugz and Trello are useless. This means that this simple staging site of yours needs to have all data replicated to it, which means it also needs the same hardware provisioned for it, effectively doubling your physical costs, your maintenance cost and reducing the simplicity of your architecture. Ofcourse, if you're big enough you can afford to do this, and one could argue fogcreek is big enough. I'm just saying it's not a simple no-brainer.

What is a simple no-brainer how ever is to have offline offsite backups that can easily brought online. A best practice is to have your deployment automated in such a way that deployment to a new datacenter that already has your data should be a trivial thing.

But yeah, if you're running a tight ship something things like that go overboard without anyone noticing.

Remember the story of the 100% uptime banking software, that ran for years without ever going down, always applying the patches at runtime. Then one day a patch finally came in that required a reboot, and it was discovered that in all the years of runtime patches without reboots, it was never tested if the machine could actually still boot, and ofcourse it couldn't :)

Data should be backed up to staging nightly anyway. There should also be scripts in place to start this process at an arbitrary point in time and to import the data into the staging server. You do not need to match the hardware if you use cloud hosting since you can scale up whenever you want.

Here's where it gets really simple. Resize the staging instance to match live. Put live into maintenance mode and begin the data transfer to staging (with a lot of cloud providers, step #1 and #2 can be done in parallel). As soon as it finishes copying, take live down, point the DNS records at staging and wait for a few minutes. Staging is now live, with all of live's data. Problem solved. Total downtime: hardly anything compared to not being prepared. Total dataloss: none.

I fully agree that this is how it could, and perhaps should be done. But you assume they are already on cloud hosting, which they obviously aren't. Ofcourse this is also a choice that has to be made consciously. Especially since fogcreek has been around a lot longer than the big cloud providers.

You can look to Amazon to see that cloud architecture brings with it hidden complexity that also increases risk of downtime while you relinguish a lot of control on for example the latency and bandwidth between your nodes.

What I don't know by the way, is wether the total cost of ownership is larger for colocation or for cloud hosting.

Why do you think they aren't doing this?

Possible explanations

1) Their engineers never thought of it

2) They considered it, and it is as simple as you think... but they don't care about uptime.

3) Implementing geographic redundancy is harder than you think given whatever other constraints or environment they face.

4) Some other explanation

#3 seems like the most likely explanation to me.

So which of your big sites have cross-datacenter redundancy? Why don't you talk about the decision process that lead to that and costs associated?

Unless you're just talking out of your arse of course and you have no experience with that sort of thing at all.

The relationship between willingness to opine on a topic and knowledge of that topic:


There's a huge difference between code you've written in your spare time, and code that exists in production.

Code that exists in production is often buggy and unwieldy, and doesn't necessarily make a lot of sense. Because when you have a product that makes money, your priorities also change.

You need to become more defensive about your maneuvers, and you have to have a real reason to justify changing code.

To commit to doing redundancy well, you need a lot of resources, and you need to have a justify diverting resources that could otherwise be used to build a better product.

There's a common misconception that you can just throw stuff at the cloud (AWS, Heroku, etc), and things will just stay up. In practice, between cacheing, database server backups, heavy writes, and crazy growth, there's a lot to deal with. It's not nearly a solved or a simple problem.

So people are probably down voting you because your opinion seems naive to them. I've personally migrated a top 80,000 global eCommerce operation, and everything broke in a million different places, and we spent 2 weeks afterwards getting things working properly again.

There's a big difference between the way things are in your head, and the way things are in the production. Don't say people don't know what they're doing because they don't have a perfect system. No system is perfect.

FWIW I agreed with you but downvoted because of the posting style.

The decision to avoid cross data center replication was probably a carefully considered one instead of amateurish. They probably have multiple layers of redundancy in their setup and decided that the cost and overhead of cross data centre replication was not justified.

In hindsight this doesn't seem like such a good decision, but I don't see how that makes someone an amateur or a fraud.

Sorry, should have linked to previous evidence of the fact: http://www.codinghorror.com/blog/2006/09/has-joel-spolsky-ju...

Quoting Jeff in an attack on Joel has got to be irony yes?

Whatever this post says Jeff clearly didn't share your view of Joel being an amateur and a fraud given that he went on to start a pretty successful business with him.

Zynga has (up until now) made a lot of money and they write shit code. Hell, most of the companies I've seen have made money while writing shit code. Making money indicates that a person knows how to make money. Writing good code indicates that a person knows how to write good code. Since the two are disconnected, I stand by my statement that this man is a fraud. You simply don't start a programming blog when you created a new language just to address a small concern in the project spec. Start a blog on how to make money or run a business, sure, but don't tread into a field where people are trying to produce something of quality and try to 'teach' them something.

The argument you have just presented is irrational since it's central point rests upon the fallacy of false cause.


Another satisfied customer. Next!

You're right, why would we want Joel to step into the field and teach us stuff when we've got you with vast knowledge and your winning manner?

After all, all we get from Joel is a decade of sharing what he's worked on and why he's done stuff a particular way, in a relatively transparent manner that allows us to maybe learn stuff but importantly to put it all in a context that allows us each to make a judgement on whether what he says is useful / interesting to us.

By contrast with you we have the rich tapestry of an anonymous account on an internet message board, a superior manner bordering on trolling and a series of aggressively worded posts.

I don't know what I was thinking. Death to Spolsky!

Just one thing. Now that you too have taken to the internet to teach the rest of us how things should be done, if someone spots any errors in what you say it's fine to term you a fraud I take it? What's good for the goose and all.

I'm not running a blog or expecting anyone to take what they read in a comment on some site on the internet seriously.

Label me however you want, it's a free internet (for now, anyway).

I still find it funny how anyone can start a blog and become famous for it. Maybe I should do the same and cash in on all that buttery goodness of advertising revenue...

Not just anyone can start a blog and become famous for it. People have to want to read your blog.

So it's essentially a marketing problem. I get the feeling that you were trying to suggest that it's worthy because it's popular. If so, argumentum ad populum is irrational. If not, apologies.

So are you saying that your comments shouldn't be taken seriously?

I would suggest the following mental exercise the next time you want to make a comment on HN:

Imagine you are at a dinner party at Paul Graham's house. He's there, obviously, along with several startup founders, aspiring founders, and a few established industry figures, including the person you are about to disagree with or criticize.

It will undoubtedly take more effort to figure out how to frame your criticism so that it doesn't make you a pariah, but the advantage will be that you will leave open the possibility of forming beneficial business and personal relationships.

In this case, I would try describing your own successes with building redundant services, and describe some of the other approaches you found while researching ones that you have built.

I've outlined how we solved this problem in another comment - http://news.ycombinator.com/item?id=4717713

Incidentally, I'm not here to form relationships - personal or otherwise. The primary goal of social media sites is to indulge in procrastination while advertisers bombard us with new products, not to improve one's life. For the latter, there are books, actions and real people made of flesh and blood. This reminds me a lot of some of the people I encountered in my gaming days - they tend to forget about the context of the platform they are using.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact