
Fog Creek is about to go down - adv0r
http://fogcreekstatus.typepad.com/
======
df07
Stack Exchange (Stack Overflow) barely made it out. We are in the same
datacenter but we just finished building out and testing a secondary
datacenter in Oregon literally last weekend. We did an emergency failover last
night after the datacenter went to generators. Read more at
<http://blog.serverfault.com>

~~~
cookiecaper
Just curious, wouldn't it have been wiser to put the failover servers
somewhere in the Midwest? It's pretty much as far away from the ocean as one
can get, making tsunamis/hurricanes/etc. irrelevant, low earthquake risk, and
a shorter flight from NYC. Seems a little inadvisable to place the
infrastructure in two coastal areas; I guess it's probably about the local
talent pool.

~~~
ShawnBird
Yeah, Oregon is known for it's viscous tsunamis and earthquakes. Also if a
tropical cyclone hit Oregon it would not be a hurricane, it would be a
typhoon. <http://www.diffen.com/difference/Hurricane_vs_Typhoon>

I think the idea of the second data center is that it is far away from the
first one.

~~~
beambot
FYI, according to your link: "The difference between hurricane and typhoon is
that tropical cyclones in the west Pacific are called Typhoons and those in
the Atlantic and east Pacific Ocean are called Hurricanes." Last I checked,
Oregon is in the east Pacific, so the tropical cyclone would (indeed) be
called a hurricane, yes?

------
jwr
We depend on FogBugz (hosted) to answer our support E-mails. If the downtime
is on the order of several hours, I'm fine with it, these things happen. But
if (as it looks like) it is on the order of days, I'll be looking for another
solution.

When you offer hosted services (not cheap, mind you), you take on
responsibilities. Among them are disaster recovery scenarios. We do have ours
and I'm expecting any company for a cloud-hosted solution to have theirs.

~~~
meaty
This is precisely why we have a self host requirement for all of our software.
We did have a ton of stuff in salesforce but due to a number of problems with
salesforce availability and the inevitable problem of relying on British
Telecom's infrastructure monkeys, it got moved to a locally hosted dynamics
CRM solution with off site transaction log shipping should the office catch
fire.

Cost a small fortune but there is nothing more expensive than not being there
for your paying customers.

~~~
kranner
Judging by current Twitter traffic for @trello, there is a clear need for a
self-hosted version.

~~~
tghw
That's not going to happen.

~~~
jonny_eh
It'd be a smart way for them to finally make money off of it.

~~~
tghw
I'm no longer with the company, so I may be out of date, but the consensus was
that it was just too expensive to support licensed products if it weren't
necessary. The thing that people don't take into account is that there are an
endless number of server configurations that can screw up the application, and
for a small company, dealing with each one of those is quite expensive.

So what's the alternative? Keep it hosted-only. The downside is that, yes,
outages like this happen. But I would argue that, on the whole, the overall
Trello downtime has been far less than the cumulative downtime of people
trying to run it themselves. Moreover, this was an extremely unusual storm.
Buoys reported waves 5 times higher than anything on record. My guess is, the
cost-benefit analysis is still solidly in on the side of having a hosted
product in an easily accessible data center.

------
mdc
We use Fogbugz for all our internal project tracking. The consensus among our
engineers is that this downtime is understandable and we'd rather deal with
it, even in a mission-important web app, than pay more every month to insure
redundancy was available.

Frankly this is just making us appreciate Fogbugz all the more since tracking
our time without it will be a real PITA.

~~~
kranner
I don't know what Fogbugz costs, but they really should start charging for
Trello now, esp. if they can use some of that revenue to add geo-redundancy.

Trello is fantastic, but now I'm worried that I'm too dependent on it and I
should arrange an offline alternative.

Take my money, Fog Creek.

~~~
spolsky
Geo-redundancy is a tough engineering problem. We're building a long term
solution but it's a lot of work and it's not in place today.

If this is the kind of problem that excites you, we're hiring :-)

~~~
enigmo
It's particularly hard to shoehorn in after the fact. Certain development
models (e.g. replicated state machines) make it much easier... mix in some
magic Paxos dust and it can handle machine failures as well.

Sadly the better implementations I've used myself (or have heard about) are
not publicly available. The closest thing in semi-widespread use seems to be
Zookeeper, but it's more like Oracle when you really wanted SQLite (standalone
service vs. library).

------
unwind
Too bad that the backup generator refuelling pumps have been submerged (while
the generators themselves are running).

That sounds like some really ... unfortunate planning of the positioning of
these machines, made me think of the Daiichi incident when backup assets
failed to come online because parts of the backup infrastructure were
destroyed. Not as serious, of course. Fog Creek's hosting isn't a nuclear
power plant. :)

Makes me glad I don't work with things that are as hard to test in the real
world as these kinds of backup solutions must be. I hope they manage the
refuelling, somehow.

~~~
mikeash
That's the first thing I thought of too. Seems like there should be a new rule
that your backup generator infrastructure should be on at least the second
floor.

~~~
emeraldd
That means that the fuel tanks will have to be there as well or you're going
to need some specialized lines/equipment to prime the pumps. How much would a
3" cylinder of diesel fuel about ten foot tall weigh? (What ever it is that
would be a crap load of vacuum to produce/maintain). Then you have the weight
of the tanks themselves and refilling logistics. All fun engineering problems.
:)

~~~
barrkel
What if you pumped it up by having a hydrostatically balanced system: instead
of sucking up the fuel, pump water down to drive a piston that pushes the fuel
up.

~~~
emeraldd
Interesting idea, I'm not an engineer so I can't really speak to that. It
sounds like you'd be adding more points of failure and still have the
potential of a submerged pump being and issue. Maybe something other than a
piston?

~~~
barrkel
A flooded piston would be more reliable than a flooded motor IMO. Lots of
reliability critical things like vehicle brakes use pistons. All you would
need is enclosed tubing sufficient to withstand the internal pressure.
Submersion shouldn't be an issue because the system already needs to be sealed
- the pressure would cause a leak if it wasn't already sealed, and in fact
flooding would reduce the probability of a leak by reducing the pressure
differential.

If you keep sufficient water (up to the weight of the fuel) at the location
where it's needed, the entire system needn't need a pump at all when called
into action, just a tap - gravity would be sufficient. This is presuming that
the fuel is kept at basement level for safety purposes, of course, otherwise
you could just keep the fuel where you're storing the water. You can get by
with less water and active pumping, since hydraulics are easy to turn into
gearing (force multiplication) effects.

------
seiji
There's pretty much two ways to deal with this. Either admit this is a low
probability failure scenario and it isn't cost effective to have global
redundancies. The outage will be resolved as soon as possible. Or, admit you
failed to build a georedundant HA infrastructure and apologize with a
tentative plan to build out a redundant infrastructure in a different
catastrophe zone.

 _move_ the servers? On what planet is server infrastructure movable on a
whim?

~~~
gecko
It is movable. It would take a few days. We have another DC. We wouldn't say
that if we knew it wasn't doable.

 _EDIT:_ Just because it's feasible doesn't mean we will actually do it; just
wanted to clarify that we weren't firing from the hip.

~~~
tinco
Wow you expect fog creek to be down for that long when it does? Why can't you
just scp everything on there to your other DC? or at least just move the
harddrives? Seems like it would cost less time.

~~~
gecko
Kiln alone has >4 TB of data; you want to SCP that with 90 minutes heads-up?

Having power and/or rack space is not the same as having servers, switches,
etc. anyway.

Hopefully it will not be down that long. We'll let you know more when we know
more.

~~~
modoc
Do you guys do off-site backups?

~~~
shadytrees
Yes we do. Unfortunately, that's all they are -- data backups, sans
infrastructure.

~~~
modoc
Sure, but that invalidates the complaint of it being infeasible to scp out 4
TB of data before your NYC DC runs out of power. Those 4 TB of data are
already out of NYC, safely in some other DC where you have hardware. You just
need new servers/VMs in/near that DC to restore the backups to.

I'm not trying to 2nd guess your ops team, but the whole point in having off-
site backups is to facilitate a your RTO plan in case you lose your primary DC
with no warning. I guess I'd be surprised if you don't have a < 24 hour RTO
plan in place. With how quickly you can get VMs and even dedicated server
provisioned by many hosting providers (minutes to a couple hours), the idea of
physically moving servers off-site into a new racks, with new networking,
etc... seems kinda nutty...

------
drewcrawford
So a couple of things:

1) Like any prepared person, I got all my data out of the east coast before
this whole hurricane thing. If someone from Fog Creek hooked me up with some
emergency licenses while they got their stuff sorted out, we'd be fine.
Actually, this would be a good time to switch to self-hosted.

2) Second, while I was doing our hurricane prep, I ran into this blog post
from Joel:

> Copies of the database backups are maintained in both cities, and each city
> serves as a warm backup for the other. If the New York data center goes
> completely south, we’ll wait a while to make sure it’s not coming back up,
> and then we’ll start changing the DNS records and start bringing up our
> customers on the warm backup in Los Angeles. It’s not an instantaneous
> failover, since customers will have to wait for two things: we’ll have to
> decide that a data center is really gone, not just temporarily offline, and
> they’ll have to wait up to 15 minutes for the DNS changes to propagate.
> Still, this is for the once-in-a-lifetime case of an entire data center
> blowing up

[http://webcache.googleusercontent.com/search?q=cache:lHEK939...](http://webcache.googleusercontent.com/search?q=cache:lHEK939AKiEJ:www.joelonsoftware.com/items/2007/07/09.html+&cd=5&hl=en&ct=clnk&gl=us&client=safari)

Obviously this was written in 2007, but they claim to be geographically
redundant and have geographic backups that are "never more than 15 minutes
behind". Presumably things haven't deteriorated since then.

~~~
larrys
"and they’ll have to wait up to 15 minutes for the DNS changes to propagate"

Not sure I see the need for a propagation delay if customers can be pointed to
the new site by simply using domainbackup.com instead of domain.com (in other
words completely separate dns as well as a completely different domain (even
through a completely separate registrar) to a site hosted elsewhere. They can
know this in advance of course.

------
spolsky
I'm kind of an optimist; I believe it'll only be a matter of hours total
outage. The generator is fine, the equipment is fine, the internet
connectivity is fine... the only problem is getting fuel up to the generator
on the 17th floor, while the fuel pumps in the basement are submerged. Someone
will carry it up 17 flights if need be.

~~~
theycallmemorty
They've since completely evacuated the building, no?

~~~
shadytrees
The datacenter is in Zone A, which did receive the mandatory evacuation order
from the city, but we have a few people from Fog Creek onsite or nearby right
now and there is at least one person from PEER 1 there right now. Roads to the
area are open, life is returning.

------
kevingessner
We are beginning to bring down all Fog Creek services (FogBugz, Kiln, Trello,
etc.) as our datacenter is shutting down.

fogcreekstatus.typepad.com and @fogcreekstatus on twitter will continue to
have updates.

edit: All Fog Creek services have been shut down ahead of power failure. We'll
update the status blog as we know more from our DC.

------
samarudge
I'm assuming their servers are in LGA11, according to Internap the basement
has flooded and damaged the fuel pumps for the generators. They also said in
an earlier email they have no staff on site and are urging all customers to
shut down their servers.

Edit: here are the emails I've received from Internap regarding LGA11
<https://gist.github.com/3980482>

~~~
richardwhiuk
Cut long lines to make it more readable. <https://gist.github.com/3980570>

~~~
sparkinson
> For our cloud customers, we will also being shutting down the infrastructure
> at this time.

That bit made me chuckle.

------
RobAley
"Given the preparation work that's gone into this, we are confident that all
of our services will remain available to our customers throughout the
weather." - yesterdays update.

Try not to let your fingers type cheques your datacenters can't cash...!

~~~
jrockway
Nobody was really expecting this much of Manhattan to lose power.

~~~
TillE
It's pretty amusing to see you posting this when a couple days ago, you were
accusing the media of "blowing it out of proportion". Oh hey, it turns out
they were right.

<http://news.ycombinator.com/item?id=4706959>

~~~
jrockway
I never use the words that you "quoted". I said "superstorm" was a stupid
term. It's a hurricane. Use a well-defined word when it's available.

(The hurricane was also predicted to weaken to a tropical storm before
landfall at the time I wrote that.)

~~~
roc
Earlier yesterday it wasn't a hurricane though. And shortly after landfall it
was no longer a hurricane again - even before it weakened. (cold core => post-
tropical cyclone?)

I think using a general catch-all that was not a narrowly-defined technical
term that didn't/wouldn't universally apply was actually a prudent and
defensible thing. Given their goal of collecting all the concerns of all the
stages of the storm under one umbrella.

Even if their motivation _was_ just stupid news branding/sensationalism.

------
trimbo
As a newish customer of Fogbugz, I'm disappointed.

This storm had days, if not a week, of head's up. While I agree "no one
expected Manhattan to lose power" (another comment), as someone who heads up
development of a SaaS product, I would have spent most of that week planning
for worst cases and recovery. I constantly think about the worst case
scenarios and how long they'll take to recover, even without huge storms
bearing down on the data center.

So, I'm disappointed. I really love the Fogbugz and Trello products. Now I'm
in a position where I have to question whether we should depend on them.

~~~
tghw
So, you agree that no one expected Manhattan to lose power. No one expected
that in 2003 during the blackouts, either. And in that case, Peer1 was able to
keep the servers running without disruption.

This was a monster of a storm, with unprecedented water levels. Buoys around
New York reported waves 5 times higher than anything on record.

So consider it this way: would you rather the services be significantly more
expensive (remember, doubling the hardware is the _cheap_ part of it) or have
the possibility of a few hours of downtime in a once in 100 years event?

------
andyjohnson0
Submerged fuel pumps are the same problem that Internap are experiencing.
<https://news.ycombinator.com/item?id=4715889>

~~~
teuobk
Fog Creek's services are hosted at Internap's LGA11 (75 Broad St) data center.

------
abarringer
Seems their previous post[0] from last night showed a bit of overconfidence?

"Consider this the "Everything is Perfectly Fine Alarm."

Having run a few HA Datacenters I don't think that level of confidence is ever
warranted.

[0] [http://status.fogcreek.com/2012/10/feelin-fine-no-
expected-d...](http://status.fogcreek.com/2012/10/feelin-fine-no-expected-
downtime-due-to-hurricane-sandy.html)

------
redler
What's a little surprising to me is that as of now www.fogcreek.com returns
nothing but an error message. Presumably they have control of their DNS, and
could quickly throw something -- even a simple web page with a status message
-- on a server at some other datacenter.

------
calinet6
This is a good reminder that no system is immune to failure, cloud or
otherwise. Georedundancy is expensive and difficult, so it's a delicate trade-
off, but engineering good physical backup systems is also difficult.

Our servers are in a state far away from hurricanes, but in a state with many
other natural disasters, including tornadoes, so it's hard to say if it's a
good trade or not. Interesting question: why aren't there more DCs in Utah,
Wyoming, Idaho, or New Mexico? And is physical location a huge determinant in
where you colo your servers?

~~~
Pwntastic
I know that Arizona is a huge place for DCs just because there aren't any
natural disasters there.

There are a few of them out here in Utah that I know of, but none at the scale
that they really could be. It would make a lot of sense to put some out here,
I would think

~~~
meaydinli
I guess one of the most famous DCs in Utah is the NSA one:
<http://www.wired.com/threatlevel/2012/03/ff_nsadatacenter/>

------
erre
Well, good luck to them, both personally and in bringing it back soon.

I only wish Trello hadn't tried to reload on its own, so I could still see the
screen before the shutdown. Now all I have is a blank page :(

~~~
william_uk
I've managed to successfully failover to our own emergency backup version:
<https://twitter.com/williamlannen/status/263294924382937090>

------
kevingessner
And we're back! All Fog Creek services are back on line. Our datacenter has
enough fuel for several hours and is working on getting a delivery of more. We
are hoping that Kiln, FogBugz, Trello, and all our services will remain up,
though things are still a bit dicey.

The details are all here: [http://status.fogcreek.com/2012/10/fog-creek-
services-update...](http://status.fogcreek.com/2012/10/fog-creek-services-
updates.html)

Thanks everyone for your patience!

------
panda_person
I initially read that as Fog Creek the company was soon to be going out of
business.

------
patrickgzill
"When the tide goes out, you find out who is swimming naked." Warren Buffett
said it in reference to economic problems, but somehow it applies...

------
shill
Latest tweet from PEER1, Fog Creek's datacenter.

Storm #Sandy highlights value of #cloud storage
[http://www.peer1hosting.co.uk/industry-news/us-storm-
highlig...](http://www.peer1hosting.co.uk/industry-news/us-storm-highlights-
value-cloud-)

<https://twitter.com/PEER1/status/263197209959493632>

------
joevandyk
Why would loss of power cause "unrecoverable data corruption"? Don't modern
databases work hard prevent this sort of thing?

~~~
atesti
They use SQL server for StackOverflow, FogBugz and Kiln. But for Trello they
use only MongoDB.

------
ph0rque
What about Trello, is that implied as well?

~~~
jjg
Trello is now down. We're simultaneously working on migrating to AWS, as well
as physically moving the hardware as mentioned above.

------
tbourdon
I guess "The Cloud" got rained on in this case.

------
rdl
Wow, that kind of sucks for them. I hope they consider prioritizing some kind
of geographic replication after the storm is done. It adds cost and complexity
(which slows down development, too), but seems like a good tradeoff when you
have customers depending on it.

The geographic load balancing side is basically a solved problem (although you
don't want to use only DNS-based load balancing like Route53 in most cases),
but the hard part is wide area replication of databases for hot failover.

It's pretty easy to do failover if you'll accept a 5-10 minute outage, though.

------
ewams
Reminds me of NASDAQ's post from over a year ago:
<http://news.ycombinator.com/item?id=2928519>

Good luck FogCreek.

------
eric5544
Let me join in the chorus of people that are slightly miffed that I don't have
access to my Trello boards. In a way it highlights again the importance of
local storage and off-line accessibility for web-apps. I just checked and in
the trello app on my nexus 7 I can still see (and browse) all my boards for
example (the only problem is that the content is a few days old as I have not
used the app recently)

------
medinismo
Can someone explain to me how someone like Fog Creek would let an app like
Trello go dark. Dont their carefully selected and perfectly screened engineers
get paid gobs of money to prevent exactly this scenario from happening by
having data centers in other locations replicate the one you have in your own
house.

~~~
danmaz74
Last time I checked, Trello was a free app...

~~~
kalininalex
Because, per Spolsky, Fog Creek wants to achieve a widespread adoption first
before they start charging for it (with probably a free tier remaining). Other
than that, Trello is very much a commercial app.

Fogbugz IS a commercial app, and at $25/mo (if I remember right) it's in the
same category as most other commercial apps, i.e. it's not particularly cheap.
It's still down.

------
dbecker
I keep hearing that this is the worst natural disaster that people (who are
currently alive) have ever seen in NYC.

It's a bummer that their sites are down... but I think I can go a day without
my to-do list when they have a once-in-a-lifetime natural disaster.

For perspective, it's not like they have down-time once every few months.

------
mfrankel
Unfortunately smaller Peer 1 companies like mine were told that power was
going to be shut and we therefore brought down our servers.

We're still down. I'm happy for FogCreek, and I'm generally happy with Peer 1,
but I wish they would have been honest with everyone in this situation.

~~~
mhp
The issue is communication between the boots on the ground and the NOC.
Earlier today, we brought our servers down too. People that stayed up
(squarespace) have been up the whole time. Until I understood the whole
situation, we made the same choice you did. Now that I know more and I spent
the day at the DC, I realized we should get 1-2hr warning before power goes
out. Peer1 _should_ be running even after all fuel in the header (shared tank)
on 17 runs dry... at least for a little bit. I can't guarantee it, but that's
my take on the facts I have at hand (I am President of Fog Creek and spent the
day at Peer1 and our office. They are currently using my aquarium pumps to try
to pump diesel to the tank on the 17th floor).

~~~
mfrankel
Thanks for your assessment. I actually chose Peer 1 because of the
recommendation from FogCreek and I'll still thank you even after this
incident.

I think it's ironic, that Peer 1 is getting accolades from Business Insider's
squarespace article, while their misinforming email has hurt my firm and the
small companies that use our software.

I won't hold my breath for an apology email. That's life in the big city.

~~~
mfrankel
It's 7:30pm in NYC and after a few persistent support requests, we're back up.

Thanks to Michael and Joel for sharing their information and assessments with
all of us.

And now that we're up, thanks to Peer 1 for doing their best in a trying
situation.

~~~
mhp
Send me an email at my name at my company and I'll keep you updated with what
I know.

------
smackfu
Curious: What kind architecture shows a 503 error when your servers are dead
(like they are currently doing), but can't show an error status page?
Presumably that server is not in the dead datacenter.

Or is it just that something at the datacenter level is redundant?

~~~
kevingessner
We've shut down all of our servers to protect data, except some of the
outermost infrastructure and gateways. That 503 is coming from HAProxy, our
load balancer -- it's unable to send your traffic to any of the (powered-down)
servers.

------
whatgoodisaroad
Does anyone else find it funny to think of the internet running on diesel?

~~~
hashtree
You might be surprised to know how much is normally running on petroleum:
[http://en.wikipedia.org/wiki/List_of_power_stations_in_New_Y...](http://en.wikipedia.org/wiki/List_of_power_stations_in_New_York#Petroleum)

------
Supreme
No redundancy what-so-ever? What an amateur operation. I still say Joel is a
fraud.

EDIT: this site is amazing - divergent opinions seem to be actively
discouraged given how many "points" I've lost thanks to stating mine. Is the
point of this site for all of the members to think in the same way?

~~~
tinco
Ofcourse they have redundancy, just not cross-datacenter redundancy. And if
you knew anything about cross-datacenter redundancy you'd know that cross-
datacenter redundancy is something you do not decide upon lightly.

Then again, having cross-datacenter backups that can easily be taken online
would be a bit more professional than 'we want to physically move the
servers'.

~~~
Supreme
Are you kidding me? If you run big sites like FogBugz then _ofcourse_ you have
cross-datacenter redundancy. It's not complicated to host your staging site in
another physical location and point the DNS records to it when things go pear-
shaped.

~~~
tinco
Yes, so this staging site of you has exactly the same databases as your
production site? Without customer data Fogbugz and Trello are useless. This
means that this simple staging site of yours needs to have all data replicated
to it, which means it also needs the same hardware provisioned for it,
effectively doubling your physical costs, your maintenance cost and reducing
the simplicity of your architecture. Ofcourse, if you're big enough you can
afford to do this, and one could argue fogcreek is big enough. I'm just saying
it's not a simple no-brainer.

What is a simple no-brainer how ever is to have offline offsite backups that
can easily brought online. A best practice is to have your deployment
automated in such a way that deployment to a new datacenter that already has
your data should be a trivial thing.

But yeah, if you're running a tight ship something things like that go
overboard without anyone noticing.

Remember the story of the 100% uptime banking software, that ran for years
without ever going down, always applying the patches at runtime. Then one day
a patch finally came in that required a reboot, and it was discovered that in
all the years of runtime patches without reboots, it was never tested if the
machine could actually still boot, and ofcourse it couldn't :)

~~~
Supreme
Data should be backed up to staging nightly _anyway_. There should also be
scripts in place to start this process at an arbitrary point in time and to
import the data into the staging server. You do _not_ need to match the
hardware if you use cloud hosting since you can scale up whenever you want.

Here's where it gets really simple. Resize the staging instance to match live.
Put live into maintenance mode and begin the data transfer to staging (with a
lot of cloud providers, step #1 and #2 can be done in parallel). As soon as it
finishes copying, take live down, point the DNS records at staging and wait
for a few minutes. Staging is now live, with all of live's data. Problem
solved. Total downtime: hardly anything compared to not being prepared. Total
dataloss: none.

~~~
tinco
I fully agree that this is how it could, and perhaps should be done. But you
assume they are already on cloud hosting, which they obviously aren't.
Ofcourse this is also a choice that has to be made consciously. Especially
since fogcreek has been around a lot longer than the big cloud providers.

You can look to Amazon to see that cloud architecture brings with it hidden
complexity that also increases risk of downtime while you relinguish a lot of
control on for example the latency and bandwidth between your nodes.

What I don't know by the way, is wether the total cost of ownership is larger
for colocation or for cloud hosting.

