
Basecamp Outage Post-mortem - sudhirj
https://m.signalvnoise.com/update-on-basecamp-3-being-stuck-in-read-only-as-of-nov-8-9-22am-cst-c41df1a58352
======
BrentOzar
Short story: storing identities in an integer field instead of a bigint.

Now would be a good time to check your own IDs and get the max value of each.
Sure is easier to fix this ahead of time than when your app is down.

~~~
nathanaldensr
Consider using GUIDs/UUIDs instead of integers. GUIDs are friendlier for
replication, fake/mock data, and other reasons. The only situations where I'd
recommend against using GUIDs is if total row size is a concern (GUIDs are
usually larger on disk than integers) or if there is some legacy requirement
to use them (e.g., an integration with a third-party).

Another thing I've done is use GUIDs for the primary key, then have an
identity/sequence column called something like Sequence, which retains the
sequencing capabilities of a regular integer identity column.

~~~
otterley
The flip side is that GUIDs take up a ton of space and are full of entropy,
which makes them very inefficient to work with, are not cache-friendly, and
not sortable. Certain database engines (e.g., InnoDB on MySQL) also limit the
length of primary keys for performance reasons, which disqualifies GUIDs from
being used. With clustering, short keys are better as they reduce IOPS
required to fetch data.

See [https://www.percona.com/blog/2006/10/03/long-primary-key-
for...](https://www.percona.com/blog/2006/10/03/long-primary-key-for-innodb-
tables/) for an interesting discussion of the topic.

In general, it's unwise to make recommendations about what data types to use,
especially for primary keys, without thorough research on their implications.

~~~
snowwolf
I found [https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-
ca...](https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-
careful-7b2aa3dcb439) has a good rundown of the pros and cons.

------
preek
Great outage ticker by DHH on the longest Basecamp “outage” in 10 years! It
was only 5h. I know that feels long for users, and it is. However, it wasnt
even a complete outage, but instead brought the app to a read only mode. And
10 years is a really long time to go with a 99.998% uptime.

Everything about this is amazing to me. Rails and good small dev teams still
rock, probably more so now than ever!

~~~
xiphias2
99.998% sounds great, but 99% here not so much:
[https://basecamp.com/3/uptime](https://basecamp.com/3/uptime) It would be
better if it would show a more precise number.

~~~
preek
What do you mean? Your link is showing "Basecamp 3 has been up 99.978% of the
time since launch." as the main header. That's pretty precise(;

~~~
sudhirj
The text starts with 99%, and the .987 is being added with javascript after
page load. If you turn off JS you’ll only see 99.

------
JshWright
We also ran into this issue a few months ago. Our problem is our data model
for the table in question is a mess, with literally hundreds of foreign keys
pointing at that ID column (so all of those tables would need to be altered as
well). Migrating all of those would have resulted in an unacceptably long
downtime (i.e. multiple days). We started down that path, but quickly realized
it wasn't a viable option.

So, plan B... It's a signed integer... The auto increment started at 1 and
counted up from there... We literally had 50% of the ID space still available
to us. So now we've started at -1 and are counting our way down from there.
Obviously we're adding rows a lot faster now than we were at the beginning of
time, but it has bought us more than enough time to fix the issue properly.

It took a bit of cleverness to get those negative IDs (MySQL won't let you set
the auto increment start point to a value less than the current max ID), but
the end result is that we got back up and running in a couple hours, as
opposed to the multiple days we initially thought we were looking at.

------
davedx
Stop the world db migrations like this suck. I read the other day postgres can
create indexes concurrently, can any db engine do bigger changes like in TFA
concurrently? That would be cool.

------
faitswulff
What does data verification in a large rails project like that entail?

------
lykr0n
> We should have known better. We should have done our due diligence when this
> improvement was made to the framework two years ago. I accept full
> responsibility for failing to heed that warning, and by extension for
> causing the multi-hour outage today. I’m really, really sorry

I don't know why, but seeing "I’m really, really sorry" in a Post-mortem
doesn't feel right.

~~~
vikingcaffiene
> I don't know why, but seeing "I’m really, really sorry" in a Post-mortem
> doesn't feel right.

Maybe your comment just reads wrong (context is tough in text format) but I
wish you would reconsider this idea. Here is a business owner accepting full
responsibility and doing his best to provide some insight into what went wrong
and why it took so long to get back online. It reads to me like someone who
cares deeply about his customers and wants to project the sense that they are
his number one concern. Its honestly refreshing and I wish it was a more
common thing. I'm not a BC customer but I'll be giving them a try next time I
am in the market. They clearly care.

