EDIT: Someone, somewhere is going to have a bad day because they didn't know what you did. This is why sharing knowledge is so important. That's part of what HN exists for! Share what you know! Help improve open source tools! Help your fellow IT professional get a good night's sleep.
In that case I don't need to basically congratulate Stripe on messing up and then posting PR piece on how they messed up, especially when it's such a trivial mistake (by that I mean it's not anything technically interesting for what went wrong). I guess I'll concede that what is technically interesting is not objective - many things I consider complicated other would consider basic, so I don't really have the right to serve as the arbiter of what is technically interesting.
What happened? We dropped an index. Why? Bad tooling. Fix? Patch code, add index. Future fix? Vague goals to stop this from happening.
Though I might be taking it too far, I don't see why I need to give props to someone for doing messing up something relatively basic and then fixing it - don't people complain enough about kids getting participation trophies?
Anyway, don't mean to call out any specific Stripe engineer, it's failed process at multiple levels (guy who drops index has no visibility into DB?).
For their customers.
> In that case I don't need to basically congratulate Stripe on messing up and then posting PR piece on how they messed up, especially when it's such a trivial mistake (by that I mean it's not anything technically interesting for what went wrong).
Few screw-ups are ever that technically interesting. The point of a postmortem isn't to be interesting, it's to explain what went wrong, why, and what you are doing to prevent it from happening again in the future.
Why? How does the honesty (in this case openness really) change the quality?
Genuine question: Would you rather have an org that's always reliable but private in their tech or one that has issues but open about them?
Plenty of companies also communicate the status and that something is happening but don't fully expound on all the internal details. Not sure why it's such a big difference if they did. It feels like fake PR trust to me.
You are, of course, entitled to your opinion. I don't think its going to hurt their business at all.
To me it seemed that it was a bug in the tooling that split the index change into two separate change requests. I'm sure a change request supports more than one piece of DDL, and it must have worked in the normal case, otherwise they would have run into this problem much earlier. So it was likely some weird corner case.
Now that I think about it, this could happen to us as well. We have peer review for each database change, but only at the source level (so, definition of the schema); the actual commands for schema changes are usually generated automatically. If some bug in that step existed that was only triggered in weird corner cases, we'd be screwed.
A backup of our main database takes about 7 hours to restore, and then we'd have to reply the binlogs from the time of the last snapshot up to the failed schema changes, so I guess we'd lose about a solid work day if things went down south. Yikes.
If you're using any of the DBIx::Class deployment tools they should be perfectly happy writing the DDL to disk and then running it from there, specifically to make it possible to audit the DDL as part of the commit that changed the result classes.
Generated != unauditable, especially when the tools are trying their best to co-operate :D
If nothing else, I don't understand the role of "database operator" if they'll just blindly delete a critical index without thinking about it. Shouldn't that person have known better than anybody how critical the index was?
One is to keep it running, monitor disc space, response times etc. but otherwise leave the schema to the developers.
Or you can own the schema, discuss all the changes and migrations with the developers etc.
If it was the first kind of DB operation (which wouldn't surprise me, because that's what our $work has as well), it's not surprising that they trust the developers to provide sane DDL patches.
TBH, it seems odd to call the first one a "database operator" instead of an IT admin.
The issue IMO is splitting an otherwise atomic procedure (creation/drop of index) into two change tickets. I'd be interested to know how the DDL was communicated to ops that led to its getting split.
This incident report reminded me of the 'Game Day Exercise' post from 2014:
in which one robustness check that should be a continuous-integration kind of test, or at least a daily test of a normally working system, is such a big deal to them that they make a big 'Game Day' about it, and serious problems result from this one simple test.
After they have lots of paying customers, of course.
I know we are supposed to be positive and supportive on HN but this was a red flag that the entire department has no idea what an actual robust system looks like and were so far away from that, after having built a substantial amount of software, that expecting them to ever get there may be wishful thinking.
So I am completely unsurprised that they are having this kind of problem. The post-mortem reveals problems that could only occur in systems designed by people who do not think carefully about robustness ... which is consistent with the 2014 post. It kind of shocks me that anyone lets Stripe have anything to do with money.