Hacker News new | past | comments | ask | show | jobs | submit login

I appreciate the post-mortem and, of course, we've all been there. I have to say, though, that the cause is a little surprising. That DDL needs to be executed sequentially is pretty basic and known by, I'm sure, everyone in the engineering and operations organizations at Stipe. It surprises me that an engineering group that is obviously so clever and competent would come up with a process that lost track of the required execution order of two pieces of DDL. If process (like architecture) reflects organization, what does this mean about the organization that came up with this process? It's not a sloppy exactly. Is it overly-specialized? It reminds me of that despair.com poster "Meetings: none of us is as dumb as all of us" in that a process was invented that makes the group function below the competence level of any given member of the group.



It's pretty easy to monday morning quarterback other org's choices or actions here. Every person posting here will have something in their org break at some point that was "obvious" to the rest of us.

EDIT: Someone, somewhere is going to have a bad day because they didn't know what you did. This is why sharing knowledge is so important. That's part of what HN exists for! Share what you know! Help improve open source tools! Help your fellow IT professional get a good night's sleep.


I get that but I'm wondering who is this postmortem written for? For other engineers? Not entirely - it's seems to be written partly as PR piece.

In that case I don't need to basically congratulate Stripe on messing up and then posting PR piece on how they messed up, especially when it's such a trivial mistake (by that I mean it's not anything technically interesting for what went wrong). I guess I'll concede that what is technically interesting is not objective - many things I consider complicated other would consider basic, so I don't really have the right to serve as the arbiter of what is technically interesting.

What happened? We dropped an index. Why? Bad tooling. Fix? Patch code, add index. Future fix? Vague goals to stop this from happening.

Though I might be taking it too far, I don't see why I need to give props to someone for doing messing up something relatively basic and then fixing it - don't people complain enough about kids getting participation trophies?

Anyway, don't mean to call out any specific Stripe engineer, it's failed process at multiple levels (guy who drops index has no visibility into DB?).


> I get that but I'm wondering who is this postmortem written for?

For their customers.

> In that case I don't need to basically congratulate Stripe on messing up and then posting PR piece on how they messed up, especially when it's such a trivial mistake (by that I mean it's not anything technically interesting for what went wrong).

Few screw-ups are ever that technically interesting. The point of a postmortem isn't to be interesting, it's to explain what went wrong, why, and what you are doing to prevent it from happening again in the future.


I would encourage you to accept that their post mortem was released in good faith, and its purpose is to both technical knowledge sharing as well as PR. I know I personally value technical organizations that are honest and forthcoming when things go south.


> I know I personally value technical organizations that are honest and forthcoming when things go south.

Why? How does the honesty (in this case openness really) change the quality?

Genuine question: Would you rather have an org that's always reliable but private in their tech or one that has issues but open about them?


The second. Because "always reliable" isn't. So when something goes down and there is nothing being communicated, that's truly infuriating.


This particular post-mortem by Stripe makes me trust them less as it's a fairly simple mistake that shouldn't have been made.

Plenty of companies also communicate the status and that something is happening but don't fully expound on all the internal details. Not sure why it's such a big difference if they did. It feels like fake PR trust to me.


> This particular post-mortem by Stripe makes me trust them less as it's a fairly simple mistake that shouldn't have been made.

You are, of course, entitled to your opinion. I don't think its going to hurt their business at all.


Didn't say it was going to hurt their business. Only that a post-mortem doesn't somehow change my opinion on their quality or reputation, and in this case is the opposite.


> It surprises me that an engineering group that is obviously so clever and competent would come up with a process that lost track of the required execution order of two pieces of DDL.

To me it seemed that it was a bug in the tooling that split the index change into two separate change requests. I'm sure a change request supports more than one piece of DDL, and it must have worked in the normal case, otherwise they would have run into this problem much earlier. So it was likely some weird corner case.

Now that I think about it, this could happen to us as well. We have peer review for each database change, but only at the source level (so, definition of the schema); the actual commands for schema changes are usually generated automatically. If some bug in that step existed that was only triggered in weird corner cases, we'd be screwed.

A backup of our main database takes about 7 hours to restore, and then we'd have to reply the binlogs from the time of the last snapshot up to the failed schema changes, so I guess we'd lose about a solid work day if things went down south. Yikes.


> Now that I think about it, this could happen to us as well. We have peer review for each database change, but only at the source level (so, definition of the schema); the actual commands for schema changes are usually generated automatically.

If you're using any of the DBIx::Class deployment tools they should be perfectly happy writing the DDL to disk and then running it from there, specifically to make it possible to audit the DDL as part of the commit that changed the result classes.

Generated != unauditable, especially when the tools are trying their best to co-operate :D


I was thinking the same thing.

If nothing else, I don't understand the role of "database operator" if they'll just blindly delete a critical index without thinking about it. Shouldn't that person have known better than anybody how critical the index was?


There is an open secret you may not be aware of: every startup is a shitshow on the inside.


Also top 20 financial institutions, USG orgs and places that store your healthcare and tax information ;-) I'd argue that a good number of startups these days (especially ones borne out of larger organizations with lots of combined experience) are way more capable of handling these issues with finesse and speed.


Just startups? It's just as hilarious in corporate America.


There are different levels at which you can operate a database.

One is to keep it running, monitor disc space, response times etc. but otherwise leave the schema to the developers.

Or you can own the schema, discuss all the changes and migrations with the developers etc.

If it was the first kind of DB operation (which wouldn't surprise me, because that's what our $work has as well), it's not surprising that they trust the developers to provide sane DDL patches.


Okay, I've never seen the first kind of DB op. The places I've worked with dedicated DB people were a combination of the first and second types you listed. That situation would explain the Stripe outage, though.

TBH, it seems odd to call the first one a "database operator" instead of an IT admin.


I had this thought too. I'm actually surprised it was something this simple for an engineering org as competent and storied as Stripe, guess it proves Murphy gets all of us at one point or another.

The issue IMO is splitting an otherwise atomic procedure (creation/drop of index) into two change tickets. I'd be interested to know how the DDL was communicated to ops that led to its getting split.


We do a ton of database updates regularly and we rely on a simple wiki put together that has all install instructions for a release which goes through sr engineers peer review. Seems like having a single tool/document for code and db updates could have mitigated this?


How about automated migrations? At the scale of Stripe, you may want to have dedicated personnel keeping an eye on it, but had it been in the same SQL file, none of that would have happened...


Don't presume to give them so much benefit of the doubt. Apparently Stripe are kind of jokers.

This incident report reminded me of the 'Game Day Exercise' post from 2014:

https://stripe.com/blog/game-day-exercises-at-stripe

in which one robustness check that should be a continuous-integration kind of test, or at least a daily test of a normally working system, is such a big deal to them that they make a big 'Game Day' about it, and serious problems result from this one simple test.

After they have lots of paying customers, of course.

I know we are supposed to be positive and supportive on HN but this was a red flag that the entire department has no idea what an actual robust system looks like and were so far away from that, after having built a substantial amount of software, that expecting them to ever get there may be wishful thinking.

So I am completely unsurprised that they are having this kind of problem. The post-mortem reveals problems that could only occur in systems designed by people who do not think carefully about robustness ... which is consistent with the 2014 post. It kind of shocks me that anyone lets Stripe have anything to do with money.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: