
Crates.io incident report for 2020-02-20 - lukastyrychtr
https://blog.rust-lang.org/inside-rust/2020/02/26/crates-io-incident-report.html
======
klysm
This seems like a consequence of using thin wrappers of C libraries in rust,
or comparably using a java API from scala, where the error handling
methodology is significantly different.

In this case, the git2 crate ([https://docs.rs/git2/](https://docs.rs/git2/))
is a thin wrapping over libgit2 ([https://libgit2.org](https://libgit2.org)).
To ensure a push went through properly you have to watch what happens inside a
callback. The docs state the following:

 _Note that you 'll likely want to use RemoteCallbacks and set
push_update_reference to test whether all the references were pushed
successfully._

It looks like the only sensible way to utilize this is by closing over mutable
state and then checking that imperatively, which is far from idiomatic (this
is what they do in the patch). If you were pushing multiple refs it would be
even worse!

Usually, when you use some rust code, it's pretty easy to tell what can fail
simply by looking at the types - this is a huge win. Thin wrappers can subtly
violate this intuition: sometimes errors are handled idiomatically and
sometimes they aren't, leading to a sort of false sense of security.

The takeaway here is to use an abundance of caution when utilizing a thin
wrapper and to understand just how thick it is, especially with errors.

~~~
kbenson
> Thin wrappers can subtly violate this intuition: sometimes errors are
> handled idiomatically and sometimes they aren't, leading to a sort of false
> sense of security.

Unfortunately common. That said, I think the best route is usually a thin
wrapper module, and then a module that attempts to map that to something more
idiomatic. At least then there's a clear place where changes/additions in the
underlying library should be handled. I've lived the problem of multiple
modules that each try to deal with mapping the underlying library itself to an
idiomatic interface. They're good for a while, but they always seem much more
susceptible to drift in the underlying library causing problems over time.

~~~
stouset
Agreed! This is how I've always developed FFI code, for both Rust and Ruby. A
thin wrapper that does little more than FFI and maybe mapping onto native
memory-management techniques, and then a separate higher-level wrapper that
provides as idiomatic an interface on top of the library as possible.

------
btmorex
> We should also strive to reduce the amount of time PRs sit in master without
> being live.

Simplest solution at the scale and level of reliability that I've had
experience with is to effectively say that master is what's live. In other
words, a merge to master triggers a deploy under normal circumstances. It'll
save you a lot of headaches. That plus one command/click deploy last known
good build and feature flagging all major updates pretty much removes the
whole deploy process as a significant failure risk.

~~~
rabidferret
Often times database changes are a complex multi-stage process. Since we're
still a small team, we're able to have them merged as a single PR and the
author deploys each commit as needed. A complicated CD process would just add
more burden on our team at this size, and is really more useful when you have
enough people contributing that those sort of complex deploys block other
people's work

~~~
yazaddaruvala
This is counter to my experience, I've worked with multiple small teams (6
people at most), that all have datastore integrations, and CI/CD pipelines
generally make them faster.

> those sort of complex deploys block other people's work

I think this is the root cause of your concerns, the database migrations seem
to always be backwards incompatible and therefore deployments and migrations
need to be synced, blocking other changes.

Typically CI/CD pipelines are never blocked.

Making sure the database migration / code change is backwards compatible is
minimally more tedious (i.e. multiple incremental changes across multiple
days), but overall speeds up development.

~~~
jbmsf
I think you make some assumptions that are context-specific.

The decision to always be backwards compatible is just that, a decision. It
comes with some really nice benefits around CI/CD, but also adds overhead for
certain kinds of development, especially if data models are evolving rapidly.
I've been in situations where we made the explicit decision not be backwards
compatible due to business context that prioritized these evolutions.

Likewise, deploying to production continuously is not always an option. I've
been in situations where customer relationships (contractual or otherwise)
require that changes not be made continuously. Accidental outages or
regressions are simply not an acceptable risk in some business contexts.

~~~
Aeolun
On the other hand, without continuous deployment, my experience is that a
deployment to live _always_ goes wrong, meaning you have to budget an entire
night for one deployment, and hope you’ll fix all the issues before next
morning.

~~~
jbmsf
As always, it depends.

I've seen staged releases work quite well at multiple companies and fail
miserably at some others.

That said, continuous deployment is the right choice in many/most contexts.
I'd just prefer that people see it as a choice instead of assuming that it's
the only way that can work.

------
Aeolun
Am I the only one that thinks this postmortem is more professional and clear
than many of the ones I see written by actual companies?

------
jasonpeacock
> We should also strive to reduce the amount of time PRs sit in master without
> being live.

I'd love to see an action item to address this. The CI/CD system should be
pushing all changes in master to live immediately. Or does a CI/CD system need
to be setup for this?

------
kryogen1c
not knowing anything about crates, this is a very amateurish and subpar
postmortem; provides neither a simple, executive summary nor a technical deep-
dive. problems arent annotated (or even clearly identified) and so cant be
correlated to solutions or a POAM. everything is written in a very passive,
non-action oriented voice. i do not leave this confident that issues have been
identified, triaged, and queued for repair or mitigation.

> Deploying the change took way longer than expected

what does this mean? how much longer is "way longer"? what was your RTO goal,
how did you fail to meet it, and how will you meet it in the future?

> We should also strive to reduce the amount of time PRs sit in master without
> being live.

not a plan. everyones striving at whatever their doing. were you not striving
before? how will your accomplish your goal of minimizing non-live time?

> It took 1 hour and 31 minutes from the start of the incident to the deploy
> of the fix.

"deploy of the fix" is an unclear statement. does that mean when code was
pushed, or when you verified your clients no longer had problems?

~~~
jen20
There is a clear timeline shown at the end which answers the last of these
questions — it was 1 hour and 31 minutes until all ill effects of the bug were
addressed and service was fully restored.

~~~
kryogen1c
> There is a clear timeline shown at the end

Yes, i saw and it did not clarify my question to which you are responding
which is obvious because...

> all ill effects of the bug were addressed and service was fully restored.

...this is not true. The timeline says a customer problem was submitted the
next day, and additional action was required to ensure the "service was fully
restored"

