
Move Fast and Fix Things - samlambert
http://githubengineering.com/move-fast/
======
jerf
I'll highlight something I've learned in both succeeding and failing at this
metric: When rewriting something, you should generally strive for a _drop-in_
replacement that does the same thing, in some cases, even matching bug-for-
bug, or, as in the article, taking a _very close look_ at the new vs. the old
bugs.

It's tempting to throw away the old thing and write a brand new bright shiny
thing with a new API and a new data models and generally NEW ALL THE THINGS!,
but that is a high-risk approach that is usually without correspondingly high
payoffs. The closer you can get to drop-in replacement, the happier you will
be. You can then separate the risks of deployment vs. the new shiny
features/bug fixes you want to deploy, and since risks tend to multiply rather
than add, anything you can do to cut risks into two halves is still almost
always a big win even if the "total risk" is still in some sense the same.

Took me a lot of years to learn this. (Currently paying for the fact that I
just sorta failed to do a correct drop-in replacement because I was drop-in
replacing a system with no test coverage, official semantics, or even
necessarily agreement by all consumers what it was and how it works, let alone
how it _should_ work.)

~~~
colanderman
You can break this down even more.

As we speak, I'm "replacing" old code by just _writing a wrapper around it_
with the new API it should have.

 _Then_ I'll rewrite it without the wrapper, bug-for-bug.

And _then_ I'll actually fix the bugs.

~~~
derefr
There's a simpler method than this that provides even _more_ surety, used by
e.g. LibreSSL:

1\. Start writing your new implementation (or heavily refactoring your old
implementation, whichever), but in parallel, for each legacy function you
remove, write an equivalent "legacy wrapper" function that implements the
_old_ API (and ABI; you have to return the same structs and all) in terms of
the _new_ API.

2\. As you develop the new code, continue to run the _old code 's_ tests.
(This shouldn't require any work; as far as the tests can tell, the codebase
containing all of {the new code, what's left of the old code, and the legacy
wrapper} presents exactly the same ABI as the old codebase.) The old tests
should still all pass, at every step.

3\. Once you're finished developing the new code, and all the old code's tests
are passing, rewrite the tests in terms of the new API.

4\. Split off all the legacy-wrapper code into a new, second library project;
give it the new "core" library as a dependency. Copy all the _old_ tests—from
a commit before you rewrote them—into this project, too. This wrapper library
can now be consumed in place of the original legacy library. Keeping this
wrapper library up-to-date acts to ensure that your new code _remains_ ABI-
compatible; the old tests are now _regression-tests_ on whether a change to
the new "core" library breaks the legacy-ABI-wrapper library.

5\. Document and release your new core as a separate, new library, and
encourage devs to adopt it in place of the legacy library; release the legacy-
wrapper (with its new-core dependency) as the next major version of the _old_
library.

When all-or-nearly-all downstream devs have transitioned from the legacy
wrapper to the new core, you can stop supporting/updating the legacy wrapper
and stop worrying about your updates breaking it. You're free!

In LibreSSL, if you're wondering, the "new core" from above is called libtls,
and the "legacy wrapper" from above is called libssl—which is, of course, the
same linker name as OpenSSL's library, with a new major version.

~~~
EvanPlaice
This is a textbook example of 'How to Kill an OSS project'

#1, #4, and #5 are completely unnecessary.

If it's desirable to 'kill with fire' the old codebase, you could always
create a fork and merge the breaking changes on the next major release.

Creating a second library is a terrible idea for 2 reasons...

You lose the legacy source control history which is (arguably) more valuable
than the current source because it can be used to research solutions to old
problems.

You split the community which is devastating to the culture and survivability
of an OSS project. Even something as a simple name change will have massive
negative impacts on the level of contributions. Only the most popular and
actively developed projects can get away with forking the community.

LibreSSL will likely survive because everything in the world that requires
crypto uses OpenSSL. Even then, that was the absolutely _wrong_ way to go
about things.

The only solid justification for a rename and complete rewrite, is if there
are license/copyright issues.

~~~
derefr
> You lose the legacy source control history which is (arguably) more valuable
> than the current source because it can be used to research solutions to old
> problems.

No reason for that. Both projects—the wrapper and the new core—can branch off
the original. Create a commit that removes one half of the files on one side,
and the other half of the files on the other, and make each new commit
"master" of its repo, and now you've got two projects with one ancestor.
Project mitosis.

> You split the community

How so? I'm presuming a scenario here where either 1. you were the sole
maintainer for the old code, and it's become such a Big Ball of Mud that
nothing's getting done; or 2. the maintainer of the old code is someone else
who is really, really bad at their job, and you're "forking to replace" with
community buy-in (OpenSSL, Node.js leading to the io.js fork, gcc circa 1997
with the egcs fork, MySQL leading to MariaDB, etc.).

In both scenarios, development of the old code has already basically _slowed
to a halt_. There _is_ no active community contributing to it; or if there is,
it is with great disgust and trepidation, mostly just engineers at large
companies that have to fix upstream bugs to get their own code working (i.e.
"I'm doing it because they pay me.") There are a lot of privately-maintained
forks internal to companies, too, sharing around patches the upstream just
won't accept for some reason. The ecosystem around the project is _unhealthy_
†.

When you release the new legacy wrapper, it _replaces_ the old library—the
legacy wrapper is now the only supported "release" of the old library. It's
there as a stopgap for Enterprise Clients with effectively-dead projects which
have ossified around the old library's ABI, so these projects can continue to
be kept current with security updates et al. It's _not_ there for anyone to
choose as a target for their new project! No new features will ever be added
to the wrapper. It's a permanent Long-Term Support release, with (elegant,
automatic) backporting of security updates, and that's it. Nobody starting a
project would decide to build on it any more than they'd build on e.g. Apache
1.3, or Ubuntu 12.04.

> Even something as a simple name change will have massive negative impacts on
> the level of contributions.

Names are IP, obviously (so if you're a third party, you _have to_ rename the
project), but they're more than that—names are associated in our brains with
reflexes and conventions for how we build things.

The reason Perl 6, Python 3, etc. have so much trouble with adoption is that
people come into them expecting to be able to reuse the muscle-memory of the
APIs of Perl 5/Python 2. They'd have been _much_ better off marketed as
completely new languages, that happened to be package-ecosystem-compatible
with the previous language, like Elixir is to Erlang or Clojure is to Java.

If these releases were accompanied by their creators saying "Python/Perl is
dead, long live _____!" then there'd have been a _much_ more dramatic
switchover to the new APIs. Managers understand "the upstream is dead and we
have to switch" much more easily than they understand "the upstream has a new
somewhat-incompatible major version with some great benefits."

One good example: there's a reason Swift wasn't released as "Objective-C 3.0".
As it is, ObjC is "obviously dead" (even though Apple hasn't said anything to
that effect!) and Swift is "the thing everyone will be using from here on, so
we'd better move over to it." In a parallel reality, we'd have this very slow
shift from ObjC2 to ObjC3 that would never fully complete.

\---

† If the ecosystem were healthy, obviously you don't need the legacy wrapper.
As you say, just release the new library as the new major version of the old
library—or call the new library "foo2", as many projects have done—and tell
people to switch, and they will.

It's easy to find healthy projects like this when you live close enough to the
cutting-edge that all your downstream consumers are still in active
development, possibly pre-1.0 development. The Node, Elixir, Go and Rust
communities look a lot like this right now; any project can just "restart" and
that doesn't trouble anybody. Everyone rewrites bits and pieces of their code
all the time to track their upstreams' fresh-new-hotness APIs. That's a lot of
what people _mean_ when they talk about using a "hip new language": the fact
that they won't have to deal with stupid APIs for very long, because stupid
APIs get _replaced_.

But imagine trying to do the same thing to, say, C#, or Java, or any other
language with Enterprise barnacles. Imagine trying to tell people consuming
Java's DateTime library that "the version of DateTime in Java 9 is now
JodaTime, and everyone has to rewrite their date-handling code to use the
JodaTime API." While the end results would probably have 10x fewer bugs,
because JodaTime is an excellent API whose UX makes the pertinent questions
obvious and gives devs the right intuitions... a rewrite like that just ain't
gonna happen. Java 9 needs a DateTime that looks and acts like DateTime.

------
cantlin
The strategy of proxying real usage to a second code path is incredibly
effective. For months before the relaunch of theguardian.com, we ran traffic
to the old site against the new stack to understand how it could be expected
to perform in the real world. Later of course we moved real users, as
incrementally as we possibly could.

The hardest risk to mitigate is that users just won't like your new thing. But
taking bugs and performance bottlenecks out of the picture ahead of time
certainly ups your chances.

~~~
bertr4nd
Out of curiosity - when you've done this type of proxy test, what do you do
about write operations? Do you proxy to a test DB, or do you have your code
neatly factored to avoid writing on the test path (I guess most code I've
worked on that needed a rewrite also wasn't neatly factored :) ).

------
mwcampbell
This is tangential, but given the increasing functionality and maturity of
libgit2, I wonder if it would yet be feasible to replace the Git command-line
program with a new one based on libgit2, and written to be as portable as
libgit2. Then there would be just one Git implementation, across the command
line, GUIs, and web-based services like GitHub. Also, the new CLI could run
natively on Windows, without MSYS.

~~~
solutionyogi
That is my dream. Right now, many Windows GUIs (SmartGit/SourceTree) use
'git.exe' to manage the actual Git repositories.

If libgit2 is fully mature, I can imagine more GUIs/tools will be built to
manage/analyze the git repositories.

~~~
leonardinius
I believe github windows client is going that route
[https://github.com/blog/1127-github-for-
windows](https://github.com/blog/1127-github-for-windows).

I also remember some blog post going into sync/async details of git.exe vs
libgit2 stuff. Will try to google it.

~~~
MarkSweep
This is the article you speak of: [http://githubengineering.com/git-
concurrency-in-github-deskt...](http://githubengineering.com/git-concurrency-
in-github-desktop/)

The AsyncReaderWriterLock mentioned in blog does not directly show up in
Google, but it appears to be based on the one in this blog post:
[http://blogs.msdn.com/b/pfxteam/archive/2012/02/12/building-...](http://blogs.msdn.com/b/pfxteam/archive/2012/02/12/building-
async-coordination-primitives-part-7-asyncreaderwriterlock.aspx)

------
rcthompson
How does Scientist work with code that produces side effects? In the example,
presumably both the new and old each create a merge commit. Maybe these two
merge commits are done in in-memory copies of the repo so that the test result
can just be discarded, but what about in the general case where a function
produces an output file or some other external effect?

~~~
jdmichal
I would think that the operation would either (a) have to be pure or (b) be
executed in two different environments. I think going for (a) is the easier
approach. If you produce an output file, make a pure operation that generates
the contents, then write it as a subsequent operation. Now you can test the
contents against each other, but only actually write one of them.

Basically, create an intermediary object that represents your state change and
test those. Then "commit" the change from control and discard the one from the
experiment.

------
smg
I am trying to understand why the new merge method needed to be tested online
via experiment. Both correctness and performance of the new merge method could
have been tested offline working with snapshots (backups) of repos. Could a
github engineer shed more light here?

~~~
tanoku
Author here. 5 years ago I would have agreed with you and logged e.g. 10
million merge requests to replay them offline. But one thing I've found over
the years (which may seem obvious in retrospect) is that staging environment
are not identical to production. Particularly not when it comes to finding
sneaky bugs and performance regressions -- the code doesn't run on the same
exact environment it will run when it is deployed (it has different input
behaviors, and most importantly, it has different load and performance
characteristics).

The question then becomes "why would you run these experiments offline when
you can run them online?". So we simply do. I personally feel it's a game
changer.

~~~
humanrebar
> why would you run these experiments offline when you can run them online?

It probably doesn't apply in this case, but many bugs cannot be replicated in
the production environment. At least, you don't want to break things in
production to see if your software does the right thing in adverse scenarios.

~~~
allannienhuis
Some would say that's a really good strategy. Keeps you on your toes...

[http://techblog.netflix.com/2012/07/chaos-monkey-released-
in...](http://techblog.netflix.com/2012/07/chaos-monkey-released-into-
wild.html)

------
clebio
Seems like the biggest takeaway is "have good tooling and instrumentation".
I'm working with a complicated legacy production system, trying to rebuild
pieces of it, and we have little or no instrumentation. Even _introducing_
such tooling is a potentially breaking change to production systems. Ach
schade.

------
daveguy
Very cool. I like this parallel execution of the original version and the
update with comparisons between the two. They use a ruby package developed in
house that has been made open source, Scientist. Does anyone know if there is
an similar type package for python (preferably 2.7) development? It seems like
an interesting area in between unit tests and A/B tests.

------
eric_h
> Finally, we removed the old implementation — which frankly is the most
> gratifying part of this whole process.

On average, I get much more satisfaction from removing code than I do from
adding new code. Admittedly, on occasion I'm very satisfied with new code, but
on average, it's the removing that wins my heart.

~~~
LaurentVB
I've long been dreaming of adding the following tagline to my resume: "Fixing
bugs by removing code since 2006"

------
_yosefk
TIL that github used to merge files differently than git because it used its
own merge implementation based on git's code, to make it work on bare repos.
Showcases a benefit of open formats and open source, showcases a downside as
well (I'd never guess it might merge differently.)

It's a good thing nobody contributes to my github repos since noone had the
chance to run into the issue...

------
danielsamuels
I wish they would add the ability to fast-forward merge from pull requests. I
know many large projects (including Django) accept pull requests but don't
merge them on Github simply because of the mess it makes of the history.

~~~
lumpypua
In my team's projects, code review in PRs made a mess of the history (certain
devs in particular :P). We switched to a squash merge based workflow to
address it, git reflow is our particular poison:
[https://github.com/reenhanced/gitreflow](https://github.com/reenhanced/gitreflow)

------
nod
This is inspiring reading. One may not actually need the ability to deploy 60
times a day in order to refactor and experiment this effectively, but it's
clearly a culture that will keep velocity high for the long-term.

~~~
jeffjose
In the pursuit to get-things-done, we forget fundamentals and more often than
not its the sound fundamentals that come in handy when your product has grown
beyond your 1 or 2 member original tech team.

------
netghost
For operations that don't have any side effects, I can definitely see how you
could use the Science library.

I'm curious though if there are any strategies folks use for experiments that
do have side effects like updating a database or modifying files on disk.

~~~
Nacraile
The first thing to do is to try to minimize the scope of mutating operations:
e.g, decompose a monolithic read-write operation into independent load,
transform, and store. You can then easily test as much as possible (the load
and the store) side-by-side. The parts that must have side effects are still
hard, but at least there are fewer of them. (One variant of this would be to
build your code such that you can always intercept side-effects, and then
block the new code's side effects and compare with the old code)

The next thing is to try to take advantage of idempotence (and, by extension,
try to make as many of your mutating operations idempotent as possible):
there's nothing completely risk free, but you can at least verify idempotence
for either ordering of new and old code, and if you're factored right, you can
run both paths on the same input and verify they have the same side-effects
and output.

Finally, making the observation that the new code must in general be
backwards-compatible with the old code, and both versions need to be able to
run concurrently (because this situation will always exist during deployment):
in the worst case, you can always start with a limited deployment of the new
path, which limits the amount of damage done if the new code is bad.

------
blt
Github sounds like a great place to work.

------
abritishguy
Wow, strange that people weren't reporting these merge issues when they were
clearly impacting people.

~~~
nod
My read of the article implies that they were running the new method on the
side, and comparing the results to the old method that was still running in
production. They got to 100% before they actually pulled the lever on what
customers would use.

Edit: Ah, I see - talking about the Git bugs, not the differences . I’m
actually not surprised that “256 (or a multiple) merge conflicts” was never
noticed (or at least root-caused and fixed) by the entire git community.

Wonderful ability to use a large userbase as a giant fuzzer.

~~~
pilif
I think OP was talking about the issue in git itself that caused merges with
mod 256 conflicts to go though and be committed, including all the merge error
markers.

This happened on their live system (and would have happened on the command
line for local git users), so OP (and incidentally, me too) was wondering how
that wasn't noticed and wasn't causing support issues (it probably was which
might have been another reason for this refactoring)

~~~
peff
I think it is a combination of two things:

\- the mod-256 conflict bug is exceedingly rare. Keep in mind that this is
mod-256 individual hunk conflicts in a _single file_. Most files in a merge
with conflicts have a handful of hunks. Over all of the testing at GitHub,
only a single merge triggered this bug, and it was on a long repetitive file
with automated changes.

\- the failure case was to quietly accept the merge. The result was obviously
bogus, but didn't look any different than a user accidentally checking in the
merge conflict markers. So if it was happening, I'd suspect that it went
undiscovered either because the merge results were never used (e.g., it was a
test-merge to feed the PR "Merge" button status) or the users simply scratched
their head and fixed it.

------
__jal
Nothing really to contribute or ask, other than to say that I really enjoyed
the writeup. Although I have nothing coming up that would use the code, the
new library sounds really neat. Kudos!

------
dlib
Very interesting, definitely gonna try this out as I have seen similar use-
cases.

Any change Github is at anytime going to show the specific merge-conflicts for
a PR that cannot be merged?

------
openfuture
Humans will always reverberate around truths like this.

The emphasis shift on breaking vs fixing looks like a good example of how
fashion trends in tech create artificial struggles that help new people
understand the "boundaries" of $things.

Fashion's like a tool for teaching via discussion

Edit: I'm just commenting on what I percieve as a fashionable title not the
article.

------
jcchee88
When running with Scientist enabled, doesn't that mean you will add both the
runtime of the old/new implementation instead of just one implementation?

I could see this begin ok in most cases where speed is not a concern, but I
wonder what we can do if we do care about speed?

~~~
e28eta
We could go look at the implementation/documentation of Scientist to confirm,
but nothing says the comparison has to be happening on the response thread.
You _could_ fork a new thread to run the unused implementation without
impacting response time, as long as you have the server capacity.

------
cmrx64
Does anyone know what an "O(n) issue" is? I can think of a few possible
meanings in the usage here, but I've never heard it before and they all seem
wrong.

~~~
acveilleux
Using a linked-list where the actual access pattern is random and an hash
table is more suitable would be the most obvious, especially since this is C
code.

Similar things includes C-style string functions like strlen() and so on that
require iterating over an unknown length array. Caching (or better avoiding!)
this work can save a lot of computing time. Example from libgit2:
[https://github.com/libgit2/libgit2/commit/7132150ddf7a883c1f...](https://github.com/libgit2/libgit2/commit/7132150ddf7a883c1f12a89c2518c1a07c0dc94c)

In fact C-style string is a rich source of O(n) performance issues. And git is
full of strings like filenames.

~~~
cmrx64
To me, I would have called that an "O(n^2)" (or whatever the proper non-linear
order is) issue. The issue isn't that the algorithm is O(n), but that it's not
O(n) when it could be! The other interpretations by andrewaylett and daveguy
seem to be more accurate (and also agrees with my own thinking). But I could
also see it being used for "the asymptotic complexity was hiding a huge
constant".

Have you heard your interpretation used in the wild?

