
Rapid release at massive scale - prostoalex
https://code.facebook.com/posts/270314900139291/rapid-release-at-massive-scale/
======
trjordan
One of the undersold (imho) parts of this post is the system that allows
tiering of releases with gates for their release. When most companies talk
about CI/CD, they mean that master gets deployed to production, full stop.
Rollbacks mean changing the code. In reality, when code hits master, there is
ALWAYS a lag while it gets deployed, and it's worth having a system that holds
that source of truth. Where release engineering gets interesting is how you
handle the happy path vs. a breaking release.

I like that Facebook separated out deploy from release. It means that you can
roll the release out relatively slowly, checking metrics as you go. Bad
metrics mean blocking the release, which means turning off the feature via
feature flag. I think for the rest of the world, that would mean halting the
release and notifying the developer.

Disclosure: I work with smart people who spend lots of time thinking about
this and writing blog posts like "Deploy != Release":
[https://blog.turbinelabs.io/deploy-not-equal-release-part-
on...](https://blog.turbinelabs.io/deploy-not-equal-release-part-
one-4724bc1e726b)

~~~
rdsubhas
This is true, except it has a huge underlying requirement: that all
deployments are forwards and backwards compatible. i.e. a running service must
be able to talk to the older version of itself, and vice versa (and of course
the chain of dependencies). This is a much bigger knowledge investment, easier
said than done.

It pays off in the end, but not worth making it a "criteria for success" when
breaking out from branch-based to trunk-based continuous delivery, otherwise
the trunking will most likely end up never happening.

shameless plug: at goeuro.com we shifted from branch-based to trunk-based CD
in a short time (<3 months) with a diverse set of services and workloads, by
applying a holistic socio-cultural, technical and process approach. Could be
of interest if anyone is trying to make a switch:
[https://youtu.be/kLTqcM_FTCw](https://youtu.be/kLTqcM_FTCw)

~~~
bpicolo
> that all deployments are forwards and backwards compatible

This is critical regardless in SoA / anything other than strict blue-green
deployment.

------
jonstewart
It makes sense upon reflection, but something I thought interesting was how
they run linters and static analysis tools in parallel with building. I've
been used to build-and-test pipelines where these are done serially, because
what's the point in building if the linter fails, and what's the point in
static analysis if the build fails? But the point is, they can be done in
parallel and that reduces latency of feedback to the developer.

------
newscracker
Note: Everything in this comment is based on personal experience and
observations as a Facebook user. They're all my opinions too.

In my experience of using Facebook (primarily groups), it is a highly buggy
platform and it's very hard to say that it behaves consistently or even that
features are really ready before release - this, IMO, implies that both
development and testing as well as rolling out the releases are messed up.
This post talks about building a better "conveyor belt", so to speak, to
release changes, but if the basic product is buggy and didn't get good
attention in design/dev/test, no improvements in the "conveyor belt" can help
make it awesome.

Standard features that have existed for long may or may not work (how good are
the regression tests then?). Posts and comments in groups may sometimes just
disappear (thank goodness an admin activity log was added sometime in the last
several months so we can stop wondering if an admin deleted anything). There's
a feature in groups to mandate people wanting to join a group to answer some
questions setup by the admins. Most people submit the answers but that never
gets saved, and it's unknown what the trick is to get the answers to stay
(this has been around for several months now?). New features aren't always
announced.

I see Facebook as a platform that's used to share ephemeral things. So this
level of quality is probably just ok (though I don't believe it justifies the
company's revenues and valuation).

Since I do not conform to Facebook's ridiculous policy on using
"real/authentic names", I don't even venture into contacting support if I see
any issue, lest my presence be obliterated (yes, I try to keep away from
Facebook, but do need it for some important awareness building because there's
a large audience there).

As a platform used by billions, Facebook still has a very long way to go in
being reliable.

------
imaginenore
It's not clear from the text how they deal with the severe production bugs. By
the time the bug is found, the master branch is full of new code. So your
bugfix has to deploy with all that new code?

And with so many people checking in to the master branch, how is it not
permanently broken? With 1000 devs pushing code, you're bound to have severe
bugs daily.

~~~
Too
It's not like people push straight to master. There is most likely a pull
request system or other form of review tool in between, with both code review
and static analysis.

------
xupybd
I can't be the only one immature enough to read that title as a double
entendre.

~~~
al2o3cr
Clearly, the solution is a middle-out approach. ;)

------
Gravityloss
Is their distributed database code also in this same repo and goes through the
same process?

~~~
oliverzheng
(Former FB eng.) No. Storage and backend services are on a different tier and
release schedule entirely.

------
latchkey
I've had a deployment cycle like this since I started using Google App
Engine... 6 years ago.

~~~
alangpierce
The interesting thing about the article isn't that they're able to release
continuously; there's nothing technically hard about deploying quickly. The
interesting thing is that they're able to make the continuous release system
work with thousands of engineers actively working in the same codebase without
destroying quality. The three-tiered release system, monitoring alerts,
feature flags, and good testing infrastructure seem to be what makes all of
that possible.

~~~
breeny592
Exactly - releasing is the easiest part of the process.

A lot of orgs don't have continuous deployment because of reasons such as:

\- they don't have a good enough automated testing suite (or at least don't
trust it fully), and thus rely on "sign offs" to have people commit to saying
it's quality

\- they don't measure in production properly (no real error alerts, no way to
measure release success), and often deal with things in a "go or no-go" type
way

\- they don't canary test. To me this one is critical - the only way to get
real production use is to have real production users actually using the
site/platform/app, just a sample of them, to see what could go wrong,
especially with new features

A lot of managers I've worked with are shocked whenever I pull out the
"continuous deployment is easy. doing it well is hard" line.

~~~
user5994461
A lot of organization don't have continuous deployment because they can't risk
breaking everything for any developers who is playing around. When they want
to release, they review everything, test and go through QA.

Facebook is not important. It has no impact when it's broken.

~~~
latchkey
I helped build a business that did about $80m in gross revenue in the first
year. We launched the initial version in 3 months (which we predicted to
within a week).

Started with 2 engineers (myself and another guy) and grew it to about 15.
Zero QA, Zero DevOps.

We had CI/CD and a full test suite. We deployed from master as many times a
day as we needed / wanted.

It can work if you open your mind to it and you hire the right people who know
what they are doing.

~~~
user5994461
And I was at business with $800M and 30 employees.

Just because it releases quickly and has no QA doesn't mean it's a good thing.

The only metrics that matters is calls from your users. Facebook doesn't even
have a number to call when it's broken.

------
fmavituna
It's interesting that static or dynamic automated security testing don't exist
in their process.

~~~
rdsubhas
when both the delivery (pipelines) and the units going in them (container
images with deployment descriptors) are automated, its really easy and
straight-forward to plug-in a variety of automated checks (e.g.
[https://github.com/coreos/clair](https://github.com/coreos/clair),
organizational policies, governance, etc)

------
nimchimpsky
"engineer productivity remained constant for both Android and iOS, whether
measured by lines of code pushed or the number of pushes."

Both of those metrics are incredibly shit ways to measure productivity.

I guess thats one explanation as to why the facebook app is 200mb+. They've
been superproductive with all those lines of code.

A better metric would rely on actual features or bugs, imo.

~~~
Osmose
I think it's obvious that they're bad at a small scale (individually or for
smallish teams, maybe < 30 engineers?), but I don't think they're _obviously_
bad for teams of 50+ engineers.

Also, there is a difference between using them as metrics that you want to
raise vs metrics you just don't want to drop or fluctuate wildly over time.

~~~
0xbear
Some people at Google are actually quite proud of six digit _negative_ line
counts they've contributed. And I think they have every right to be proud of
that.

~~~
Swizec
That’s why you track _changes_ made.

Throughout my career imve found that number of changes comitted correlates
pretty well with features delivered and bugs fixed. Also with feature and bug
size.

Yes there are edge cases when it takes a day of debugging to make a 1-line
fix, but those are rare. Just like it’s very rare to deliver a useful new
feature by changing a single line.

Yes there are also features that are tracked as a real ticket and require a
1-line copy change and nothing else. Nobody thinks doing those makes you hella
productive, it just needs to be done.

As for padding lines and changes. That’s what code review is for.

~~~
0xbear
As to what one should track, one should track _results_ if you've managed to
accomplish something useful with very little code, that is decidedly better
than accomplishing the same with much more code and hundreds of check-ins.
Simplicity is the ultimate sophistication.

~~~
Swizec
> one should track _results_

One should, but companies often care more about effort than results. They can
manage based on effort, they can't manage based on results.

If you spend 2 days getting the same results as somebody else does in 5 days,
guess what, they don't want you milling around those extra 3 days and bringing
morale down. Gotta give you more work!

~~~
0xbear
And worse: in more than one megacorp (including Google in late aughts) I've
seen _complexity_ as a requirement for promotion. Google has tried to get rid
of that de jure, but it remains a requirement in practice, so people do what
they are rewarded for: create complexity.

------
ianamartin
I must be really out of the loop because I just don't see enough of what
Facebook is doing to justify all of this garbage. There's a lot of lip service
in the article about user experience, and I guess there are some changes here
and there, but wtf is happening here? I definitely want and use CI/CD tools
for my team's software, but what the fuck are you really doing when you are
making this many changes per day?

Call me an old fart, but if you are in a situation where you need to make that
many changes per day, you are utterly fucked from almost every angle.

Every aspect of this article sounds to me like people have no idea what they
are trying to do, so they write code and push it, and it goes live. And
everyone is very happy about this, for some insane reason.

No offense to anyone, but this is not a reality I want to live in. And the
article doesn't do much to defend the notion.

~~~
coldtea
The parent makes a point.

Across their whole product line (client apps, Hip Hop VMs, Flow, whatever),
there might be millions of lines of code.

But what exactly seems to actually change year over year on FB itself
(server/client of the actual social website) that warrants so many commits?

~~~
cromulent
I would suspect that the engagement level of the site changes. They have the
audience to be able to run massive amounts of experimentation to see how
people use the site and respond to advertising. To the average user, not much
changes except for a feed of news.

~~~
pja
Exactly. Only yesterday, my partner was complaining about how the like icons
on her FaceBook page had become animated. Mine were static at the time -
clearly FB had put her in a test bucket for some 'do animated like icons
increase engagement?' test.

FB does this stuff _all the time_.

------
kasperset
The massive scale is getting massive at a massive rate.

------
ernsheong
Hey Facebook, rapid release is great, but you broke my Messages,
Notifications, Quick Help, and caret buttons on my
[https://www.facebook.com](https://www.facebook.com) navigation bar. It's been
broken for a few hours now, clicking on it does absolutely nothing. Chrome 60
browser, macOS.

Maybe you need to slow down.

~~~
thomasjudge
I see this happen not infrequently. Clearly their values still weight "move
fast & [don't worry too much when we] break things" over reliability. Which
for a site of this nature is a somewhat defensible choice, but for those of us
who come from enterprise software for example, or expect a reasonably stable
user experience, it's occasionally between disconcerting & annoying

~~~
tehlike
Nope, just means that they didnt measure that particular metric. It gets
better over time.

