
How Facebook Ships Code - stunr69
http://framethink.wordpress.com/2011/01/17/how-facebook-ships-code/
======
Silhouette
Am I the only person who thinks this whole approach is broken?

We have seen the rise of "devops" recently, and big name web companies like
Facebook and Google seem to be very proud of how engineer-led they are, how
empowered their developers are, how their product managers don't have much
real authority, how they push code to production ten minutes before it's even
written, and so on.

From the outside, I see systems that are always changing, so users can never
rely on anything working the same way from one visit to the next. I see
organisations with access to sensitive personal data being cavalier at best
about how they handle it. I see a major blunder _every few weeks_ that at best
causes serious irritation to users and at worst risks significant loss of
business and/or legal/regulatory consequences.

And I see huge brands whose egotistical staff don't realise that they are
successful _despite_ evidently not bothering with either real product
management or robust testing, _not because_ of these things, and who don't
seem to notice that they are still relying almost entirely on momentum from
one or two really big successes from early on to maintain their user base and
revenues.

~~~
jrockway
I wouldn't lump Google in there with Facebook. We have very strict controls on
access to user data (we can't even see email addresses in logs), and we have
not adopted the motto "move fast and break things". We do extensive automated
testing and have release processes that are designed to minimize problems in
the event of a bad push. Yes, bugs happen from time to time, but it's software
-- there is no known practical technique to produce bug-free software, so we
have to settle for mostly-bug-free software instead. This isn't being
amateurish or egotistical, it's being realistic.

~~~
Silhouette
I'm not arguing for unrealistic quality levels, and I will acknowledge
immediately that my experience could be atypical.

However, in fairness, if I look at all of the software that I use regularly in
a professional capacity today, then it is _clearly_ Google products that are
the most buggy, and by a very wide margin.

For example, I have a client who uses Google Docs/Drive. We have rarely
managed to hold a meeting without one of our small team struggling to see
either a word processor document or a spreadsheet properly, and that's just
the minor cosmetic or browser incompatibility bugs that keep appearing along
with all those minor UI changes. We have also experienced some much more
serious problems, including corruption/data loss and seeing the entire change
history for some files become inaccessible for no apparent reason. In other
words, it's not just minor UI errors that crept in as the product evolved,
there are evidently fundamental flaws in the underlying architecture that
don't store the data robustly.

Another example: I spent a couple of days last week trying to figure out why a
site that had been working fine for users until recently and had not been
changed at all on our side was suddenly generating bug reports. It turned out
that recent Chrome builds have broken some HTML5 features in multiple ways.
There have been related bug reports in some cases, but they have been closed
as the specific example given no longer seemed to be a problem. Again, the
nature of the problems makes it obvious that these are not just one-liner
issues but fundamental flaws, typically where Chrome is so aggressive with its
cache usage and/or trigging redrawing/relayout that it just plain doesn't
work. And even though the bugs had been reported, obviously the correct root
cause was never identified and fixed.

Another example: I recently spent some time looking into how the Closure Tools
are coming along. Have you tried the examples/demonstrations for the Closure
Library recently? Many of them simply don't work in Gecko-based browsers, and
that would be obvious if anyone working on the project had made even a cursory
attempt to test them for five minutes.

I will finish by once again acknowledging that my experiences here may be
atypical, and that the particular projects I've mentioned that I happen to be
using may not reflect the wider Google culture. But on the evidence before me,
I see an organisation that keeps breaking things in its rush to push new
features out and that demonstrably lacks robust architectures that keep data
safe, effective testing processes before pushing code into production, and
proper issue resolution processes when defects do get reported.

~~~
jrockway
_However, in fairness, if I look at all of the software that I use regularly
in a professional capacity today, then it is clearly Google products that are
the most buggy, and by a very wide margin._

It's more likely that you just spend more time using Google software than
other software. When I worked at BofA, our website was down for an entire week
because of a bad push. No online banking for a week. That's pretty much the
standard for the industry. I don't doubt that there are bugs in Docs or
Chrome, but they are relatively obscure. That's the nature of software, not
every bug is caught by a unit test and users end up seeing them and reporting
them. (Oddly, we use Docs heavily at Google, and the only bug that I've
noticed is the "Your zoom level is not supported one". I haven't personally
hit any other issues.)

But like I said in my original post, if you know how to develop bug-free
software, I'd love to hear how. Expecting low-cost web apps to be as reliable
as airplane control systems is unrealistic.

~~~
Silhouette
_It's more likely that you just spend more time using Google software than
other software._

Sorry, but I'm really not. In the case of Chrome, for example, I would
routinely test new work on web projects in _all_ the major browsers. I am
writing this with some empirical data in front of me, because I've just
checked the bug tracking systems for a couple of projects I work on to be
sure: for both projects, excluding mobile browsers, Chrome has been
responsible for a clear majority of all defects where the root cause was found
to be a browser bug over the past two years.

 _I don't doubt that there are bugs in Docs or Chrome, but they are relatively
obscure._

Respectfully, if they were that obscure, my colleagues and I wouldn't keep
running into them on multiple projects. I'd agree that the particular symptoms
of any particular bug are usually a corner case: set this option to A and that
option to B and it breaks, but other combinations work OK. It's the way that
several of these bugs have recurring themes that betray both an underlying
architectural weakness and a lack of effective quality control that I find
most unfortunate.

Using Chrome as an example again, it is clearly aggressive with caching and
conservative with repainting, but sometimes that means it is simply not
behaving properly at all. If I set a part of the DOM to display instead of
being hidden and then send an AJAX request, I want my "please wait" message
displayed while the request is running, not afterwards, or indeed not at all
since it probably gets hidden again as soon as the response arrives.

 _That's the nature of software, not every bug is caught by a unit test and
users end up seeing them and reporting them._

And as I mentioned, several of those bugs in Chrome _had_ been reported, and
subsequently closed without the root cause ever being identified and fixed.

 _Expecting low-cost web apps to be as reliable as airplane control systems is
unrealistic._

I don't expect Google's software to be as reliable as airplane control
systems, but somewhere close to as reliable as everyone else's software would
be nice. I appreciate that you're having trouble believing it or reconciling
it with your own experience, and I've already acknowledged that my experience
might be a complete outlier, but I'm looking at several years of _empirical
data_ across multiple completely different projects and development teams and
it is quite clear that the Google products I'm looking at here haven't been
keeping up lately.

~~~
jrockway
_And as I mentioned, several of those bugs in Chrome had been reported, and
subsequently closed without the root cause ever being identified and fixed._

Link?

~~~
Silhouette
See issue 104487 for one recent example. An issue was flagged up where HTML5
videos weren't playing properly when given a poster image but no controls
attribute.

The issue was closed almost immediately, with some obviously hastily written
comments, apparently because no-one could reproduce it in a different version
of Chrome on Linux or Mac. As far as I can see, no-one even tried to reproduce
the other reported failing case on Windows, and there was no attempt at all to
investigate the original bug and determine how it happened and why it was no
longer observable on the platforms tested.

The issue was simply marked "fixed", despite no actual fix having been
identified, rather than giving it a more specific "no longer reproducible"
status.

There are still serious problems with that combination of attributes today.

------
sreyaNotfilc
This may be my favorite quote...

"Engineers handle entire feature themselves — front end javascript, backend
database code, and everything in between. If they want help from a Designer
(there are a limited staff of dedicated designers available), they need to get
a Designer interested enough in their project to take it on. Same for
Architect help. But in general, expectation is that engineers will handle
everything they need themselves."

I actually like this idea. Building my own website as well as working as the
senior engineer during my day job forces me to be involved in all facets of
web developing. The jobs are not abstracted. You are expected to know what
you're doing from the front end to the back end. If not, nothing get done in a
timely manner. I like this method because the front end is tightly woven in
with the business logic of the module that you're working on. In other words,
you know what the code is doing inside and out.

~~~
crazygringo
I dunno. I personally am a full-stack guy.

But when I work with back-end developers who don't know MySQL inside and out,
and they write a query that works fast in development but slow in production
because they didn't realize the column index they specified won't work because
of a string collation incompatibility between tables, and they've never even
_heard_ of this kind of problem before...

Then I wish they'd stick to writing their back-end executable code, and have a
database guy write their query for them, and have a JavaScript expert take
care of the front-end stuff.

It's not a question of intelligence, just a question of experience. You have
to have done a lot of JavaScript coding to realize never to call parseInt(x),
but always parseInt(x, 10), for example. And the number of CSS
incompatibilities between browsers that you need to account for...

Unless someone has a lot of full-stack experience, it doesn't seem that great
that they should be pushing out full-stack code on their own at a place like
Facebook. Maybe code reviews will catch those kinds of things, though.

~~~
underwater
Facebook has a good abstractions at all levels of the stack. This allow
engineers to work on all parts of the stack without having to worry about low
level implementation details. I have not written a line of SQL while at
Facebook.

If there is some domain knowledge needed then I can just rope in an engineer
from the appropriate team to review my diff.

It's also worth noting that on product teams engineers tend to gravitate
towards the parts of development they are more comfortable with, or enjoy the
most. They are not forced to work with te full stack just for the sake of it.

------
gouranga
What a crock of shit:

 _after boot camp, all engineers get access to live DB_

I can understand on a startup or small org but an organisation of that size,
there should be very tight access control.

Despite what anyone says, the probability that someone does something bad
increases in larger groups. Security should be on a simple need-to-know basis
and nothing else.

I build BIG financial software and we have certain audit requirements, access
control requirements, data protection requirements etc and that is exactly how
it should be.

I'd never put my data near FB. They are simply irresponsible.

~~~
piggity
You finance and billing guys will never understand.

These are _rockstars_.

They work for facebook.

They would never ever type UPDATE users SET email = username ||
'@facebook.com'; WHERE username == 'john.smith';

~~~
jacques_chester
Of course not. Much faster to type

    
    
        UPDATE users SET email = username || '@facebook.com';
    

Then let the users sort out the exceptions.

~~~
alttab
I think he was being sarcastic - you can see the semicolon in the middle of
the command which looks like pretty much what they did with the whole e-mail
debacle. You simply removed the second incomplete statement.

~~~
jacques_chester
I now realise that I am underqualified to work at Facebook. :(

------
epriest
See this question on Quora for some clarifications from people who are or have
been Facebook employees (including myself). The article (particularly the
original version) has a lot of inaccuracies, and is now around 18 months out
of date.

[http://www.quora.com/Facebook-Engineering/How-accurate-is-
th...](http://www.quora.com/Facebook-Engineering/How-accurate-is-the-How-
Facebook-Ships-Code-article-written-by-yeeguy)

------
michaelmartin
I really like seeing this approach listed:

"resourcing for projects is purely voluntary. A PM lobbies group of engineers,
tries to get them excited about their ideas. Engineers decide which ones sound
interesting to work on."

That sounds exactly the same as how Github's engineers work.

It's an awesome concept; no-one can justifiably be bored with their projects
if they chose them. And if you can't get anyone interested in working on the
project, then it's a good indicator it may not be a project worth completing
for the company anyway.

I'm sure there are times when someone has to say "We _need_ someone to do
this", but I'd be curious to hear from someone who works in one of these
environments how common an issue that really is.

~~~
jacques_chester
> _It's an awesome concept; no-one can justifiably be bored with their
> projects if they chose them._

Eventually people lose interest. It's only human. But _The System_ has been
written, it is in production, and it has accrued a healthy stock of user data.

One day, The System breaks. "Only" tens of millions of users -- less than a
percent of all Facebook users -- rely on the The System. But they rely on it
utterly.

Who will fix The System?

Of the five engineers who wrote it:

* Jack and Mandeep left to launch myfornaxisnatrr.com

* Wei Li has moved to a different group

* John doesn't want to touch it, he only came on board to help Jack

* Michael was an intern but has since taken a job with Google.

Uh oh! There's nobody around to voluntarily fix an existing system. That's
boring, and there are no incentives for fixing bugs in obscure features
because only launching successful new features gains visibility from higher-
ups.

Guess we'll need some mean old managers to round up a posse.

If they care enough.

Meanwhile, The System has acquired millions of users, cost millions of dollars
to develop and operate and will now abruptly cause tens of millions of
customers to become incredibly frustrated. And at no point has anybody stopped
and asked:

 _"Was this the Right System to build?"_

~~~
angstrom
> That's boring, and there are no incentives for fixing bugs in obscure
> features because only launching successful new features gains visibility
> from higher-ups.

Companies really should find a way to incentivize this. Rewrites are about as
silly. "Write the next generation xyz." Translates to "No one here has any
idea how xyz was written, so it's being rewritten instead of modified. We look
forward to rediscovering the same problems we did the first time."

~~~
jacques_chester
> Companies really should find a way to incentivize this.

It can't be done from the top. You need to start with engineers who care about
quality over the long term. It has to come from within each engineer.

When Zuckerberg talked about younger programmers being "better", he probably
meant "more like me". But old farts are just young farts with expensively-
acquired scar tissue.

Most of our most treasured software development lore comes from the lesson
that "move fast, break things" just devolves into "fuck, everything is broken,
fix it fast".

------
Goladus
The facebook system actually sounds really solid, especially the "boot camp"
thing that so many companies fail to have, however it's probably pretty
expensive. Choosing between fast, cheap, and good, Facebook is choosing fast
and good. For many, cheap and good but slow is more desirable.

One of the risks to consider with a "devops" oriented approach is that you may
become more dependent on it than you want.

Often, applications split things into a few different categories depending on
how tweakable they need to be. There's code, configuration, app
administration, and data. Code shouldn't need to change often. Configuration
may need to change when the environment changes. App administration (eg
creating new accounts) needs to change often, and data is always changing (or
at least growing).

The risk is that developers will design the system so that only developers can
administer it. Configuration, the settings that may need to be tweaked by sys
admins long after the original developers have left the project, may wind up
in the code or sometimes lumped into the database alongside end-user options.

It's not a reason not to take this approach, just something to consider when
developing internal processes and culture.

------
jsvaughan
Previously on HN: <http://news.ycombinator.com/item?id=2594083>

↪ How Facebook _actually_ pushes updates to the site

I came across this originally on the Etsy dev blog, rather than HN, and that
particular post had some good other stuff about Flickr and Etsy:

[http://codeascraft.etsy.com/2011/06/01/pushing-facebook-
flic...](http://codeascraft.etsy.com/2011/06/01/pushing-facebook-flickr-etsy/)

------
yawgmoth
I like the idea of encouraging a high-performance culture, but I don't think
the 'perform or die' atmosphere would be healthy for many engineers. I know,
idolizing 'rockstar programmers' is a sort of new hotness and I understand
that a company like Facebook wants to have super-talented developers, but
developers grow and learn new tricks as they mature, and they might take more
than six months to do so.

~~~
alttab
True - but those engineers don't work at Facebook. This isn't no-child left
behind. This isn't hand-hold time. This is the most expensive and expansive
internet application in the world. If they need time to ramp up, they can do
it on someone else's product and come to Facebook when they're ready.

~~~
wpietri
That sounds impressively macho, but it's an attitude that has long-term
organizational costs. HN just had a great article on how Microsoft's internal
competition deeply harmed the company. The number one complaint I hear from
departing Google engineers is the absurd internal promotions system.

Years ago I did a gig at eBay, and I thought their macho attitude was a giant
source of problems. Plenty of good, sane people were driven off (or driven
mad) by artificially high-pressure situations. Every email about a promotion
mentioned how somebody had worked all night to get something done; they were
promoting more for drama than for skill. And a "this isn't hand-hold time"
attitude was common among senior technical staff, which meant that people
often hid their weaknesses rather than getting the help they needed.

The lesson I learned from that is that software companies that take normal
circumstances with the intensity of emergencies are gradually cutting their
own throats.

That's a lesson reinforced for me spending a lot of time with a family member
in hospitals last year. Even when survival was on the line, the best doctors
and nurses proceeded with patience and kindness, working to train staff and
improve systems as they went. In an actual emergency they moved like it was an
emergency. But only then. If they can be serene and thoughtful while dealing
with brain tumors, I don't think there's any reason that people at Facebook
have to puff themselves up with self-importance.

~~~
alttab
This is a good lesson. I was straight-forward to get a point across, but it is
easy to see how "excellence first" could turn into a poor work environment
without moderation.

It also seems that an "excellence first" culture seems to provoke fears that
people can't learn, or aren't allowed to make mistakes. Tone and delivery are
important here - especially from leadership. My take on excellence first is
you don't have to compromise environment for it. Create an environment that
fosters excellence, where people aren't afraid to make mistakes but are
committed to learning from them. Everything doesn't need to be treated like an
emergency, but theres a level of bullshit on a technical level that shouldn't
be tolerated. Again, each situation needs to be felt out individually and
moderation is always key.

Personally I don't advocate the internal competition aspect of excellence
either. The pitfalls of that are well documented as you have pointed out. I'd
promote based on team success, and THEN individual contributions. If you can't
work as a team then the whole team loses, only once the team succeeds would
you start singling people out to see where the best influences were coming
from.

TLDR - I think there is a middle ground between not running it like a daycare
and running it like a wall street trading firm or gulag.

------
DigitalSea
Don't get me wrong a developer-led company in theory sounds like an awesome
idea because the developers know the product better than any product manager
ever could, but that isn't a good thing. I've worked with companies that have
ultra-strict testing of features, where many eyes see and use the code before
being pushed out and while that process works it can get in the way of
progress when the politics come out to play.

The whole move fast, make mistakes mantra is a bad approach when you're a
high-trafficked site like Facebook that risks jeopardising your revenue. A
second of down time can be very costly. Knowingly pushing out half or
completely untested features might be acceptable if you have a small user
base, you're a new Internet startup or your logo says alpha or beta, but
certainly not an established brand and company it's stupid.

This is all my own opinion, of course.

~~~
alttab
I would argue the opposite. Facebook is in prime position to move quickly and
break things. They have enough traction, gravity, and brand recognition they
can afford to make mistakes.

If you have a small user base or are just getting started, delivering the best
product experience is the #1 thing you can do for success. Is iterating faster
and making more mistakes and giving the early adopters (ones that champion
your service the most to others) a sub-par experience worth the speed?
Sometimes, yes. All the time? No.

Facebook can change your privacy settings, steal your e-mail address,
recognize you automatically in photos, watch your browsing activity outside of
their walled garden, and sell your data to advertisers. And yet they still
have 500million+ users.

------
five_star
"very engineering driven culture. ”product managers are essentially useless
here.” is a quote from an engineer. engineers can modify specs mid-process,
re-order work projects, and inject new feature ideas anytime"

Maybe this is why Facebook seems to be chaotic for users. They change and
change to whatever design they wanted without much consideration about what
the user's would feel about the design. Facebook has now become the combined
features of the other existing social media.

------
krosaen
""" Engineers responsible for testing, bug fixes, and post-launch maintenance
of their own work. there are some unit-testing and integration-testing
frameworks available, but only sporadically used. """

Sounds like a lot of code debt accumulating that could bite them hard down the
road - it's one thing to write and manually verify bug free code, it's another
for a different engineer to make sure he/she doesn't break that code
inadvertently a year later when the original author has moved on to another
project or company. I'm not talking about 100% test coverage; if the smoke
test for a feature breaking is someone noticing while playing with the site,
in the long run it strikes me as a much less efficient way to verify and fix
regressions than using an automated test suite. Writing good tests is hard,
but keeping a product bug free as more and more functionality accumulates
without automated test suites is even harder in my experience.

~~~
mkjones
This article's about a year and a half old. We have pretty good unit test
coverage on a good chunk of our code (especially core stuff), though
admittedly not everything.

Some groups put particular emphasis on this (e.g. the messages team is great
about testing), and it shows in the reliability of their products. Even
better, they end up building frameworks that make it easier for the rest of
engineering to write tests, and drive the whole ecosystem forward.

~~~
krosaen
Good to hear. Related: a good article by Eric Ries on how in many situations
within a startup technical debt can be used effectively.

[http://www.startuplessonslearned.com/2009/07/embrace-
technic...](http://www.startuplessonslearned.com/2009/07/embrace-technical-
debt.html)

------
peapicker
I stopped reading at "very unique". It is unique, or it isn't. Intensifiers to
'unique' tell me the writing is below par; and I've been correct about this
enough over the years that I stopped bothering.

------
Dybbuk
Well, I'll be joining Facebook in a few weeks. I am a bit of a laid back type
and don't know if I fit into their culture of moving fast.

------
its_so_on
This is the real Facebook secret sauce in convenient flowchart form:

    
    
        What's the most evil thing we can think of doing?
         V 
        Candidate <-----<------- Think of the next-most evil thing we can do
         V                                     ^   
        Can we code that? (no) ------->------->^
         V (yes)                               ^
        Is it legal?                           ^
         V (yes/no)                            ^
        Can we get away with it? (abs. not)--->^
         V (yes/maybe/prob. not)               ^
        Keep shipping!                         ^
         V                                     ^
        Did we get in trouble for it? (no)     ^
         V (yes)                        V      ^
        Claim it was a mistake!         V      ^
         V                              V      ^
        Still in trouble? (no)-> Keep feature->^ 
         V (kind of)                    ^
        Being sued for it? (no) --->--->^
         V (yes)                        ^
        Throw money at lawsuit          ^
         V                              ^
        Did we lose? (yes/no) ----->--->^

~~~
gouranga
That's great - thanks for posting it :)

~~~
gouranga
If you're going to down vote, please at least say why...

~~~
bajsejohannes
Comments on HN tends to be voted down if they are not substantial. To quote
<http://ycombinator.com/newswelcome.html>: Does your comment teach us
anything?

That page also says that a simple thanks may be acceptable, but the consensus
seems to be that those comments are superfluous as well.

------
sodelate
what does this mean?

