If they're doing up to 50 commits a day and deploy all commits to master automat...

nbm · on April 12, 2016

Assuming they're spread out over ~10-12 hours (some crazy morning people, some crazy nocturnal people), that's only ~4-6 commits per hour.

Most problems will be discovered by someone and reported in an hour, and most of those will also be discoverable in a dataset on a system like Scuba - https://www.facebook.com/notes/facebook-engineering/under-th... - and you can identify the first time that particular issue happened.

If you're lucky, it lines up exactly to a commit landing, and you only need to look at that. Otherwise, due to sampling, maybe you need to look at two or three commits before your first report/dataset hit. You can also use some intuition to look at which of the last n commits are the likely cause. A URL generation issue? Probably the commit in the URL generation code. You'd do the same thing with a larger bundled rollout, but over a larger number of commits (50, in the case of a daily push).

brianwawok · on April 12, 2016

Why roll out to all servers at once? What if each new commit went out to 1% of servers, then spread? At most you need to rollback a small % of servers, as you clearly see which servers are acting bad.