Very cool to see this on HN. :) I developed a 1hr class in March or April for Facebook's "bootcamp", and kind of just kept going, collecting in war stories from smart folks around the industry. If you have any questions or want to suggest a topic to add, fire away.

Thanks. I especially loved the story about the delayed-feedback-causing-oscillations issue.

Is it possible to share some of the details of "postmortems" of large-scale systems in more detail? This would really help academics who are interested in working on techniques to diagnose such issues. Perhaps log files, history of monitored metrics, etc.?


That's an interesting idea. Normally it goes the other way round. Facebook has a fellowship program, and often collaborates with academia, but usually in "data science", not systems or perf. If you are interested, msg me. Maybe something can happen.

Histograms instead of averages!

