Hacker News new | past | comments | ask | show | jobs | submit login
GitLab.com Downtime Postmortem (docs.google.com)
123 points by jobvandervoort on July 8, 2014 | hide | past | favorite | 48 comments

The bit about the large repositories reminds me tangentially of something I read about Amazon: when tuning for performance, they don't look at, say, the median response time. They look at the 99.9% level.

If I recall rightly, their example was a customer searching their old orders. The customers with the slowest response times were their very best customers, and they wanted those people to have at least as good as an experience as the median customer.

That definitely change my attitude to performance tests: now whenever I'm picking metrics, I think hard the real-world implications of the levels I'm setting.

Very true, and you can be sure we'll pay a bit more attention to the 0.1% biggest repositories from now on.

Yup, AWS uses this metric everywhere (TP99). If 1% of your requests take 1000% more time than everything else, then you need to optimize and ensure this doesn't screw everyone.

Here is a bit more info: http://codesith.blogspot.com/2012/06/tp99.html

One important learning from the postmortem: always set your servers up to be in UTC rather than any other time zone. Helps debugging and log correlation and eliminates confusion during incidents.

Of course all hardware clocks should be set UTC, but there's a lot of benefit from using the locale to understand what's going on in relation to the human viewing the log.

If i'm a human waking up at 3AM and I look at the logs for this server, the first thing i'm going to think is "is the time on this server the same as where I am?" followed by "how long ago did these events occur in relation to me?" The easiest way for any human to do this is to compare timezones. If you condition yourself to know the UTC difference of every time zone, then this works automatically, but [i'd argue] most humans are better at estimating timezone difference from other timezones, such as how California and New York are 3 hours difference, and New York and London is 5 hours difference.

In terms of correlating log events across the globe, you really, really want a log parser and correlating tool. They make your life easier and reduce time and complication during events. Splunk, Loggly, Graylog, Logstash, ELSA, etc. Don't look at your logs by hand. You'll be sitting there all day with 6 split windows looking at mysql, nginx, app logs, kernel logs, mail logs, security logs, blah blah blah, just on one server. When you horizontally scale your app, whether it's 2 or 2,000 servers, you need something to parse and correlate your logs so you can say "show me all logs from 11:00PM to 2:00AM from Web Cluster B", you get to save 10 minutes on your outage, and you don't miss anything.

I think you just made a better case for having everything in UTC. Instead of having to remember multiple offsets, one for each location, you just need to know your current offset to get a sense of relativity (i.e. "ok, that occurred an hour ago") and the correlation across different locations and sources takes care of itself.

I'm not saying that a log aggregator is not needed, they are still an important of any system as your 3rd paragraph clearly explains, but your 2nd paragraph actually makes the case for keeping everything in UTC regardless.

Is that the lesson? Or just that everything should be in the same time zone?

If the company's staff is all in one time zone, I'm inclined to use that for servers, as otherwise people have to mentally juggle two time zones: local and UTC.

That's a good question. UTC is almost certainly still the best candidate even in that case because:

* it's not affected by daylight savings time, which in many candidate timezones causes at least two complex and error fraught time discontinuities every year, and which requires manual intervention at the whims of the US Congress in the USA.

* UTC is the lingua franca of machine time communication -- if you ever have to, e.g., send data about transaction times, server event logs, order histories, etc., to any third party or API, then because there are 23 other timezones besides yours, and data normalization is important, they will almost certainly expect ISO8601 UTC. Yes, ISO8601 has time offsets. No, not all libraries implement them properly on either side.

* The first time you ever have logs from two different timezones -- e.g., a service provider log denominated in UTC and yours in PST -- you will hate computers so much that you will likely quit your job, throw your keys on the floor, and go run a hotdog stand down by the pier rather than deal with it. This is an honest response to time and localization issues but think of your children.

...and just when you thought it was safe to go back outside leap seconds rear their ugly head and ruin your day.

This starts out to be the way most companies think, then before you know it you are big enough to start opening datacenters in other countries, then you have to deal with timestamp conversion.

Why would you ever need to convert a timestamp? You typically parse a timestamp like "1999-03-29 20:02:04 CEST" and return a number of seconds since a given point in time (e.g. UNIX epoch if it must be). The returned time is obviously +0000 == UTC. If it's not your language/os/software sucks and its no wonder your hair is on fire.

You typically parse a timestamp like "1999-03-29 20:02:04 CEST" and return a number of seconds since a given point in time (e.g. UNIX epoch if it must be).

There's your conversion. I'm not saying it's not easy, I'm saying you're just pulling you out of a hole that your tools were configured to dig.

An example of some of the pitfalls involved: http://search.cpan.org/~drolsky/DateTime-1.10/lib/DateTime.p...

Because it's always easier said than done.

Multiple applications in different languages that are running on different environments and managed by different teams posses a real challenge.

It's not like am speculating. I implemented these things. The hardest part is agreeing on TZ symbols (e.g. Indian Standard Time vs Ihoa Standard Time). E.g. its not very hard to get right.

Very true, we'll update the graph server that was misconfigured.

This is a very open and candid write-up. Very much appreciated by the community. Too many companies try to hide/cover-up their outages, but Gitlab is letting it all hang out, their mess-ups as well as things outside their control. I think that speaks a lot to the character of the company, and shows they really care.

Thanks Alipus, we're trying to be a part of the GitLab community in everything we do. Even if our operations go south we want that to inform others.

GitLab B.V. CEO here, please let us know if you have any questions (you can also leave a comment or suggestion in the doc if you want). This whole real-time postmortem is something we thought of to contribute back after having downtime. Feel free to let us know what you think of it.

It's only after I saw the "B.V." I realized Gitlab is a Dutch company, cool! :)

Just one question, I saw a lot of logs being copy-pasted, aren't you concerned about any security sensitive data leaking out?

Thanks! We are based in the Netherlands but we're mostly a remote company https://about.gitlab.com/2014/07/03/how-gitlab-works-remotel...

We're worried about sensitive data, see my other answer https://news.ycombinator.com/item?id=8003800

I haven't read it but I like the idea in general. One question: aren't you concerned about accidentally exposing security critical info over the doc?

Glad you like the idea. We're worried about exposing sensitive information. In general we're pretty careful about credentials so that will be OK and our setup is not a secret (a lot is open source and we share the High Availability details with standard subscribers). But it could happen that we share user information such as projects names in the log-files by accident. We think that there are also benefits of working out in the open and thing that the trade-off is worth it. We hope that people let us know if they see anything sensitive in the doc so we can quickly remove it.

Thanks, I've now read it. Very nice. I guess having your architecture out in the open is liberating in that sense.

So now two specific questions: 1. So why did you have a big read spike? And 2. are you going to share the TODOs as well? :)

BTW We're using gitlab internally (at EverythingMe) and are very happy about it, especially the flow of fixes and features implemented.

1. We don't know (and the read spike might be an effect rather than the cause)

2. Yes, most of the things after the hashrockets (=>) currently in the doc are TODO's

BTW Awesome to hear that you are happy users of GitLab.

I bet it's an effect rather than a cause. We had some server incident recently that started with a huge write spike on some machines. Turned out it was the nginx error log and the problem itself was something else.

Right now it looks like 1 was caused by an extremely large repo (18 GB) being pushed. Only 0.1% of repos are bigger than 1GB. We're still investigating.

have you tried HackPad instead of docs?

I have yet to understand why HN:ers like HackPad. It is at best almost as good as GDoc:s on some of the features.

The reason I've seen it used is that it's open source and therefore can be hosted by the entity involved.

It is?!

Where's the source code? Their GitHub page is rather sparse, and the only hint I've found is this tweet:


Sorry, I'm thinking of Etherpad:


The source code of which is here:


Hackpad is apparently a fork of that:


Cool, thanks!

"6. Production filesystem IS NOT mounted at this point

9. gitlab-ctl start starts the GitLab with the staging data in /var/opt/gitlab on the root filesystem that doesn’t have any production data. At this point logs report that the production db doesn’t exist which is correct because we are not on the production file system. No production data has been touched at this point. The Gitlab web UI is not responding (502 error from nginx)"

This is what strikes me as needing addressing. There shouldn't be staging data that is normally hidden by mounting the production filesystem. If the production database isn't there, Postgres should fail to start. The Postgres team is pretty adamant that otherwise bad things can happen; see http://www.postgresql.org/message-id/12168.1312921709@sss.pg...

We fixed part of the root issue by restarting GitLab automatically when we configure a server as master, preventing it from continuing with the staging data. The staging data was created automatically by gitlab-ctl. I agree that this is confusing and created the item: "Why did GitLab start when it shouldn't?" in the postmortem. Thanks for raising this point.

I wonder what their drbd is using as its backing store and why it's needed at all.

Personally, I honestly wouldn't trust to run a database on top of drbd and I'm unsure whether using a drbd volume as a database directory will even cause the other replica to have correct and usable data in case of an error.

Personally, I would use a real disk or maybe LVM as the store for the database in order to reduce points of failures and remove less tested components from the picture.

Then of course you need a database slave, but it looks like you had that anyways which, again, leads me to question why drbd was even involved.

I used DRBD before it was cool (before it was added into the kernel), and it works great for highly available directories like home dirs, which was my use case. Even then it was pretty solid, and it's very similar to using a high end NAS product like NetApp or EMC with automatic failover, except you don't have to shell out $100K+++ for it.

If you used it for a database then everything committed to disk is immediately available on the partner, which you can spin up instantly with some scripts. And look, your transaction log is exactly where you left it! With a database and built-in replication you could go either way, but there are some applications with persistency that you need to make highly available, and that's another area DRBD would shine.

I used to ran a production database on top of drdb without any problem, active-passive fail-over HA-setup.

The major advantage is that no other specialized components (NAS/SAN/etc) is needed and you'll still have decent shared storage between the nodes.

Three nodes in such a cluster is highly recommended resolve/avoid any insane-in-the-split-brain scenarios, drdb-storage needs to be protected from split and can not help you to resolve it in the same way as a traditional NAS/SAN-lun.

We had to do some tuning to get the performance we needed, but this was a few years ago and the situation has probably improved since. Databases usually likes vm-dirty-* to be very low or 0.

We use EXT4 on DRBD on LVM on RAID10. We already have DRBD for the git repo's and prefer to reuse our experience for the database. There are other reasons too some of which are articulated in http://wiki.postgresql.org/images/0/07/Ha_postgres.pdf

I gave a presentation yesterday about our HA setup https://docs.google.com/presentation/d/1gIzmp-d5X86jJMQz7Ixs...

Why would you not trust a database server on DRBD? It's a very solid product and using it for database replication/failover is a common pattern. Word on the street is that Amazon RDS uses DRBD for multi-az failover.

Could this be caused by DRBD not reading the changes from the Postgresql database quickly enough? Or is it some issue with the interaction between the software RAID10, DRBD under high postgresql I/O load? What are the respective versions of the kernel, database and DRBD? Are there disk I/O, network I/O and cpu logs available of the time leading up to the crash?

We're not sure. There was a lot of disk locking going on at the time.

Well, this is not back for us: our repositories are corrupted...

I'm sorry to hear you are experiencing problems. This is the first report we see of possible corruption. Please contact support@gitlab.com and note the urls and commands that do not work for you. Edit: there another report I just found: https://gitlab.com/gitlab-com/support-forum/issues/2

We're talking a break until 16:00 CEST, many questions are answered already and TODO's are given after the => in the document.

We're back again.

And we're done for today, thanks for stopping by everyone.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact