Hacker News new | past | comments | ask | show | jobs | submit login
Lessons we learned when debugging a scaling problem (gitlab.com)
167 points by sahin-boydas 50 days ago | hide | past | web | favorite | 66 comments

I love these sorts of write ups. I wish there was a way to encourage internal transparency like this, as opposed to my least favorite way to be greeted when I let someone know about an issue with their product: "Can you try again?"

Nothing as infuriating as that, it could be a valuable learning lesson for many people just like this!

I would love to see a blog post explaining the non-technical interaction for something like this. Is it a single person that understands all the networking plus the whole software stack, from HA-Proxy to the SSH hosts, or is it a team of people working together on the problem?

I feel like everyone involved would have to be more eager to solve the problem than worrying about assigning blame. Does that work because it's all within the same company or does GitLab make extra effort to dis-incentivize blame shifting and other unproductive behavior?

In my life (all small scale - local vendors) I call people and the only answer I ever get is roughly "our stuff if perfect. It's you."

We definitely take deliberate effort to dis-incentivize blame; it's specifically called out in our values at https://about.gitlab.com/handbook/values/#blameless-problem-... but speaking as an employee, I can confirm that this is my daily experience.

Of course for this specific issue there really wasn't anything or anyone to blame: no-one really did anything wrong at any point. Systems simply get complicated at certain scales and circumstances, and we have to engineer our way around the issues.

Back when I was studying for a cybersecurity course, I had the pleasure of reading a few posts on the Cloudflare blog. There are about 88 pages(!) of blogs at blog.cloudflare.com

A few recent ones might pique your curiosity




> Lesson #4: When you choose specific non-default settings, leave a comment or link to documentation/issues as to why, future people will thank you.

Put it in the commit message. There is no better place to put it. That is the exact point in time at which you have the most information about that exact change. Commit messages are cheap, be liberal with your words and links.

It drives me nuts to read commit messages of the form "fix", "fixed 2", "add test", "new var" and on and on. You took an hour to make the change, take a damn minute to save some future colleague a day of frustration.

I agree with you about the importance of commit messages, but the problem with using them in place of comments is that commit messages become much harder to find as the history becomes longer, as opposed to comments which will always be immediately visible for as long as they are relevant.

This is what git blame is for. Comments are notorious for drifting out of sync with reality.

Git blame only shows you the last person that touched the line, not the person who introduced a breaking change to that line (but it wasn't caught for weeks/months).

Comments are just as cheap as commit messages. Why not both?

Git blame can also show you the last revision which touched a file since some revision. Git log on a file can show you all the revisions in which it was touched. Using these together lets you slice the history in many useful ways, and there are countless gui-based tools (git gui, for example) which also allow you to explore revisions.

None of that is as effortless as looking up at the comment on the previous line.

But which is more effortful to keep accurate.

Whereas the commit is immutably and ideally exactly about what it's about. There is no better time to capture the information in a way that perfectly connects it to what it's describing.

While commit messages are always there, they get lost. You fix a bug in your code, with a commit message. Later, you realize that this function is super reusable, so you move the function to another class in another file. Now, for all intents and purposes, the original commit message will never be seen again.

It's harder to find originals under such circumstances, sure. But I don't see this as a reason for never writing good messages, because a journal of intent is strictly better than a journal of physical amendments.

That something needs to be entirely perfect before being judged better than the status quo is the Nirvana fallacy.

I don't think anyone is arguing for "never writing good messages"; the point is that you can have both good commit messages and good comments, both of which are useful for different reasons, and you don't need to give up one to have the other.

The parent thread was very much about using commit messages instead of comments, and I was arguing against that.

Two rights, on the other hand, don’t make a wrong.

It's more accurate to say that git blame has that information but doesn't make it especially easy to get to.

It shows you the commit that introduced the current version of the line. That's what you want if you want to understand the current version of the line.

I feel like I'm missing something here. I hate this feeling of impending dumb.

A line can have been touched multiple times since the breaking change (simplest example: indentation/formatting)

Got it. It's still possible to keep stepping back through the history of that line to understand its exact history.

I also typically find most files have only a handful of commits of interest in their history, so scoping a git log to the file often gets me within spitting distance of the right commit, if someone took the time to write a good one.

Edit: which goes back to my original point, that the commit is the best time and place to snapshot the meaning of that time and that place.

I agree, but you must also concede that not all developers are as proficient at git as you and might not be aware of such debugging/detective techniques. These same developers tend to favor print statements for debugging instead of an actual debugger. Also, not every developer has the privilege of using a DCVS like git.

For them, a comment in the code could save a lot of time and hassle.

Fair points.

Though I hope people at gitlab use git.

Hi Jacques, your Pivitol is showing :p

Commit messages are good, but the dogmatic anti-comment philosophy that Pivitol follows is overdone. Lots (I mean LOTS) of comments are redundant or can be fixed with better naming or organization. But comments shine when dealing with the "why" behind something. I would rather see a comment that says why a value is set to what it is than have to be a git historian.

Hi Seth :)

Yes, when there's no better alternative, or when it's idiomatic (Go linters) or situationally appropriate (Spring libraries are immaculately commented).

But I emphasise "no better alternative" because I have seen comment rot and it sucks badly. The English language is pretty vast and can be usefully ransacked for names. Keyboards are cheap and autocompletion plentiful, so go hog wild with variable names. There are idioms, patterns, well-understood conventions and so on that allow you to avoid noise.

When you need a why, then absolutely give the why. But usually the need for an explanation is that the code isn't clear on its face and you should exhaust that possibility first.

This concludes today's reading from the Book of Annoyingly Doctrinaire Ex-Labs Pivots. Go in peace.

So this is something I want to fix about commit messages.

If you screw up and forget to add the documentation to the commit message, there is nothing you or I can do about it. And if the commit log is not uniformly useful in this way, then people start to discount the value of the commit messages.

And then I have to explain to people who keep cutting and pasting code around improperly are ruining our commit history. I often get a shrug.

The first part we could do something about, and I hope we do in the next version control system (there's always a next one). All sorts of metadata should be capable of being attached to a code commit. Some of it stuff we have been relegating to third party tools, which means you can't reason about it as a whole.

Did this commit get a green build? Did it have a code review? What PR was it in? Did we run any static analysis? Did we learn after the fact that this code doesn't solve the problem or is a bad solution?

If all of this is at my fingertips, I've got a better chance of getting a bunch of people (myself included) to have a broader understanding of the code, or at least an easier time getting familiar with it when I need to.

Well there's two things there.

One is commit discipline. People care more or they care less, no tool can truly change that.

The other is how we collect information. I'm not convinced that version control systems are the right place for that, but I know others have explored it (eg. Fossil SCM). Github provides what is a de facto data store for a lot of such annotations through their APIs, I imagine Gitlab does similar.

Commit discipline is a wider issue, but building and relying on tools that require humans to never make an error does not scale.

People always screw up their commit messages. And once they do it they stop trying because if you can't be perfect you're just torturing yourself to try. If there's a way to fix them after the fact you can begin demanding that people have more commit discipline and make them go back and fix it if they don't get it right the first time.

I take it as a matter of Broken Window syndrome. You can try to keep them from getting broken but at some point you have to have a way to fix the ones that break.

I didn't mean to give the impression that I don't think better tools are needed, I do. I feel that automation is a gift: what gets automated gets done. Tools create paradigms of working.

What I was driving at was that better tools are one part of the triad of people, tools and ways that people work together with tools. Commit discipline will always be desirable, because no tool can recover a snapshot of my mind at the time I made the commit.

This is a big part of my problem with git.

The existence of a feature is not the same as a solution to the problem. All of these little tools that exist in the git toolspace are not automatic. Every developer has to add it to their already pretty long set of git rituals.

That's bullshit. We can do better than this.

Point me to a git feature that works when I do 'git push' or 'git pull' otherwise I'm not interested.

I'm also not interested in hearing people imply that people aren't smart enough if the tool doesn't suffice, and there's a very, very large concentration of that sort of asshole in the Git community. I don't make my tool decisions based on what I am smart enough or not smart enough to get. I make them based on what I'm willing to teach, my anticipation of how often I will have to be preempted to support others who don't get it, and what I'm willing to juggle when a production issue has the entire management team hand-wringing on a day when I'm already having a bad day.

Give me stupid-simple and dead reliable or it's not helpful.

> Lesson #3: It is polite to log interesting information at default levels and deliberately dropping a connection for any reason is definitely interesting to system administrators.

Agreed. But I suppose part of the reason this feature is present is because so many ssh servers are the targets of attacks, often brute force. So logging these might be problematic.

Speaking of "Rate limiting" -- openssh logging something like "dropped N connections exceeding MaxStartups" once per minute (if N > 0) [at WARN level] seems like a sane compromise.

I can't help but notice that these kind of blogs should be recommended reading for CS/engineering students as well and not just to people who are already working.

The way an issue is to be analyzed (this post is a great example of Systems thinking IMO), what happens at scale and why not to blindly Google your way out of problems are valuable skills any engineer should have and which are unfortunately not taught in many schools.

Isn't it possible to just set the haproxy maxconn to a slightly lower value than what the backend can deal with, and then let the reverse proxy retry once with another backend? Or even queue it for some hundreds of milliseconds before that?

This way you avoid overloading backends. Also, haproxy provides tons of real-time metrics, included the highwatermark for concurrent and queued connections.

What you suggest is what was followed by a large-scale distributed system I worked on. Your suggestion is also in-line with Google' SRE book section on Handling Overload: https://landing.google.com/sre/sre-book/chapters/handling-ov...

> Mandate that batch client jobs use a separate set of batch proxy backend tasks that do nothing but forward requests to the underlying backends and hand their responses back to the clients in a controlled way. Therefore, instead of "batch client → backend," you have "batch client → batch proxy → backend." In this case, when the very large job starts, only the batch proxy job suffers, shielding the actual backends (and higher-priority clients). Effectively, the batch proxy acts like a fuse. Another advantage of using the proxy is that it typically reduces the number of connections against the backend, which can improve the load balancing against the backend.

Other chapters on Load Balancing, Addressing Cascading Failures are related too.

Aaand this is what happens when one comments before reading all the article...

Scheduled tasks are a great way to brown-out your downstream dependencies.

In one instance, MDAM RAID checks caused P99 latency spikes first Sunday of every month [0] (default setting). It caused a lot of pain to our customers until the check was IO throttled, which meant spikes weren't as high, but lasted for a longer time.

Scheduled tasks are a great way to brown-out yourself.

In another case, the client process hadn't set a socket timeout on a blocking tcp connection [1] (default setting), and so it'd run out of worker threads blocked on recv routinely when the server (fronted by reverse-proxy) started rejecting incoming due to overload. Only a restart of the process would recover the client.

Scheduled tasks are a great way to prove HAProxy will scale way better than your backend. Thanks u/wtarreau

Speaking of HAProxy: It fronted a thread-based server processing all sorts of heavy and light queries with segregated thread-pools for various workloads. During peak, proxy would accept connections and queue work faster than the worker threads could handle, and on occasion, the work-queue would grow so big that it not only contained retries of same work scheduled by the desperate clients but a laundry list of work that wasn't valid anymore (processed by some other back-end in the retry path, or simply expired as time-to-service exceeded SLA). Yet, there the back-end was, in a quagmire, chugging through never-ending work, in constant overload when ironically the client wasn't even waiting on the other end and had long closed the connection. The health-checks were passing because, well, that was on a separate thread-pool, with a different work-queue. Smiles all around.

Least conns and the event horizon. Tread carefully.

Least conns load balancing bit us hard on multiple occasions, and is now banned for similar reasons outlined here: https://rachelbythebay.com/w/2018/04/21/lb/

[0] https://serverfault.com/questions/199096/linux-software-raid...

[1] https://stackoverflow.com/questions/667640/how-to-tell-if-a-...

> Scheduled tasks are a great way to brown-out your downstream dependencies.

I've been trying to convince my division to prioritize adding events to our stats dashboard.

Comparing response times to CPU times is just expected level of effort to interpret the graphs. But you don't have any visibility into how a chron job, service restart, rollback, or reboot of a host caused these knock-on effects. And without that data you get these little adrenaline jolts at regular intervals when someone reports a false positive. Especially in pre-prod, where corners get cut on hardware budgets, and deploying a low-churn service may mean it's down for the duration.

This might end up being a thing I have to do myself, since everyone else just nods and says that'd be nice.

I guess it would be nice to have a "fuzzy" cron job setting, so you could specify "run it at a random time during the day/hour/minute"

Someone pointed out in comments on the blog post (https://about.gitlab.com/2019/08/27/tyranny-of-the-clock/#co...) that systemd actually has some good functionality for that.

Elsewhere this is referred to as "jitter". Most systems with regular schedules are built without it and eventually, many will grow into facemelting hordes.

General best practice is to add fuzzing to your scheduled task. So for example wrap it in a shell script that starts with `RANDSLEEP=$((RANDOM + 100)); sleep ${RANDSLEEP:1:2}`. Then it will sleep between 0 and 99 seconds. Or do 1:1 if you run the task every minute so fuzz the start by 0 to 9 seconds.

This seems to be the behaviour of AWS 'rate' Cloudwatch events. Eg. rate(1 hour) or rate(10 minutes) don't necessarily run on the hour. Not sure if they're relative to when you set them up or random though.

Bravo to the people that can navigate such a complex (not an insult) architecture and pull useful information from it! I expect it to be quite tricky to get in between the production environment(s) and capture packets or change logging etc.

I did wonder though: the authors seemed a bit unsure about the relation between MaxStartups and resource usage, wouldn't it be wise to just send an email to the openSSH mailing list or something to not be surprised by possible future problems?

Might be valuable if it reacted the same way in all environments (not always the case). And if someone answered faster (correctly - without sending them down an unrelated rabbit hole too!) then they could have found the answer out by just experimenting.

Don't get me wrong - I'm a huge fan of community support, but sometimes it's faster to just try things and see how the systems reacts.

Not to say you can't do both in parallel; and they might have - you aren't going to detail everything you did in a writeup like this.

As others said, I really enjoy reading write-ups like these. Who needs fiction when you have real world mysteries to solve?

> It would make everyone more comfortable if OpenSSH let us see the how close we were to hitting MaxStartups at any point, rather than having to go in blind and only find out we were close when the limit is breached and connections are dropped.

If they are not afraid to dive into source, it should be fairly easy to add some logging when the value reaches 75% for example. Of course one needs to be careful not to break anything else (issuing thousands of warning logs at the same time that the system is at peak load is probably not a good idea), but it shouldn't be that difficult. Also, I assume this would be something that OpenSSH would be happy to get a PR (or MR, coming from GitLab? ;) ) for.

Nice writeup. What I miss is how they added monitoring for these kinds of errors, so it doesn't escalate to 1.5% of all connections again. But other than that, good job!

Oh, that's a fun story. We use https://github.com/google/mtail to turn log data into metrics exported for prometheus. We were already doing this for our haproxy logs; we just added a stanza to capture a secondary indicator that the problem is occurring (termination_state 'SD', and 0 bytes_read), and export that as a metric. You can see the relevant MR at https://gitlab.com/gitlab-cookbooks/gitlab-mtail/merge_reque... (and https://gitlab.com/gitlab-cookbooks/gitlab-mtail/merge_reque... to fix some bugs)

Then we just hooked it up through the usual sort of prometheus alerting rules. We made it really twitchy, because this combination of logs Should Not Happen if everything is working right, and we want to alert as soon as it starts occurring, so we can bump the limits again (or re-evaluate in general)

Interesting read.

But i bumped into the same problem recently while working on distributed simulation. I was writing an agent for an application which talks to clients over ssh. I started getting the same error when i cranked up machine count to 60.

I googled like every layman and zeroed in on MaxStartups. Looked at the manual to cross verify and fixed the issue in 5 minutes. I agree with what ever he did after he found out the issue but it was a clear case of SSH throwing an error. Why not just look up manual or google instead of wasting time on laborious packet inspection.

Because at their scale, people actually need to understand the issue rather than just applying fixes blindly plucked from the Internet. Don't get me wrong, Google is my go-to as well, but I suspect that they need to dig a little deeper.

Fair enough.

But wouldn't doing investigation on the error provide data point to help debug the issue. They are trying to fix a customer issue and time is also of importance here.

> just applying fixes blindly plucked from the Internet.

I wasn't recommending that.

> Why not just look up manual or google instead of wasting time on laborious packet inspection.

When you've got a load balancing layer, and virtual machines, and interacting with the real world, there's a reasonable chance that the problem is not a (relatively) simple application configuration issue. If you've looked through enough of these, getting a pcap (especially with an identified customer) and looking through it with Wireshark is relatively quick.

I could have worded it better like "why packet inspection to start with".

I love packet inspection, but I always feel that if I have to go to that depth, then we missed an opportunity to have observability (metrics/logging) earlier.

Because pcap lies a lot less than everything else.

They should have led with the bottom line - add a `sleep 7` to any cron job that is going out to the network and set it to start at the top of the hour plus 3 minutes

It's blog posts like this that make me realize I have no idea what I'm doing. Great work on troubleshooting a complex and interesting problem.

I thought this was a pretty good piece, and points to the problems caused by running services at regular intervals.

I remember that few years ago same issue occured to either Github or Bitbucket and they made a blog post about it.

> we have 27 SSH backends, and 18 HAproxy

No autoscaling? This load pattern is a prime candidate for it.

I disagree.

Autoscaling sounds great, but in practice it notices that there is a problem after you already have once, and by the time you've scaled up, there is no problem any more. Which is the worst of all worlds.

For example in this case you'd realize there is a problem at the top of the hour. And 2 minutes later you'd have a bunch of autoscaled instances up, all wondering what the fuss was about. So they scale down again, but then at the top of the next hour the same thing happens again with the same result.

The same thing happens with ad servers. An ad buy goes in for 30 seconds. When it hits, you have a firehose. That then shuts off.

In fact this problem is so common that I recommend against autoscaling unless you have specific reason to believe that it will work for you.

And god help you if that starts your autoscaler to flapping.

Too much CPU load, start some servers. Too little, kill some servers. Oops now CPU is too high, bring them back. No, it's too low...

It reminds me of the old UI bug where you get an off-by-one error for scrollbar logic and so it appears and disappears in a cycle until you resize.

Edit: I think we want something akin to a learning thermostat, but for servers. Figure out that I get a CPU spike at 5:20 on week days and spin some servers up at 5:10. Then spin em down at 8:00 when EST starts to go to bed.

> .. Oops now CPU is too high, bring them back. No, it's too low...

You just described closed-loop control system oscillation. Cause is wrong gain and/or delay parameters. AWS ASG has some knobs for tuning like Cooldown, MaxSize. What you described above is most probably long boot time (delay) problem. Service should be ready in less than 30 secs after boot. To get there, one should bake AMIs, not install afresh all the stuff on boot.

You can get an oscillation without long boot or ramp-up time but they sure don't improve the situation.

Lazy loading of modules is a common cause, but any other resources that are loaded on demand or in the background. But that's more of a situation of thinking you need two additional nodes, getting three, settling back to two after warmup, and then killing them all off again when the spike ends.

I may have said elsewhere that I'm more comfortable with scaling load based on daily and weekly cycle patterns, with perhaps a few exceptions for special events (sales, new initiatives, article publications) and making it easy for someone to dial up or down.

To use a car analogy, get really good at power steering and power brakes before you attempt cruise control, get really good at cruise control before you attempt adaptive cruise control. And don't even talk about self-driving unless you're FAANG.

The load spikes here are predictable. They could spin up right before the extra resources are needed. However, it only masks the problem. Queue the argument for horizontal vs vertical scaling.

> For example .. In fact this problem is so common ..

So whats the problem here? Assuming there are no misconfigurations anymore, and just spiking loads.

The problem is that people try to use autoscaling to handle a spiking load, and it fails abysmally.

In my opinion for those with a spiking load profile, autoscaling should be considered harmful.

As others have said here, autoscaling has gotchas. I won't recount those reasons, but there are two other relevant points:

1) Check out the graph in the 'An illuminating graph' section. The connection rate spikes by a factor of 3 in the space of 5 seconds (and that's actually a consolidated average over a number of hours; the worst spikes at the top of the hour are even bigger). We'd need super-responsive, practically magic autoscaling, or pro-active auto-scaling that does so at known intervals (every hour, 10 minutes, 15 minutes, etc). But given that the actual CPU usage doesn't really vary that much over those timescales (it all sort of evens out once the connections are made and git takes over), auto-scaling just to add more connection slots in SSH would be a poor use of resources, when we can just increase MaxStartups as far as necessary.

2) We do want to autoscale, and will in the medium-term future, because the access patterns at a weekly scale are quite variable and predictable, and we can save a lot of resources by scaling down at the weekends (or even the EMEA evenings) when the bulk of our users go home and stop creating MRs, and it's only bots, CI, and cron jobs (I jest, but it feels like that some days). But not because of the issue described in the blog post

Autoscaling would only work well if it could react quick enough. Since there are peaks in each second, that means a very rapid startup is needed. You would also want to stop rapidly, but some of the ssh sessions are going to be lengthy, so you can't kill instances early.

We don't even have any indication that the load on the system is particularly affected, there's just an arbitrary connection cap. Average load in the cpu graph [1] looks pretty flat, and around 50%; so it's probably already over provisioned depending on their disaster recovery plans and their normal load patterns (didn't see a daily/weekly graph to armchair that)

[1] https://gitlab.com/gitlab-com/gl-infra/infrastructure/upload...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact