
Lessons we learned when debugging a scaling problem - sahin-boydas
https://about.gitlab.com/2019/08/27/tyranny-of-the-clock/
======
cfors
I love these sorts of write ups. I wish there was a way to encourage internal
transparency like this, as opposed to my least favorite way to be greeted when
I let someone know about an issue with their product: "Can you try again?"

Nothing as infuriating as that, it could be a valuable learning lesson for
many people just like this!

~~~
donmcronald
I would love to see a blog post explaining the non-technical interaction for
something like this. Is it a single person that understands all the networking
plus the whole software stack, from HA-Proxy to the SSH hosts, or is it a team
of people working together on the problem?

I feel like everyone involved would have to be more eager to solve the problem
than worrying about assigning blame. Does that work because it's all within
the same company or does GitLab make extra effort to dis-incentivize blame
shifting and other unproductive behavior?

In my life (all small scale - local vendors) I call people and the only answer
I ever get is roughly "our stuff if perfect. It's you."

~~~
certifiable
We definitely take deliberate effort to dis-incentivize blame; it's
specifically called out in our values at
[https://about.gitlab.com/handbook/values/#blameless-
problem-...](https://about.gitlab.com/handbook/values/#blameless-problem-
solving) but speaking as an employee, I can confirm that this is my daily
experience.

Of course for this specific issue there really wasn't anything or anyone to
blame: no-one really did anything wrong at any point. Systems simply get
complicated at certain scales and circumstances, and we have to engineer our
way around the issues.

------
jacques_chester
> _Lesson #4: When you choose specific non-default settings, leave a comment
> or link to documentation /issues as to why, future people will thank you._

Put it _in the commit message_. There is no better place to put it. That is
the exact point in time at which you have the most information about that
exact change. Commit messages are cheap, be liberal with your words and links.

It drives me nuts to read commit messages of the form "fix", "fixed 2", "add
test", "new var" and on and on. You took an hour to make the change, take a
damn minute to save some future colleague a day of frustration.

~~~
hinkley
So this is something I want to fix about commit messages.

If you screw up and forget to add the documentation to the commit message,
there is nothing you or I can do about it. And if the commit log is not
uniformly useful in this way, then people start to discount the value of the
commit messages.

And then I have to explain to people who keep cutting and pasting code around
improperly are ruining our commit history. I often get a shrug.

The first part we could do something about, and I hope we do in the next
version control system (there's always a next one). All sorts of metadata
should be capable of being attached to a code commit. Some of it stuff we have
been relegating to third party tools, which means you can't reason about it as
a whole.

Did this commit get a green build? Did it have a code review? What PR was it
in? Did we run any static analysis? Did we learn after the fact that this code
doesn't solve the problem or is a bad solution?

If all of this is at my fingertips, I've got a better chance of getting a
bunch of people (myself included) to have a broader understanding of the code,
or at least an easier time getting familiar with it when I need to.

~~~
jacques_chester
Well there's two things there.

One is commit discipline. People care more or they care less, no tool can
truly change that.

The other is how we collect information. I'm not convinced that version
control systems are the right place for that, but I know others have explored
it (eg. Fossil SCM). Github provides what is a de facto data store for a lot
of such annotations through their APIs, I imagine Gitlab does similar.

~~~
hinkley
Commit discipline is a wider issue, but building and relying on tools that
require humans to never make an error does not scale.

People always screw up their commit messages. And once they do it they stop
trying because if you can't be perfect you're just torturing yourself to try.
If there's a way to fix them after the fact you can begin demanding that
people have more commit discipline and make them go back and fix it if they
don't get it right the first time.

I take it as a matter of Broken Window syndrome. You can try to keep them from
getting broken but at some point you have to have a way to fix the ones that
break.

~~~
jacques_chester
I didn't mean to give the impression that I don't think better tools are
needed, I do. I feel that automation is a gift: what gets automated gets done.
Tools create paradigms of working.

What I was driving at was that better tools are one part of the triad of
people, tools and ways that people work together with tools. Commit discipline
will always be desirable, because no tool can recover a snapshot of my mind at
the time I made the commit.

------
wyldfire
> Lesson #3: It is polite to log interesting information at default levels and
> deliberately dropping a connection for any reason is definitely interesting
> to system administrators.

Agreed. But I suppose part of the reason this feature is present is because so
many ssh servers are the targets of attacks, often brute force. So logging
these might be problematic.

Speaking of "Rate limiting" \-- openssh logging something like "dropped N
connections exceeding MaxStartups" once per minute (if N > 0) [at WARN level]
seems like a sane compromise.

------
vmurthy
I can't help but notice that these kind of blogs should be recommended reading
for _CS /engineering students_ as well and not just to people who are already
working.

The way an issue is to be analyzed (this post is a great example of Systems
thinking IMO), what happens at scale and why not to blindly Google your way
out of problems are valuable skills any engineer should have and which are
unfortunately not taught in many schools.

------
tilolebo
Isn't it possible to just set the haproxy maxconn to a slightly lower value
than what the backend can deal with, and then let the reverse proxy retry once
with another backend? Or even queue it for some hundreds of milliseconds
before that?

This way you avoid overloading backends. Also, haproxy provides tons of real-
time metrics, included the highwatermark for concurrent and queued
connections.

~~~
ignoramous
What you suggest is what was followed by a large-scale distributed system I
worked on. Your suggestion is also in-line with Google' SRE book section on
Handling Overload: [https://landing.google.com/sre/sre-book/chapters/handling-
ov...](https://landing.google.com/sre/sre-book/chapters/handling-overload/)

> _Mandate that batch client jobs use a separate set of batch proxy backend
> tasks that do nothing but forward requests to the underlying backends and
> hand their responses back to the clients in a controlled way. Therefore,
> instead of "batch client → backend," you have "batch client → batch proxy →
> backend." In this case, when the very large job starts, only the batch proxy
> job suffers, shielding the actual backends (and higher-priority clients).
> Effectively, the batch proxy acts like a fuse. Another advantage of using
> the proxy is that it typically reduces the number of connections against the
> backend, which can improve the load balancing against the backend._

Other chapters on Load Balancing, Addressing Cascading Failures are related
too.

------
ignoramous
Scheduled tasks are a great way to brown-out your downstream dependencies.

In one instance, MDAM RAID checks caused P99 latency spikes first Sunday of
every month [0] (default setting). It caused a lot of pain to our customers
until the check was IO throttled, which meant spikes weren't as high, but
lasted for a longer time.

Scheduled tasks are a great way to brown-out yourself.

In another case, the client process hadn't set a socket timeout on a blocking
tcp connection [1] (default setting), and so it'd run out of worker threads
blocked on recv routinely when the server (fronted by reverse-proxy) started
rejecting incoming due to overload. Only a restart of the process would
recover the client.

Scheduled tasks are a great way to prove HAProxy will scale way better than
your backend. Thanks u/wtarreau

Speaking of HAProxy: It fronted a thread-based server processing all sorts of
heavy and light queries with segregated thread-pools for various workloads.
During peak, proxy would accept connections and queue work faster than the
worker threads could handle, and on occasion, the work-queue would grow so big
that it not only contained retries of same work scheduled by the desperate
clients but a laundry list of work that wasn't valid anymore (processed by
some other back-end in the retry path, or simply expired as time-to-service
exceeded SLA). Yet, there the back-end was, in a quagmire, chugging through
never-ending work, in constant overload when ironically the client wasn't even
waiting on the other end and had long closed the connection. The health-checks
were passing because, well, that was on a separate thread-pool, with a
different work-queue. Smiles all around.

Least conns and the event horizon. Tread carefully.

Least conns load balancing bit us hard on multiple occasions, and is now
banned for similar reasons outlined here:
[https://rachelbythebay.com/w/2018/04/21/lb/](https://rachelbythebay.com/w/2018/04/21/lb/)

[0] [https://serverfault.com/questions/199096/linux-software-
raid...](https://serverfault.com/questions/199096/linux-software-raid-runs-
checkarray-on-the-first-sunday-of-the-month-why)

[1] [https://stackoverflow.com/questions/667640/how-to-tell-
if-a-...](https://stackoverflow.com/questions/667640/how-to-tell-if-a-
connection-is-dead-in-python)

~~~
qw
I guess it would be nice to have a "fuzzy" cron job setting, so you could
specify "run it at a random time during the day/hour/minute"

~~~
certifiable
Someone pointed out in comments on the blog post
([https://about.gitlab.com/2019/08/27/tyranny-of-the-
clock/#co...](https://about.gitlab.com/2019/08/27/tyranny-of-the-
clock/#comment-4593761983)) that systemd actually has some good functionality
for that.

------
afraca
Bravo to the people that can navigate such a complex (not an insult)
architecture and pull useful information from it! I expect it to be quite
tricky to get in between the production environment(s) and capture packets or
change logging etc.

I did wonder though: the authors seemed a bit unsure about the relation
between MaxStartups and resource usage, wouldn't it be wise to just send an
email to the openSSH mailing list or something to not be surprised by possible
future problems?

~~~
EricE
Might be valuable if it reacted the same way in all environments (not always
the case). And if someone answered faster (correctly - without sending them
down an unrelated rabbit hole too!) then they could have found the answer out
by just experimenting.

Don't get me wrong - I'm a huge fan of community support, but sometimes it's
faster to just try things and see how the systems reacts.

Not to say you can't do both in parallel; and they might have - you aren't
going to detail everything you did in a writeup like this.

As others said, I really enjoy reading write-ups like these. Who needs fiction
when you have real world mysteries to solve?

------
amenod
> It would make everyone more comfortable if OpenSSH let us see the how close
> we were to hitting MaxStartups at any point, rather than having to go in
> blind and only find out we were close when the limit is breached and
> connections are dropped.

If they are not afraid to dive into source, it should be fairly easy to add
some logging when the value reaches 75% for example. Of course one needs to be
careful not to break anything else (issuing thousands of warning logs at the
same time that the system is at peak load is probably not a good idea), but it
shouldn't be that difficult. Also, I assume this would be something that
OpenSSH would be happy to get a PR (or MR, coming from GitLab? ;) ) for.

Nice writeup. What I miss is how they added monitoring for these kinds of
errors, so it doesn't escalate to 1.5% of all connections again. But other
than that, good job!

~~~
certifiable
Oh, that's a fun story. We use
[https://github.com/google/mtail](https://github.com/google/mtail) to turn log
data into metrics exported for prometheus. We were already doing this for our
haproxy logs; we just added a stanza to capture a secondary indicator that the
problem is occurring (termination_state 'SD', and 0 bytes_read), and export
that as a metric. You can see the relevant MR at [https://gitlab.com/gitlab-
cookbooks/gitlab-mtail/merge_reque...](https://gitlab.com/gitlab-
cookbooks/gitlab-mtail/merge_requests/47) (and [https://gitlab.com/gitlab-
cookbooks/gitlab-mtail/merge_reque...](https://gitlab.com/gitlab-
cookbooks/gitlab-mtail/merge_requests/48) to fix some bugs)

Then we just hooked it up through the usual sort of prometheus alerting rules.
We made it really twitchy, because this combination of logs Should Not Happen
if everything is working right, and we want to alert as soon as it starts
occurring, so we can bump the limits again (or re-evaluate in general)

------
navinsylvester
Interesting read.

But i bumped into the same problem recently while working on distributed
simulation. I was writing an agent for an application which talks to clients
over ssh. I started getting the same error when i cranked up machine count to
60.

I googled like every layman and zeroed in on MaxStartups. Looked at the manual
to cross verify and fixed the issue in 5 minutes. I agree with what ever he
did after he found out the issue but it was a clear case of SSH throwing an
error. Why not just look up manual or google instead of wasting time on
laborious packet inspection.

~~~
toast0
> Why not just look up manual or google instead of wasting time on laborious
> packet inspection.

When you've got a load balancing layer, and virtual machines, and interacting
with the real world, there's a reasonable chance that the problem is not a
(relatively) simple application configuration issue. If you've looked through
enough of these, getting a pcap (especially with an identified customer) and
looking through it with Wireshark is relatively quick.

~~~
navinsylvester
I could have worded it better like "why packet inspection to start with".

~~~
sethammons
I love packet inspection, but I always feel that if I have to go to that
depth, then we missed an opportunity to have observability (metrics/logging)
earlier.

------
mattip
They should have led with the bottom line - add a `sleep 7` to any cron job
that is going out to the network and set it to start at the top of the hour
plus 3 minutes

------
funkjunky
It's blog posts like this that make me realize I have no idea what I'm doing.
Great work on troubleshooting a complex and interesting problem.

------
adamc
I thought this was a pretty good piece, and points to the problems caused by
running services at regular intervals.

------
darkhorn
I remember that few years ago same issue occured to either Github or Bitbucket
and they made a blog post about it.

------
auslander
> we have 27 SSH backends, and 18 HAproxy

No autoscaling? This load pattern is a prime candidate for it.

~~~
btilly
I disagree.

Autoscaling sounds great, but in practice it notices that there is a problem
after you already have once, and by the time you've scaled up, there is no
problem any more. Which is the worst of all worlds.

For example in this case you'd realize there is a problem at the top of the
hour. And 2 minutes later you'd have a bunch of autoscaled instances up, all
wondering what the fuss was about. So they scale down again, but then at the
top of the next hour the same thing happens again with the same result.

The same thing happens with ad servers. An ad buy goes in for 30 seconds. When
it hits, you have a firehose. That then shuts off.

In fact this problem is so common that I recommend against autoscaling unless
you have specific reason to believe that it will work for you.

~~~
hinkley
And god help you if that starts your autoscaler to flapping.

Too much CPU load, start some servers. Too little, kill some servers. Oops now
CPU is too high, bring them back. No, it's too low...

It reminds me of the old UI bug where you get an off-by-one error for
scrollbar logic and so it appears and disappears in a cycle until you resize.

Edit: I think we want something akin to a learning thermostat, but for
servers. Figure out that I get a CPU spike at 5:20 on week days and spin some
servers up at 5:10. Then spin em down at 8:00 when EST starts to go to bed.

~~~
auslander
> .. Oops now CPU is too high, bring them back. No, it's too low...

You just described closed-loop control system oscillation. Cause is wrong gain
and/or delay parameters. AWS ASG has some knobs for tuning like Cooldown,
MaxSize. What you described above is most probably long boot time (delay)
problem. Service should be ready in less than 30 secs after boot. To get
there, one should bake AMIs, not install afresh all the stuff on boot.

~~~
hinkley
You can get an oscillation without long boot or ramp-up time but they sure
don't improve the situation.

Lazy loading of modules is a common cause, but any other resources that are
loaded on demand or in the background. But that's more of a situation of
thinking you need two additional nodes, getting three, settling back to two
after warmup, and then killing them all off again when the spike ends.

I may have said elsewhere that I'm more comfortable with scaling load based on
daily and weekly cycle patterns, with perhaps a few exceptions for special
events (sales, new initiatives, article publications) and making it easy for
someone to dial up or down.

To use a car analogy, get really good at power steering and power brakes
before you attempt cruise control, get really good at cruise control before
you attempt adaptive cruise control. And don't even talk about self-driving
unless you're FAANG.

