Nothing as infuriating as that, it could be a valuable learning lesson for many people just like this!
I feel like everyone involved would have to be more eager to solve the problem than worrying about assigning blame. Does that work because it's all within the same company or does GitLab make extra effort to dis-incentivize blame shifting and other unproductive behavior?
In my life (all small scale - local vendors) I call people and the only answer I ever get is roughly "our stuff if perfect. It's you."
Of course for this specific issue there really wasn't anything or anyone to blame: no-one really did anything wrong at any point. Systems simply get complicated at certain scales and circumstances, and we have to engineer our way around the issues.
A few recent ones might pique your curiosity
Put it in the commit message. There is no better place to put it. That is the exact point in time at which you have the most information about that exact change. Commit messages are cheap, be liberal with your words and links.
It drives me nuts to read commit messages of the form "fix", "fixed 2", "add test", "new var" and on and on. You took an hour to make the change, take a damn minute to save some future colleague a day of frustration.
Comments are just as cheap as commit messages. Why not both?
Whereas the commit is immutably and ideally exactly about what it's about. There is no better time to capture the information in a way that perfectly connects it to what it's describing.
That something needs to be entirely perfect before being judged better than the status quo is the Nirvana fallacy.
Two rights, on the other hand, don’t make a wrong.
I feel like I'm missing something here. I hate this feeling of impending dumb.
I also typically find most files have only a handful of commits of interest in their history, so scoping a git log to the file often gets me within spitting distance of the right commit, if someone took the time to write a good one.
Edit: which goes back to my original point, that the commit is the best time and place to snapshot the meaning of that time and that place.
For them, a comment in the code could save a lot of time and hassle.
Though I hope people at gitlab use git.
Commit messages are good, but the dogmatic anti-comment philosophy that Pivitol follows is overdone. Lots (I mean LOTS) of comments are redundant or can be fixed with better naming or organization. But comments shine when dealing with the "why" behind something. I would rather see a comment that says why a value is set to what it is than have to be a git historian.
Yes, when there's no better alternative, or when it's idiomatic (Go linters) or situationally appropriate (Spring libraries are immaculately commented).
But I emphasise "no better alternative" because I have seen comment rot and it sucks badly. The English language is pretty vast and can be usefully ransacked for names. Keyboards are cheap and autocompletion plentiful, so go hog wild with variable names. There are idioms, patterns, well-understood conventions and so on that allow you to avoid noise.
When you need a why, then absolutely give the why. But usually the need for an explanation is that the code isn't clear on its face and you should exhaust that possibility first.
This concludes today's reading from the Book of Annoyingly Doctrinaire Ex-Labs Pivots. Go in peace.
If you screw up and forget to add the documentation to the commit message, there is nothing you or I can do about it. And if the commit log is not uniformly useful in this way, then people start to discount the value of the commit messages.
And then I have to explain to people who keep cutting and pasting code around improperly are ruining our commit history. I often get a shrug.
The first part we could do something about, and I hope we do in the next version control system (there's always a next one). All sorts of metadata should be capable of being attached to a code commit. Some of it stuff we have been relegating to third party tools, which means you can't reason about it as a whole.
Did this commit get a green build? Did it have a code review? What PR was it in? Did we run any static analysis? Did we learn after the fact that this code doesn't solve the problem or is a bad solution?
If all of this is at my fingertips, I've got a better chance of getting a bunch of people (myself included) to have a broader understanding of the code, or at least an easier time getting familiar with it when I need to.
One is commit discipline. People care more or they care less, no tool can truly change that.
The other is how we collect information. I'm not convinced that version control systems are the right place for that, but I know others have explored it (eg. Fossil SCM). Github provides what is a de facto data store for a lot of such annotations through their APIs, I imagine Gitlab does similar.
People always screw up their commit messages. And once they do it they stop trying because if you can't be perfect you're just torturing yourself to try. If there's a way to fix them after the fact you can begin demanding that people have more commit discipline and make them go back and fix it if they don't get it right the first time.
I take it as a matter of Broken Window syndrome. You can try to keep them from getting broken but at some point you have to have a way to fix the ones that break.
What I was driving at was that better tools are one part of the triad of people, tools and ways that people work together with tools. Commit discipline will always be desirable, because no tool can recover a snapshot of my mind at the time I made the commit.
The existence of a feature is not the same as a solution to the problem. All of these little tools that exist in the git toolspace are not automatic. Every developer has to add it to their already pretty long set of git rituals.
That's bullshit. We can do better than this.
Point me to a git feature that works when I do 'git push' or 'git pull' otherwise I'm not interested.
I'm also not interested in hearing people imply that people aren't smart enough if the tool doesn't suffice, and there's a very, very large concentration of that sort of asshole in the Git community. I don't make my tool decisions based on what I am smart enough or not smart enough to get. I make them based on what I'm willing to teach, my anticipation of how often I will have to be preempted to support others who don't get it, and what I'm willing to juggle when a production issue has the entire management team hand-wringing on a day when I'm already having a bad day.
Give me stupid-simple and dead reliable or it's not helpful.
Agreed. But I suppose part of the reason this feature is present is because so many ssh servers are the targets of attacks, often brute force. So logging these might be problematic.
Speaking of "Rate limiting" -- openssh logging something like "dropped N connections exceeding MaxStartups" once per minute (if N > 0) [at WARN level] seems like a sane compromise.
The way an issue is to be analyzed (this post is a great example of Systems thinking IMO), what happens at scale and why not to blindly Google your way out of problems are valuable skills any engineer should have and which are unfortunately not taught in many schools.
This way you avoid overloading backends. Also, haproxy provides tons of real-time metrics, included the highwatermark for concurrent and queued connections.
> Mandate that batch client jobs use a separate set of batch proxy backend tasks that do nothing but forward requests to the underlying backends and hand their responses back to the clients in a controlled way. Therefore, instead of "batch client → backend," you have "batch client → batch proxy → backend." In this case, when the very large job starts, only the batch proxy job suffers, shielding the actual backends (and higher-priority clients). Effectively, the batch proxy acts like a fuse. Another advantage of using the proxy is that it typically reduces the number of connections against the backend, which can improve the load balancing against the backend.
Other chapters on Load Balancing, Addressing Cascading Failures are related too.
In one instance, MDAM RAID checks caused P99 latency spikes first Sunday of every month  (default setting). It caused a lot of pain to our customers until the check was IO throttled, which meant spikes weren't as high, but lasted for a longer time.
Scheduled tasks are a great way to brown-out yourself.
In another case, the client process hadn't set a socket timeout on a blocking tcp connection  (default setting), and so it'd run out of worker threads blocked on recv routinely when the server (fronted by reverse-proxy) started rejecting incoming due to overload. Only a restart of the process would recover the client.
Scheduled tasks are a great way to prove HAProxy will scale way better than your backend. Thanks u/wtarreau
Speaking of HAProxy: It fronted a thread-based server processing all sorts of heavy and light queries with segregated thread-pools for various workloads. During peak, proxy would accept connections and queue work faster than the worker threads could handle, and on occasion, the work-queue would grow so big that it not only contained retries of same work scheduled by the desperate clients but a laundry list of work that wasn't valid anymore (processed by some other back-end in the retry path, or simply expired as time-to-service exceeded SLA). Yet, there the back-end was, in a quagmire, chugging through never-ending work, in constant overload when ironically the client wasn't even waiting on the other end and had long closed the connection. The health-checks were passing because, well, that was on a separate thread-pool, with a different work-queue. Smiles all around.
Least conns and the event horizon. Tread carefully.
Least conns load balancing bit us hard on multiple occasions, and is now banned for similar reasons outlined here: https://rachelbythebay.com/w/2018/04/21/lb/
I've been trying to convince my division to prioritize adding events to our stats dashboard.
Comparing response times to CPU times is just expected level of effort to interpret the graphs. But you don't have any visibility into how a chron job, service restart, rollback, or reboot of a host caused these knock-on effects. And without that data you get these little adrenaline jolts at regular intervals when someone reports a false positive. Especially in pre-prod, where corners get cut on hardware budgets, and deploying a low-churn service may mean it's down for the duration.
This might end up being a thing I have to do myself, since everyone else just nods and says that'd be nice.
I did wonder though: the authors seemed a bit unsure about the relation between MaxStartups and resource usage, wouldn't it be wise to just send an email to the openSSH mailing list or something to not be surprised by possible future problems?
Don't get me wrong - I'm a huge fan of community support, but sometimes it's faster to just try things and see how the systems reacts.
Not to say you can't do both in parallel; and they might have - you aren't going to detail everything you did in a writeup like this.
As others said, I really enjoy reading write-ups like these. Who needs fiction when you have real world mysteries to solve?
If they are not afraid to dive into source, it should be fairly easy to add some logging when the value reaches 75% for example. Of course one needs to be careful not to break anything else (issuing thousands of warning logs at the same time that the system is at peak load is probably not a good idea), but it shouldn't be that difficult. Also, I assume this would be something that OpenSSH would be happy to get a PR (or MR, coming from GitLab? ;) ) for.
Nice writeup. What I miss is how they added monitoring for these kinds of errors, so it doesn't escalate to 1.5% of all connections again. But other than that, good job!
Then we just hooked it up through the usual sort of prometheus alerting rules. We made it really twitchy, because this combination of logs Should Not Happen if everything is working right, and we want to alert as soon as it starts occurring, so we can bump the limits again (or re-evaluate in general)
But i bumped into the same problem recently while working on distributed simulation. I was writing an agent for an application which talks to clients over ssh. I started getting the same error when i cranked up machine count to 60.
I googled like every layman and zeroed in on MaxStartups. Looked at the manual to cross verify and fixed the issue in 5 minutes. I agree with what ever he did after he found out the issue but it was a clear case of SSH throwing an error. Why not just look up manual or google instead of wasting time on laborious packet inspection.
But wouldn't doing investigation on the error provide data point to help debug the issue. They are trying to fix a customer issue and time is also of importance here.
> just applying fixes blindly plucked from the Internet.
I wasn't recommending that.
When you've got a load balancing layer, and virtual machines, and interacting with the real world, there's a reasonable chance that the problem is not a (relatively) simple application configuration issue. If you've looked through enough of these, getting a pcap (especially with an identified customer) and looking through it with Wireshark is relatively quick.
No autoscaling? This load pattern is a prime candidate for it.
Autoscaling sounds great, but in practice it notices that there is a problem after you already have once, and by the time you've scaled up, there is no problem any more. Which is the worst of all worlds.
For example in this case you'd realize there is a problem at the top of the hour. And 2 minutes later you'd have a bunch of autoscaled instances up, all wondering what the fuss was about. So they scale down again, but then at the top of the next hour the same thing happens again with the same result.
The same thing happens with ad servers. An ad buy goes in for 30 seconds. When it hits, you have a firehose. That then shuts off.
In fact this problem is so common that I recommend against autoscaling unless you have specific reason to believe that it will work for you.
Too much CPU load, start some servers. Too little, kill some servers. Oops now CPU is too high, bring them back. No, it's too low...
It reminds me of the old UI bug where you get an off-by-one error for scrollbar logic and so it appears and disappears in a cycle until you resize.
Edit: I think we want something akin to a learning thermostat, but for servers. Figure out that I get a CPU spike at 5:20 on week days and spin some servers up at 5:10. Then spin em down at 8:00 when EST starts to go to bed.
You just described closed-loop control system oscillation. Cause is wrong gain and/or delay parameters. AWS ASG has some knobs for tuning like Cooldown, MaxSize. What you described above is most probably long boot time (delay) problem. Service should be ready in less than 30 secs after boot. To get there, one should bake AMIs, not install afresh all the stuff on boot.
Lazy loading of modules is a common cause, but any other resources that are loaded on demand or in the background.
But that's more of a situation of thinking you need two additional nodes, getting three, settling back to two after warmup, and then killing them all off again when the spike ends.
I may have said elsewhere that I'm more comfortable with scaling load based on daily and weekly cycle patterns, with perhaps a few exceptions for special events (sales, new initiatives, article publications) and making it easy for someone to dial up or down.
To use a car analogy, get really good at power steering and power brakes before you attempt cruise control, get really good at cruise control before you attempt adaptive cruise control. And don't even talk about self-driving unless you're FAANG.
So whats the problem here? Assuming there are no misconfigurations anymore, and just spiking loads.
In my opinion for those with a spiking load profile, autoscaling should be considered harmful.
1) Check out the graph in the 'An illuminating graph' section. The connection rate spikes by a factor of 3 in the space of 5 seconds (and that's actually a consolidated average over a number of hours; the worst spikes at the top of the hour are even bigger). We'd need super-responsive, practically magic autoscaling, or pro-active auto-scaling that does so at known intervals (every hour, 10 minutes, 15 minutes, etc). But given that the actual CPU usage doesn't really vary that much over those timescales (it all sort of evens out once the connections are made and git takes over), auto-scaling just to add more connection slots in SSH would be a poor use of resources, when we can just increase MaxStartups as far as necessary.
2) We do want to autoscale, and will in the medium-term future, because the access patterns at a weekly scale are quite variable and predictable, and we can save a lot of resources by scaling down at the weekends (or even the EMEA evenings) when the bulk of our users go home and stop creating MRs, and it's only bots, CI, and cron jobs (I jest, but it feels like that some days). But not because of the issue described in the blog post
We don't even have any indication that the load on the system is particularly affected, there's just an arbitrary connection cap. Average load in the cpu graph  looks pretty flat, and around 50%; so it's probably already over provisioned depending on their disaster recovery plans and their normal load patterns (didn't see a daily/weekly graph to armchair that)