Why would you say twitter is a high profile place?
I'm lost on this. I never seen ads on Twitter and only read it to see if someone really posted whatever news side is posting a tweet.i have never seen anything being driven by twitter
If you pushed something to swap, you didn’t have enough RAM to run everything at once. Or you have some serious memory leaks or the like.
If you can take the latency hit to load what was swapped out back in, and don’t care that it wasn’t ready when you did the batch process, then hey, that’s cool.
What I’ve had happen way too many times is something like the ‘colder’ data paths on a database server get pushed out under memory pressure, but the memory pressure doesn’t abate (and rarely will it push those pages back out of swap for no reason) before those cold paths get called again, leading to slowness, leading to bigger queues of work and more memory pressure, leading to doom loops of maxed out I/O, super high latency, and ‘it would have been better dead’.
These death spirals are particularly problematic because since they’re not ‘dead yet’ and may never be so dead they won’t, for instance, accept TCP connections, they defacto kill services in ways that are harder to detect and repair, and take way longer to do so, than if they’d just flat out died.
Certainly won’t happen every time, and if your machine never gets so loaded and always has time to recover before having to do something else, then hey maybe it never doom spirals.
I do a lot of ci/CD where we just have weird load and it would be a waste of money/resources to just shelf out the max memory.
Other example would be something like Prometheus: when it crashes and reads the wal, memory spikes.
Also it's probably a unsolved issue to tell applications how much memory they actually are allowed to consume. Java has some direct buffer and heap etc.
I have plenty of workloads were I prefer to get an alert warning and acting on that instead of handling broken builds etc.
Because once swap activates, build now takes hours instead of tens of minutes. So it would timeout anyway, but only after wasting lots of resources. And even if you increase the timeout a lot instead, your machine how has a bunch of things swapped out, so now your tests timeout, which is even worse.
Yes, killing that part of the build did destroy the work of hours. It was still better to disable the swap than try to "ride it out".
Funny that you advocate for manual things while I have the feeling we do too much manual and advocate for the opposite.
Especially the error scenarios are bad. They cost you time no one measures.
Manual processes are also difficult to sync across multiple team members and you need tooling around it to make sure manual things happen.
My mantra / priority looks more like this:
1. Try not to do it at all
2. Make it automated
3. Do it manually with a heartbeat system
I don't want to do things manually I prefer to be able to go to a beer garden in the summer and being flexible.
And as an endpoint: automation for me is the necessary base of adding additional value with high return. Only with an automation base you can extend it by fixing more and more issues automatically. While you fix the full disk issue a 100 times, I fix it once.
I think you alluded to it wonderfully: a small team with < 10 ppl will be fine with sharing knowledge and doing parts of the pipeline manually. The overhead of creating and maintaining a stable automation for edge cases quickly exceeds the time saved.
It's a different story altogether if there are multiple teams etc that are supposed to utilize the same pipeline
My practical experience is with small teams below 10 people.
As soon as you have a well understood base system for automation (running code with Cron, monitoring and alerting) all further automationsteps are easy to add to that system.
The initial effort was always worth it.
And the big issue is, quality is very flexible.
If you need to do something every few days and you forget about it once and you get informed, did you heart someone?
Probably not but your quality suffered.
We even had a process which was broke for 3 weeks and a customer realized the issue, not us.
Automation was missing, monitoring and alerting as well.
One solution for a manual process was a Jira plugin which would create a ticket every Monday. It would describe what to do. Half automated. Purely manually would lead again to quality issues.
I'm lost on this. I never seen ads on Twitter and only read it to see if someone really posted whatever news side is posting a tweet.i have never seen anything being driven by twitter