More

chronid · 2025-02-21T12:33:10 1740141190

When coding, you trust the CPU to do correct math. The kernel to properly allocate memory. The network stack to send packets and pass you only the ones with valid checksum.

Much of our world operates on trust (and eventually verify), same as any application does. It's exceedingly costly to do otherwise.

chronid · 2025-02-19T13:49:32 1739972972

We will never know, but I wonder if it could be a power/signaling or VRM issue - the CPU non getting hot doesn't mean something else on the board has gone out of spec and into catastrophic failure.

Motherboard issues around power/signaling are a pain to diagnose, they will emerge as all sort of problems apparently related to other components (ram failing to initialize and random restarts are very common in my experience) and you end up swapping everything before actually replacing the MB...

chronid · 2025-02-16T12:35:39 1739709339

In my last job we ran centralized clusters for all teams. They got X namespaces for their applications, and we made sure they could connect to the databases (handled by another team, though there were discussion of moving them onto dedicated clusters). We had basic configuration setup for them and offered "internal consultants" to help them onboard. We handled maintenance, upgrades and if needed migrations between clusters.

We did not have a cluster just for a single application (with some exceptions because those applications were incredibly massive in pod numbers) and/or had patterns that required custom handling and pre-emptive autoscaling (which we wrote code for!).

Why are so many companies running a cluster for each application? That's madness.

swiftcoder · 2025-02-16T16:49:16 1739724556

I mean, a bunch of companies that have deployed Kubernetes only have 1 application :)

I migrated one such firm off Kubernetes last year, because for their use case it just wasn't worth it - keeping the cluster upgraded and patched, and their CI/CD pipelines working was taking as much IT effort as the rest of their development process

chronid · 2025-02-16T12:30:04 1739709004

You arent' forced to use service mesh and complex secrets management schemes. If you add them to the cluster is because you value what they offer you. It's the same thing as kubernetes itself - I'm not sure what people are complaining about, if you don't need what kubernetes offers, just don't use it.

Go back to good ol' corsync/pacemaker clusters with XML and custom scripts to migrate IPs and set up firewall rules (and if you have someone writing them for you, why don't you have people managing your k8s clusters?).

Or buy something from a cloud provider that "just works" and eventually go down in flames with their indian call centers doing their best but with limited access to engineering to understand why service X is misbehaving for you and trashing your customer's data. It's trade-offs all the way.

chronid · 2025-02-16T12:20:24 1739708424

The k/v store offers primitives to make that happen, but for non-critical controllers you don't want to deal with things like that they can go down and will be restarted (locally by kubelet/containerd) or rescheduled. Whatever resource they monitor will just not be touched until they get restarted.

chronid · 2024-11-20T08:12:11 1732090331

Some FAANGs at least (though they may not cover everything) have a "help something is broken but I don't know what to do" team and/or rotation for incident response, staffed on multiple continents to "follow the sun".

But you need to know they exist. :)

ElevenLathe · 2024-11-21T12:33:32 1732192412

I've worked on several such teams (not at FANGy places, but some household names), variously called just the NOC or SOC (early on in my career, the role was also a kind of on-duty Linux admin/computer generalist), Command Center, and Mission Control. It was great fun a lot of the time but the hours got to be tiresome.

I would be very surprised if any enterprise of significant size and IT complexity didn't have an IT incident response team. I'm biased but I think they are a necessity in complex environments where oncall engineers can't possibly even keep track of all their integrators and integrators' integrators, etc. It also helps to have incident commanders who do that job multiple times a week instead of a few times a decade.

fma · 2024-11-20T13:26:52 1732109212

I never worked at a FAANG...but a Fortune 20 company the last 9 years. There is no system of record of applications?

I can go to a website and type in search terms, URLs and pull up exactly who to contact. Even our generic "help something is broken" group relies on this. There are many names listed so even if the on call person listed is "making dinner", you have their backup, their manager, etc.

I can tag my system as dependent on another and if they have issues I get alerted.

chronid · 2024-11-20T23:06:15 1732143975

I am fairly simplifying, but you are expected to know your direct dependencies (and normally wil), pagers have embedded escalation rules with prinaries and secondaries, etc. The tooling once you know what to do is better than anything outside of FAANGs I've seen in terms of integration and reliability.

Escalation teams are usually reserved for the "oh fuck" situations, like "I don't work on this site but I found it broken" or "hey I think we are going to lose soon this availability zone" or "I am panicking and have no idea how to manage this incident, please help me".

They're a glue mechanism to prevent silos and paralysis during an event, usually pretty good engineers too.

chronid · 2024-11-13T12:31:00 1731501060

This has been done forever. Ops team had cronjobs to restart misbehaving applications out of business hours since before I started working. In a previous job, the solution for disks being full on a VM on-prem (no, not databases) was an automatic reimage. I've seen scheduled index rebuilds on Oracle. The list goes on.

braggerxyz · 2024-11-13T13:49:55 1731505795

> I've seen scheduled index rebuilds on Oracle

If you do look into the Oracle dba handbook, scheduled index rebuilds are somewhat recommended. We do it on weekends on our Oracle instances. Otherwise you will encounter severe performance degredation in tables where data is inserted and deleted at high throughput thus leading to fragmented indexes. And since Oracle 12g with ONLINE REBUILD this is no problem anymore even at peak hours.

xeromal · 2024-11-13T16:41:08 1731516068

Rebooting Windows IIS instances every night has been a mainstay for most of my career. haha

ComputerGuru · 2024-11-14T01:38:03 1731548283

I’ve got an IIS instance pushing eight years of uptime… auto pool recycling is disabled.

chronid · 2024-11-06T13:14:59 1730898899

You should look where the economy is growing and where the salaries are growing. It's not uniform at all.

The entire situation (as an EU country citizen who moved to another EU country) and the narratives around it are funny to me because they're the same as the ones going around for years in my birth country.

"Side X should learn they should get better candidates, otherwise people are not going to show up" way of thinking included, which has only led to further decline as the "conservatives" win and make the situation worse taking more and more seats and control in state controlled companies while at the same time pushing their own companies to absorb more and more of the budget. Yeah, not showing up because you did not like the candidate was a great success - if you wanted the decline to accelerate, that is.

Well, good luck US friends, to you and us all.

chronid · 2024-10-25T21:40:26 1729892426

Even software (at least outside academia) eventually has to fight physics and the thing with most gravity of it all, money.

chronid · 2024-10-12T12:50:00 1728737400

Doesn't a lot of optimization target making your code and data more cache friendly because memory latency (not bandwidth?) kills performance absolutely (between other things like port usage I guess)?

If something is in L3 it is better for CPU "utilization" than stalling and reaching out to RAM. I guess there are eventually diminishing returns with too much cache, but...