"Everyone screws up, you will too, just be honest about it and tell me and it will be good."
It was a job working with this memory mapped wonky hardware tied to mainframes. There was no "undo" and as soon as you wrote to memory there it was. It was inevitable that you would typo something sometime in an important system.
Finally 3 years later a buddy is talking to me over the cube wall "Hey was that #3 you were working on yesterday?" I of course am typing away while talking and say out loud "3? Um..."
So I type something like -RESET SYSTEM 3-
I meant to type 9 a non critical system. 3 tied to a data replication system that was absolutely critical that it be running otherwise all transactions would stop (well for a bit while backups take over).
So if you couldn't use an ATM for a huge bank for a little while decades ago (fortunately it was the middle of the night), that was me ;)
I went to my boss the next morning and told him what I did, and he says "This like your first in 3 years, that's a record or something, it's usually a few within like 6 months. Nice job!"
It was a great place to work, no finger pointing, if you screwed up no big deal, everyone stuck around working with that team for decades.
When there were conference calls it was rarely stated (if ever) who actually did the thing. It was just accepted that it happened and we could discuss how to prevent it and such. "The engineer" or "the support team" and such were common phrases.
Inevitably folks would ask "who was it" and the answer usually was something like "it doesn't matter".
I still screw up, of course, but this attitude has made me feel less paranoid about performance and less like an idiot when I do mess up. The amount of trust a team needs to have to make this work is really the key,IMO.
Blame is such a complicated thing and the results of any mess are always way complicated even without the human factor.
I'm not convinced in any complex system outside of a guy running through a datacenter with a hammer... that anyone can actually "properly" assign blame or fault.
Humans are going to dork up, but no point in coming down on someone(s) for all the complexity that leads to a problem..... helping everyone avoid it on the other hand has a lot of benefit.
It's the breaking of things and seeing it go wrong in the real world that often boosts one above the level of academic coding and model building, because "holy shit, all that engineering and principals stuff actually matters and the real world is hard and there's consequences".
But I expect everyone to do it also, because without curiosity and an urge to push the boundaries, you'll be a very mediocre data scientist :p
It wasn't where I wanted to be, but it was also my first real job for a guy who dropped out of college (I was a poor student / shouldn't have gone when I was that age)... so I didn't mind at all.
But it was weird, lots of old east coast customers with old ways. One of our supersmart engineers was a woman. She would be on a call and would tell them their mainframe was configured wrong and straight up some guys would ask for her manager every time.... every time.
So she would come get me, a lowly n00b with a male voice and I'd pretend to be her manager looking things over and tell them the same thing, word for word, and they'll belive me. Weird world.
One of the wonderful things about them is that they never change. You can sit down and, in a period of weeks or months, read all the black binders containing all of the (usually excellent) documentation, and then you'd know everything about it, forever. Close enough to everything, anyway.
Some people might find this horribly boring, but it meant I could easily partition my time between "the familiar thing that always works the way I expect", and "the experimental stuff on the side". Today, those lines are super blurred; I rarely get the opportunity to become a proper expert on anything at all in the web world, so 100% of my time is "omg something's broken and quick, find the quirky part in the experimental thing that's running in production."
And, speaking of documentation, we just don't have anything like mainframe documentation in modern software. O'Reilly books often come sort of close, or used to, but Unisys for instance had something like a 30-volume set of binders, about 500 or so pages each, containing extremely deep, carefully edited documentation on every single system call. Imagine if your favorite web framework had an entire wikipedia, with a big team dedicated just to testing and reviewing every page for accuracy.
And finally, with modern queues, databases, file systems, and -- much as I'm loathe to admit it -- systemd, we're finally just now catching up to the disaster recovery procedures that were standard for mainframes in 1995. You could literally walk into the data center and yank the power cord for the mainframe while it was in the middle of running payroll, plug it back in, and it would pick up where it left off without skipping or double-counting a single record.
They had gigantic limitations too. TCO was astronomical by modern standards, so the hardware never got upgraded, so you'd never be able to do any of today's big data stuff on a mainframe. Software development, such as it was, was done in COBOL or JCL or WFL or, maybe, Algol or FORTRAN, and git isn't a thing and a lot of that software has decades of history behind every single line of code.
But it wasn't all bad.
People don't always get things right, they screw up, they do stupid things for good reasons, and sometimes good things for stupid reasons. As a manager I always want folks to be observant and thoughtful, and try to keep such things not about "who" screwed up but how that screw up came to be (the good or stupid reasons) and how one might think about the action ahead of time that would alert you to the potential problem that would result from a given action.
And the key of all that is making the discussion about how to think so you don't have the problem in the first place, rather than making it a blame-fest on some hapless engineer who chose poorly.
I was fortunate to have a manager early in my career who was very proactive at solving problems and moving forward, not affixing blame. He would say "Ignorance is the natural state before learning, only if it persists in the presence of learning opportunities does it become a problem."
I've always tried to learn by what I observe and what I do, which is why I enjoy Rachel's stories of finding root causes. They teach the principles that needed to be understood prior to the action. All without experiencing the feeling of dread that you've just taken production off line :-).
 That being that commenters feeling badly that the author doesn't seem to show their own flaws in the stories.
"What happened the last time production went down?" should produce a quite illuminating answer. Do they go through a detailed root-cause analysis? Do they answer with marketing-speak meant for a legally minimal disclosure? Do they blame "that moron" whom I'm meant to replace?
As it stands here, the official corporate policy is everything happens perfectly until someone who shouldn't be there messes things up, and the problem is best solved with a public and angry firing letter. Quoth our business partner: "We're allowed to change our minds, but you're not allowed to be in error. Even if we give you bad data, you're expected to infer proper data and give us proper output. BTW we're not paying for testing"
Everyone makes mistakes! It's how we learn :-)
Fun fact: none of these VMs had rebooted in that time, or they wouldn't have crashed.
Anyway, back in 2014 or so I dropped a bunch of transmit packet completions. In most cases I also double completed packets which was immediately fatal. Kernels get mad about that sort of thing.
Turns out, not all of the affected VMs died. Some of them lived on with head indices forever unequal to tail indices (until they rebooted).
In 2018 a developer realized there was a potential bug in waiting for VMs entering a quiescent state -- a truly idle networking stack had retired all Tx packets that it had admitted. Having unequal indices was impossible under correct operating conditions. They fixed the glitch.
This change rolled out gradually.
Gradually, the kernel panics appeared.
The change rolled back, halting the impact, but then the analysis began. What had we broken?
Another fun fact: Linux often includes an uptime in dmesg logs.
Slowly a pattern appeared. The dmesg logs included unusually large numbers for uptimes. Plotting these, there was a clear cliff in terms of a minimum uptime. Historical deployment logs showed a noteworthy release at that date, years past. Noteworthy in that it was rolled back for my bug, years prior.
On the plus side, I realized this was almost certainly my years prior fuckup slightly sooner than anyone else, so at least I got to call myself out :)
An overnight mission at McMurdo or a server reset without internet access on another device?
I'll take the overnight.
Depending on the time the year, perhaps. What's the longest possible time from sunset to sunrise at those latitudes?
chmod -R is powerful. :-)
> cd /
> mtree -U -f /etc/mtree/BSD.root.dist
> mtree -U -f /etc/mtree/BSD.var.dist
> mtree -U -f /etc/mtree/BSD.include.dist
> mtree -U -f /etc/mtree/BSD.sendmail.dist
> mtree -U -f /etc/mtree/BSD.usr.dist
The thing is that often my screw-up as a junior was worse (short version: broke a key part of the 'boot' system, was detected friday evening, and we had a major scheduled release on monday morning) and it just puts people at ease. I'll tell it in a humorous way as well. It's important (I think) that they don't feel bad about it.
I've often had colleagues join in on the conversation as well. We're human, we'll make mistakes, no need to stress out over it.
EDIT: added a bit more explanation of _why_ I do so.
Two things broke in a visible way during all of this. During testing everything was wired up to my personal account. I managed to spam all my followers with thousands of happy new year tweets in a couple seconds since I wasn’t subject to the rate limiter. I deleted all but one of those, which I left to remind myself that with great power comes great stories of things going wrong.
The other thing was a bit more dramatic, albeit short-lived. The first big test had everyone ready to go. I hit enter on the job, and at the time (maybe still) I had no way to get metrics out of production at a granularity less than one minute. A very worried minute goes by, and then we realize I’ve DDOSed the authentication service. All my fake accounts needed to auth to actually tweet, and naturally they did that first. Since the whole point of the test was for load to all hit in roughly the same second, the auth load also all arrived in the same second. Oops.
We decided that was an unfair test, I spent a few hours getting auth tokens for my fake users, and we tried again. That time everything worked, and we also survived New Year’s... But it was fun getting there.
I ask junior folks what their biggest (technical) screw up has been, in interviews. I think it's a bad sign if they won't admit to it or claim they've never screwed up big-time.
Screw ups lead to uncertainty and research suggests we learn best in uncertainty - https://www.aau.edu/research-scholarship/featured-research-t...
set -g window-style 'fg=red,bg=black'
One other "defensive scripting" trick I frequently use is starting any `rm` command with `ls`, double checking its output (or triple checking if it's a recursive one), and then replacing `ls` with `rm`. It barely takes any extra time if you're proficient with emacs-style readline hotkeys:
C-a M-d rm C-m
I do the tmux color trick too-- color coded by environment for each bastion.
And I've seen a solution for more secure environments where physical separation was used with the operator having separate monitors/keyboards, and the "important" system having a different color keyboard and monitor frame.
I think the first one in my career was discovering that "killall" does something very different on Solaris from Linux.
You live and you learn, I'd say :)
We'd hook respondents up to a webcam and record their facial expression as they would watch a series of videos. The vendor's emotion recognition machine learning software would then basically assign scores saying that at this second, the viewer expressed xyz emotion.
The project failed for 2 reasons - one was that the theoretical link between what expressions people were presenting, and their actual emotions to a particular piece of media was not fully proven - which meant the model output was not particularly helpful from the beginning.
Secondly, and this is really important - the model was trained on images of western faces (i.e white people) - and because our target audience - southeast asians - emote very differently a substantial chunk of the output data needed to be trashed (it couldn't process darker faces well, it interpreted a grimace as a smile, etc)
So there you have it - this was something I should at least have anticipated - I got in a lot of trouble
Search SRE had a 5lb bag of shredded money as a "gag gift" that was given to whomever caused the most recent outage that impacted search ads.
I also took CNBC off air briefly, although that was their man's fault as he told me to unplug the wrong video server.
We had our field engineer in doing a PM and he needed a scratch disk and I said oh you can use xxxx and pointed at the sticky label which had all the disk id's on.
Turns out that some one had been using this for a big GIS project in Amman and ended up wiping 6 month's work
that's what differentiate a good engineer from not-so-good - they learn on own mistakes!
One time I or My Boss (I cant recall who) stepped backwards and hit the off button with his head - we had our electrician fit a molly guard after that.
One day we hired an engineer who was a Sikh. Turns out the PDUs were almost exactly at turban (dastar) height. Cue the outage alerts (and the installation of mollyguards).
After two or three hours of exploring, I noticed something weird: it was a sunny LA afternoon, but I felt something like a drop of liquid hit the back of my leg. I kept walking, but felt another drop, so I stopped and checked. Yep, definitely real and definitely liquid. Also, it smelled like vinegar. Where was it coming from? Who would do such a thing, and how?
Perplexed, I walked on, until my bag started emitting a drawn-out Mac startup tone, and I realized just what I'd done. I opened it up, and sure enough: the seal on one of my kombucha bottles had failed, and its entire contents had emptied into my new work laptop.
Tell me about a time you made a mistake that you thought was going to get you fired.
1. Everyone has one. If you don't, you haven't been doing this long enough and I want you to make a couple of those mistakes elsewhere first.
2. If you didn't learn anything from it, you're going to make that and bigger mistakes in your hubris. I'd rather you do that elsewhere.
Tech for 12 years and I've never made a mistake disastrous enough to be fearful for my job. The worst costing ~$20k in hardware (couple server CPUs). Told my manager right after without hesitation (this was also at a startup).
I would not stress that much anyway now if it were to occur. Having been through mass layoffs from startups twice before, you change and become hardier. I will be careful but will never be fearful of employment. Short of doing a Desk Pop, I'm falling asleep every night with both eyes closed. Life is too short as is. Let me go and I will spend my next morning on a nearby beach with a good book.
It's not that they don't fire for mistakes, it's that they work with you to correct your behavior first and they give plenty of warning to someone who is in danger of that.
And I've never been given that warning.
edit: Added the word "technical". I've worked for companies that would fire at the drop of a hat, but they were all retail minimum-wage jobs.
It's also a refreshing reminder that "just because someone is successful and has a great resume doesn't mean they're flawless".
Resumes and LinkedIn pprofiles are like Instagram posts - enhanced to bring out the best aspects and with enough photoshop/makeup to hide the worst.
Of course an outage is never caused by one mistake. That mistake was mine, so I felt badly about it. There were also mistakes in code reviews, validation in the part receiving the config, and operational procedures. And then the big one: the company as a whole was in this awkward phase where everyone knew quick global pushes were bad but there wasn't good common tooling to support doing staged config files easily. That was the worst mistake behind dozens if not hundreds of major outages.
I needed to reboot at one point and when I did, it started giving me "boot disk not found". I couldn't get it to boot, at all. It seemed the boot disk was corrupted.
I was literally in a cold sweat for 2 hours, late into the night, until I finally noticed that I had left a diskette in the drive which was causing the bios to try to boot from there first.
I have had plenty of other cases where I actually messed something up. But that feeling you get when you think you have irreparably broken something is so terrible.
Was testing code and pushed a file to FTP 2 days early... vendor picked up, processed file.. the people who signed up in the next 2 days were in the file pushed later... but the vendor already processed the earlier file so they didn't get their metro cards that month
Somehow managed to rebalance underlying components for a Trendpilot ETF monthly instead of quarterly... daily audit that compares the values on NYSE vs in our DB caught it.. lucky for me there was no money in it yet
dropped a table once at lunch time right before taking a bite of my sandwich... did restore table within 10 minutes , didn't eat lunch that day ... lost appetite...
In ETL tool hardcoded something to test.... left it there when running for real
So I format the "hard disk" and for some reason my 3.5" wasn't formatted. So I tried again and again to no avail and gave up.
The production manager came in to work Monday morning to a fresh hard drive. Some things were backed up and some things had to be recreated.
The outcome of this necessitated learning a new skill: bypassing passwords.
The first outage where I thought I was going to get fired: I was working on a system that had a single-point-of-failure server, and through a mishap with rsync I accidentally destroyed the contents of /etc. That SPOF also had no backups. (I'm not claiming it was well-designed...) Thankfully the job that depended on that server would not kick off until morning, so my team slowly reconstructed its functions on a separate machine and swapped it in behind the scenes. I helped as much as I could while vibrating with anxiety, and my team was incredibly kind throughout. I was not in fact fired. :-)
The most recent outage I caused? Yesterday! I accidentally rebooted most of the machines in a development cluster. It's a dev system, there's no SLA, on the whole I don't feel horrid, but it definitely ruined a few people's work for an hour. This morning I spent a few minutes putting in a guard rail to prevent that particular mistake again...
If you're in this job long enough, everyone breaks things -- it just happens.
The second time we deployed, I happened to glance at the deployment size immediately after deploying. For about five seconds, our deployment size went from 100 down to 2. The reason for this was simple: The "Replicas" count was specified in the deployment spec, and it was set to the size we used in our staging infra. That had been fine in prod, and was quickly overridden by our autoscaling configuration, but it did cause the Kubernetes infrastructure to take down every existing pod (minus two), then bring up a bunch of new pods very quickly.
I immediately typed "rm README" and hit enter.
Then I crawled under my desk and wouldn't come out until we'd gotten the file restored from backups. Naturally, it had no useful information in it.
Then there was the time, for no readily apparent reason, where I typed "DELETE * FROM table" (in the dev database). Fine, I thought, it's time to go home, and submit a request to get the DB restored.
It turns out that they kept one (1) day's worth of backups, which they took at 6:30pm or so. I submitted the request at about 6:00pm and the DB guy had already gone home; he did the restore about 7:00am the next morning. Yes, he restored an empty table.
I also screw up all the time in ways that would cause outages, except we have automated tests, tsan/asan, code reviews, a staging environment, various safety checks, experiment gates, pre-mortems, slow rollout procedures, an alert on-duty SWE and on-call SRE, etc.
Today one of my mistakes was caught early in the prod phase of our push. That's much later than I would like but still before it did any real damage. I submitted the bad code last Wednesday and have been out sick with the flu (and caring for my preschool-aged kids) since then, so my awesome team handled my problem for me.
Given what I think is her age (judging from using C64s and whatnot) I'm going to go out on a limb and guess this was "The Sure Thing"  with John Cusack and Daphne Zuniga. It's a great movie if you haven't seen it.
Then there was the time I broke e-mail for Global Network Navigator, which was a partnership between O'Reilly and AOL. Lost all e-mail for over a million users on what was then the first nationwide ISP. I also submitted that one to The Register as well, but they haven't published it, at least not yet.
Always a fun one to share. :)
UPDATE Users SET Password=?
We had backups. Selective restore. 7 accounts that were new, not in the backup, got a special flag that required a reset.
Years later I still consider it my biggest screw up. Everything else can be explained by bad processes, documentation, etc but this one is just me being stupid.
Overall one of the craziest days on my life.
I became _that_ guy.
My then-workplace didn't always have enough funds, though as an employer they were generally generous especially considering their actual finances. This is relevant to the story because this employer:
1. was very lenient when it came to office attendance. So we frequently worked remotely in odd hours; that was normal. But as a matter of professionalism, I always tried to be conscientous when it came to the hours I put in. Most weeks I probably did more than usual, merits of which is another discussion entirely.
2. periodically organized events to promote the business. But being short on funds, they didn't have money to hire an actual photographer. So they'd ask me to shoot because I was interested enough in photography to, at the very least, have the gear for it.
The day I became _that_ guy they had this event I'm supposed to shoot but they really communicated the time badly to me. I expected to be able to do at least three, maybe four, hours of work before I'm needed with my camera. This is what I communicated to my TL.
Turns out they needed me _earlier_, such that I only had an hour of work done so far. Again, office culture was lenient about such things so my TL didn't really mind if I left then. The event was some kind of a big deal besides.
I'd generally start my "hours" in the afternoon, way after lunch. So by the time this event was done, it was already pretty late in the evening. I had my dinner and received a message from my TL. Nonverbatim:
"Hey can you update PostgreSQL (9->10) tonight? It shouldn't take too long and here's the steps..."
It was still in to my "usual" working hours but a couple of things that night made this request result to disaster:
1. I was tired from the event. Honest to goodness tired. I should've called it off when I couldn't even entertain myself enough to stay awake waiting for one of the given steps to finish. But I didn't because...
2. I didn't have the heart to beg off on this task when I've only done one hour of technical/engineering work for the day. To be fair, my TL always abided by the rule "Don't touch prod when tired; you will make things worse". Pretty sure he would've understood if I explained the state I was in. We could've done it the next night. But when you're tired and embarassed at having only done one hour of work for the day so far your decision making is exceptionally unsound, for lack of a stronger adjective.
Unfortunately the technical bits of this story gets fuzzy; it's been two years ago. But two years ago we have just migrated to Kubernetes and a couple of months in the team was still adjusting their mental models from servers to containers/deployments/statefulsets/pods. From just thinking between HDD vs SSD tradeoffs to Persistent Volume architecture issues. This is also why upgrading Postgres was such an ad hoc process for us then. We simply didn't know better (if something not "ad hoc" even exists).
Part of the instructions was to "delete the old data directory of Postgres" (cue: I have read this in a postmortem before...). Because I was tired and lazy I wrote a script so the update could go without my (much needed!) supervision. The instructions were sound and deletion would've been safe--assuming all the steps prior to the deletion finished successfully. It did not and I did not use `set -e`. Which meant I just deleted all the prod data in master. I was efficient. The realization woke me up harder than sugar ever did.
To cut this already long story short, I at least had the sense to concede at that point and wake up my TL with the bad news. Much like the rest of this story, what saved me that night came in twos:
1. I at least had the sense to put the site into maintenance mode.
2. I used `rm -rf`, as opposed to issuing DROP statements to psql. Which meant that my fuck-up did not replicate. So we just promoted the replica to master and downgraded the master to replica and monitored replication.
These two together ensured no data loss. Apocalypse canceled. Everyone in the company went to work in the morning none the wiser.
This story actually had a less fortunate sequel but that story is not for me to tell. And besides, I've written long enough.
These days, when running as root, I concentrate hard on every command, asking myself whether this is really exactly the right thing in every respect.
When my hands start shaking, I know my mind is in the right place.
Other places may use similar terminology but OP is at FB.
She's not worked there for nearly 2 years now: https://rachelbythebay.com/w/2018/03/10/free/
the classification can be in terms of impact to the business. E.g. "P1" could be reserved for issues severe enough to prevent the business from functioning as a business. (E.g. the bank that cannot process customer transactions, the cdn that cannot distribute content)
P3 might mean some features of a service are broken and 1% of your customers are pissed but there is a workaround available or the features aren't really critical
A simple, early one: We used VNC for remote desktop support of line-side production computers. One of my team leads was walking me through what was on the screen and what was going on. I was used to right clicking on these screens to see more of what was going on, but this one happened to be running a script that was interrupted by this. When I right clicked, my team lead freaked out, and the operator on the floor freaked out, and started moving the mouse themselves and clicking everywhere. After a while they started just doing their job again, but shortly after we got a call from a supervisor.
I got a call saying tracking was off on a production conveyor. This means that operators were getting incorrect instructions and work was being recorded on incorrect units. I adjusted tracking to match what I was being told. All good.
I shortly got a call from the same conveyor saying tracking was off. I told them "Yes, I just fixed it". "Well it is wrong now, it was fine a minute ago". So I adjusted it to match what I was being told now. Who knows what that other guy was smoking.
Right as I finished re-adjusting tracking I got another frantic, high energy, expletive filled call saying the tracking was off.
Dear reader you may have guessed what was wrong.
Since I got multiple sets of contradictory information, I decide to go out to the floor. This is what I see (simplified):
Footprint 1,2,3,4, etc
FP1 FP2 FP3 FP4 FP5 FP6 FP7 FP8
008 007 006 ___ 005 004 003 002
You see, the first person was on the first half of the conveyor, and the second person was on the other half. They were both correct, but neither had the full story. There was an empty carrier in the middle of the conveyor.
Last one. One of our weekly ops tasks was to verify that the 3 (three!!) scheduling services agreed with one another and also the production schedule. Unfortunately sometimes we got late breaking schedule changes, like running extra time or extending or moving lunch. On a Friday night/Saturday Morning, I got one such change. We were going to run an hour extra to make up for earlier lost units.
I made the requested change and went back to "compiling code" etc. (Perks of night shift)
Some time later... I get a call on one of my radios (Nextel) saying the lights were off in the back of the shop. I say, "hmm, that's odd" and go to the screen to turn the lights on in that area. I get a call on my other radio, saying the lights were off in the middle of the shop. Oh sh*t. For context, the lights were now off for over 200 pissed off people who just wanted to finish their overnight shift and go home. I continue to press buttons to turn lights on, hindered by the fact that the lighting controllers were on a very very slow daisy chained serial bus. My radios continue to go off with people urgently and excitedly informing me that the [expletive deleted] lights were off. I also got a visit from the Plant Shift Lead (2 or 3 steps down from the plant manager). I was pretty surprised to see her, as I was kind of wedged in a corner with a bookcase blocking half of the entryway to my cubicle.
Anyway, I eventually got the lights turned back on. Looking at the schedule changelog, I had successfully extended the shift, but for the wrong day. I had done it for Saturday, as the clock was past midnight when I edited the schedule. Oops.
These were all relatively early in my career, but I think they're pretty colorful.
Swift Playgrounds for macOS (apps.apple.com)
Judge Orders Navy to Release USS Thresher Disaster Documents (usni.org)
Where are all the animated SVGs? (getmotion.io)
Stage is a minimalistic 2D, cross-platform HTML5 game engine (piqnt.com)
How the CIA used Crypto AG encryption devices to spy on countries for decades (washingtonpost.com)
N26 will be leaving the UK (n26.com)
The coming IP war over facts derived from books (abe-winter.github.io)
Growing Neural Cellular Automata: A Differentiable Model of Morphogenesis (distill.pub)
A popular self-driving car dataset is missing labels for hundreds of pedestrians (roboflow.ai)
Investigating the Performance Overhead of C++ Exceptions (pspdfkit.com)
To my mind SRE != Sysadmin; SRE is a principle of tackling "Sysadmin" as if there were no sysadmins- engaging software solutions and engineering to track recurrent problems with a top-down approach, often with little understanding of high-availability in hardware or OS design.
Sysadmin is historically a role of automation and reliability, but working from the bottom up. I (and others) make sure operating systems are not exhausted and that the hardware can support various reliability metrics.
Personally, I think these roles are complementary because an auto-healing system that has a stable platform is going to be more reliable than something that is very over-engineered to deal with hardware faults as a common occurrence.
I don't think title inflation is necessary.
Don't get me started on "DevOps" engineers. It's either rebranded sysadmins doing the same thing but maybe with some CI/CD. Or Developers who have been thrown to the wolves. Hardly anyone is actually using the "there are no fullstack people, only full-stack teams" mantra.
I just lament the truth of your statement. :(
This made me laugh! This is so true.
There is some truth in advertising!
I feel like so many of the posts I read and enjoy could lead with that statement.
However, "even in turds you can sometimes find a peanut". I mean, come on...
Hardly an unreasonable description of hackernews comments.
I'm confused about why folks seem so upset by people expressing this opinion .
At least it's clear that this is just you White-Knighting....
IMO this is only validating the criticism the article levels at comment sections like HN's. You have picked out some random sentence and expressed no more than idle disagreement. Maybe I personally wouldn't compare your comment to a turd, but there's not a whole lot of nutritional value in it either.
Perhaps the reason the article expresses this concern in this specific way is because it is warranted. Because people insist on disassembling articles coming from this domain sentence by sentence and posting comments that really don't say anything helpful or sometimes anything at all.
2. indiscriminantly rude towards an entire community?
> Because people insist on disassembling articles coming from this domain sentence by sentence
I feel like I may have walked into something where I don't have much context. I'm not sure what you mean by that.
Also, I find it strange people are so fixated on my criticism, and nobody has commented anything about the praise I made in the very same post.