Hacker News new | past | comments | ask | show | jobs | submit login
Trying to sneak in a sketchy .so over the weekend (rachelbythebay.com)
410 points by ingve 10 days ago | hide | past | web | favorite | 220 comments

I expect this post will resonate with a lot of HN readers. Every tech company I’ve worked for has had at least 1 manager who tries to ship features over the weekend with a ragtag team of developers who don’t understand the system or why that’s a problem.

If they succeeded, the manager would have pointed to the feature as an example of their “hustle” and ability to get things done where others couldn’t.

If they shipped the feature and it crashed the website, the manager would blame the front end team for making a fragile system that couldn’t handle a simple feature.

If they failed or were blocked, they’d point out their working proof-of-concept and blame the front end team for onerous process or politics.

The real litmus test is how the company reacts to that manager after this stunt. If the company sides with the hustle manager, it doesn’t bode well for engineering in the long term. When management rewards shows of effort instead of long-term results and fails to acknowledge negative externalities or selfish managers, you breed more of that behavior.

However, if management sides with engineering and shuts the hustle manager down, you’ve found a good workplace.

Every company over a certain size has a manager who explicitly puts together a rag tag team of cluelessness in order to "get stuff done" because they don't understand the complexity of the stuff. They interpret push back as a lack of co-operation, rather than the requirement for sanity that it entails.

Ideally, once they identify by trying to pull the trigger you can move them out of the company.

The probability of such "get stuff done" goes up the more seemingly-pointless and painful, slow and bureaucratic processes are introduced.

Many of these processes may be necessary, but it's also necessary to explain why, and to make them as fast and painless/frictionless as possible - especially as each single process in isolation may seem reasonable, but when stacked on top of each other, the "get stuff done" approach becomes a lot more tempting.

>especially as each single process in isolation may seem reasonable, but when stacked on top of each other, the "get stuff done" approach becomes a lot more tempting.

Process stacking for me is one of the reasons why it's super painful to work in big companies if you want to get something done. As soon as somebody makes a mistake, they will add a little bit of process to ensure that never happens again.

Individually, as you point out, that makes sense. But if you have to go through a 1000 item review checklist for a single line of code, then I can assure you that no human will be able to actually think through those 1000 items. But they will go through the motions to satisfy the process. Then because they have the checklist they don't think they have to think about it anymore. They make a mistake. It gets added to the checklist.

I experienced situations where a single code change would take at least a month. This lead to people trying to save time on a) tests, b) any kind of refactoring, c) adapting libraries instead of writing your own implementation (because fixing the library would be 2 code changes and not just twice the process effort, but an actual committee had to decide about the library change first.)

So a lot of process IMHO is the worst thing you can do for your code quality. Checklists are good, but they should be limited to a manageable number (e.g. 10 items, if you want to add something, you have to remove something less important first). It should also not be harder to do the right thing, e.g. centralizing functionality in libraries should be easier than.

It's a cool thing that software moves so much faster than other processes, like in my experience military grade weapon loading.

Our sub would dock at the pier the day before, everyone but Weapons Department got the day off/in port duty day. Weapons Department would hold an all afternoon walkthrough of the entire process. Manpower locations and roles. Equipment setup and basic operations. Types, quantity, and sequence of weapons to take aboard. Expected timeframe / pace so that no one was expecting to have to hustle to catch up.

And everything was in binders, with plastic strip edged pages and fresh grease pencils issued to everyone managing.

Every one of those steps was a result of "Ok, crap, what do we rewrite to make sure (shudder) THAT NEVER happens again."

And even so, on my fifth loadout party, I still missed a retaining strap and almost helped dump a torpedo in the harbor, except there was already a step right after mine with a separate checkbox that said "Aux handler has checked strap type/quantity/positioning for weapon type."

Procedures are great for the things that need them. And when you have numerous teams/functions scattered about, procedures are even more necessary.

And I do get that a lot of code is not likely to detonate under the companies' hull, per se.

Moreover, in this industry, a checklist is usually a reinvention of 18th century manufacturing process. A lot of the "needs checklist" can be transformed and automated into "do the integration tests pass?" Several orders of magnitude faster, as it doesn't need couriers on horseback to carry documents to and fro, or a live human imprimatur. Talk about repeating history...

I tried to introduce that, but you need to change culture even more in that case.

>As soon as somebody makes a mistake, they will add a little bit of process to ensure that never happens again.

To be fair, the other side of the medal is when it is simply not possible to get certain things done, because the need has not been anticipated when designing the processes. If you don't have the company political clout to get these processes amended, your only option is to wait until a customer is negatively affected, in order to drive the point home. Still, hustling (even if it is well-meaning) is of course not an acceptable solution.

Absolutely agree. And it's up to a company to insure that process isn't pointless or obfuscating the reason for its existence.

The less clarity there is on the "why" the more creative the management will be.

Of course, managers who say "I don't believe that will happen so I'm going to skip this part." should be walked out of the door to their car immediately. :-)

I have seen the rag tag "get stuff done" team that works really well because they are staffed with people who know what they're doing. A team that knows better than to try to ship late Sunday for example, perhaps small and organizationally lax but experienced and disciplined.

The issue can sometimes occur when the manager doesn't know that their rag tag team is not this special case, but actually clueless. Or have not learned to spot the difference, or that there is a difference.

Bingo! The origin of the cargo cult, right there. "Look at that team: gets stuff done, and quickly! Therefore, I shall assemble a team at random, and push for speed at the expense of everything else. The results must surely manifest, for I am following the incantation!"

Incidentally Pirates of the Apple put together first Mac. Though not exactly rag-tag, but brilliant, just a counter example I suppose.

Well, the Mac was designed and developed by Apple engineers hand-picked by Steve Jobs. So it was an official long-term project with backing by the CEO.

> just a counter example

It took me a while to understand why HN'ers revel in "the counter-example."

In mathematical proofs, you only need one counter-example to refute a proof or argument.

Pedantic HN'ers seemingly fail to realize that mathematics and the real world are not the same thing.

I think it's the opposite of what you're imagining. Refutation is generally a claim that the truth of the situation is more nuanced than a simplistic statement would make it seem.

Here you're right to raise issue, but it seems the comment is merely trying to point out that 'not all [Scottish!?!] rag-tag teams are bad' and idea draw attention to some such teams being superb. Which seems a fair comment to me.

>In mathematical proofs, you only need one counter-example to refute a proof or argument.

The things we're talking about here aren't mathematical axioms, they're general trends. One counter-example does not disprove a trend. Every real-life trend has exceptions, and it frequently is interesting to examine the exceptions to see why they bucked the trend.

I don't think that the goal here is refutation, its adding additional data-points give a more nuanced picture.

Yup, and developers looking for shortcuts or how to do things the way they are used to (e.g. calling a database directly). I can recall one instance where a guy was asking for SSH access to the production environments, just so he could look at some env variables and logs. We had the best ops team (IMO) who worked with Principles (they were working towards 99,99999% uptime), which simply excluded SSHing into production servers. The developer was told to add something to his application that logged the environment variable, if that's what he really needed. It's a bit more work but at the same time he had no excuse because new deploys were a matter of minutes.

But anyway, the main lesson I learned there is that as an ops team (or broader, as an IT department) you need to have Principles, capital P. A short set of rules and goals which you can always point to. Like the uptime goal, which excludes / includes a LOT of things right off the bat - access controls (nobody can touch production directly), testing practices, application architecture (stateless), etc.

I don't think asking what environment variables are available for application developers to use on the servers is an unreasonable request. It should probably be part of the platform documentation though. Logs, again there should be a documented safe way to do this (ELK etc.). SSH to production is not the answer though.

BTW, the night crew pulling a stunt is how the Chernobyl Disaster happened.


They weren't exactly pulling a stunt. They were carrying out a test as planned, just off schedule (because of other operational demands) and the key thing is that neither the test planners nor the operators were aware of the design flaw in the reactor (because the government had declared it a state secret).


These analysts say that Soviet authorities appear to recognize that operator errors at the Chernobyl plant on the night of April 25-26 were not the sole cause of the accident, and that technical flaws in the reactor’s design contributed to the worst accident in the 44-year history of nuclear energy.

In particular, they said, a distinctive feature of the Chernobyl design, which sets it apart from conventional nuclear power plants in most of the world, is its tendency to generate a sudden and uncontrollable burst of power if large steam bubbles, or “voids,” are allowed to form in the reactor core, as they did before the accident.

This peculiarity of the Chernobyl type of graphite reactor, called a positive void effect, is now seen as a decisive factor in the accident, one that transformed successive blunders on the part of Soviet operators over a period of hours into a catastrophe.


Never heard of this before - thanks!

If this interests you, there's a show on HBO by the same name that you should give a go.


Making it easy and self service to do things the right way also helps create a cultural expectation that if you think you need to leave the happy path, you’re probably holding it wrong. Or at least, you brought the blockage/expensive coordination problem on yourself.

Yes, you are correct. But, to play devil's advocate to this, I will also say that there are operations where resistance to any change that will cause more than a trivial amount of work are roadblocked endlessly for no substantive reason except that it will disrupt a cushy routine of just keeping the lights on.

It's not that black and white though. A startup will need to have a different culture than an established company with thousands of users. What is right for a consumer company is not necessarily right for an enterprise company. Companies need to evolve all the time, and such stories are tipping points where such evolution should happen.

Startups aren’t an excuse to get sloppy and break other team’s production environments.

If a startup needed to move quickly, they’d ping the relevant parties at the planning phase and get everyone on the same page.

I was referring to the archetypal “hustle” managers who deliberately try to do an end-run around other teams for their own personal gain.

Startup or enterprise, doesn’t matter. You can’t have management that rewards asymmetrical games that benefit rule-breakers at the expense of everyone else, including the customers.

Here's an example of an end-run. I had a 6 month secondment to the USA at a previous job, and I needed a server to do data analysis on. It would have taken over 6 months to get one provisioned for me because of "process".

I had a laptop which I brought with me from Australia. It wasn't in the asset register in the USA, so I was entitled to a computer. I ordered the most highly specced desktop build available, put it into the empty cubicle next to mine, and spun it up as a development server. It didn't have backups, but that was OK because I never worked with primary data on it and all my work got committed back to Perforce daily.

Strictly it was very much against policy, but policy would have meant I spent 6 months sitting on my hands. My manager "hustled" for me and did an end run around process.

That's fine because it isn't contagious. If your rulebreaking bites you, the entire company's customer base won't even know.

no no no no. I've seen this a hundred times[0].

That service becomes the source for a management report.

That management report contains useful data that the CEO looks at weekly and uses to build his board report.

The original Aussie guy leaves, but leaves the laptop behind because it's not his. He also doesn't document it (because that would get him in trouble for running a server on a laptop).

...Years pass...

The laptop finally dies. The CEO is furious because he can't create his report. He leans on the IT manager. The IT manager has no freaking idea where this report is coming from or who makes it. They lean on the Support team to find out which server produces this report. The Support team drop everything to work out wtf is going on, because this suddenly became their #1 priority.

Eventually, someone finds the decaying husk of the laptop, and works out what's going on. They put together a plan for creating a supported server to do the same thing. It'll take weeks, because they have to provision a server properly through the usual channels. CEO has a rant at the entire IT department for not supporting critical business processes, and not being agile enough to support the business. IT manager takes a beating in the next management reviews. No-one is happy.

[0]usually a rogue spreadsheet rather than server. The worst case I saw was an Excel spreadsheet in a business-critical department running on a user machine with a post-it note on it saying "don't turn this machine off". If the logged-in user name wasn't the same as the temp who had originally built it, the spreadsheet refused to work and the department ground to a halt.

Slippery slope fallacy. We're talking about temporary measures to get things done. If you don't pay back the technical debt, then sure. It always comes back to haunt you.

If you have worked in support this tends to be less fallacy and more reality. The point is that when cutting corners in the purchasing process people often are fed up or too much in a hurry....so things are not documented the way something should be for production. When these process jumpers leave the “tech debt” Piper is paid via your support team/IT burnout.

I would agree with you, except I've seen it happen so freaking often

I hope you don't mind -- to me that sounds like an off-topic example. What you did, was good for you and the others. Whilst the one you replied to, wrote about doing things that would be good for you, but bad for others:

> ... benefit rule-breakers at the expense of everyone else, including the customers

Agree with that. Rules should be agreed and then followed by all. I was talking about the general rule of whether "ship features over the weekend with a ragtag team of developers" is bad.

At the exact moment you take money or personal information from users, you really better have a safety / security first approach and have a defined, repeatable, scheduled, abortable deploy procedure. Its not number of users, its what its going to cost each user when you screw up.

> Every tech company I’ve worked for has had at least 1 manager who tries to ship features over the weekend with a ragtag team of developers who don’t understand why that’s a bad idea.

Maybe I've been in startup land for too long, but seems super normal and fine to ship a feature over the weekend if it goes through the regular CI gates - it's tested, peer reviewed, been QAd in staging, etc. Is this not accepted outside of startups?

The hypothetical scenario you’re describing and the story I just read have almost nothing in common except the word “weekend”.

Deployment over the weekend can make a lot of sense in the world of B2B, but there’s a difference between a carefully thought out plan to deploy at a quiet time and sneaking something out when no one is looking.

Shipping over the weekend isn’t always a bad thing when it’s necessary, but it needs to be an all-hands effort. You need to give people a heads up and at least collect the minimum base knowledge to do it right.

In this case, someone was trying to quietly ship things to production on a Sunday without involving the owners of the front end. How would it look for you if some other team crashed your part of the website on a Sunday without even coordinating the change with you first?

My point was that it’s important for companies to not reward selfish behavior from managers who want to make a name for themselves. If you genuinely need to ship a website feature on Sunday, you involve the website team for launch and follow up monitoring. You don’t try to quietly ship it out the door at the risk of breaking other parts of the business, as Rachel explained in the article.

That sounds like a nice smooth pre-production process...

Her story was of an untested feature trying to get injected into the frontend on Sunday to meet a Monday deadline, by someone without the proper access, and with no apparent oversight or process concerns.

I'd do everything I could to block this push as well.

Wouldn’t “everything” in this case be to summon the managers to drop blocks on heads? If you are yourself a manager, summon a more powerful manager that can drop more blocks.

Here's another idea, you can ship late. There has to be a really good reason to meet the Monday deadline and it must be really important to the company. In fact it would be so important that the people and departments would know that it's important. They wouldn't find out about it on the weekend the day before it ships.

Sounds like these people designed a component that wouldn't work on the current infrastructure. That happens, but if there is a big oversight like this, then you should ship late, instead of risk taking everything down.

> seems super normal and fine to ship a feature over the weekend if it goes through the regular CI gates - it's tested, peer reviewed, been QAd in staging, etc.

Multiple of these processes you listed often require humans. You either are asking them to do this during the weekend, which is bad, or you gave them ample time during the week meaning you're ok with steps needed on weekdays, so you can do that for the last step too, with full staff present.

(bugs and urgencies notwithstanding, but that doesn't appear to be what we're discussing)

Depends. If you have a full CD/CI system and use feature flags, you can definitely push such an update into production, having the feature disabled by default but enabled if the requests have special attributes set, or for a specified group of people (A/B).

My comment is unrelated to what you can do, it's about what you should do. If you can afford weekday humans in the other parts of the process, you can afford a weekday release.

Considering the described situation is someone asking for help understanding how to achieve something (which indicates they don’t have the familiarity with the processes in place to do these things), and they’re asking the team responsible for that infrastructure for the first time on the day before launch, I don’t think one can claim that it’s been well-tested and peer reviewed.

sure, but I'd think some gates on that are standard practice no? Owners needing to approve changes, tests passing, QA'd in a staging env, etc

Sure, but we’re so far away from that scenario here. In that ideal world you describe above, the project leads would have already talked through the requirements and processes involved with the environment and/or service owners and/or release engineering groups. Not all changes are of the sort where “if the tests pass, it’s safe to ship it”, and that requires learning what is acceptable and not, and talking to the relevant people when you’re out of scope.

Environments like those described often have continuous push and automated slow rollouts with health checks, so the idea of doing something on a Sunday isn’t that strange at all.

That said, there’s something to be said for not trying to locally optimize. If you push bad stuff on Sunday, you’re messing up a bunch of people’s well-earned rest and recovery time from work. You push bad stuff on Monday, and everyone’s there to help you fix it without the stress of lost family or other commitment time.

The difference is 24 hours, which likely isn’t going to make or break anything. It’s easy to get sucked into believing things like that matter when they don’t.

I'm at a fintech in NYC and my team generally doesn't release stuff to prod between like 10AM Friday and 10AM Monday.

I haven't had specific conversations with anybody about it, but I think we have all been around the block enough times to have been burned on a few weekends when it really wasn't necessary.

Not a start up at all though, and not a team of twenty somethings with anything to prove by moving fast and breaking things.

Agree with you though that I have seen this at a lot of places. I did a number of phone interviews looking for a more relaxed place in order to end up here.

Deploying stuff over the week-end is standard practice for trading systems, because that's when markets are closed. But obviously those deployments are discussed and planned during the week.

In fact most trading platforms have this huge advantage of not being 24/7 operations.

This handy flowchart hangs on the wall near my company's core infrastructure group [0].

[0] https://www.amazon.com/Should-Deploy-Friday-T-Shirt-Sizes/dp...

That's almost complete... there's another three question flow which gets to yes.

1) Is stuff really badly broken?

2) Was the really bad breakage introduced recently (this could either be an earlier bad rollout, or it could be external factors changing)?

3) Is this requested deploy either a revert of a recently-made bad change or the minimal possible fix/bandaid?

If all three of these are true, then you can do a deploy RIGHT NOW regardless of the calendar.

(recently - because if this is something which has been broken for years then it's unlikely that it suddenly became urgent absent external change - which is already carved out above)

So what you're saying is, it's acceptable to deploy on Friday at 5pm if and only if you're undoing the damage caused by some chump deploying on Friday at 4:55pm?

Yes. And then to find out why a DEFCON-1 level revert was necessary at all: that's a deploy that you want to be prepared for, but you never want to use.

Or some data corrupting bullshit that was released on Thursday and it's taken you that long to notice but it's causing ongoing damage.

It really depends on what it is you're shipping.

Once a startup hits some level of maturity, it's unacceptable to be shipping something significant on the weekend (or whenever people aren't around to respond to an issue). Probably post product-market fit, maybe Series B.

I guess it also matters how much your company values work-life balance.

In my experience at all levels, I would say Engineers generally don't have a very good sense of the big picture and of what really matters, are generally caught up in a lot of details unneeded complexity that doesn't create value in the manner they think it does.

I remember as a young Eng. getting caught up in the platform holy wars, and then sitting as a PM looking back on it all like I must have been in some cult.

There's truth to the notion that 'it's complicated' and rarely does anything get done in a weekend, but if there is focus, a decent dynamic process, things can move faster.

I worked at one company that had a messaging product, it has a big team of Engineers and things were at a snail's pace. I suggested bringing in a few talented people and starting from scratch as a re-factor, they thought I was crazy. A young intern left the team, did it on his own with one other person and met with enormous success. The company, even after literally watching an intern out-do them never changed.

Both the old company and the new company are big enough names you've all heard of them I wish I was at liberty to share.

In another project, we were opening up some basic APIs. We did some work with Facebook and they were able to give us a custom API in literally a few days. Our own, simple APIs took 18 months to deliver. The weekly product teams consisted of 10 people rambling on - and the two most important people, the dudes actually writing the code, were not ever present. It was a colossal and shameful waste.

Even though getting a rag-tag bunch of Engineers over the weekend is usually not a good sign (it might actually work for some marathon bug fixes or something), I'm usually sympathetic to the cause.

I’ve faced this a lot, particularly when the management is a couple of layers away from the team doing the work.

It makes me wonder how these organisations don’t collapse under the weight of their ineptitude. Most of the bugs or issues I have to fix are from problems we created by short term hacks. Way beyond simple tech debt.

The engineers are as much at fault as the managers, particularly when it comes to introducing insane complexity to the stack to solve simple problems (how many startups seriously need to invest in tech like gRPC or graphql except to gain cool points?). Management, on the other hand, have no empathy for the people doing the work, so quality dips as we are pressured by both self-imposed and external deadlines which are decided with zero input from engineers.

Half the time we web engineers are building glorified content management systems with some nice design over the top. It’s boring but it’s not a burn out.

Funny that everyone leapt to the conclusion that it was a junior developer. I immediately thought of a senior developer who "gets things done" and gets rewarded for that, but leaves a flaming pile of rubble in their wake. I've seen bad bugs written by junior developers, but I've spent more time cleaning up issues introduced by experienced people for whom time had just reinforced bad habits. I don't know which is more common in general, that's just my personal history.

Who knows which it is. The article says they haven't touched the frontend servers yet, not that they haven't done anything substantial yet, so I don't think we can know unless Rachel decides to comment on that point.

The worst experiences I ever had at a job were due to the most senior developer there. He was basically given this god-like status from management where he could do no wrong, and had free reign of all systems, because he'd been there for so long..

Yes, he was an incredibly capable software developer, but that had led to an ego issue where he thought he knew EVERYTHING, not just developing in his particular stack... so he'd find his way into various systems (software and hardware) he had no experience in, and assume he could just figure it out.

We had half a dozen full-on-disaster level situations in the couple years I was there thanks to this 1 person. Each time it was shrugged off as "an accident" because heaven forbid you actually upset him or hold him accountable.

This sounds a lot like my current situation at work. We're a decently-sized company of 100+ people. About a quarter of them are developers. Developers are divided into teams who are responsible for specific components of our system.

However, there are also a few "architects" who are basically free agents with god-like status. They don't architect anything. Rather, they do a lot of counter-productive things such as attending meetings for specific teams without those teams knowing. Then they make large decisions about the direction of a given feature without documenting anything, let alone letting those teams in on what's going on. It's always a fun surprise and works wonders for our ability to deliver a solution on time /s

Beyond that, they'll go off and develop whatever they feel like and force it through the system. We have a code review process, but if you bring up that problems are present in their code, they will often say that it doesn't matter and that the code just has to go through to meet some unspecified deadlines. They'll get management to skip code reviews if you try to hold them accountable. And of course, this results in things exploding. But then they can clean up those exploded bits, knowing what they were, and come out heroically since they "put out so many fires".

I'll stop ranting. But it really is frustrating, especially when you get shut down by management for bringing up any issues with this chaotic group.

Sounds like every startup ever.

I've observed that pattern with developers who are relatively senior, and have been in the same, smallish place since they graduated. Basically they had no technical feedback ever in their career.

Some developers put their crazy ideas and experiments on github; some developers put those in production.

> I've spent more time cleaning up issues introduced by experienced people for whom time had just reinforced bad habits

In my last job there was this guy who had more than 10 years of dev experience and:

- He tried to convince me it was better to write our own encryption algo instead of using https.

- His mySQL tables used no foreign keys, only int fields.

- He mixed all sorts of naming conventions in his code and databases: English, Spanish, camel case, snake case, kebab, etc.

- He spent 1-3 hours every day getting into the bank web apps, downloading some pdfs, copying and pasting stuff around, and finally producing a report by hand. Every single fucking day. So I developed a small service for extracting financial data from those pdfs into JSON. He only needed to read those JSONs and automate the reporting part. No, he kept making those reports by hand.

The guy was a friend and long time collaborator of the IT manager which was a complete ignorant in terms of dev.

> His mySQL tables used no foreign keys, only int fields.

This one is potentially justifiable if speed is a higher priority than data integrity.

Most MySQL applications don't use FKs. The world keeps turning.

Besides performance issues, FKs cause ordering issues with restoring tables when necessary.

However, they're great for diagramming tools to show table relationships.

I'm more concerned when I don't see unique indexes on things that need to be unique (often caused by early versions of RoR.) In my experience, this is never a theoretical problem - it always becomes a major operational problem.

Source: MySQL DBA

Also, depending on how long ago this was, may not have been as bad as it sounds -- MyISAM doesn't support foreign keys, and it was the default all the way until MySQL 5.5 in 2010.

Ok, but no, this was incompetence. The application he managed had around 10k users that used it a couple of times per year.

That last point kind of revokes his programming credentials. Something that takes that long, done that often, gets automated pretty quickly by most self respecting developers.

Yeah totally. I've never understood why he preferred copying and pasting over actually writing code.

job security and laziness. could be very meditative to do some copy-pasting over your morning coffee/tea

Note also that the developer said it would be a good long term solution to put the thing behind a service. Oftentimes the person isn't ignorant of better practices (though they may be out of practice in applying them). They just differ in which shortcuts they think are acceptable. Or perhaps they've been broken by an environment where doing the right thing has always been blocked by someone, until they no longer even try.

Replying to myself, since I can no longer edit.

Bad "senior" developers are worse than clueless junior developers: they cover their landmines better.

..and have juice with one or more managers

also, their ability to shift blame often is the reason they are a senior developer and not a fired developer.

That was my first assumption. These dangerous situations come up when a manager or developer think they have an asymmetric opportunity:

Ship this feature quickly and collect all the credit, or crash the website in the process and blame the website team.

Week patterns have a long history, and humans have to rest some point, so I respect the "Nothing ships on Friday [and implicitly Saturday or Sunday]" plan outside of very young (often self-funded) startups where basically all that the half a dozen people doing it are thinking about 24/7 is that one project so there aren't really "weekends" anyway at first.

However, bigger organisations can get really out of hand with this. They start out with small carve outs, oh, nobody really works from the start of Christmas week until the first full week of the New Year. Soon it's from Black Friday although good luck scheduling anything the week before that, and so before you know it they're declaring that nothing can even be planned for the months of November, December and January each year.

Now, if you really must have this habit, which I don't recommend, what you need to understand, just as much as the "No shipping on Friday" folks is what this does to your perception of time. If you have "Don't ship Friday" then the question in the Tuesday meeting "Will it be live on Monday?" must be re-phrased "Is it ready to ship by Thursday?" and not every manager seems to understand that. But when you have "Nothing can be scheduled for our 90 day Winter Risk period" as policy then the question in that meeting "Will we be ready for the January 31st legal deadline?" is actually "Are we 100% sure we'll have fixed this by the end of October this year?" and I have yet to see anyone who recognises that despite having a policy which makes it so.

We have a "don't deploy stuff that could go sideways between Dec 1st and Dec 25th" policy because that's when we have the highest a) traffic and b) spend. But the converse of that, is that in Jan/Feb, we can sorta go nuts, because all our customers blew their budget in the lead-up to Xmas.

It's really a case of knowing your business and customers.

I have a similar situation with similar timing, and have found it's a good time to get people going on long-running projects. They're not going to be deploying anything to production within a month anyway, so it doesn't block them.

I'd like to provide a counter-argument.

Every software release includes risk. There's risk that there are bugs. There's risk that it will introduce regressions. There's risk it will cause a major crash. We can do things that reduce the risk. We can have recovery plans for negative outcomes. We cannot eliminate all risk.

The reason businesses have black-out deploy dates is so that they can control the risk. Deploying code 24 hours before a critical business day, like black friday is for some businesses, is an entirely different risk/reward ratio and should be treated different. Even 90 day risk periods can be valid for some businesses, for example seasonal businesses during peak season.

> The reason businesses have black-out deploy dates is so that they can control the risk. Deploying code 24 hours before a critical business day, like black friday is for some businesses, is an entirely different risk/reward ratio and should be treated different. Even 90 day risk periods can be valid for some businesses, for example seasonal businesses during peak season.

That's a completely valid strategy, but as I explained you need to actually embrace the consequences of this strategy in your thinking. If the next 90 days are a "risk period" then when somebody says "Next week" they actually mean "Three months from now" not "Next week" and everybody in the business needs to start thinking that way. Sometimes that's going to cost a bunch of money.

For example, if you have such a 90 day period covering November, December, January as in my example, then a system which alerts you to certificates on public facing web sites that expire in less than 30 days time is no good, you can't act for 90 whole days, so that system needs to be re-calibrated for more like 100 days.

Once you accept the 90 day risk period as a reality rather than a convenient fiction it gets very expensive. Make sure the people who insisted on this risk period are paying, because that's the only way they'll learn anything from it. If you don't make them pay for the actual consequences of a 90 day risk period (e.g. hiring a lot more staff, buying more expensive products that don't need intervention) then of course they're going to keep increasing this magic "It's not my fault" period and you'll pay the price in your sanity.

In the early 00's I had to work with a company that setup payment portals. My workplace already used them for existing portals. We needed another for a separate service. It took a few days to get the paperwork signed for the banking end of things, and then we were told by the portal people "sorry, yesterday was Tuesday. We only go live with changes on the first Tuesday of every month."

"There's a time and a place to dig in your heels and say NO to something. It usually comes when someone else has been reckless. If you do it for everything, then either you're the problem, or you're surrounded by recklessness. I think you can figure out what to do then."

The magic is in telling the difference.

There is a rule of thumb to not ship anything on a Friday.

This was a Sunday!

And it didn’t seem to have been tested, so where was the Dev/UAT/Prod separation?

So many red flags, and I think anyone who has been in development for a while has had some product owner try to break procedures to force whatever pet project that is falling behind to hit some target to get their bonus/gain brownie points. It almost never goes well. But by this time the PO has a scapegoat because it’s no longer their fault it’s not working, it’s production’s fault. (Also it’s not always a PO, it can be a developer or anyone up to c-suite level)


“No but you don't get it, my son is really good at those things, he said he would speed up our Windows machines 10x if I gave him the codes and physical access to the servers during weekends!”

That would be a very bad, very cliche joke if it weren't a true war story (2005, good times). Turns out the 15 y.o. had convinced his father that I was "lame" at my job because I didn't tweak regedit. Important lessons were learned by the three of us that day!

Sounds like a "Dude, do you even regedit Bro?"

I actually laughed.

Wasn’t her implicit point that you wouldn’t need to know which it is, because the answer to both (“you’re the problem” vs widespread recklessness) is to leave?

In the former case, the answer is to change your own behavior. The sentence structure is a bit confusing, but I interpreted that last clause as referring only to the latter case (surrounded by recklessness).

From her other posts on her blog, it's clear that she means that her managers consider her a problem because when she predicts trouble she is usually right. When you continuously play Cassandra it gets tiring for everyone involved, Cassandra, her manager and the audience viewing the spectacle. Time to find a new job.

The flip question is always, how often was Cassandra wrong? We're after all seeing one side here. A perfect Cassandra is potentially useful, a trigger happy Cassandra is just annoying. Even a perfect Cassandra can be annoying if the issues in the end cause minimal monetary fallout.

One of the funniest stories I've ever heard was about how a junior developer asked a more senior developer a creatively terrifying question:

How do I install half of an RPM?

> half of an RPM

Well, I'm stumped... I wonder if he just needed one tool out of suite?

Maybe trying to workaround a dependency issue. Not sure what's in rpms, but in deb the manifest may require libsomething version exactly 1.1, will fail if something else requires libsomething >= 1.2.

1 revolution per 30 seconds maybe?

That would be 2 RPMs. Still easy to install!

From what I can tell, with cpio :-D

This comment thread originally consisted of one single reply by someone who, apparently, was not aware of Rachel by the bay and her wonderful blog. That post concluded that calling a fellow worker a "rando" was toxic, and that they wouldn't want to work with the author.

While I agree with Rachel here at a high level, and have been a dedicated follower of her blog, I completely agreed with that comment. You shouldn't be shipping things in the manner described in this post, and you shouldn't be considering your coworker "some rando" and looking down on them for not having the same schedule as you.

People are free to project "looking down on" or contempt onto the phrase "some random person" and the (subsequent) use of "rando" but I felt like it communicated precisely what was going on: this is a large organization, and the person is totally unknown to this sysadmin.

Even aside from whether "rando" is contemptuous, the issue isn't that they don't have the same schedule: it's that they are not respecting the company-wide schedule, nor are they respecting fairly obvious norms of professional software development.

I'm a teacher, and I very much believe in educating people rather than putting them down, but jumping on a single phrase/term when there's nothing else to suggest contempt here strikes me as odd. It's especially odd when the entire culture of sysadminship has a reputation of eye-rolling and begrudging wizardry to protect users from themselves.

I think this is a fair point, and I cannot disagree with what you've said. But, I must reiterate, the tone of this piece changes dramatically if "some random person" becomes "a junior developer".

As far as the company wide/normal schedule goes, at my previous place of employment, major changes were routinely performed (by me) off hours on a Sunday with only relevant personnel on hand. This was primarily for B2B reasons where the vast majority of our clients were doing mission critical things from Monday to Friday. I don't feel that this was the case in these circumstances, which is why I completely agree with what was really said here, and with your reply was well.

I suppose my personal experience in this industry leads me to believe that small snipes like that uncover much deeper contempt than is revealed on the surface.

Agreed rando might just be shorthand for “anyone in the organisation that’s not responsible for production”.

However there’s often a substitution for a more acceptable phrase when you relay similar information to management.

It’s perfectly acceptable in this context of a BOFH type story, which doesn’t care about the identity of the luser in question.

I think there's a legitimate development culture gap. I've had good friends tell me that, at their company, it's totally acceptable to roll things out on a day's notice and you're expected to help make that happen unless you have something more urgent to do. I doubt that's a functional culture for any but the smallest companies, but that still means there are some places where it'd be unacceptably rude to shove someone off and not help them release their code.

There’s a specific creative tone to this article that incorporates these expressions. There is narrator persona who has a style of voice. I think you and the other people who are seemingly offended are missing this?

Agreed, those parts of the post seemed highly filled with contempt and even though I can empathize with that, it's not a good look. The description of the author's responses in the company chatroom remind me of highly toxic irc chatrooms where people sitting idle in them forever verbally abuse newcomers or people asking questions from some self righteous angle of superiority instead of engaging in simple discussion.

It's very very off putting and detracts from an otherwise good post. Makes the author sound like someone who simply has nothing but ire for their fellow employees.

> someone who, apparently, was not aware of Rachel by the bay and her wonderful blog.

Is there some other prestige you should know about her apart from her blog?

Having worked with her? I’m sure many current and former colleagues of hers frequent this site.

Sort of tangential to the article, but... Setting aggressive HTTP timeouts can be quite informative in finding "interesting" HTTP endpoints. Set them, enable tracing, and search for timeouts every morning. At my last job, I found an interesting "charge_cards" endpoint and learned that someone logs in every day, clicks "charge cards" and all billing is handled in the frontend web process instead of the event queue. It was idempotent so it wasn't a problem (just click reload until it returns), but was an interesting bit of visibility into a large system. Knowing that that's how it works makes it less surprising when someone wonders "I noticed cards didn't get charged". Is the person who normally does that on vacation? Aha, that's why. This never caused us any problems so we didn't rewrite it, but it's always nice to have weirdness exposed front and center so you are less surprised when it starts causing an issue.

It was also interesting to read about a wide deployment of circuit breaking. That's a relatively new concept to me (never ran into it at Google, for whatever reason), but it's another way to prevent weird outages. Last week I upgraded something in my k8s cluster and the API server stopped responding. With the API server unresponsive, I couldn't kill the workload (or even determine if that's what was causing it). I had to kill all the nodes before I could get back in. If the job were communicating to the API server through a circuit breaker, it would have popped after requests started taking too long, and I would have been able to just kill the job instead of having downtime. But, I guess it's still a relatively esoteric concept.

I know this article is about not doing dumb things in production. I just assume those are a given. I'm more interested in the nuts and bolts of day-to-day operation ;)

> That's why some clever person had added a client-side "gate" to the transport. If it saw that too many requests were timing out to some service, they'd just start failing all of them without waiting

I believe this is commonly called the "circuit breaker pattern"

Si. If you already have a distributed rate-limiting service, it could be used to create a more intelligent circuit-breaking service as well I would imagine..

Was half expecting these people to be rogue actors trying to install some malware into prod.

Glad I was wrong!

It is really better if they are coworkers trying to install malware?

"Sufficiently advanced incompetence is indistinguishable from malware."

I remember an instance where someone updated and deployed the strong_passwords ruby gem since "it had an update". This was more or less right after it was found out that the v0.7 gem was compromised. Cue rushing at the commit and double checking that it was using the "fixed" 0.8 and not the 0.7. Developer had no idea that 0.7 was compromised, and just updated the gem because they could without looking at patch notes.

Is this yours? Sounds perfect.

Sounds even better (and more malicious) reversed: "Sufficiently advanced malware is indistinguishable from incompetence". All those debug features "mistakenly" shipped in hardware...

Sketchy .so acted like a denial of service.

That’s malware in my book.

It's accidental malware. That's better, if you're an ethicist, and no better at all, if you're the one having to clean up the mess.

It's also better if you're the company deciding whether you have to press espionage charges or not.

The interesting question is why the ODBC driver needed to be updated and why it was such a hurry. Probably because he/she wanted to see if it worked before business opening... And when he/she was not allow to do that maintenance on off hours, then he/she tested it on a less important internal system. The thing with ODBC drivers is that they deal with things like bigin support, which cover edge cases that you only see in production.

ODBC was not used anywhere prior to this point. The company had its own hardened transports.

The frontend systems, taken as a whole, amounted to a bitchin’ botnet. They’d destroy ordinary services without even working up a sweat.

Using the right mechanisms for comms between them was in everyone’s best interest.

Odbc/SQL newb. I know what begin does, but I don't know why you'd only see it in prod. Surely you should test the same codepaths in staging that you plan to execute in prod?

Generally speaking from experience, most non-production facing testing does not accurately reflect what is actually happening in production in terms of actual queries run and volume of it.

A codepath which is perfectly fine in testing/staging could kneel over immediately against a production workload,

That's all well and good, but begin is pretty basic? Like, that's how you do transactions in SQL, right? We're not talking about a codepath that's working in testing and falling over in production. The person I responded to was suggesting a codepath that's not even tested in testing, and I was hoping to understand why begin would be such a codepath.

I meant biginT, for example the prod might have over two billion of something, where local dev have not.

I could imagine some contrived situation where begin could make something fall over in production... Something something rollback logs.

"Quantity has a quality all its own."

- Someone (I heard it attributed to Napoleon, but have seen it attributed to others)

In this case, it's very true -- workloads are really difficult to properly test with respect to production load. A code path that works well in one test can easily buckle under the pressure of a production workload. Stress testing is hard.

This is why we have release management processes


I need blogs like this to remind myself that there in fact are actually sane organisations in the world. It's a glimmering ray of hope against my own experiences, which more and more would suggest the writers pushback in that chat room would just get them considered a problem employee.

My ex-boss use to say... "Don't you like your weekends ?" whenever somebody wanted to deploy something on a Friday :D

It’s about process + team and not about individuals. Far too many times I’ve seen companies over praise/reward certain individuals at the expense of demotivating the team around them that makes things possible.

I’ve found a lot of things solved by a decent review and deploy process.

My precious company just went through SOC2 certification. SOC2 is mostly about change management. Who can make changes to prod? Is every change approved by the right personnel? Is every change auditable? Etc

Mostly boils down to using git(hub) properly and having a sane deploy/monitoring/revert system.

Doesn’t matter if you’re the CEO or a VP or a yolo manager. You can’t deploy things without oversight. If you want to deploy on the weekend you need approval from the opslead for that system who is responsible for keeping things alive.

On the weekend, no one should be working, even opsleads should only work if they are paged. Only broken things that really affect customers are fixed on weekends. No new features. Period.

ha! last year i was at a shop that had a dev manager who snuck in a stored procedure into a prod db. the arch called for an instance per customer, with each using the same sprocs. when one customer suddenly had prod fails it caused all the teams who had pushed code to burn cycles tracing the issue (the effects were system-wide and this was the biggest customer). we diffed the sprocs and saw what was up. the correct code was restored and the problem got fixed. in the end the dev manager in question didn’t have much of a reason for making this change without telling anyone.

...I'm probably missing something obvious, but what is ".so"?

I have read the post, btw (liked it - I usually like Rachel's posts) - but still don't understand what .so is.

In Unix these are shared object libraries.

In this context, I believe .so is the file extension for extensions that can be loaded by web servers during runtime. I think the person in the story was trying to change the Apache2 configuration of all FE servers to load the wanted client library / .so file.

- https://httpd.apache.org/docs/2.4/mod/mod_so.html

I think it's the shared library file that the person was trying to add to the frontend

`find /usr -name '*.so'` that's enough for you ;)

it's a shared object file. Basically a pre-compiled set of functions.

For example libc is a .so file. In windows, think .DLL.

Bringing up problems only shortly before launch sounds damn familiar

On another note, I hate posts from Rachel. My opinion is from reading a dozen posts from her blog by now. It's usually about complaining of fellow colleagues and her employers. I would really hate to be her colleague. I've never seen a post where she admitted she fucked up. It's always about how everyone else is awful and how she is the victim in all this. I really hate this toxic attitude and I would really hate to have a toxic colleague who instead of grabbing her teammates for coffee and giving them constructive feedback (and keeping company events confidential), shames her colleagues publicly.

I disagree 100%. I have read quite a few posts on her blog as well. Most of them are purely technical; and the failures described there are mostly "Hey cool, this failed" and could be attributed to any person including herself.

In the ones where she specifically blames other people or companies, I mostly see a competent, reasonable person who is fighting an uphill battle against unreasonable management, which is a trope among software developers for a reason. Like this one for example -- do you really think adding (effectively) backdoors for your special service on a Sunday evening constitutes sustainable software development? I'd rather have that "toxic attitude" than some meaningless sunshine-out-the-bottom screed about "team players who get things done".

> "On another note, I hate posts from Rachel."

I don't but then again I spent years on call for a couple of critical services. SREs / sysadmins are the people who have to work overtime on evenings or weekends to restore services when an incident causes downtime, whether accidental or caused by foul-ups by the development team. After a few years of that, having a rather low opinion of colleagues whose routine bad practices or behavior risks downtime for critical services is more than forgivable in my book.

Having been in an SRE-like role my entire engineering career I can empathize with the mindset. I've built pub cloud and provided tier(N) support for it and managed hosting having to fix screw-ups from OTHER companies engineering teams, as well as working in-house roles with SRE responsibilities.

It's an easy mindset to get into and a blog is certainly a valid place to blow off some steam IMHO.. But when you have the knowledge AND access to fix all of your own mistakes, it's easy to forget that everyone makes mistakes and to empathize with the every-engineer. It can be a toxic mindset to let seep into the workplace; better saved for the pub :)

These are anonymous technical critiques of situations that are typical in the life of a sysadmin/RSE/security engineer.

Even if your colleagues are super-smart they will still fuck up some of the time. In a large enough project someone's fucking up all the time and these people have to fix it.

Of course they're grumpy and tend to complain. That's how they stay sane and don't go postal at work :)

So next time you see an SRE or the like grab them for a coffee and tell them how much you appreciate them. It could save your life in the future.

Imho her blog is a valuable resource. Can be used to point out "here's what could happen if you do X" in specific cases and in general as "here are a lot of examples for why things should be done in a non-messy way".

While I don't enjoy the blog either (partly for the reason you cite, and partly for the unspoken assumption behind many posts that FAANG-like companies are a good thing and a good working environment), I wouldn't phrase it like that. There is enough misogyny on the internet that starting a critique with "I hate" is not going to get your point across lucidly.

>partly for the unspoken assumption behind many posts that FAANG-like companies are a good thing and a good working environment

Sorry for sidetracking, but are they not good working environments (note this is a separate question from whether they're a good thing)?

I've been planning to try to get a job at one of the FAANG companies, and have basically assumed it would be a good working environment based on the reputation, the perks offered, and my prior experience at another very large tech (but not exactly FAANG) company, which was that the working environment was generally very good there though I didn't really like how that particular company operated in the marketplace.

> "Sorry for sidetracking, but are they not good working environments (note this is a separate question from whether they're a good thing)?"

Depends heavily on which team you join within a company. FAANGs are not monoliths; they are collections of business units cooperating (or sometimes not) with quite a bit of variation in culture and work-life balance.

They have to offer perks because they will work you very hard. You will learn a lot, but you will almost certainly face a lot of stress too. For some people this isn't worth it.

Sorry if it came across as this way, this has nothing to do with her gender. I would have said the exact same thing even if it was a male author and I mean it.

Good news rarely make for interesting reading; cf the phrase "nothing to write home about": "the task was planned well in advance, hit some snags during integration, but got caught by automated tests and eventually was sorted out between frontend and backend teams, shipping roughly on time...snore"

Aaand...you're THE ONE. Congrats, I guess?


Why are the comments here so hostile? Seems awfully critical, even for hn.

I (cynically) believe it's due to very low attention span.

At least, I'm guilty of that, I skim articles and read comments. But this post is not skim-able(?) and thus feels a lot more dense than usual.

However, I went back and read the whole thing and it's certainly worth the read.

Pretty sure it’s the domain name.

Also the time of day/day of week.

But mostly the domain name.

For what it's worth, I personally thoroughly enjoy reading your articles. I wouldn't consider myself adept at operations by any means, but your articles are extremely approachable and I always enjoy reading them.

I always click on the article when I see it's from you.

That's really a shame. Of all the colleagues I've had over the years, you still stand out as a beacon of competence, and I learned a ton from both working with you and reading your blog in the years since. Your haters have no idea what they're missing out on :-(

I hope you're joking about the domain name part. When I see your domain name I know to expect some really great tech content, and I'm pretty sure tons of other people are the same way.

HN is always very critical, you'll see it on every article.

I appreciate your dedication to throwing pearls before swine.

Those intelligent and unbiased enough to learn something useful will do so. I agree with and/or have learned from almost everything you've posted. Some of those things mirror my own experiences.

I also didn't see any problem with tone, you were polite and helpful. It seems quite obvious the only problem with your "tone" is your gender. Not that you didn't know this already.

Thanks for your contributions. At least some of us appreciate them and find them valuable.

Just in case the negativity gets tiring, I want to say that you're on my whitelist of domains I always read on HN (which is much smaller than the blacklist of domains I never read).

This being HN, A/B testing is necessary. I suggest 'rossbythebay'

Adding to the other comments, you're one of my favourite SRE war story (and cultural war story) blogs, so thank you for your writings.

Why? Are you not by The Bay anymore?

I also noticed that I had a hard time skimming through this post. I wonder why. Is it the vagueness used in reference to parties involved?

Contrary to the sibling poster I believe (also cynically) that it's because of the sex of the author more than any other facet.

HN (and more broadly the tech industry at large) likes to believe that it is an egalitarian meritocracy but it is unfortunately not true.

I can't believe that to be true; Rachels posts consistently reach the front page and additionally you're implicitly calling every person who doesn't like/can't parse the content misogynistic.

Besides aside from the name it's not like she puts her sexuality front and center, the reader has to notice it as part of the domain. Her name isn't even in the article.

The vocal minority gets representation in a comments section.

Misogyny is pretty common in tech. It's not an absurd conclusion to jump to.

>you're implicitly calling every person who doesn't like/can't parse the content misogynistic.

What I'm suggesting is that opinionated women tend to draw more ire and negative reactions than opinionated men, for reasons too complex to address in an HN comment.

I'm willing to bet that if the same post had been made on a medium.com blog that didn't include a gendered name ("Rachel") then the comments would be less caustic.

I'm not going to argue the point you're making but instead I'll mention that the opinion being discussed here is not _really_ considered an "opinion" (nor are you opinionated for having it) in sysadmin-land.

(Regarding everything the author does/says/thinks in the article)

Everything she does in the article is exactly what any sane sysadmin would do, it's institutional awareness.

>Everything she does in the article is exactly what any sane sysadmin would do, it's institutional awareness.

My point exactly.

>HN (and more broadly the tech industry at large) likes to believe that it is an egalitarian meritocracy but it is unfortunately not true.

Given the up-votes of the thread and the down-votes of the negative comments (even ones only tangentially negative) that seems an odd conclusion to draw about the whole HN community.


Just reading that, the tone of this person made me cringe. I definitely would never want to work with him/her. I can feel the negativity and attitude in almost every sentence.

The right thing to do would be to help this person achieve their goals while maintaining reliability of the operation. Calling a co-worker who is just trying to get something done a rando is the epitome of toxicity. She wouldn't last 5 minutes with that attitude in my organization.

Please don't cross into personal attack here. Besides being mean and against the rules, it makes for boring reading.


> it makes for boring reading

I didn't find this readers interpretation of the author's behavior to be boring. I can see how she may come across as being overly combative (though I agree with her goal).

Your frequent comments about what you find to be 'boring' is boring.

I mean the word 'boring' in the HN sense of not intellectually interesting. Something can still be exciting even if it's boring in this way. It's boring because there are only so many ways to denounce something as "the epitome of toxicity", "this person made me cringe", and so on, which means the patterns of denunciation get recycled a lot. That makes them predictable, and this site exists for things that aren't predictable.

Another way of putting it is that indignation tends to melt down to a few basic formulations that get repeated a lot, where intellectual curiosity looks for what is new and different in a situation, and then has a new and different reaction to that.

Since we want HN to be a site for intellectual curiosity, it's important not to feed indignation too much. It quickly overwhelms curiosity if allowed to (especially on the internet); therefore we need to moderate it. You're right that the moderation comments are boring; that's because they're also predictable. If it helps at all, they're even more tedious to write than they are to read.

For what it's worth, I think it's ok for those comments to be boring.

The root comment has been hidden, and the mod (who probably hid it) has replied to it stating why it was hidden.

That we can see moderation is a good thing; that moderation is 'boring' is to be expected.

The kind of people who expect to push untested code to a production farm without the slightest intuition that it may impact reliability and operations are unfortunately a fixture of certain organizations, and after having to argue with them for the 100th time, people do get a bit jaded.

Unfair to call the coworker a rando, but these people did bring systems down at literally the first occasion they managed to ship their code, despite having received advice.

I have a similar situation maybe once or twice a year (used to be worse when working in other places) and do my best to defuse the situation while maintaining politeness and etiquette, but this is the tone I'd then use to describe the story to some friends.

Yeah. The author has a good point about _the specific way the co-worker went about things_ being wrong. But there was also a lack of understanding to explain things. I'm assuming we, the blog audience, had a better explanation for "no, just no" than the co-worker did.

See another post by this author for a similar tone: https://news.ycombinator.com/item?id=22277582. Maybe work right now is tough. It sounds that way.

All that said, a solid takeaway is that sometimes people are naive and you need to say no. The next step is: alongside saying no, ask what their aim is. You're saying no to the solution and you don't know the problem.

Well, yes and no. In large organizations, with complex things that can cause a lot of problems quickly, you do often need to have a few people like this. They are, hopefully, not anyone's boss, but they may be, for example, the DBA in an organization where everyone is hitting the same database and if it gets hung up everything in the company grinds to a halt. The "grizzled veteran", aka grizzly bear.

The usual result is that if someone finds out they have to do something that involves asking the grizzly bear a question, they make sure to do lots of due diligence first, and maybe ask other people who might know the answer, and only bother the grizzly bear if they absolutely have to.

You definitely don't want to have an organization with lots of grizzly bears, but it is also the case that a particularly high-stakes system sometimes does need a grizzly bear paying attention to it, to keep that system from giving everyone a great "learning opportunity" on a regular basis. If that's a less high-stakes system, then that learning opportunity is fine, but if it's something that nearly everyone relies on in a large organization, you cannot afford that many outages.

Not saying you're wrong in your evaluation, just saying that the same personality type is occasionally useful, or even necessary, in certain situations.

Rachel (who is here also as rachelbythebay since 2011) is a very old-school sysadmin in many ways, but she’s earnt her stripes [1]. If she says something is stupid, chances are that it is.

[1] https://medium.com/wogrammer/rachel-kroll-7944eeb8c692

The parent wasn't claiming it wasn't stupid or wrong, they were talking about the tone. It's not quite Linus rejecting a patch with some choice insults, but it's certainly in that direction.

She has seen and written about enough to be entirely justified in her tone.

And she's always got an overall point about how to run things better / improve process / get newbies up to speed and past the 'every commit breaks things'.

I'd take the time to go read the rest of her posts I haven't seen, and definitely learn a bit about surviving sysadmin work in larger organizations in the process.

Linus has also seen and written about a lot. Would you say that he was justified in his tone prior to his apology for his tone?

Granted that his tone and her tone were very different, but, imo, as a reader of her posts, this wasn't necessary. Educating junior devs is more productive than considering them randos.

> Would you say that he was justified in his tone prior to his apology for his tone?

I'll take my downvotes here, but personally I think Linus with that apology crossed the threshold into old age. Quoting Tomasi di Lampedusa:

> The Prince who had found the town unchanged, was found much changed himself, whom never before would have used such accommodating words; and since then, invisible, began the decline of his prestige.

By grumpy sysadmin standards it seems very, very mild, tbh. And certainly nothing like Linus on a bad day.

I do believe that it's still what would today be called exclusive, unwelcoming, in short: "toxic".

I don't mind that way of communicating, but I'm also absolutely fine with Linus' style. The comments in defense of this had a bit of a cultish feel ("she has earned the right to talk that way about people!!111"), but I gather that's more because people know her and may consider her a friend, so their perception is changed.

Her behavior isn’t toxic in the least.

Toxic is playing dumb and bypassing controls on a Sunday to get your boss off your back, while dumping a bucket of shit over everyone who is obliged to support the company.

Toxic is playing dumb and bypassing controls on a Sunday

It's actually not, that's just going along to get along. Ship enough features, get a promotion. The higher-ups that enable the circus are the toxic ones.

It sounds like the person was told to stop and directed to the people to discuss the correct solution with.

And then it sounds like the person did it anyway, at a later date, and broke shit.

If that person repeated that behavior it is someone I wouldn't want on the team, or at least having access to certain systems.

Trying to add a brand new, non-load-tested, non-hardened binary to a site with actual reliability requirements 12-24h before your deploy is not behavior that demands to be taken seriously. Someone serious would have talked to the reliability team months, if not quarters, ahead of time and learned eg transport and cutout requirements. There are also almost certainly other requirements, eg run books.

Let alone on a Sunday afternoon. The whole thing smells like someone knew there was a process to do X, figured their X wouldn't pass, and tried to bypass the process.

Don't know about the tone. What worries me is the company. There's no procedure for updates that have far-reaching ramifications? Yeah, can see why that would wear you down.

There usually are procedures. Doesn’t stop people trying to work around them anywhere I’ve heard of.

And I’ve yet to find any set of procedures that cover all possible cases. So you still need people to apply good judgement when the procedures don’t cover situations, or if the procedures currently would allow something that shouldn’t be allowed. Like: “no, even though we have a way to safely roll new libraries out, we won’t allow you to add this library to all front-ends, you should follow best practice and use ...”

Well, you’re right. But there’s a bigger deal at play here, which is why does a junior lone dev have access to deploy randomly to the business critical services? Is this what devops has become, carte-blanche access across the board? Somehow I thought the norm was better than this.

This was my main takeaway from this post as well.

> The right thing to do would be to help this person achieve their goals while maintaining reliability of the operation.

I think in most places, if your goal is to make major production changes on a weekend, the right answer is ‘no’. Seems like she handled it fairly diplomatically, all things considered. And as it turns out, later on the rando in question broke the internal site. So she was probably right to tell them to step away from the keyboard slowly when they were trying to do it on production.

The author apparently works for Facebook, which makes all of the vague language in the story make a lot more sense.

Would Facebook really let somebody do this to every external frontend server on a Sunday without some boring authorization first? The authorization could simply be denied with a boilerplate response about the right way to do things, with less emotional resources spent than the author apparently did.

The author is a veteran in the industry and stories crop up from various aspects of her entire career. I wouldn't attribute any of them to a particular employer unless she indicates it directly or cough indirectly.

It was Sunday. The person wanted to deploy something to meet a deadline the next day. But from what I understand there was still some months of work to do on this component before it was production ready.

I feel it’s totally justified to just make sure that this Sunday they don’t break production. Helping the team get it right can wait until Monday.

Rachel doesn’t say what exactly was being communicated. But maybe her response on the weekend made that clear. In fact that’s what I assumed when I read the article.

Their actual communications with the colleague didn't strike me as toxic at all.

It's a blog post, I took their use of "rando" as an artistic choice in emphasizing this person had utterly neglected to coordinate with any of the stakeholders beforehand, with the intent of going into production the next day.

From the perspective of the stakeholders, the person was effectively a stranger, and that's the crux of the matter being described.


Why? She writes good content.


She's an experienced SRE/sysadmin who writes a really interesting blog, I highly recommend you check the archives. :)


Isn't seagull management where you fly in, shit over everything, then leave. You kind of need to be VP level to pull that off.

I suppose my view of it is, I see a Sysadmin admittedly monitoring a situation unfolding that they have no take in then dumps on the situation.

I've never really cared for this style of tech management. What I feel really should have gone down is a call to discuss what is going down (by all means, slay the dragons). Outright stating that this "isn't happening on my watch" or whatever is really counter productive because the employees trying to execute this change surely didn't volunteer on a weekend to be doing it. Process has already been breached and they are in a sticky situation. Now they have a hostile sysadmin pushing on them rather than whoever conceived this half-cocked plan.

Not seagull, because the effect is dump-avoidance. Outside of electrical-fire-in-hair moments, going slow is a virtue. There was no apparent reason to do this specific thing at that time, nor even (evidently) within an identifiable timeframe in the future.


You've got an engineer looking to rollout a new piece of software across the entire frontend fleet of serves late Sunday afternoon. Not only that, it's to add a new transport (network protocol) to a new database, both of which have not been through any sort of analysis to determine if they were ready or how they behaved when things went bad.

All to support a production launch for the next day.

That should make you squirm. It's the stuff that the company might get away with once, but once you get enough size, it will never work again. The Risk/Reward equation just doesn't match. When these things go bad they go _really_ bad _really_ fast.

Think the machine is down, you can't connect to it and it stays up for a minimal amount of time after you reboot it and then you find out you don't know how to rollback and the person who does is at a football game enjoying many beers.

ODBC is a client API which allows devs to write to a fixed API. Then you have a ODBC driver that connects to the database, each database has their own network transport.

Yes. This post relies on the reader knowing some background to the services/business RbtB supports. She is in the top comments on HN relatively often so it probably made more sense to others more familiar.

I don't know anything about the author, but it read just fine for me. Someone else criticized the style of the writing, but I found that it was like telling a story in the way you would naturally do it face to face.

I would invite anyone who is confused about a part of the story to ask for clarification on jargon or situations.

FWIW I don't know anything about the author, other than having scanned a couple of articles from time to time (I don't recall anything I couldn't glean from reading this article), and it read fine (and was interesting to boot).

Not a customer. A junior dev trying to ship untested code to production on a day off. Then later that low quality code borked an internal system.

We can certainly hope they were junior but I don't believe the post ever says so. Also I believe there were two separate people trying to ship the thing.

FTA: "person who had never done anything to the frontend servers".

I read that as junior, at least for the skillset necessary for the task at hand.

I suppose I've seen a number of senior developers (measured in years of experience, not skills) do wildly inappropriate things both in systems they should be familiar with and in systems where they should have an awareness of their familiarity-deficit.

You can be senior in one technology and a junior (or less) in others.


The article goes into quite a bit of technical depth. I don’t mean this in a negative way, but you may be lacking experience in the areas of work that would contextualize what she’s saying.

I would say that I am adequately experienced, and I also share GP's opinion. I don't think the lack of experience is the issue here.

The article describes in detail the actions the rollout team was trying to perform, the new communication protocol they were introducing, and the timing characteristics of their backend which caused the new protocol to be problematic. I'm not sure how you could get more deeply technical. I guess in principle you could say something like "at 9:52 our monitoring showed error rates above 1%", but who remembers things that specifically?

I think most people would find a single paragraph less enjoyable. Most people enjoy stories, and find them more memorable and impactful than data.

Most isn't the word I would choose when I consider the audience of this article.

Given that none of her posts are one-paragraph technical summaries, I'm not sure why you're imagining her audience is clamoring for one-paragraph technical summaries.

I've been assuming that her blog posts are very detailed, but then have any identifying information carefully removed. Hence talking about specific (companies|teams|coworkers|services) but only by vauge labels.

Calling your co-worker a rando? Very unprofessional.

Why not teach people the proper way to release things to production and avoid the name calling and negativity?

I agree that it might be offensive to the actual person who is being singled out in this article, if they’re reading the article. It also occurred to me that this person is fictitious or semi fictitious in which case it’s not just inoffensive, but accurate, to call them a rando.

For once, she's not being unprofessional. Give her credit for that much.

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact