Hacker News new | past | comments | ask | show | jobs | submit login
The anatomy of a 2AM mental breakdown (zarar.dev)
435 points by recroad 31 days ago | hide | past | favorite | 266 comments



Working as a SRE for a year in a large global company broke me out of this "panic" mode described in this post. To a business, every problem seems like a world-ending event. It's very easy to give in to panic in those situations. However, in reality, it's rarely that bad, and even if it is, you'll probably survive without harm. The key in these situations, and what I try to do (totally relate to breaking out in a sweat, that still happens to me, just happened yesterday) is to take 5-10 minutes before doing any action to try to fix it and sketch it out, think about it as clearly as you can. Fear interferes with your ability to reason rationally. Mashing buttons in a panic can make your problems spiral even worse (seen that happen). Disrupting that fear circuit any way you can is important. Splashing my face and hands with extremely cold water is my trick.

Then after you go through a few of these, you'll realize it really isn't too bad and you've dealt with bad situations before and you'll gain the confidence to know you can deal with it, even when there's no one you can reach out to for help.


Bear in mind that when something "breaks" a business can freak out, but they're not freaking out about lots of other arguably more important problems:

- The software they purchased doesn't do anything because it was never staffed or configured correctly.

- Employees as a whole lose thousands of hours a year to terrible UX or meaningless requirements.

- Some capability doesn't actually do anything, but no one cares enough to notice.

- Useless meetings which waste collective employee hours daily.

- Capability serves no purpose, but just exists to meet audit requirements.

- Executives enrich themselves and waste company money constantly.

- Etc.

Downtime doesn't seem any worse than these problems, but it gets far more attention and panic. It feels a bit like terrorism vs. heart disease. It's important not to sweat it, because a company does not care about your sleep or mental health. It will push you as far as it can. I'm not calling companies malevolent, but they're a bit like a bully in that regard: they push you as much as you yield.


The zero-downtime culture is pretty out of control. Your flight can be delayed for hours, your roads can be closed for weeks, your internet can be out for hours because someone cut a line, that's all ok, but someone can't get into your website for 3 minutes to check the status of their order and we all must lose our minds and publicly apologize and explain in detail what happened and why it will never happen again.


The obvious difference is no matter how long the county or city takes to reopen that road - you'll go right back to using it because you don't have an alternate choice.

For a website - particularly ecommerce websites - you have many choices. Rarely is a product only sold on a particular website. Being down when someone is trying to place an order can and does result in losing the sale, and potentially the customer forever. Customer loyalty is often fickle - why would they wait around for you to fix your stuff when they can simply place their order on Amazon or another retail website?


That really depends on what type of service we're talking about. Yeah, if you're doing really high volume sales on products that can be purchased elsewhere for close to the same price and shipping options, then uptime is critical. That's not the situation most of us are in.


> That's not the situation most of us are in.

From an ecommerce website's perspective - it is exactly the situation most of us are in.

Unless you are the manufacturer for your own goods, then you are competing on often-thin margins within a crowded industry with lots of competition where everyone more-or-less offers the same price.

In an age where more and more people don't even consider shopping anywhere but Amazon, being down exactly when that customer decided to grace your website is akin to giving yourself a nice, stinging papercut. Do this often enough and long enough, those papercuts will start to leave scars.

So ya, being down is unacceptable for most ecommerce websites. The website literally generates your revenue, and without it you receive no revenue.


Lots of things can be unacceptable, but not unacceptable enough to warrant being treated as a life-or-death emergency, where klaxons ring out and people get paged at 2AM. So your e-commerce website is down for a few hours. Fine. What's the worst thing that happens? You miss out on a small number of purchases, maybe a few customers are gone for good, and the shareholders reap a little less profit this quarter. Nobody died.

I think this is what OOP meant about "zero-downtime culture" being out of control. Unless you're a hospital keeping people on life support, I don't want to hear about zero downtime.


> What's the worst thing that happens? You miss out on a small number of purchases, maybe a few customers are gone for good, and the shareholders reap a little less profit this quarter.

You build a culture of nobody caring about the singular thing that brings revenue into your organization. You're fine with shareholders taking a hit, because it's not your bank account - but what happens when a company has to let people go because they don't have enough customers and orders to support everyone? Oh, now they're supervillain out-of-touch CEO's I presume?

> Unless you're a hospital keeping people on life support, I don't want to hear about zero downtime.

Then look for other work? If you are being paid to keep a website online 24/7, and it goes down "for a few hours" and your attitude is "nobody died", then you're simply not cut out for that job - find another or be terminated.

There's not a single person on pager duty (or whatever) that doesn't know what they are getting paid for. Complaining about a job you volunteered (at-will employment!) for is insane.


Way ahead of ya, I don't have such a job and obviously wouldn't accept one since I have such a fundamental objection to that kind of "omg four alarm fire" outlook on work. SREs with pager duty have a more compatible mindset for that and frankly are just built different!


>Unless you are the manufacturer for your own goods, then you are competing on often-thin margins within a crowded industry with lots of competition where everyone more-or-less offers the same price.

That's not true. You don't have to manufacture your own goods to not compete with others at a commodity level. It's quite literally why branding is so important.

But I do find it funny you wound up doing exactly what the above commenter stated, and are acting like this is a world ending event for a small-medium sized ecommerce site to be down for a while.


> It's quite literally why branding is so important.

Most of the goods you purchase are from brands not owned by the retail website. Pick a hobby and consider the prominent brands, then consider the prominent websites people in that hobby often patronize - they are very often not the same.

> and are acting like this is a world ending event for a small-medium sized ecommerce site to be down for a while

Because it kind of is. The smaller you are, the more significant of an impact an outage has. Nobody is going to think "this website I've never heard of totally isn't working right now, so I'll just wait around until they fix it". No, instead, they will go on to their next google search result - or Amazon. This hurts even worse if they clicked one of your ads while your website is down... costing you actual money and not just the sale opportunity. A customer that might have had an expected lifetime of 1-3+ orders now becomes 0.

Websites don't have "business hours", and routine maintenance is not acceptable for customers that shop when it's convenient for themselves. Remember, ecommerce exists as a better way to do over-the-phone and catalog/mail ordering. One of those benefits is you can order whenever you want - so it is in fact important for the website to be online at close to 100% as possible.


> Websites don't have "business hours"

depends what you're selling, and there's always the chance that someone is up at 4am their time, but if the page was down from 11pm to 6am every day for a mainstream physical product, they'd probably not lose too many sales.

unfortunately, losing a sale puts a business into a tizzy, to they tend to react poorly to that thought. it's because it's easy enough to leave the computer running unattended over night that business hours aren't a thing that websites are in that spot, but watching the traffic flows while on the Internet traffic team at Google on top of some emotional development, means that for a small bespoke business, SLAs don't even have have any nines for that business to remain in business.

we're biased to by the very nature of us commenting on an Internet website, but there are still customers out there that don't use the Internet on a daily basis.


I'm about to order some electric bike motors. Probably from ebikes.ca. If their website goes down I'll just wait until it comes back up.

If it was groceries, I'd go to the store.

Unless in-person-purchasing is an option, I'm just going to wait for my preferred vendor. My account's already set up and clearly they have what I want.


> I'm just going to wait for my preferred vendor

By the numbers, most customers are shopping on any particular given ecommerce website for the first time. It's expensive and difficult to gain new customers - and even more difficult to gain repeat customers.

What would you have done if you had not purchased from ebikes.ca before and just happened to stumble across them after a google search? What would you have done if after clicking the ad/link, the page didn't load? Would you try again tomorrow? Probably not... that is the issue. Instead you would be naming some other small business that was online when you needed them to be.


I haven't purchased from them yet.

> What would you have done if after clicking the ad/link, the page didn't load?

archive.org first just to check if it's down temporarily. But if you're a vendor just reselling Alibaba, then there's not a lot of reason to shop from you. But with real value add, you can get customers. That's exclusive products, support/warranty, niche product, etc beyond just "fulfillment".


> if you're a vendor just reselling Alibaba

There's an entire ocean between Alibaba flippers and strongly branded products sold first-party by mega-corps.

Consider the physical space and walk into a Target or Walmart - how many products on the shelf are Alibaba flips, and how many are sold by the manufacturer? Target has house brands, but Dawn Dish Soap is not one of them. You can get Dawn Dish Soap from most stores - if one day Target was randomly closed are you going to sit out front and wait for them to open, or are you going to another store?

Back in the digital space - let's pick golf... how many websites do you think sell Callaway golf clubs? How many of those websites are the actual manufacturer/brand? One? A google search for "buy golf clubs online" yields probably close to 50 websites just on the first page alone... of which 80% appear to be small businesses.

You start clicking through the search results looking for deals - and some of the pages don't load.

You're telling me you're going to go to archive.org and lookup the history of these random websites, then wait for them to come back online just to see if you want to buy something?

I don't buy this argument at all...

> That's exclusive products, support/warranty, niche product, etc beyond just "fulfillment"

People really don't consider or understand supply chain logistics, distribution and it's relation with retail... people really like to casually wave their hands and assert resellers/distributors have no space in the world, yet they buy products from resellers/distributors every single day without a moment's thought (Target & Walmart!). The "value add" is making desirable products accessible to regular people.

Regular people aren't going to pony up a $75,000 Purchase Order, fill a shipping container, and wait 6-8 months transit just so they can have a new cell phone case or new bookshelf. That's where distribution networks come in handy. We buy the $75,000 container so that you can buy just one item.

I guess we can't really blame people for not understanding this - people rarely understand how their beef and carrots got to the grocery store either.


If Target's website is down for an hour, I'll just wait.

> sit out front and wait for them to open, or are you going to another store?

Many times I've wanted to get something from my preferred grocery store, but they're closed so I... just went a different time.

Most people aren't going to shop a bunch of random websites to buy "Callaway golf clubs". They're going to buy it from Dick's or Amazon. And if those websites are down the moment they check, they'll just wait an hour because it isn't urgent or figure it's a problem on their end.

Fuck, b&h photo takes their website down a whole day every week!

> I guess we can't really blame people for not understanding this - people rarely understand how their beef and carrots got to the grocery store either.

Thanks for the lesson, I learned how smart you like sounding.


> If Target's website is down for an hour, I'll just wait.

>> sit out front and wait for them to open, or are you going to another store?

You were asked if the physical store was closed - because that's the analog you are deliberately overlooking.

> Thanks for the lesson, I learned how smart you like sounding.

Some people unwilling to learn. Even worse, some people insist they understand an entire industry, having spent exactly zero time in that industry. The hubris on display is amazing.


You are of course right, but in practice I've seen many more incidents created by doing changes to make things more robust than from simpler incremental product changes that usually are feature flagged and so on. At deeper levels usually there's less ability to do such containment (or is too expensive or takes too long or people are lazy) and so many times I wonder if it's better to do the trade-off or just keep things simple and eat only the "simple" sources of downtime to fix.

For example the classic thing is to always have minimum of 3 or 5 nodes for every stateful system. But in some companies, 1 hour of planned downtime on Monday mornings at 7AM to 8AM for operations and upgrades (which you only use when you need) + eating the times when the machine actually dies, would be less downtime than all the times you'd go down because of problems related to the very thing that should make you more robust. An incident here because replication lag was too high, an incident there because the quorum keeping system ran out of space etc and you're probably already behind. And then we have kubernetes. At some point it does make sense, when you have enough people and complexity to deal with this properly, but usually we do it too early.


Sure. If you have a couple of short outages in a year and say 2% of your customers view the website during that time, half quit and never come back over it, that's it. You've lost 1% of your customers.

You seem to be stating that any loss at all no matter how small is a complete disaster without really qualifying it. Yet one could easily make the point - well, what about the customers you don't serve at all because they want to buy a rucksack and you don't sell them? That number could be far higher than the ones affected by outages.

I think this probably just depends on how big the numbers are to be honest. Yes, if 1% of people are millions of dollars, then sure, we care. Usually it's not anything close to being like that.


With the right analysis I can show how your outage increased your business value.


Are you assuming airline employees aren't scrambling when a flight is delayed? Or that road crews don't get called in the middle of the night? A minor power outage gets a crew called no matter what time it is; a major outage will bring in crews from different states, much less power companies.


But none of them are working in panic-mode as them doing this work is just an expected thing. As OP has pointed out, downtime in other industries is expected and agreed upon as OK. For some reason because it's on a computer that does not apply to our industry, we are to go in to panic-mode to fix, and write about how it will never happen again, stock prices fall, people lose their jobs.


In the case of flights, isn't it rare that it is a single employee or just a couple of employees that are tasked or held responsible for the flight taking off on time?

Funny story though, veering a little OT, but recently I had to haul ass from one end of the airport to the other due to arriving on a delayed flight with a connecting flight with a different airline. As I was about to enter the airplane, after getting my boarding pass scanned, completely drenched in sweat, the flight attendant saw me and said the captain wanted to see me and ushered me into the cockpit. I found myself in the cockpit, where the captain proceeded to talk to me in the local foreign language while I was thinking, "Wtf, isn't it illegal for me to be in here after 9-11?" I then started repeating, in the local language, that I didn't understand them. But that didn't have the desired effect intially and they kept talking. I repeated myself both in their language and English until the captain finally said, in English, "Are you a passenger?" When I said yes, he told me I could sit down.

When I sat down, the only rational explanation that I could come up with was that despite being in civilian clothes I was mistaken for a pilot. Not just any pilot, but one of the pilots that was supposed to be in the cockpit. I could only conjecture that he was a no-show and that they had to bring in a substitute to replace him which caused the plane to be delayed and allowed me to make the flight.

Any commercials pilots are free to correct my -- which normally should be a very silly -- theory. If it could possibly be true though, being the no-show pilot in that situation might be a little stressful.


You know, I think if there was a little more tendency to explain, and trust in those explanations, people would like airlines, local transportation departments, etc, a whole lot more.

Without information, it is easy to assume incompetence, or worse, just not caring.


Great point. I honestly can't remember the last time that I've had a smooth experience with any kind of business or commercial thing, even when I'm the one paying for services. Bank declines transactions randomly when balance is sufficient, on vendors that have been used before. Orders of equipment over $10k are days late, no explanation provided, no number to speak to a human, just ineffective support robots. Planning days or weeks around other people's problems is pretty normal, but really, that just means that I get to postpone going to the DMV and getting the run-around from them.

Not sure if it's quiet-quitting or what, but the last few years (basically since covid) my consumer experiences in the US are feeling a lot more.. shall we say, European? And I guess that's fine, pretty much everything really can wait. But it's hard to maintain one's own sense of professional urgency to keep things smooth for others when literally nothing is smooth for you.


> Orders of equipment over $10k are days late, no explanation provided, no number to speak to a human, just ineffective support robots.

I would be furious. If I spend $10k I want to be speaking with a human if something goes wrong. I feel like in the US we have given up on decent customer experiences.


This is what labor shortage feels like.


Every company seems to just be phoning it in lately. The number of times I've ordered something from an online retailer, and either 1. the wrong thing arrived, 2. the thing arrived broken/doa, or 3. the thing never arrived, has at least tripled since say 5 years ago. And to a lot of companies' credit, their indifferent but friendly customer support tends to just say "<yawn> just keep whatever we sent you and we'll refund you. You can try to order the thing again, and we'll maybe send you the right, working thing next time. Or maybe not. Who knows?"


"Dialing it in" is getting things working better -> perfectly.

"Phoning it in" is when your heart isn't in it, you're just reading your lines or going through the motions.


Thanks for the reminder. Edited!


Every time the stock market shits itself, the large brokerages seem to "go down"


The world would be a better place when airlines and ISPs worked that way too. (Roads work is usually communicated well in advance)

Downtime for these is critically impacting people, and you're damn right you should explain what went wrong and how you'll avoid this in the future. Because the only reasonable alternative is nationalizing them. "I dun fucked up, but meh" is not an acceptable attitude.


> "I dun fucked up, but meh" is not an acceptable attitude.

I'm cool with it. I don't care that some hard drive went bad and the monitoring system didn't pick up on it because some queue was full. Who cares? If this isn't happening constantly, I assume they're going to learn their lesson.


> "I assume they're going to learn their lesson."

That's pretty much the opposite of "but meh". Absolutely, mistakes happen, they aren't the end of the world, but they need to trigger a learning process. Looking at airlines and ISPs, that process is, in the most charitable viewing, processing glacially.


>> Bear in mind that when something "breaks" a business can freak out, but they're not freaking out about lots of other arguably more important problems

THIS.

Worked for a huge corporation when I first started contracting as a developer. This was a daily occurrence. Someone would push some stuff to prod and it would break their very large e-commerce site. Sometimes it was minor stuff, other times it would cripple the entire site. Most of the time, nobody GAF at all.

I remember sitting at my desk one day and my senior dev walking by my desk like Lumbergh in Office Space. Coffee in hand and just casually said to me, "Looks like someone broke the build, most of the video game pages are throwing 404's right now, going to lunch, see you soon." and then just walked away.

I also found out one day one of the CDN folks that was running the Akamai account quit and then a month later, we found out one of our secondary content servers had been down since the day he left. When dude left, he never transferred any of his knowledge to the other team members so once they found out the server was down, it took another two days to get in contact with someone at Akamai who was handling our account.

At my previous job if something went down on our site, it was a four alarm fire, war room, and all hands on deck to get it resolved or heads would roll. It was so dysfunctional to work somewhere when something broke or stopped working, nobody was in any hurry to fix it. Several times I just thought, "Is this what they mean when they say "the inmates are running the asylum""?


>> At my previous job if something went down on our site, it was a four alarm fire, war room, and all hands on deck to get it resolved or heads would roll. It was so dysfunctional to work somewhere when something broke or stopped working, nobody was in any hurry to fix it. Several times I just thought, "Is this what they mean when they say "the inmates are running the asylum""?

That's funny, because reading your first sentence I was thinking that was the dysfunctional place. I had a boss that would take problems from 0 to 10 in a flash, when the problem was really a 4 or 5, and it really was not a great environment.


Yea, nobody seems to be able to handle problems that require solid, but moderate, non-urgent effort. The problem is either a 0, where nobody even knows, let alone cares that it's happening or a 10, where it's treated as though everyone's chair is on fire and the company is losing $10M per millisecond.


> It feels a bit like terrorism [..] It’s important not to sweat it

Hmm… interesting thought. It’s interesting how labeling something a terror attack makes the public lose critical thinking and act incoherently for a bit. It’s like a stunning shock response. The very same this author describes.

Do you think some managers intentionally play up the severity of problems to push their teams into this panic mode?

People in customer service know clients definitely do.

It is interesting how these concepts connect.


>Do you think some managers intentionally play up the severity of problems to push their teams into this panic mode?

I'm sure some do. Much more often I observe that the managers themselves are just reacting to the incentive system in their environment -- and that those incentive systems are complex, not-widely-understood, and often built "accidentally." ie, companies are often accidentally incentivizing the wrong behaviors and are often powerless to prevent this.


I am almost convinced most middle management are like small babies. That's not meant derogatively: there's a problem, they have been notified, yet they cannot do anything about it. They need someone else to fix it. So, they cry loud and clear that there is a need to fix this.

It's difficult to understand sometimes, but they are pressured from above, and they don't have the means to fix it by themselves.


Good management is a tool they have to fix the problem. But yes, sometimes they have mismanaged things so much that they can only kick up a fuss or pester people. It causes a lot of stress and is not very effective from my exp.

On a more positive note, other managers will have a rapport with their teams where if the team only learns of a problem, it will be addressed. They only need to act as liaisons — stick something on a kanban board somewhere, and it will get resolved. No loud crying necessary, just business as usual.


not saying you're wrong, because you absolutely aren't, but there's one key difference - usually freakouts are when it's client-facing related in combination with no control and/or information on what's going on and 'they' have to answer customer calls. That's when panic mode is activated.


> It feels a bit like terrorism vs. heart disease.

Fantastic analogy! One is scary but practically non-existent, and the other will bring early death to many people you know.


One of the funnier situations (funny now, wasn't so much the first time I saw this) that I run into at new gigs or contracts is when a business has absolutely zero monitoring or alerting. Go into their backend, it's predictably a dumpster fire. Start plugging in monitoring and the business realizes everything is on fire and PANICS. It's very difficult to explain to someone who definitely doesn't want to hear that it's actually been broken for a long time, they just didn't care enough to notice or to invest in doing it right the first time.


> It's very difficult to explain to someone who definitely doesn't want to hear that it's actually been broken for a long time, they just didn't care enough to notice or to invest in doing it right the first time.

if no one noticed then can you really say it was broken?


Depends - IME yes, lol. Stuff like email campaigns being broken is super hard for the business to detect, as is a lot of random customer facing stuff that doesn't directly drive revenue. Stuff that isn't a total outage and a degradation will often be tolerated or unnoticed for quite a long time til it actually blows up.

My favorite anti-pattern ever that happens here is when you notice a bug that's existed for a long time, you fix it, then some other part of the system that had adapted and expected the behavior over the years breaks because of how they were handling the bug no longer works. Then of course the business comes to you like "why did you break this?"


Errors can go unnoticed or ignored until they cause a fault. For example you may not notice missing bolts on the underside of the bridge.


There are even times to prefer to site being down to being up. If you've been hacked and the site is serving CSAM/malware/leaking PII maliciously, as the SRE on deck, your job is to keep it down!


> It's important not to sweat it, because a company does not care about your sleep or mental health. It will push you as far as it can.

One time a few years ago, a coworker called me to tell me the stress we were under was causing her hair to fall out. I mentioned I'd lost significant weight and wasn't sleeping well, especially since another coworker had quit and now I was shouldering his responsibilities (programming and managing a junior). She said she was going to quit. I wished her the best. Around six months later, I had to quit too. It took around a year for me to get back to normal sleep patterns and for my heart to stop racing in the night.

Oh yeah, another guy from that team had a "cardiac event" due to stress and had to take a month off.


This is such an insane mentality for a rank-and-file employee who does not have significant upside tied to the business. If you're the sole shareholder of the company and there's some technical problem causing it to stop making money, then I can see it having a stressful effect on you and your health. If you're a worker bee who got 0.00000001% of the company's equity (if you're lucky), why oh why would you let the company's problems cause you to miss sleep and have a literal heart attack?


That is a great perspective - thank you


Some of the worst mistakes that I saw were from over-reaction in an active incident.

One of my programming mantras is "no black magic." If I don't understand why something works, then it's not done.

I take this same approach to an incident. If someone can't coherently identify why their suggestion will have an impact, I don't think they should do it. Now there may come a time that you need to just pull the trigger on something, but as I think back I'm not sure that was ever the case in the end.

It was wild to see the top brass—normally very cool and composed—start suggesting arbitrary potential fixes during an incident.


I have a similar mantra - "if you don't know why a fix worked, you may not have fixed it."

I'm willing to throw shit at the wall early in the triaging process, but only when they are low-impact and "simple" things. stuff like -

have we tried clearing cache?

have we checked DNS resolver for errors?

have we restarted the server?

etc. I try to find the "dumb" problems before jumping to some wild fix. In one of the worst outages of my career, a team I was working for tried to do a full database restore, which had never been done in production, based on a guess. At 3am on a saturday. I push back really hard at stuff like that.


That mantra reminds me of "Any problem that goes away by itself can just as easily come back by itself."


To quote Gene Kranz during the start of problems on Apollo 13: "Let's not make things worse by guessing."


> One of my programming mantras is "no black magic."

This. * 1000

I am very grateful that my earliest training was as an RF bench technician. It taught me how to find and fix really weird problems, without freaking out.

However, it does appear that this particular event may have been caused by reliance on a dependency that may not have merited that reliance.

I'm really, really careful about dependencies. That's an attitude that wins me few friends, in this crowd, but it's been a long time, since I've had 2AM freakouts.


Wow, another RF/electronics tech turned SWE? How many of us do you think there are?


Not too many.

But debugging RF stuff is hairy AF, so it’s a great forge for making problem solvers.


> One of my programming mantras is "no black magic." If I don't understand why something works, then it's not done.

so many ppl nowadays don't care about knowing how thing works, they just push shit and cross the fingers.


It's a good reminder that if things get bad, people will just start burning things to try and appease the gods.


OTOH, a lot of people who really needed to be able to change their event pricing at 2am (+/- two timezones) on jumpcomedy.com were really let down. Probably some of them died.

Imagine how much more the damage would have been if someone was telling this solo developer to stop trying to make 'fetch' happen.


One of my VP's (one of the coolest dude I know) would often say "slow is smooth and smooth is fast".

You're absolutely right that fear interferes with one's ability to reason rationally but I'd also add that fear is highly infectious. If ICs sees their leaders/managers/colleagues panicking, they will often panic too. Thankfully my VP was always level-headed and prioritized clarity over action.


>> One of my VP's (one of the coolest dude I know) would often say "slow is smooth and smooth is fast".

Apparently you worked for a Navy Seal, so yeah, likely a very cool dude :)


How did you know?


Well known quote, that is often promoted as being used by the Navy Seals. Enter it in your favorite search machine to learn more about the background of the phrase.


And ultimately you carry none of the risk. It's not your company, and the company can (and will) cut you off at a random time. Unless it is your company :)


This is the entire risk: being fired and financially in trouble. I spent 5 years eliminating that risk. Now I don't give a single fuck. There is no fear. They get better, more rational work and I have security.


Was fired, worked out but only through luck (the firing was humane, so credit where credit due). Your advice is spot on: derisk financially, do your best work, but don't care when you get walked out. It's just a job, it doesn't matter. If it matters because you need the job, treat your financial situation like an emergency until you don't need the job. It will be a cold day in hell before I am ever on call or in a pager rotation again. "We're not saving lives, we're just building websites." as a wise old man once told me early in my career.


My manager told me this during the first "emergency" we had where the client was wildly over-reacting.

We're not doing open-heart surgery or launching people into space. They can wait 1-2 hours extra.

Earned a lot of my respect on that day :)


Good manager, rare manager. Worth their weight in gold.


How did you get to this point? I have $2M+ net worth, and while I live somewhere rent controlled, $2M isn't that much in the Bay Area. I'm still scared to lose my job.


What is your $2m generating? For instance, if you invested in high quality dividend stocks, you can get 3% and it should generally grow about what GDP does (which should also include inflation effects). So that's $60k/year, or after 15% dividend tax + let's say 5% California state tax, for a net of $48k. If you're in somewhere rent-controlled, seems like even with a family you're not going to on the streets starving.

Alternatively, if you have $2m in stocks that are growing, you could consider the growth amount as potential income, because you could always sell it. Generally the growth is better than 3%, but down times will also hurt worse since you might have to sell at a low.

Or you could purchase a residence with $1m, and buy dividends with the rest. Taxes on the residence will be about $10k, with dividend income of $24k, leaving $14k for food, etc.

You won't independently wealthy in any of these scenarios (not in the Bay Area, anyway), but in these scenarios you don't actually need much from a job, maybe $40k with a family. So you should be able to save several years worth of that on a Bay Area salary, which leaves you with quite a long runway.

And if things get really bad, you can move to the middle of nowhere, buy a house for $300k and actually be independently wealthy.


if you're only getting 3% with dividends, can I interest you with an HYSA paying 5%?


10 year US treasuries my good friend, no state or local income taxes and HYSA rates will decline as the Fed lowers the federal funds rate.

3% is on par for SCHD, which is a popular ETF dividend fund (that provides income, but also growth). Different risk profile than treasuries though.

https://finance.yahoo.com/quote/SCHD/


We're still under a yield curve inversion, no? 3-month t-bills are paying 5.149% which is more than the rest.


Which will decline as the Fed cuts rates. Longer duration locks in risk free yields longer. Your portfolio goals and allocation will of course drive your strategy.


3% comes from traditional FIRE texts which find it to be the withdrawal rate that would survive all historic recessions.


The 5% HYSA minus the FED-targeted rate of 2% is back to 3%. But high quality dividend companies generally raise their dividend each year, plus you get the increase in stock value. So you only get 3% in income, but your total return is much better.


> can I interest you with an HYSA paying 5%

Don't plan a retirement around a HYSA at 5% because those rates will go back down.


HYSA varies with interest rates


Two dead parents worth of houses. Mortgage paid off and a side job. It would have been that without the houses but there is an ex wife involved. Avoid those.

My net worth is less than yours. I could live on your interest fine and just retire.


With a $2m+ net worth you could move to 90% of the country, retire and live off of the interest.


> With a $2m+ net worth you could move to 90% of the country, retire and live off of the interest.

Not really for most scenarios.

For someone who is single and has no kids and plans to stay that way forever and is happy to live a very simple lifestyle, perhaps. Anyone else, no.

You'd still need to buy a house even if you move to a real cheap area, so you don't have 2M in cash. Then, how are you paying for health care? Kids education?


$500k buys a very nice house in many part of the country.

Health insurance can be purchased from healthcare.gov in largely the same way home insurance works.

$1.5M would get you $60k/year without ever touching your principal.

That being said, waiting even a few years drastically changes the math into your favor. 5 years of a job, plus interest on your retirement fund could easily double it.


To support a sibling post, depends very much on age and circumstances. 50, healthy, no wife or kids and no plan to have them? Might be fine retiring at $2m. 30 with a family? Ehhhh…

Likely-time-remaining is a huge factor in figuring out how much it’ll take to retire at a given level of risk (of running out of money too early).


It's $80k/year at a 4% withdrawal rate. If you were 30 with a family it might be cutting it close (depends on things like student loans, cost of health insurance, car payments, cost of mortgage, etc). Bumping it up to $2.5m would probably make things a lot more comfortable, but the point remains that OP has no reason to worry.


That's a mindset problem. If you have $2M net worth, if $1M is in cash in a HYSA paying 5% APY you're making $3.5k/month on that alone. I don't know what your costs are; rent, healthcare, partner, kids, other dependents; if you caught long covid and became disabled and couldn't work another full day in your life, you'd still be making $3.5k a month. I don't wish that disability on anyone, but what helps the mental switch on what to do is to pull a years worth of living expenses out of savings into a cash account, and then just go do that for a year. Don't think about the rest of the pile, just sit with that year's worth of just living and doing no work.


It's not net worth. It's how long you can abide. If you have $2M net worth and $1.99 is tied up in a way you can't spend on bills, that don't mean nothin'. If things got bad, I guess you could move. You could probably buy a house now in fly-over country and just retire, living on interest from conservative investments.


At 2M net worth, the only thing that is making you fearful is your own poor assessment of risk/regret in life.

You are more likely to regret taking large debts or missing out on life milestones than on losing the job or going hungry.


If you're scared about your financial situation with over a $2M net worth either you are prone to paranoia or you're living well above your means.

I'm constantly surprised how many well-off people try to pass themselves off as "one of the poors". My family and friends do okay with a net worth way below that, in the Bay Area. I've worked minimum wage jobs here in the bay, and had co-workers actually struggling to make it to the next paycheck. It really gets on my nerves when someone in software chimes in with "we're all struggling"—No, we are not all struggling on the same level. Guess how much the person making your fast food or working at your local grocery store is making... And think about how they still live in the same city as you.

"$2M isn't that much". Man, I can't imagine what it takes to come to that conclusion.


> If you're scared about your financial situation with over a $2M net worth either you are prone to paranoia or you're living well above your means.

As someone who was layed off and went without a job for 13 months (tech bubble), I am _always_ scared about my financial situation. Sure, if I lose my job, I have the resources to hold out while I find another, but

- It _will_ impact my lifestyle, because I'll need to be more careful about what money I spend

- It _will_ make my family unhappy, for the same reasons

- I've been in the position where it's "do we eat or pay rent next month?", and it sucks. I likely won't get back there, but the fear of getting there persists.

Look, I get it, $2M is a lot of money. But there's a fair number of people that can get to that amount (especially living in an area where a 2br dwelling is easily $1M of that), still be saving _more_ money, and still fear the idea of being out of work for months. Especially if it's a single earner household. And especially if there's children.


Then you are living above your means. Having $2M in assets puts you in the top 5% in the USA, let alone with world. If you can't figure out a way to pay for food with this much money tied up in assets, you should spend some of your savings on a financial advisor.

There are literal minimum wage workers getting by in the bay. I really can't understand this obsession with not wanting to admit that you are well-off.


> There are literal minimum wage workers getting by in the bay.

And there are people that don't have a job and live off the land. Each type of person has different expectations and commitments.

- The person making $80k/year is well off compared to the person making nothing because they can't hold down a job; but they're still reasonable to be worried about losing their job, being unable to pay for their apartment, and being out on the street.

- The person making $150k/year is well off compared to the $80k person, but can have the exact same fears. Maybe they're worried about losing their house because they won't be able to pay their mortgage.

It can be very hard to pay your bills after being out of work for a year, even taking minimum jobs to help make ends meet; because people make commitments based on their earnings. And even _reasonable_ commitments can be rough if you're out of work long enough.

> I really can't understand this obsession with not wanting to admit that you are well-off.

The poster with $2M in net worth never denied they were well off. You're the one that seems to be denying that it's possible to be well off and _also_ be worried about the impact of losing your job. Your stance is... mind boggling to me.


If you're making enough to sustain a $2M net worth, either you are skilled enough to be able to find another job quickly (highly experienced individuals are always in demand) and you are just not confident enough to realize your worth—or, you are not skilled enough to get another job quickly and are being paid above your worth.

In any case, I would suggest saving up an emergency fund to the point that you are not scared of losing your job. I understand making that people must make commitments (car payments, mortgages, student loans, etc.) and that the risk of not paying them off is always there. But I would hope that a millionaire of all people has enough collateral for a loan if necessary, and that they would have good enough credit to be approved.

If you have your finances in order, you should not be scared of losing your job. Having actual fear from the thought means that maybe your assets aren't working for you (they are all deprecating assets or the cost of their upkeep is higher than the return). The one thing that can and should terrify you is illness or injury that makes you unable to work.


> If you're making enough to sustain a $2M net worth, either you are skilled enough to be able to find another job quickly (highly experienced individuals are always in demand) and you are just not confident enough to realize your worth—or, you are not skilled enough to get another job quickly and are being paid above your worth.

As someone who was out of work for ~13 months during the dotcom bust; this is absolutely false. Sometimes, things work out such that getting a new job is more complicated than just having a good skill set. And honestly, the idea that "if you can't find a another job quickly, you're clearly not very good at your job" is flat out insulting.

> I would suggest saving up an emergency fund to the point that you are not scared of losing your job.

13 months eats a lot of runway.


> And honestly, the idea that "if you can't find a another job quickly, you're clearly not very good at your job" is flat out insulting.

To be clear, I was only referring to leadership or management roles (given the context of a millionaire in this industry). At that point in your career I would expect one to have many career connections and/or enough sheer charisma and experience that makes landing a job doable if not a breeze.

It's not meant to be particularly insulting, just realistic.

> As someone who was out of work for ~13 months during the dotcom bust; this is absolutely false.

I didn't consider exceptional events such as the dotcom bust, or the 2008 financial crisis or the Coronavirus pandemic and subsequent inflation. If we include those, I do agree that if makes sense to fear losing your job.

I do concede that we've had multiple "once-in-a-lifetime" industry upsets in a very short while and with that context it could make sense to panic. But in normal circumstances, I still think fearing job loss as a millionaire comes from somewhere more emotional/instinctual than rational.


You intentionally moved to that area, and it even sounds like you own real estate there.

Yes, it will impact your lifestyle, and you may have to be more careful about what money you spend, depending on how you're currently spending it. So what? 99% of people on earth live like that, and most do just fine.

The fact that having to live a non >$2M net worth lifestyle would make your family unhappy sounds concerning.


1. The discussion of what I've gone through, where I am in my life, and how the idea of losing my job (again) is concerning to me... is totally separate from the discussion of someone having $2M in net worth and living in an expensive area. Neither of those are true about me.

2. The vast majority of people, if they suddenly had to live with $0 income for a while, would be concerned, regardless of how much they make now.

3. The vast majority of people, if they suddenly had to live with 1/2 their normal income, would be concerned, regardless of how much they make. People make life decisions based on their income, and having that thrown up in the air generally causes people worry and stress.


> The fact that having to live a non >$2M net worth lifestyle would make your family unhappy sounds concerning.

The phrase "$2M net worth lifestyle" does not mean anything.

A 2M net worth could be a 1.5M house (which in the bay area is a simple nondescript house) and 500K in a 401k (which you can't touch until you're 60, so it's off limits). That's it. Neither of these can be used to pay any monthly bills.

So this person with a 2M net worth (locked in assetts) is still 100% dependent on that monthly paycheck to cover all expenses.


It makes especially... interesting reading to those of us from the global South. To make $2 mil at my current salary, I will have to work for about 100-150 years, saving every penny and spending nothing, and I make decent money by our standards. There's even no relatively safe way to make some smart investments (which I imagine is how most of this money was acquired) because of incessant economical crises and non-existence of startup culture.

Man, some people here are completely out of touch with the average Eathlican. I'll have to show this discussion to my friends.

With that kind of money you could just... you know, move pretty much anywhere on the planet and live comfortably for many years, until a good opportunity comes up. Depending on the place, it might last you the entire lifetime (see my salary above).


(Who are Eathlicans? I find no search results.)


Earthling maybe?


I'm not one of these rich people that drives a gold-plated Lamborghini, like my obnoxious boss with his huge collection of supercars.

I just drive a regular, normal Lamborghini - like my father, and his father before him.


> I'm constantly surprised how many well-off people try to pass themselves off as "one of the poors".

The amount of people I've seen on here saying "It's just not an option to quit Google/Meta no matter how awful they are, I have a mortgage to pay!" even when the tech job market was still red-hot is quite incredible.


For a software engineer, or really most professionals, $2M really is not _that_ much in terms of retirement.

It’s absolutely enough to live a frugal retirement on, but not some sort of massive nest egg. You can safely withdrawal about $80k/year from that without it diminishing.

Don’t get me wrong, that’s the average income in the USA, but most deep career professionals would be taking a step down in terms of purchasing power. No real source of income means you can’t really finance anything. Major purchases are essentially pulled straight from that nest egg.


Try having a chronic health condition. My ability to continue walking will at some point depend on paying doctors to replace part of my skeleton; that doesn't come cheap in America.


I wasn't speaking to exceptions. There will always be those who require ridiculous amounts of money to live comfortably—this is not the norm.


Hip and knee replacements are amongst the most common surgeries performed. I'm quite unusual in how young my issue struck, but everyone gets older.


Why do you even bother to work if you have 2M ?


Doesn't help you much if you are a family in Bay Area.


Of course I'm a random person on the internet and you haven't asked for my advice.

Just saying -- kids pick up languages real fast, friendship treaties and tax benefits are abundant and with 2M you can shop around for a better lifestyle if you really want to not have that stress and fear, unless you are determined to become filthy rich, I ofc don't know you.


The easiest way to not panic is to always remind yourself, "You don't need to be a hero. It is not your company".

If the company wants to derisk their business, they can hire more people and create systematic safeguards in their services over a long period of time. The return on this investment is theirs. If they don't make this investment, they don't get the returns - simple as that.

If you are afraid of "losing your job" - remember that you might lose your job even if you were the hero because of reasons not really in your control.

The only time to really panic is when your family is under crisis. To prevent your family from going in crisis, you should build safeguards in terms of health, income etc.

Business owners can do the same if they want safeguards. If not, fine, things will just break. And it is NOT an individual engineers problem to fix all of their poor decisions.


I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain.


After a few times I realized that everytime I step into blind-ish panic over a mistake, I should get as far as possible from the machines. Otherwise I try one bad idea after the other and create an actual catastrophic failure.


What is this response called? I see it in many people. It’s like trying to catch a boiling mug of coffee when it falls — panic and not thinking leads to severe consequences; inaction would be better.

It seems to relate to fight or flight, as opposed to freezing. All three are not great and can significantly escalate a bad situation.


This is a kind of error-cascade, common for lots of systems.

Maybe more specific though, situations where mistakes cause more dire mistakes are sometimes called "incident pits". Not unknown to engineers, but I think we ripped it off the scuba-divers, mountaineers, and other folks that are interested in studying safety-critical situations. https://www.outdoorswimmingsociety.com/the-incident-pit/


Action bias and it seems rather fundamental to human behavior. You don't want to be seen as doing nothing when there are problems, but of course with a complex system, it can be very difficult to diagnose an issue, so you tend to screw things up if you just rush in. I think my favorite saying in that instance is, "don't just do something, stand there".


That’s a good one. I think my driving instructor told me something similar on my first lesson.

I’m talking about the brain fog more than the action itself, though. I no longer have it since I’ve been trained in emergency field medicine. That training course gave me enough exposure to stress to no longer get into this state of confusion. But I want to help others deal with it and it’s hard to find info online.


In my case it wasn't an emergency reflex. More like a strange curiosity to fiddle with something that was already in failure mode. Every attempt was fuzzy, say for a server, something shows a warning, you attempt a live fix, you try to get into a special superadmin UI you don't master, some operations run flaky, but you keep trying more. Before you know it you locked yourself out or bricked the machine.


I've heard "catch a falling knife" for this.


That's a very common term in stock trading as well! Prices are crashing, and you try to make use of that opportunity to buy stock and make a profit when the stock rebounds. Then the price crashes further, the rebound is not gonna happen in the near future, and your fingers are a bloody mess on the ground...

Another panic reaction is selling off everything at the first sign of trouble. And some people hold on to doomed stocks for way longer than what makes sense (look up "Intel guy" on /r/wallstreetbets)

Day trading requires due diligence and nerves like steel, else you are just making somebody else a little bit richer.


Hmm, I will use this one. Thanks.


Politician's syllogism gets close.

> We must do something. This is something. Therefore, we must do this.

https://en.wikipedia.org/wiki/Politician%27s_syllogism


I think calling it a panic response is appropriate. To me, that evokes a person doing irrational things to try and fix a sudden and distressing situation.


"Splashing my face and hands with extremely cold water is my trick."

Yes, literally cooling down. Combined with conscious breathing. And then go on doing, what you can do.


It's the mammalian diving reflex: https://en.wikipedia.org/wiki/Diving_reflex


My proudest SRE moment was when, in a team that dealt with a key system where seconds meant millions, I was able to find a precursor to the problem, so instead of having to intervene on a page in 3 minutes before disaster, the alarm now had 20 minutes of headroom.

We all slept so much better after that


For me, in situations like this, the fear and anxiety comes from the part of my brain screaming, "but what if I can NEVER fix this?" Which is of course completely nonsense. There is ALWAYS a solution somewhere, even if it's VERY hard to find, or even if it's not the one you want.

My first serious job as an adult was avionics repair. I knew how to do SOME component-level troubleshooting but I wasn't very good at it and always got lost in schematics pretty quickly. When I ran across a broken device that had a problem I couldn't solve within 30 minutes of investigation, my first instinct was to say, "welp, this is beyond my abilities" and ship the card or box off as permanently broken, or try to pawn it off to a more senior tech in the shop.

Fortunately, I had a mentor who disabused me of that tendency. He taught me that EVERY fault has a cause. Every time you think you've exhausted all options, what it really means is that you've only exhausted all CONVENIENT options and you just have to suck it up and get cracking on the harder paths.


I was once the primary party responsible for an accounting bug that would have cost Amazon around 100 million dollars in unselable inventory. I stayed up for two nights writing the most heinous regexes to scrape logs(1) and fix the months-long discrepancies. Ended up fixing the problem by writing a script that re-submitted an event for every inbounded shipment to every Amazon warehouse for the last few months. I took the next week off to recover physically and mentally.

Afterward I made a promise to myself to never get into such a panic again. But funny enough, when I feel like panic is immanent I just remind myself that whatever problem I have is absolute peanuts compared to that event. So for me at least having one truly bad event helps me put things into context.

(1) the logs had JSON-escaped JSON and I am slightly proud of those regexes because it is the only time I hopefully ever have to use 8 backslashes in a row https://xkcd.com/1638/


Same experience working for a major streaming provider as SRE. Even when it's a million dollars per minute, the stress never will help you to get the bleeding to stop. It's very helpful everywhere else in life. I can handle social, physical, and other stressors better after this exposure. Nothing in business is as bad as your biology can fool you to believe.


For what it's worth, I'm not sure this is a mental breakdown, and it might give the wrong impression to people who do legitimately suffer breakdowns related to stress about tech.

For me it's only happened once. It was an anxiety attack, and I'm very lucky my wife was there to talk me through it and help me understand what was happening. She's had them many times, but it was my first (and thankfully only).

It turns out that this sort of thing happens to people, and that there's nothing wrong with it. It doesn't mean you're defective or weak. That's a really important point to internalize.

Xanax is worth having on hand, since that was what finally ended it for me and I was able to drift off to sleep.

I guess my point is, there's a difference between having intrusive thoughts vs something that debilitates you and that you legitimately can't control, such as an anxiety attack or a panic attack. You won't be getting any work done if those happen, and that's ok.


https://www.webmd.com/mental-health/signs-nervous-breakdown

Not all breakdowns come in the form of panic and anxiety attacks. Those are certainly a way that breakdowns can manifest, but it’s not the only way. Stress manifests in wildly different ways for different people and even stressors.

You weren’t in his head, experiencing what he was experiencing, so it’s pretty much impossible to “diagnose” from the outside. It certainly sounds like he was functionally paralyzed for hours, even if he didn’t have a full blown panic attack.


That's a good point. I didn't mean to gatekeep, which is what I ended up doing. Thank you.

If he was functionally paralyzed for hours, then that absolutely qualifies. I was reading through it and thinking "If you can debug stuff, you're probably not having a mental breakdown" and wanted to highlight that some people do break down (which is why it's called a breakdown) and that it's ok.


If they were staying up for hours debugging, then they were also probably too tired to see the forest for the trees anymore. Simple exhaustion is sufficient to cause people to go on autopilot and follow trails that lead nowhere, or to just stare at nothing for minutes. Unless it's about life or death (or financial ruin), it's smarter to call it a day in such a situation. Or to at least do a powernap.


> Xanax is worth having on hand

Doing a search online seems to indicate that Xanax can be addictive?

https://www.drugs.com/xanax.html

Doesn't seem to be the kind of thing to take lightly?


It really doesn't take much to be effective if you don't take it regularly. It takes over a week for it to be addictive.

One of the potential side-effects of the medication is after you stop taking it your baseline anxiety will be worse for a while as your body resets. Similar to caffeine and feeling tired.

Depending on how long you're trying to use it you may need to slowly increase the dose to offset increased baseline. This isn't really dangerous using for say.. one week every two months.

It is possible to get psychologically dependent by frequently using it instead of strategies you would learn in therapy. This also happens with things like alcohol.

Using medication to deal with occasional severe anxiety or panic attacks is far safer than substituting it with alcohol.


You're correct. Don't take it lightly. But have it on hand for an emergency, if you're prone to panic or anxiety attacks.

Those are relatively rare, so you shouldn't end up taking it at a frequency where addiction can manifest. But everyone's different; keep open communications with your doctor.


Xanax is a benzodiazepine. You get "high" when taking just barely more than instructed. It's addictive. It's abused frequently. I've abused it before. It feels nice.

I would be surprised if you asked your GP "I have mental breakdowns now and then, I'd like some Xanax", and they just handed it out to you. Then again, I don't live in the US, so maybe the doctors there do just dish it out.


Doesn’t usually cause addiction if used in emergencies only and quite a few people in tech do.

I do feel more stressed 12-24 hours after taking benzos though. This leads some people to take a higher dose and you can see where that leads.

But no, no addiction in me or anyone I know using it “as needed”.

Some doctors prescribe it daily for a long period of time, which is reckless if other drugs are available for the same problem, like sertraline for anxiety. But then, some doctors prescribe oxy for pain when ibuprofen will do. Malicious stuff. All the addiction warnings are to steer people away from these drugs in these scenarios.

Benzos 3x a day for a month will be tough to shake. Benzos 3x a month to 3x a year — hard to imagine problems with that.


I think it's just a default warning; any drug can me. I am taking Sertraline for anxiety and even while I went through it with my psychiatrist about if it can be addictive (his statement was - no, not really - and for me that was true, I was able to stop taking it without any side effects), there's still that risk.

As with any medical thing, better consult it with a specialist anyway.


Sertraline is an SSRI. Tapering SSRIs can suck, but if you're unlucky, benzodiazapine (like Xanax) tapers can be really, really bad, and really long:

https://wapo.st/4fV2b2Y

> A safe taper typically takes much longer than a week-long detox or 28-day stay in rehab — an average of six to 12 months, experts say.

This is not to downplay SSRI tapers, which can be plenty awful if you are on them for a long time, but quitting benzos too abruptly can give you seizures. You have to be very careful to taper them safely.


I will apologise then - I was so sure Xanax is SSRI too that I didn't even bother checking that to be sure. Thank you for correcting me.


I shouldn't have to drug myself to cope with the stress of constantly half-assed released, features getting pushed in without thought, and 3AM pagerduty alarms because of these things.

We will likely see a large cohort of people who got into tech in the mid 2000s dying off from stress related disease.


My wife is a critical care nurse, and she used to work at a hospital in an area with a lot of tech workers about five years ago. She saw a lot of heart attacks in men in their late thirties and early forties that she rarely saw elsewhere.


One thing I did not expect about working in enterprise tech companies was the amount of coworkers who popped Xanax regularly.

I've been an anxiety basket case my whole life and the idea that an addictive pill could take it all away terrifies me - I'd be hooked for life.


It depends on the type of anxiety and how you're using the medication. If you it routinely to handle everyday stress, you're heading down a very bad road. If you have events a few times a year that are extremely stressful, medication can really make a difference.

I had the same fears and I'm not gonna say you should try medication, but I will share it my psychiatrist did. She wouldn't let me have more fifteen low-dose pills at any one time.

Consider your relationship with alcohol. If you already have a problem with it or completely abstain from it, I would strongly recommend avoiding medication.


It's hard to compete against the people who are "enhanced" with drugs.


I used to occasionally take Xanax for [health] anxiety myself. It doesn't have that kind of effect, it's like a sedative.


I'm the same, and similarly, the idea of taking a pill to make it go away is terrifying.

I have to keep myself very disciplined on the "helpful" medications I do allow into my life because I know how easy it'd be for me to get hooked and start depending on the meds to avoid my problems. Instead I've been working on training myself to see through my nerves.

Alcohol at least has the drawback that the few times I've had it, I've had such an awful experience that I don't associate it with avoiding my problems.


> I'd be hooked for life.

I mean, that's how a lot of prescriptions work anyway. I take daily medication to prevent anxiety and depression. It's not addictive, but I try pretty hard to never miss a day, because anxiety and depression will get in the way of me living my life. Assuming Xanax is safe to take daily (I have no idea), the only difference is that a supply disruption would have side-effects. But if I run out of my medication, I'll get depressed within weeks, almost guaranteed, withdrawals or no.


Xanax is safe short term, but extremely addictive if taken for too long (~1 month?).


Yeah I was actually disappointed to find an article about a routine dependency debugging story. I've felt on the precipice of a breakdown a few times before and was really hoping this would be a more relevant article.


This person's stress was caused by a single line of code in PostHog. This is the reversion: https://github.com/PostHog/posthog-js/pull/1371/commits/7598...

Highlights two lessons. 1. If you ship it, you own it. Therefore the less you ship, the better. Keep dependencies to a minimum. 2. Keep non-critical things out of the critical path. A failing AC compressor should not prevent your engine from running. Very difficult to achieve in the browser, but worth attempting.


Even worse, it appears that PostHog dynamically updates their part of their code at runtime - not bundling it at build time. Their docs note an Advanced Option where all dependencies are bundled in the build. I mean, I get why, and maybe I am misunderstanding but as a user I would expect lazy loading of executable code to be an optimization rather than the default. And used only if fully bundling was a serious delivery delay.


These are valuable lessons for sure, but then someone from the marketing comes and demands you add PostHog or any other tracking script to the site and won't take no for an answer.


Then communicate clearly the trade-off the project leader is making.


While I agree with you, this sounds like it could lead to a classic case of "that escalation worked, in the sense that it was heard".

Unless someone with decision-making power is going to be woken up at 4AM for a problem, they have very little incentive (intrinsic or extrinsic) to block a project on such nebulous claims as "Reducing dependencies" or "Less things on the critical path", if a business leader has come to them with a request.


Looks like the bug was in a monkey-patched `window.fetch`

https://github.com/PostHog/posthog-js/blob/759829c67fcb8720f...

The biggest lesson here is, if you're writing a popular library that monkey-patches global functions, it needs to be really well tested.

There's a difference between "I'll throw posthog calls in a try/catch just in case" and "With posthog I literally can't make fetch() calls with POST"


I was poking around to understand how this was not caught in a test - any ordinary fetch call could have triggered the error, and besides how poor coverage it has for all the ways `fetch` can be used, it seems excessive mocking may have played a part: https://github.com/PostHog/posthog-js/blob/main/src/__tests_...

The whole fetch and XHR functions are mocked and become no-ops, so obviously this won't catch any issues when interacting with the underlying (native or otherwise) libraries. They have Cypress set up so I don't see why you'd want to mock the browser APIs.


I have seen so many mocked tests where you end up asserting the logic in the mock works; effectively testing 1=1.

The number of issues that can be prevented with an acceptance level test that has a user log in and do one simple interaction is amazing. Where I can convince the powers that be, PRs to main are gated by a build that runs, among others, that simple kind of AC test. If it was merged to main, you _know_ it will not totally break production.

We had regular outages with our internal emailing system at a small e-commerce shop. I stepped in and added one test that actually sent an email to a known sink that we could verify and had that test run pre-deploy. We went to zero email outages. Tests had the occasional flake that auto-retried. Also, if your acceptance tests are flaky, how do you know your software isn't? Bad excuse to avoid acceptance level testing


Thank you for pointing this out. I did not read the post in detail but I was wondering how could a monitoring library cause the entire application to go down. At worst, I thought, it should have failed to process the monitoring events, assuming that it was integrated in a reasonable way.

PostHog patching a very important global function is a feature that should be well-documented so that the people who are using it are aware of it and can be reasonably expected to have it in mind when debugging these seemingly unexplainable issues.


I think it worked as defined, it hogged the POST requests?


it's common with all these analytics toolkits. When you're monkeying with such a core api, I have no idea how you can actually test everything.

eg heap analytics still (as of this month) bad touches something inside hotwire and randomly entirely breaks hotwire causing every click to do a full page load. ime, it affects 30-60% of page loads. You can fix it (but thanks for 50+ hours of debugging) but making heap load after all hotwire js.


> I have no idea how you can actually test everything.

While a worthy goal (if unattainable sometimes), in this case, folks would settle for "test anything", which would have caught the regression.


They've been hanging out with the CrowdStrike people.

"Test strategy: yolo"


As others have mentioned, the bug that led to this late-night stress was a one-line change to the PostHog library[0].

I take this as a reminder of the importance of giving precise names to variables. The code

    res = await originalFetch(url, init)
looks harmless enough. But in fact the `url` parameter is not necessarily a URL, as the TypeScript declaration makes clear:

    url: URL | RequestInfo
The problem arises in the case where it is not a URL, but a RequestInfo object, which has been “used up” by the construction of the Request object earlier in the function implementation and cannot be used again here.

It would have been more difficult to overlook the problem with this change if the parameter were named something more precise such as `urlOrRequestInfo`.

(A much more speculative idea is the thought that it is possible to formalise the idea of a value being “used up” using linear types, derived from linear logic, so conceivably a suitable type system could prevent this class of bug.)

[0] https://github.com/PostHog/posthog-js/pull/1351/commits/2497...


The problem with linear/affine type systems is that they have an incredibly high barrier to entry. Just look at ownership semantics in something like Rust. They're not impenetrable (especially with experience), but they're severe enough to be the number one complaint for learners.


Ha, that was a stressful yet funny read. The self flagellation bit hits too close to home though. I run a somewhat successful iOS/MacOS app and pushed a release that completely broke about 350k+ installations. Not entirely my fault but doesn't matter as it's my product.

The cold sweats and shame I felt, man... Plus it's on the App Store so there's the review process to deal with which extends the timeline for a fix. Thankfully, they picked it up for review 30 minutes after submission and approved it in a few minutes.


My first employer as a developer, due to their incompetence, not intelligence, ‘let’ me break our customers’ shit from a young age. As I’ve progressed through my career, and through my transition to leadership, I’ve realised that it was a very valuable experience. I may have stressed about it in the past, but those memories are too distant for me to even reach now. I certainly don’t stress about it now.

I’ll, maybe controversially, sometimes allow my (early-career) team members to break prod, if I can see it happening ahead of time, but am confident that we’ll be able to recover quickly. It’s common knowledge that being given room to fail is important. But many leaders draw the line at failures that actually hit customers. If one finds oneself in the very common and very fortunate position to be building software that isn’t in charge of landing planes or anything similarly important, they should definitely let their team experience prod failures, even if it’s at the expense of so-and-so from Spokane Washington not being able to use their product for a few minutes.


Within my first 6 months at a FAANG, I accidentally took down our service for half the world for 30-60 minutes.

That was honestly one of the best things that could’ve happened to me and I still use the story for new hires to this day. It’s a very humbling reminder that nobody is perfect and we all make mistakes. And I’m still here, so it’s never the end of the world.


Author: Thank you for writing this! I love reading about how people overcome challenges like this, especially under pressure (and usually overnight!). I am better for hearing not just the technical post mortem but also the human perspective that is usually sanitized from stories like this. This is the kind of technical narrative only a small/solo dev or entrepreneur can share freely.


Based on the way you were troubleshooting it. You can tell you're a programmer first. You went to your code, you went to your logs. Both reasonable, both potential causes of the problem. Both ignore the primary clue that you had. It worked on localhost.

As an SRE/devops/platform engineer or whatever the title of the day is people want to give. I would have zeroed in on the difference between the working system. And the non-working system. Either adding and then removing, or removing and then adding back the differences one at a time. Until something worked. What I see is two things. 1) you have an environment where it does work. 2) the failing environment was working, then started failing.

Is my method superior to yours, no. It just is being stated to highlight the difference in the way we look at a problem. Both of a zero in on what we know. I know systems, you know code.


many years ago, i was working as an electronic technician. we had a stack of processor boards (from a Perkin Elmer 7/32) that were removed from service. broken. many different revisions, and only schematics for one revision of each board. i thought it was hopeless. an older wiser tech taught me how.

plug a good board on an extender. run a diagnostic that fails in a loop. using a scope, look at every pin on the connector. write down what you see. replace with a bad board. repeat.

which signals are different? chase them back. if the schematic does not match, get out a voltmeter and your eyes and draw a schematic that reflects how the board is wired.

he called this "good card - bad card".

and it worked. not going to make any claims about cost effectiveness, but we fixed every board. and my troubleshooting skills in digital electronics were greatly improved.

this was a 'fireman' kind of job. waiting for the system to break, so it didn't matter if 2 techs put a week into 1 circuit board.


> This is no good. Let me just try reverting to a version from a month ago. Nothing. Three months ago? Nothing. Still failing. A year ago? Zilch.

Reverting your own code, but still using a broken PostHog update from that same day? For me, the lesson is to make sure that I can revert everything, including dependencies.


It seems that PostHog just always loads the latest version of this piece of itself:

https://github.com/PostHog/posthog/issues/24471#issuecomment...

Though you can opt to bundle it yourself:

https://github.com/PostHog/posthog/issues/24471#issuecomment...


>> It seems that PostHog just always loads the latest version of this piece of itself:

Now there's a supply chain attack vector...


Years ago, IT at the company I was working at force-pushed a browser extension that did this same trick, but the extension vendor in question didn't even bother loading over https.

Edit: the extension's manifest gave it nearly every permission, on every web site, including internal ones


> I definitely want to figure out in detail what happened here so I can add a test to prevent a similar change in future!

Whoa! Good idea!

Could have been worse. At least the change didn't expose a hidden exploit.


Ouch. That just adds insult to injury.


This is the key lesson. Your own deploys mean nothing if you link to another CDN for parts of your application.

You handled it well OP, the silver lining of incidents like this is the grab bag of valuable takeaways!


Great reminder of the people behind services, as well as a nice account of the debugging process. The reality is that pressure doesn't make you debug problems any faster... usually, it interferes with your thinking. You have to try to ignore the consequences and stay as calm as possible.

But most of us have been in some situation similar, if not quite as bad. (Running your own company is going to be uniquely stressful.)


All monitoring comes at a cost and adds complexity. I wish people realized that, I struggle with this in my own team, we keep adding layers upon layer of monitoring, metrics etc.


I’ve definitely come across this genre of issue on many sites in the past when checking out the console when a site is broken. Page is just a blank white screen? Oh, looks like the render function was placed after the init for some 3rd party user monitoring, which crashed because the script didn’t load properly. “Complete Checkout” button just does nothing at all? Oh, looks like the code to take my money runs in a callback to some analytics script that my ad blocker blocked. Oops.


You need to insulate your metrics/monitoring from the critical path so that failures in these providers don't take out your app.


I had these issues before for plenty of things, it just hurts the most when it's something non-essential. I've had outtages because silly system updates to slack broke and took things down. I run metrics and such out through logs these days because UDP don't care.


Weirdly I think this is heavily related to social anxiety / shame - as in “everyone will knowingness me and point”. This is buried so deep in our brains it’s almost certain to do with herd behaviour.

And it’s (IMO) why anonymity online is usually a bad idea - we need to learn, deep in our bones, that what is said online is the same as standing up in front of the church congregation and reading out our tweets - if you would not in front the vicar, don’t in front of the planet.


> you would not in front the vicar, don’t in front of the planet.

Every tyrannical government in history would love this maxim.


Not really. One can be civil without self-censorsing oneself. Candidness is not the same as rank rudeness.


These breakdowns happen to everyone and its really bad when its just you against the world. I've been lucky that my last few major outages have all been team efforts with anywhere from 2-10 people working on the issue. Albeit, this is a perk of working in a large enterprise.

With more than one person you can bounce ideas off each other and share the pain so to speak. It's highly desirable.


Great post, but kind of buries the lede: PostHog is having a CrowdStrike moment.


PostHog cofounder here. This affected users that did not have a specific version of the JS library pinned and deployed a new version, or were using the snippet, and had network capture enabled, (a feature we introduced very recently and is only enabled on 3% of projects), and had recordings enabled on that particular session (for most customers, only a small percentage of sessions are recorded due to sampling or billing limits)

This outage was definitely disruptive and we shouldn't have let this happen. We will be doing a full post mortem write up, but this affected a small percentage of our users, so the comparison with Crowdstrike isn't fair.


Random guy here. This affected users that

- Used your recommended way of implementing PostHog [1]

- Used a feature of the product

- Used a feature of the product

The comparison to CrowdStrike is not fair, you're right. But this attempt to shed responsibility still leaves a sour taste.

[1] See "This is the simplest way to get PostHog up and running. It only takes a few minutes." from your website, which is the first method suggested when clicking the "installation" tab


Just to be clear, those are AND not OR conditions.

Definitely not trying to shed responsibility here, we messed up and we'll make sure this doesn't happen again.


Not your customer, just a random person on the Internet, but I hope you can see that a lot of that is through luck more than judgement.

I personally would have like to see a bit more contrition rather than trying to minimise the issue.


>This affected users that did not have a specific version of the JS library pinned and deployed a new version

Par for the course honestly. The amount of garbage that gets called "production" these days is mindboggling. No blue/green or canary deployments, shipping code that has nothing pinned, no clear rollback, etc. This is what happens when anyone can become an EngineerTM after a two week JavaScript boot camp.


No, actually, it's because Posthog explicitly recommends that as the way to do it, makes their standard npm package unpinnable (as it will always lazy load the most recent version of its modules) and calls version pinning via npm as an "advanced" installation[1].

The ecosystem has plenty of versioning and best practices, but they do jack squat when you recommend to your customers to bypass them and trust that you'll never break your latest build.

[1] https://posthog.com/docs/libraries/js


Sure, but just because _they_ suggest that you set your website to depend on https://us.i.posthog.com/static/array.js doesn’t mean you’re off the hook for following that (bad) advice.


>No, actually, it's because Posthog explicitly recommends that as the way to do it

Just because a project recommends "curl whatever | bash" to get started doesn't mean it's something you should productionize. You need an engineer that's done more than a bootcamp to understand code pinning, packaging, and deploying in order to ship a supportable, observable system. You're making my point for me.


You're trying to phrase this as if those conditions make it any less bad, but they don't. This affected users that were using the latest version and used... features? Give me a break. Every product has bugs, but trying to downplay the issue after you've just read a distressed user of yours struggle with it is definitely not what you should be doing.


There's certainly a failure to test properly from PostHog, as in they have production features that aren't being tested before a release.

On the other hand the author of the article did the exact same thing. They either pushed a release without testing, or they automatically just pull in the latest version of an external library, without any testing or verification. Now I lean towards this being the latter, as if they pushed a release and then the site broke, they would have considered a rollback. Kinda hard to blame others for failing to do testing that you also didn't do.

Edit: So others have pointed out that PostHog will just pull down the latest version on it's own, unless you actively disable that feature. That seems like a brave move.


Yeah, honestly not a good look to come in and “well… actually”. It’s certainly far from a “crowdstrike moment” but tact is still needed when you’ve clearly affected multiple people and their customers with your bug.


Every external service you integrate is adding a small, non-zero, compounding probability of finding yourself in exactly this situation.


Not to mention the performance hit to your actual customers when it all "works."


Side loading 3rd party scripts in a critical path is asking for problems. Try https://partytown.builder.io/ runs 3rd party scripts like this in a web worker. I'm not sure it would help in this case. Maybe? Probably couldn't hurt to try.


B&H Photo Video shuts down intentionally for a whole day every single week, and as far as I can tell, they're one of the top retailers of pro/prosumer A/V electronics.


We once planned an international trip with my dad to new york around visiting b&h to buy electronics about a decade ago.

Best memory was walking in there and seeing all the conveyer belts above me shooting packages across.


During the mid-early part of my Ops/SRE career I had a senior who was a mentor to me. I noticed as we dealt with outage after outage he was always calm, cool, and collected, so one day I asked him "<name redacted>, how do you always stay so calm when everything is down?" That's when I found that before he'd been in tech, he'd be a Force Recon Marine and had been deployed. His answer was "Nothing we do here is life or death, and nobody is shooting at me. It'll be alright, whatever it is."

While I have never experienced anything similar myself, it really helped me to put things in perspective. Since then, I've worked on some critical systems that were actually life or death, but I no longer do. For the /vast/ majority of technology systems, nobody will die if you let the outage last just a few hours longer. The worst case scenario is a financial cost to the company who employs you, which might be your own company. Smart companies de-risk this by getting various forms of business insurance, and that should include you if it's your own company.

So, do everything you can to fix the outage, but approach it with some perspective. It's not life or death, nobody is shooting at you.


I know solo projects always have an infinite list of "nice to haves". But personally I never skimp on vendoring dependencies.

In my experience, not vendoring has _always_ led to breakages that are hard to debug and fix.

Meanwhile, vendoring is quite easy nowadays. Every reasonable package manager, and even npm, can do this near-trivially.


the argument is always "the pr that pulls in the dependency is gross to review with dependency updates" -- and there are ways to mitigate that. I vendor dependencies. My customers want stability and that means a bit more process in managing dependencies. Easy win.


I remember as a junior dev getting a call at around 3am from Indian tech support about a failed deploy I had been heading. I stressed myself about it so much and only later reflecting back do I realize nobody but me cared.

Also funny that the culprit was posthog since I have some past experience with it.


I heard a sleep expert say that during the night your logic and reasoning abilities are greatly reduced. I think it was in relation to dreaming, you don't want to apply much logic to that stuff.

That's why trying to solve problems in the middle of the night just ends up in stress.


Pilots call the wee hours of the morning the "window of circadian low." Unless you routinely sleep during normal daytime hours (for a few weeks at least), the circadian low still nudges your brain and body metabolism toward a lower-energy state EVEN if you are fully rested up until that point.

https://www.faa.gov/documentLibrary/media/Advisory_Circular/...


It also makes sense why I'll wake up in the middle of the night terrified of things that don't bother me during the day. My fears all are much more immediate in the witching hour, and I can't talk them away.


I'm the same. The only thing that has a chance of working is to remind myself that it's the middle of the night and however terrifying this is I'll deal with it in the morning.


Funny enough, I almost exclusively do my best problem solving at night. I'll wake up with solutions, or I'll stay awake and get things done. I'd say 90% of my Master's work was done between midnight and 3am.

But indeed, regardless of time of day, if I just wake up, or am woken up, I'm basically a big, dangerous toddler when it comes to problem solving.

Context and nuance is important, of course. We're all so different.


That often happened for me in grad school as well. Generally the questions I had trouble with on a take home exam would yield to late night inspiration. And if they didn't yield by a semi-reasonable time, I would go to bed and many times, as I was drifting off, an insight would come to me. One memorable time though, that didn't happen and I woke up several times in the middle of the night from stress dreams where I was trying to solve the problem. And when I thought about the dream, nothing I had been doing in it made any logical sense to actually help me with a solution. Fortunately I woke up early and was able to figure it out in the morning. It was not a very restful night of sleep though.


On the other hand, sometimes I find the time right after waking up the clearest, having an empty mental stack. Like having a clear cache.


correct, once i notice myself making stupid mistakes i verbally confirm it with myself and say whelp that's the night. superpower. too many times i've woken up and said what the fuck was i thinking?


I know that it's not a real post-mortem, but this is the opposite of what a good post-mortem looks like.

It includes:

* Blaming the tools (and the author)

* Not focusing on facts in the timeline

* Not considering improvements

But that doesn't make for engaging content, right?

> * At $TIME we observed HTTP POST calls failing

> * At $TIME customers reported inability to make changes to ticket prices and promo codes

> * $PERSON took the following steps to debug...

> * Root cause: an update to a vendor library resulted in cascading failures to the site

> * 5 whys (which might include lack of defensive programming, the use of a CDN without a fixed version, etc. etc.)

> * Next steps: pin the CDN version or pull the dependency into the build, etc.

Actually, that still looks like a pretty good story to me without any of the associated mania.


The trouble with post-mortems is always the things that you can't say.

Are we going to talk about how someone wanted analytics everywhere even though it's expensive and they don't use it, that they chose the vendor and didn't provide time to evaluate it, that they wanted it added now-now-now to production even though it hadn't gone through dev, even though it was unplanned work in a busy sprint with other risky work, and that they wanted it on the critical path despite specific objections from some in engineering?

Not saying any of this was the case here, but this kind of thing happens frequently. While engineering is saying "let's use blameless post-mortems to highlight problems in the system" a lot of other people are thinking "now we get to see who could or couldn't fix a problem we caused, focus on that, and downplay our own role in causing trouble". Devs talking about CDNs and HTTP-POST and engineering processes are super helpful for steering the conversation away from broken business processes.

Easily 80% of PMs I've been involved in probably come down to "we had rules for moving tested code through dev/qa/prod, but we were forced to ignore the process we had agreed to because $PERSON / $DEPARTMENT said so". No one can actually talk about the elephant in the room though, because it's career suicide after you're branded as not a team-player. For all the kids out there.. it's really important to learn to be able to recognize the difference between a PM that's going to be an earnest effort to understand things, vs a PM that is going to be an unhelpful but necessary ritual of humiliation.


Good post-mortems should provide ammo and action items to review existing business processes. If you don't have an advocate in your engineering organization who is working across departments to improve these processes, that may be a sign of organizational immaturity or dysfunction. No organizational function happens in a vacuum, and in the right company, collaboration can help influence the things that engineering doesn't directly control.


Let’s not pretend it’s just PMs. I’ve worked with PMs who fought against shipping fast but ultimately an engineering manager or engineer comes in and forces a fast ship.


It's also not a good fantasy novel or stage play.

It seems pretty silly to look at a blog post that is very obviously not meant to be a technical post-mortem and criticize how it isn't a good one.


A mantra I heard recently that has been helping me with my own 3AM panics was "None of this matters and we're all gonna die". A bit Nihilist maybe, but has been helpful and just kind of removing the weight of the situation


Good thinking. Sometimes I have to remind my nerdy colleagues about what a true emergency is. A true emergency is when your granny is having a heart attack and has to be rushed to the hospital. When a big corporate website, that does millions of dollars of transactions is down at night it is NOT an emergency. They have to hire staff specifically for night shifts if they are so concerned about the money that they may lose.


Nothing good comes from working at 2am. Company I worked at pushed a major waterfall upgrade one day. Completely tanked the database while attempting to update the schema.

I spent hours on a call with the clients sr. engineer and we eventually came up with a script to fix it. It was after midnight, my director said, good job, you are tired, I'll run the script, call it a night.

An hour later director ran the wrong script... and then called me.

Clients sr. engineer was legitimately flabbergasted, only time I have ever seen that word apply in real life.

Was a not good, very bad day.


From the PR:

> fetch() broken on August 19: TypeError: ...

Not broken at this version, broken on August 19. This is why I'm terrified of putting anything on the web. It is a dark scary place where runtime dependency on servers that you don't control is considered normal.

One day I'll start my own p2p thing with just a bunch of TUI's and I'll only manage to convince six people to use it each for less than a month and then I'll have to go get a real job again but at least I won't have been at the mercy of PostHog.


> Not broken at this version, broken on August 19. This is why I'm terrified of putting anything on the web. It is a dark scary place where runtime dependency on servers that you don't control is considered normal.

Yeah, that is terrifying. In a nearby comment [1], a PostHog co-founder wrote this affected sites which "did not have a specific version of the JS library pinned and deployed a new version, or were using the snippet". I gather from that is it possible to pin the version, and this incident highlights the value of doing so.

[1] https://news.ycombinator.com/item?id=41301008


I'd prefer a something where such references are resolved by cryptographic hash so that there's never any ambiguity re: what you're actually getting.

Unison does this I believe.


During these situations, a prompt message can be very reassuring for users, for example:

"We're investigating an issue affecting $X".

As a user, I can rule out that the issue is at my end. I can focus on other things and I won't add to the stack of emails.

This is one of my biggest frustrations with AWS being slow to update their status page during interruptions. I can spend much of the intervening time frantically debugging my software, only to discover the issue is at their end.


Love the detailed emotional reaction to scrambling to fix an outage. Nothing quite like attempting calm, dispassionate debugging while actively wrestling your own brain


The thing that struck me the most was how wildly unprofessional Paul D'Ambra's comments are in response to the bug report.

And then he rolled out a fix that was broken, too - showing incompetence in development, understanding the problem, and a total failure to do proper QA on the fix.

Royally fucked the pooch twice and he's all "gee golly whillikers!"


Such an entertaining read, conveyed very well the sense of stress and abandonment felt by the author. It adds to it that this was written fresh and right in the moment, and feels as an expiation.

I'm struggling to find the lesson to take out of that. Limit your dependencies? Have a safe mode that deactivates everything optional?


Whenever I am dealing with a 3rd party services, I like to write a small adapter for it that bridges the connection point and can keep an eye on a few things.

Primary I use a code generator to write most of it.

For huge services it may not be practical, but for most it usually provides a heads up if something stops working. with an integration.


This post resembles how I’ve been feeling lately at work, too bad it’s been months like this now!


I can relate to this. I used to be on call for many years and honestly, it destroyed my mental health. In the last company I did it, it felt like falling in a meat grinder for a week. I remember once spending a whole weekend giving support on an bug that was introduced by a recent release. 72 hours of working non stop. Because of that among others, I got a severe burnt out that took me to the deepest dark place I've ever been.

To this day, I simply refuse to do on call. There's no enough money you can pay me that would make me to suffer that again.

PS: Fuck you, Rackspace.


More like "The anatomy of a calm and collected response while facing a dire situation thanks to years of expertise, eminem and my sweet wife."


I've been on my end of plenty of operational outages. I don't want to be harsh but this could have been written by one of my colleagues, the type of colleagues that I really wish I didn't work with. Console logging for hours? Randomly disabling things? Sometimes when you feel "imposter syndrome" you shouldn't ignore it and maybe up your game a bit.

In fact, I have dealt with an extremely similar situation where a bunch of calls for one of our APIs were failing silently but only after they had taken card payment transactions. Dealing with the developers of this system was like pulling teeth, after we got them to stop stammering and stop chipping in with their ideas (after half a day with this issue ongoing) it took 10 minutes to find the culprit by simply going through the system task by task until we got to the failing task (confirmation emails were unable to send so the API server failed for the entire order despite payments being taken etc.).

This only required 2 things: knowledge of the system, and systematic process to fault finding. You would think that developers who have at least the first, being the ones who wrote it, but sometimes even that is a big ask.

Maybe I'm just burnt out from this industry and incompetent people but... come on... no excuses really.


Agree with everything you said.

And then I’d add: Start with reading the error message. In his panic state, he seems to have thought it was a red herring. Error messages are gold. It gives you a concrete thing to work backwards from.


The author needs to relax and Posthog needs more discipline. (and a rename)


Pin your dependency versions! The rollback should've fixed things :(


So Zarar needs to keep the localdev as close to prod as possible, or have a separate pre-prod environment that can run integration tests to catch vital function disruptions.


Your advice is useful for detecting bugs in the code that you release, before your push it to production, but it would not have helped here, because the bug was in Posthog code that was pushed to production asynchronously.

In fact, it would have made debugging this particular issue harder, because the difference in Posthog configuration between dev and prod is what clued the author in on Posthog causing the problem.

To avoid this kind of problem, the solution is to avoid “live” dependencies which can change in production without your testing. Instead, pin all dependencies to fixed versions and host them yourself.


Ah, thanks, I didn't know it was a live dependency.

>> Instead, pin all dependencies to fixed versions and host them yourself.

Indeed.


Having a(n accurate) service graph of all your (internal and external) dependencies is a game changer in troubleshooting issues like this.


Do you need another dependency for that? A dependency to manage dependency hell.


Optimally, that’s already part of your observability stack, and might well be a built-in feature. In a certain scale/landscape it easily pays off.


Love the story. The question for the author of course... what did you learn and how can you keep this from happening again


Since you were able to think and act, I would not call this mental breakdown. That kind of thing is very, very different.


I'm so glad I don't have to work with these byzantine JS monstrosities.


I feel your pain. Someone else shot me in the foot. No fun whatsoever.


That's a display of endeavour and persistence. Congratulations.


I would love to read from the author what are the lessonS learned.

Use better tools? Know better your tools? Know better how to debug? Add yet another tool to detect the error?

In all big companies where I worked, at the end of such an event, it boiled down to answer the 3 questions: - what happened?

- why did it happen?

- what do we do so it does not ever happen again?


This also serves as a cautionary tale to small-business web people.

You can start a web service business solo (or with a small handful of folks). But the web doesn't shut down overnight, so either have a plan to get 24-hour support onboarded early or accept that you're going to lose a lot of sleep.

(And if you think that's fun, wait until you trip over a regulatory hurdle and you get to come out of that 2AM code-bash to a meeting with some federal or state agent at 9AM...)


write my own everything gang rise up


Relevant [XKCD](https://xkcd.com/2347/)


... and thats why you pin dependancies


Nice writing. Art can be useful for helping us cope.


[flagged]


I don't think that's the right TL;DR here. The point is the FUD we sell ourselves is often just not true. Take a breath and face life's challenges without telling yourself "it's over" along the way.


Thank you for this. The article started interesting, but then devolved in boring all caps recap of panic thoughts which killed its momentum. I was still interested in the origin of the bug though.


I'm not entirely sure what you expected, given the title


It's not your style ; I still found it an entertaining read that conveyed well the prolonged stress the author was feeling.


You can always skip to the end if it bores you. I liked it.


"Maybe PostHog, I have the api_key blanked out locally to reduce costs"

Come on, if POST requests work locally and not on PROD, isn't this an obvious place to start?


The author mentioned significantly more notable prod/dev differences than the posthog API key, which I suspect is where they looked first and second. So no, not an obvious place to start.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: