The zero-downtime culture is pretty out of control. Your flight can be delayed for hours, your roads can be closed for weeks, your internet can be out for hours because someone cut a line, that's all ok, but someone can't get into your website for 3 minutes to check the status of their order and we all must lose our minds and publicly apologize and explain in detail what happened and why it will never happen again.
The obvious difference is no matter how long the county or city takes to reopen that road - you'll go right back to using it because you don't have an alternate choice.
For a website - particularly ecommerce websites - you have many choices. Rarely is a product only sold on a particular website. Being down when someone is trying to place an order can and does result in losing the sale, and potentially the customer forever. Customer loyalty is often fickle - why would they wait around for you to fix your stuff when they can simply place their order on Amazon or another retail website?
That really depends on what type of service we're talking about. Yeah, if you're doing really high volume sales on products that can be purchased elsewhere for close to the same price and shipping options, then uptime is critical. That's not the situation most of us are in.
From an ecommerce website's perspective - it is exactly the situation most of us are in.
Unless you are the manufacturer for your own goods, then you are competing on often-thin margins within a crowded industry with lots of competition where everyone more-or-less offers the same price.
In an age where more and more people don't even consider shopping anywhere but Amazon, being down exactly when that customer decided to grace your website is akin to giving yourself a nice, stinging papercut. Do this often enough and long enough, those papercuts will start to leave scars.
So ya, being down is unacceptable for most ecommerce websites. The website literally generates your revenue, and without it you receive no revenue.
Lots of things can be unacceptable, but not unacceptable enough to warrant being treated as a life-or-death emergency, where klaxons ring out and people get paged at 2AM. So your e-commerce website is down for a few hours. Fine. What's the worst thing that happens? You miss out on a small number of purchases, maybe a few customers are gone for good, and the shareholders reap a little less profit this quarter. Nobody died.
I think this is what OOP meant about "zero-downtime culture" being out of control. Unless you're a hospital keeping people on life support, I don't want to hear about zero downtime.
> What's the worst thing that happens? You miss out on a small number of purchases, maybe a few customers are gone for good, and the shareholders reap a little less profit this quarter.
You build a culture of nobody caring about the singular thing that brings revenue into your organization. You're fine with shareholders taking a hit, because it's not your bank account - but what happens when a company has to let people go because they don't have enough customers and orders to support everyone? Oh, now they're supervillain out-of-touch CEO's I presume?
> Unless you're a hospital keeping people on life support, I don't want to hear about zero downtime.
Then look for other work? If you are being paid to keep a website online 24/7, and it goes down "for a few hours" and your attitude is "nobody died", then you're simply not cut out for that job - find another or be terminated.
There's not a single person on pager duty (or whatever) that doesn't know what they are getting paid for. Complaining about a job you volunteered (at-will employment!) for is insane.
Way ahead of ya, I don't have such a job and obviously wouldn't accept one since I have such a fundamental objection to that kind of "omg four alarm fire" outlook on work. SREs with pager duty have a more compatible mindset for that and frankly are just built different!
>Unless you are the manufacturer for your own goods, then you are competing on often-thin margins within a crowded industry with lots of competition where everyone more-or-less offers the same price.
That's not true. You don't have to manufacture your own goods to not compete with others at a commodity level. It's quite literally why branding is so important.
But I do find it funny you wound up doing exactly what the above commenter stated, and are acting like this is a world ending event for a small-medium sized ecommerce site to be down for a while.
> It's quite literally why branding is so important.
Most of the goods you purchase are from brands not owned by the retail website. Pick a hobby and consider the prominent brands, then consider the prominent websites people in that hobby often patronize - they are very often not the same.
> and are acting like this is a world ending event for a small-medium sized ecommerce site to be down for a while
Because it kind of is. The smaller you are, the more significant of an impact an outage has. Nobody is going to think "this website I've never heard of totally isn't working right now, so I'll just wait around until they fix it". No, instead, they will go on to their next google search result - or Amazon. This hurts even worse if they clicked one of your ads while your website is down... costing you actual money and not just the sale opportunity. A customer that might have had an expected lifetime of 1-3+ orders now becomes 0.
Websites don't have "business hours", and routine maintenance is not acceptable for customers that shop when it's convenient for themselves. Remember, ecommerce exists as a better way to do over-the-phone and catalog/mail ordering. One of those benefits is you can order whenever you want - so it is in fact important for the website to be online at close to 100% as possible.
depends what you're selling, and there's always the chance that someone is up at 4am their time, but if the page was down from 11pm to 6am every day for a mainstream physical product, they'd probably not lose too many sales.
unfortunately, losing a sale puts a business into a tizzy, to they tend to react poorly to that thought. it's because it's easy enough to leave the computer running unattended over night that business hours aren't a thing that websites are in that spot, but watching the traffic flows while on the Internet traffic team at Google on top of some emotional development, means that for a small bespoke business, SLAs don't even have have any nines for that business to remain in business.
we're biased to by the very nature of us commenting on an Internet website, but there are still customers out there that don't use the Internet on a daily basis.
I'm about to order some electric bike motors. Probably from ebikes.ca. If their website goes down I'll just wait until it comes back up.
If it was groceries, I'd go to the store.
Unless in-person-purchasing is an option, I'm just going to wait for my preferred vendor. My account's already set up and clearly they have what I want.
By the numbers, most customers are shopping on any particular given ecommerce website for the first time. It's expensive and difficult to gain new customers - and even more difficult to gain repeat customers.
What would you have done if you had not purchased from ebikes.ca before and just happened to stumble across them after a google search? What would you have done if after clicking the ad/link, the page didn't load? Would you try again tomorrow? Probably not... that is the issue. Instead you would be naming some other small business that was online when you needed them to be.
> What would you have done if after clicking the ad/link, the page didn't load?
archive.org first just to check if it's down temporarily. But if you're a vendor just reselling Alibaba, then there's not a lot of reason to shop from you. But with real value add, you can get customers. That's exclusive products, support/warranty, niche product, etc beyond just "fulfillment".
There's an entire ocean between Alibaba flippers and strongly branded products sold first-party by mega-corps.
Consider the physical space and walk into a Target or Walmart - how many products on the shelf are Alibaba flips, and how many are sold by the manufacturer? Target has house brands, but Dawn Dish Soap is not one of them. You can get Dawn Dish Soap from most stores - if one day Target was randomly closed are you going to sit out front and wait for them to open, or are you going to another store?
Back in the digital space - let's pick golf... how many websites do you think sell Callaway golf clubs? How many of those websites are the actual manufacturer/brand? One? A google search for "buy golf clubs online" yields probably close to 50 websites just on the first page alone... of which 80% appear to be small businesses.
You start clicking through the search results looking for deals - and some of the pages don't load.
You're telling me you're going to go to archive.org and lookup the history of these random websites, then wait for them to come back online just to see if you want to buy something?
People really don't consider or understand supply chain logistics, distribution and it's relation with retail... people really like to casually wave their hands and assert resellers/distributors have no space in the world, yet they buy products from resellers/distributors every single day without a moment's thought (Target & Walmart!). The "value add" is making desirable products accessible to regular people.
Regular people aren't going to pony up a $75,000 Purchase Order, fill a shipping container, and wait 6-8 months transit just so they can have a new cell phone case or new bookshelf. That's where distribution networks come in handy. We buy the $75,000 container so that you can buy just one item.
I guess we can't really blame people for not understanding this - people rarely understand how their beef and carrots got to the grocery store either.
If Target's website is down for an hour, I'll just wait.
> sit out front and wait for them to open, or are you going to another store?
Many times I've wanted to get something from my preferred grocery store, but they're closed so I... just went a different time.
Most people aren't going to shop a bunch of random websites to buy "Callaway golf clubs". They're going to buy it from Dick's or Amazon. And if those websites are down the moment they check, they'll just wait an hour because it isn't urgent or figure it's a problem on their end.
Fuck, b&h photo takes their website down a whole day every week!
> I guess we can't really blame people for not understanding this - people rarely understand how their beef and carrots got to the grocery store either.
Thanks for the lesson, I learned how smart you like sounding.
> If Target's website is down for an hour, I'll just wait.
>> sit out front and wait for them to open, or are you going to another store?
You were asked if the physical store was closed - because that's the analog you are deliberately overlooking.
> Thanks for the lesson, I learned how smart you like sounding.
Some people unwilling to learn. Even worse, some people insist they understand an entire industry, having spent exactly zero time in that industry. The hubris on display is amazing.
You are of course right, but in practice I've seen many more incidents created by doing changes to make things more robust than from simpler incremental product changes that usually are feature flagged and so on. At deeper levels usually there's less ability to do such containment (or is too expensive or takes too long or people are lazy) and so many times I wonder if it's better to do the trade-off or just keep things simple and eat only the "simple" sources of downtime to fix.
For example the classic thing is to always have minimum of 3 or 5 nodes for every stateful system. But in some companies, 1 hour of planned downtime on Monday mornings at 7AM to 8AM for operations and upgrades (which you only use when you need) + eating the times when the machine actually dies, would be less downtime than all the times you'd go down because of problems related to the very thing that should make you more robust. An incident here because replication lag was too high, an incident there because the quorum keeping system ran out of space etc and you're probably already behind. And then we have kubernetes. At some point it does make sense, when you have enough people and complexity to deal with this properly, but usually we do it too early.
Sure. If you have a couple of short outages in a year and say 2% of your customers view the website during that time, half quit and never come back over it, that's it. You've lost 1% of your customers.
You seem to be stating that any loss at all no matter how small is a complete disaster without really qualifying it. Yet one could easily make the point - well, what about the customers you don't serve at all because they want to buy a rucksack and you don't sell them? That number could be far higher than the ones affected by outages.
I think this probably just depends on how big the numbers are to be honest. Yes, if 1% of people are millions of dollars, then sure, we care. Usually it's not anything close to being like that.
Are you assuming airline employees aren't scrambling when a flight is delayed? Or that road crews don't get called in the middle of the night? A minor power outage gets a crew called no matter what time it is; a major outage will bring in crews from different states, much less power companies.
But none of them are working in panic-mode as them doing this work is just an expected thing. As OP has pointed out, downtime in other industries is expected and agreed upon as OK. For some reason because it's on a computer that does not apply to our industry, we are to go in to panic-mode to fix, and write about how it will never happen again, stock prices fall, people lose their jobs.
In the case of flights, isn't it rare that it is a single employee or just a couple of employees that are tasked or held responsible for the flight taking off on time?
Funny story though, veering a little OT, but recently I had to haul ass from one end of the airport to the other due to arriving on a delayed flight with a connecting flight with a different airline. As I was about to enter the airplane, after getting my boarding pass scanned, completely drenched in sweat, the flight attendant saw me and said the captain wanted to see me and ushered me into the cockpit. I found myself in the cockpit, where the captain proceeded to talk to me in the local foreign language while I was thinking, "Wtf, isn't it illegal for me to be in here after 9-11?" I then started repeating, in the local language, that I didn't understand them. But that didn't have the desired effect intially and they kept talking. I repeated myself both in their language and English until the captain finally said, in English, "Are you a passenger?" When I said yes, he told me I could sit down.
When I sat down, the only rational explanation that I could come up with was that despite being in civilian clothes I was mistaken for a pilot. Not just any pilot, but one of the pilots that was supposed to be in the cockpit. I could only conjecture that he was a no-show and that they had to bring in a substitute to replace him which caused the plane to be delayed and allowed me to make the flight.
Any commercials pilots are free to correct my -- which normally should be a very silly -- theory. If it could possibly be true though, being the no-show pilot in that situation might be a little stressful.
You know, I think if there was a little more tendency to explain, and trust in those explanations, people would like airlines, local transportation departments, etc, a whole lot more.
Without information, it is easy to assume incompetence, or worse, just not caring.
Great point. I honestly can't remember the last time that I've had a smooth experience with any kind of business or commercial thing, even when I'm the one paying for services. Bank declines transactions randomly when balance is sufficient, on vendors that have been used before. Orders of equipment over $10k are days late, no explanation provided, no number to speak to a human, just ineffective support robots. Planning days or weeks around other people's problems is pretty normal, but really, that just means that I get to postpone going to the DMV and getting the run-around from them.
Not sure if it's quiet-quitting or what, but the last few years (basically since covid) my consumer experiences in the US are feeling a lot more.. shall we say, European? And I guess that's fine, pretty much everything really can wait. But it's hard to maintain one's own sense of professional urgency to keep things smooth for others when literally nothing is smooth for you.
> Orders of equipment over $10k are days late, no explanation provided, no number to speak to a human, just ineffective support robots.
I would be furious. If I spend $10k I want to be speaking with a human if something goes wrong. I feel like in the US we have given up on decent customer experiences.
Every company seems to just be phoning it in lately. The number of times I've ordered something from an online retailer, and either 1. the wrong thing arrived, 2. the thing arrived broken/doa, or 3. the thing never arrived, has at least tripled since say 5 years ago. And to a lot of companies' credit, their indifferent but friendly customer support tends to just say "<yawn> just keep whatever we sent you and we'll refund you. You can try to order the thing again, and we'll maybe send you the right, working thing next time. Or maybe not. Who knows?"
The world would be a better place when airlines and ISPs worked that way too. (Roads work is usually communicated well in advance)
Downtime for these is critically impacting people, and you're damn right you should explain what went wrong and how you'll avoid this in the future. Because the only reasonable alternative is nationalizing them. "I dun fucked up, but meh" is not an acceptable attitude.
> "I dun fucked up, but meh" is not an acceptable attitude.
I'm cool with it. I don't care that some hard drive went bad and the monitoring system didn't pick up on it because some queue was full. Who cares? If this isn't happening constantly, I assume they're going to learn their lesson.
That's pretty much the opposite of "but meh". Absolutely, mistakes happen, they aren't the end of the world, but they need to trigger a learning process. Looking at airlines and ISPs, that process is, in the most charitable viewing, processing glacially.