So when https://en.wikipedia.org/wiki/May_1998_riots_of_Indonesia started happening, we heard some harrowing stories of US employees being abducted, among other things.
Around that same time, the equipment in the Jakarta store started sending high temperature alerts prior to going offline. Our NOC wasn't able to reach anyone in the store.
The alerts were quite accurate: that was one of the many buildings that had been burned down in the riots. I guess it's slightly surprising that electrical power to the equipment in question lasted long enough to allow temperature alerting. Most of our stores back then used satellite for their permanent network connection, so it's possible telcom died prior to the fire reaching the UPC office.
In a couple of prominent places in the home office, there were large cutouts of all of the countries WalMart was in at the time up on the walls. A couple of weeks after this event, the Indonesia one was taken down over the weekend and the others re-arranged.
I remember during the 2000s and 2010s that WalMart in the USA earned a reputation for it's inventories primarily consisting of Chinese-made goods. I'm not sure if that reputation goes all the way back to 1998, but it makes me wonder if WalMart was especially targeted by the anti-Chinese element of the Indonesian riots because it.
Prior to the early 2000s, WalMart had a strong 'buy American' push. It was even in their advertising at the time, and literally written on the walls at the home office in Bentonville.
Realities changed, though, as whole classes of products were more frequently simply not available from the United States, and that policy and advertising approach were quietly dropped.
Just for the hell of it, I did a quick youtube search: "walmart buy american advertisement" and this came up: https://www.youtube.com/watch?v=XG-GqDeLfI4 "Buy American - Walmart Ad". Description says it's from the 1980s, and that looks about right.
The switch had already begun.
Part of what my team (Network Management) did was write code and tools to automate all of the various things that needed to be done with networking gear. A big piece of that was automatically discovering the network. Prior to our auto discovery work, there was no good data source for or inventory of the routers, hubs, switches, cache engines, access points, load balancers, VOIP controllers...you name it.
On the surface, it seems scandalous that we didn't know what was on our own network, but in reality, short of comprehensive and accurate auto discovery, there was no way to keep track of everything, for a number of reasons.
First was the staggering scope: when I left the team, there were 180,000 network devices handling the traffic for tens of millions of end nodes across nearly 5,000 stores, hundreds of distribution centers and hundreds of home office sites/buildings in well over a dozen countries. The main US Home Office in Bentonville, Arkansas was responsible for managing all of this gear, even as many of the international home offices were responsible for buying and scheduling the installation of the same gear.
At any given time, there were a dozen store network equipment rollouts ongoing, where a 'rollout' is having people visit some large percentage of stores intending to make some kind of physical change: installing new hardware, removing old equipment, adding cards to existing gear, etc.
If store 1234 in Lexington, Kentucky (I remember because it was my favorite unofficial 'test' store :) was to get some new switches installed, we would probably not know what day or time the tech to do the work was going to arrive.
ANYway...all that adds up to thousands of people coming in and messing with our physical network, at all hours of the day and night, all over the world, constantly.
Robust and automated discovery of the network was a must, and my team implemented that. The raw network discovery tool was called Drake, named after this guy: https://en.wikipedia.org/wiki/Francis_Drake and the tool that used many automatic and manual rules and heuristics to map the discovered networking devices to logical sites (ie, Store 1234, US) was called Atlas, named after this guy: https://en.wikipedia.org/wiki/Atlas_(mythology)
All of that background aside, the interesting story.
In the late 90s and early 2000s, Drake and Atlas were doing their thing, generally quite well and with only a fairly small amount of care and feeding required. I was snooping around and noticed that a particular site of type International Home Office had grown enormously over the course of a few years. When I looked, it had hundreds of network devices and tens of thousands of nodes. This was around 2001 or 2002, and at that time, I knew that only US Home Office sites should have that many devices, and thought it likely that Atlas had a 'leak'. That is, as Atlas did its recursive site mapping work, sometimes the recursion would expand much further than it should, and incorrectly map things.
After looking at the data, it all seemed fine. So I made some inquiries, and lo and behold, that particular international home office site had indeed been growing explosively.
The site's mapped name was completely unfamiliar to me, at the time at least. You might have heard of it: https://en.wikipedia.org/wiki/Shenzhen
I was seeing fingerprints in our network of WalMart's whole scale switch to sourcing from China.
You're correct, if Drake (the always running discovery engine) didn't detect a device on a given port over a long enough time, then another program would shut that port down. This was nominally done for PCI compliance, but of course having open, un-used ports especially in the field is just a terrible security hole in general.
In order to support legit equipment moves, we created a number of tools that the NOC and I believe Field Support could use to re-open ports as needed. I think we eventually made something that authorized in-store people could use too.
As an aside, a port being operationally 'up' wasn't by itself sufficient for us mark the port as being legitimately used. We had to see traffic coming from it as well.
You mentioned elsewhere that you're working with a big, legacy Perl application, porting it to Python. 99% of the software my team at WalMart built was in Perl. (: I'd be curious to know, if you can share, what company/product you were working on.
Walmart must have been an interesting place to work during the late 90s, early 2000s - I imagine that most everywhere they had to solve problems at scale before scale was a considered a thing. I'd be very interested to see how the solutions created in that period match to best-practices today, especially since outside of the telecom or perhaps defense worlds there probably wasn't much prior art.
As for the Perl application, I probably shouldn't say since I'm still employed at the same company and I know coworkers who read HN. If you're interested, DM me and I can at least provide the company name and some basic details.
Definitely. (: I didn't hold it against the hands on workers like yourself. Even (and perhaps especially) back then, WalMart was a challenging, difficult and aggressive partner.
> working on networking gear at Walmart meant that you were out at 11pm at night
That sounds about right; the scheduling I was directly aware of was very fast paced. Our Network Engineering Store Team pushed and pushed and pushed, just as they were pushed and pushed and pushed.
> Walmart must have been an interesting place to work during the late 90s, early 2000
Yup. Nowhere I've worked before or since had me learning as much or getting nearly as much done. It was an amazingly positive experience for me and my team, but not so positive for a lot of others.
> I imagine that most everywhere they had to solve problems at scale before scale was a considered a thing.
Sometimes I imagine writing a book about this, because it's absolutely true, all over Information Systems Division.
For a time in the early 2000s, we were, on average, opening up a new store every day, and a typical new store would have two routers, two VOIP routers, two cache engines, between 10 and 20 switches, two or four wireless access point controllers and dozens of AP endpoints. That was managed by one or two people on the Network Engineering side, so my team (Network Management) wrote automation that generated the configs, validated connections, uploaded configs, etc etc etc. (Not one or two people per store: one or two people for ALL of the new stores.)
The networking equipment was managed by a level of automation that is pretty close to what one sees inside of Google or Facebook today, and we were doing it 20 years ago.
> ... telecom ... prior art ...
John Chambers, the long-time CEO of Cisco, was at the time on WalMart's board of directors. He was always a bit of a tech head, and so when he came to Bentonville for board meetings, he'd often come and visit us in Network Engineering.
Around 2001-2002, we were chatting with him and he asked why we weren't using Cisco Works to manage our network. https://en.wikipedia.org/wiki/Cisco_Prime but back then it was mostly focused on network monitoring and to a lesser extent, config management. We chuckled and told him that there's no way that Cisco Works could scale to even a fraction of our network. He asked what we used, and of course we showed him the management system we'd written.
He was so impressed that he went back to San Jose, selected a group of Cisco Works architects, had them sign NDAs, and sent them to Bentonville, Arkansas for a month. The intent was to have them evaluate our software with an eye toward packaging it up and re-selling it.
Those meetings were interesting, but ultimately fruitless. The Cisco Works architects were Ivory Tower Java People. The first thing they wanted to see was our class hierarchy. We laughed and said we had scores of separate and very shallow classes, all written in Perl, C and C++.
Needless to say, they found the very 'rough and ready' way our platform was designed to be shocking and unpalatable. They went back and told Chambers that there was literally no way our products could be tied together.
> ... match to best-practices today ...
Professionally, I've been doing basically the same kinds of things since then, and I'll say that while our particular methods and approaches were extremely unusual, the high level results would meet or perhaps exceed what one gets with 'best practices' seen today.
Not because we were any smarter or better, but because we had no choice but to automate and automate effectively. At that scale, at that rate of change, at those uptime requirements, 'only' automating 99% would be disastrous.
FWIW, my brain was going "book! book! book! book! book!" back at the top-level comment, and the beeper may have got slightly overloaded and broke as I continued reading. :)
As a sidenote, the story about "the CEO vs the architects" was very fascinating: the CEO could see the end-to-end real-world value of what you'd built, but the architects couldn't make everything align. In a sense the CEO was more flexible than the architects, despite the fact that stereotypes might suggest the opposite being more presumable.
Also, the sentiment about your unusual methodology exceeding current best practice makes me wonder whether you achieved so-called "environmental enlightenment" - where everything clicks and just works and makes everyone who touches the system a 5x developer - or whether the environment simply had to just work really really well. Chances are the former is what everyone wishes they'll find one day, while the latter (incredibly complex upstream demands that are not going to go away anytime soon and which require you to simply _deliver_) definitely seems like the likelier explanation for why the system worked, regardless of the language it was written in - it was the product of a set of requirements that would not accept anything else.
Hmm. Now I think about that a bit and try and apply it to "but why is current best practice worse", I was musing the other day about how a lot of non-technical environments don't apply tech in smart ways to increase their efficiency, because their fundamental lack of understanding in technology means they go to a solutions provider, get told "this will cost $x,xxx,xxx", don't haggle because they basically _can't_, and of course don't implement the tech. I wonder if the ubiquitification (that seems to be a word) of so-called "best practices" in an area doesn't function in a similar way, where lack of general awareness/understanding/visibility in an area means methodology and "practices" (best or not) aren't bikeshed to death, and you can just innovate. (Hmm, but then I start wondering about how highly technically competent groups get overtaken by others... I think I'll stop now...)
Some interesting facts to know for those who don't dig into it. Walmart:
- has 80+ internal apps, mostly variants but still unique
- runs k8s inside of Distribution Centers
- maintains a fleet of >180k mobile devices in the US alone
- has a half-dozen data centers in the US
- has most International infrastructure seperate from US Stores'
I've got some stories of my own, maybe I'll post them in a bit.
Wow, that's a hell of a change! When I left in 2009, there were exactly two datacenters: NDC and EDC. Not surprising really.
From where I was sitting, the best era was definitely 1997-2004 or so. ISD really went down hill, pretty quickly, in my last five years there, for many different reasons.
Really though, I feel truly awful for anyone affected by this. The post recommends implementing a disaster recovery plan. The truth is that most people don't have one. So, let's use this post to talk about Disaster Recovery Plans!
Mine: I have 5 servers at OVH (not at SBG) and they all back up to Amazon S3 or Backblaze B2, and I also have a dedicated server (also OVH/Kimsufi) that gets the backups. I can redeploy in less than a day on fresh hardware, and that's good enough for my purposes. What's YOUR Disaster Recovery Plan?
If OVH is totally down, and the fail over IP doesn't work, I have a fairly low TTL on the DNS.
I backup the database state to S3 every day.
Since I'm truly paranoid, I have an Intel NUC at my house that also replicates the DB. I like knowing that I have a complete backup of my entire business within arm's reach.
I also run our entire production system on one beefy machine at OVH, and replicate to a similar machine at Hetzner. In case of a failure, we just change DNS, which has a 1 hour TTL. We've needed to do an unplanned fail-over only once in over 10 years.
And like you, I have an extra replica at the office, because it feels safe having a physical copy of the data literally at hand.
What's most relevant, in your case, would you say? Evil maiden attack, or browser zero day, or spear phishing? An insider? Or something else?
(And how did you arrive at that threat model, if i may ask)
Again, I may be completely wrong but why would you not use AWS/GCP? Even if it's complexity, Amazon have Lightsail, or if it's cost I thought DigitalOcean was one of the only reputable business-grade VPS providers.
I just can't imagine many situations where a VPS would be superior to embracing the cloud and using cloud functions, containers, instances with autoscaling/load balancers etc.
I think small bussinesses like smaller simple providers instead of bigclouds. Its different philosophy if you are afraid of extreme centralisation of internet it makes sense.
They also don't charge for bandwidth, although some of those other providers have a generous free bandwidth and cheap overage.
At OVH? If so, their US data centers don't seem to have that option.
Not that I need it. The largest database I run could easily fit in RAM on a reasonably sized dedicated box.
I didnt know.
It's a 4 core 8 thread Xeon with 3x 1TB SATA with 32GB of ECC RAM IIRC (E3-SAT-1-32, got it during a sale with a price that is guaranteed as long as I keep renewing it)
The thing is great, I can run a bunch of VM's on it, it runs my websites and email.
Overall to get something comparable elsewhere I would be paying 3 to 4 times as much.
I would consider $50 a month or less low end pricing. ¯\_(ツ)_/¯
But i assume they are less known in US.
OVH is one of the better deals for bare metal, but there are even better ones for bandwidth. You have to shop around a lot.
Also be sure you have a recovery plan... even with the big providers. These days risks include not only physical stuff but some stupid bot shutting you off because it thinks you violated TOS or is reacting to a possibly malicious complaint.
We had a bot at AWS gank some test systems once because it thought we were cryptocurrency mining with free credits. We weren’t, but we were doing very CPU intensive testing. I’ve heard of this and worse happening elsewhere. DDOS detector and IDS bots are particularly notorious.
It has twice the revenue, and is the third largest hosting provider in the world.
In any case, they aren't "primarily for personal use".
I've been making this point for a long time. Both of those AS spaces are legendary for the volume of dodgy packets they barf at the rest of us.
..and they do support all of those things you just listed, not just API-backed bare metal.
Their sub-brand soyoustart has older servers (that are still perfectly fine), roughly E3 Xeon/16-32GB/3x2TB to 4x2TB for $40/m ex vat.
Their other sub brand kimsufi for personal servers has Atom low-power bare metal with 2TB HDD (in reality it is advertised 500GB/1TB, but they don't really have any of those in stock left, if your drive fails they replace it with a 2T - so far this has been my exp) for $5.
All of this is powered by automation, you don't really get any support and you are expected to be competent. If your server is hacked you get PXE-rebooted into a rescue system and can scp/rsync off your contents before your server is reinstalled. OS installs, reboots, provisioning are all automated, there's essentially no human contact.
PS: Scaleway, in Paris, used to offer $2 bare metal (ultra low voltage, weaker than an Atom, 2GB ram), but pulled all their cheap machines, raised prices on existing users, and rebranded as enterprisey. The offer was called 'kidechire'
It is kind of interesting that on the US side everyone is in disbelief, or like "why not use AWS" - while most of the European market knows of OVH, Hetzner, etc.
My own reason for using OVH? It's affordable and I would not have gotten many projects (and the gaming community I help out with) off the ground otherwise. I can rent bare metal with NVMe, and several terabytes of RAM for less than my daily wage for the whole month, and not worry about per-GB billing or attacks. In the gaming world you generally do not ever want to use usage based billing - made the mistake of using Cloudfront and S3 once and banned script kiddies would wget-loop the largest possible file from the most expensive region botnet repeatedly in a money-DoS.
I legitimately wouldn't have been able to do my "for-fun-and-learning" side projects (no funding, no accelerator credits, ...) without someone like them. The equivalent of a digitalocean $1000/m VM is about $100 on OVH.
Like kimsufi equivalent brand "isgenug"
32c xeon/256GB ECC/500GB SSD 8TB HDD is $100/m at OVH. The difference is amusing.
Yann, CEO at Scalingo
(This is 2011, I think it looks fancier now)
Edit: Seems like they stopped publishing videos for that datacenter, but this seems to be a video for the burn down datacenter in 2013:
Why not use AWS/GCP? From my personal point of view: as a French citizen, I'm more and more convinced that I can't completly trust the (US) big boys for my own safety. Trump showed that "US interest" is far more important than "customer interest" or even "ally interest". And moreover, Google is showing quite regurlaly that it's not a reliable business partner (AWS look better for this).
Interesting, any examples?
As a customer (or maybe an "involontary data provider"), I do as much as I can to avoid Google to be my SPOF, not technically (it's really technically reliable) but on the business side. I had to setup my own mail server just to avoid any risk of google-ban for example... just in case.
I won't use Google authentificator for the same reason. I'm happy to have left Google Photos some years ago, to avoid problems of Google shutting it down. And the list could go on...
As a business, I like to program Android apps but the Google Store is really a risk too. Risk to have any Google account blacklisted because some algorithm thought I did something wrong. And no appeal.
Maybe all this doesn't apply to GCP customers. Maybe GCP customers have a human direct line, with someone to really help and the capacity to do it. Or maybe it's just Google: as long as it work, enjoy. If it doesn't, go to (algorithmic) hell.
If my money and/or job depended on having something running without (or with minimal) disruption I would be as paranoid as you, too.
BTW - Some people call this business recovery plan, not plain paranoia ;-)
So they might have their primary and replica databases at different DCs from the same hosting provider, and only their nightly backup to a different provider. Four copies to four different providers is a step above three copies with two providers!
A large enterprise would probably be using a filesystem with periodic snapshots, or streaming their redo log to a backup, to protect against a fat-fingered DBA deleting the wrong thing. Of course, filesystem snapshots provide no protection against loss of DC or wrongful hosting account termination, so you might not count them as true backup copies.
e.g., if you are an AWS customer, have your back ups in S3 and use simple replication to sync that to either GCS or Azure, where you can get the same level of compliance attestation as from AWS.
I’ve really struggled working with others that either are both ignorant and apathetic about the business’s ability to deal with risk or believe that it’s their job to keep putting duct tape over the duct tape that breaks multiple times a day while users struggle.
I like seeing these comments reminding others to a wear seat belt or have backups for their backups, but I don’t know whether I should care more about reliability. I work in an environment that’s a constant figurative fire.
I also like to spend time with my family. I know it’s just a job, and it would be even if I were the only one responsible for it; that doesn’t negate the importance of reliability, but there is a balance.
If you are dedicated to reliability, don’t let this deter you. Some have a full gas tank, which is great.
> ... [F]inance is fundamentally about moving money and risk through a network. 
Your employer has taken on many, many risks as part of their enterprise. If every risk is addressed the company likely can’t operate profitably. In this context, your business needs to identify every risk, weigh the likelihood and the potential impact, decide whether to address or accept the risk, and finally, if they decide to address the risk, whether to address it in-house our outsource it.
You’ve identified a risk that is currently being “accepted” by your employer, one that you’d like to address in-house. Perhaps they’ve taken on the risk unintentionally, out of ignorance.
As a professional the best I can do is to make sure that the business isn’t ignorant about the risk they’ve taken on. If the risk is too great I might even leave. Beyond that I accept that life is full of risks.
 Gary Gensler, “Blockchain and money”, Introduction https://ocw.mit.edu/courses/sloan-school-of-management/15-s1...
Apathetic people though, that can be really tough going. It’s just that way “because”. Or my favourite “oh we don’t have permission to change that”, how about we make the case and get permission? _horrified looks_ sometimes followed by pitch forks.
At the end of the day you have a budget (of any kind) and a list of priorities on which to spend it. It's up to you or your management to set a reasonable budget, and to set the right priorities. If they refuse, leave or you'll just burn the candle at both ends and just fade out.
A backup on its own is of little worth if unused.
When a backup is used to re-enable something, then the amount of time disabled may be decreased. When it is, this is reliability- we keep things usable and in function, more than not.
It could also provide a burglar a fantastic opportunity to pivot into career in data breaches.
And as noted, encrypted backups would be resistant to casual interdiction, or even strongly-motivated attempts. Data loss being the principle risk mitigated by off-site, on-hand backups.
There is nothing magical about data centers making them safe while your local copy isn't.
Is this a serious comment? My house is not certified as being compliant with any security standards. Here's the list that the 3rd party datacenter we use is certified as complaint with:
The data centers we operate ourselves are audited against several of those standards too. I guess you're right that there's nothing magic about security controls, but it has nothing to do with trust. Sensitive data should generally never leave a secure facility, outside of particularly controlled circumstances.
You are entierly missing the point by quoting the compliance programs followed by AWS whose sole business is being a third party hoster.
For most business, what you call sensitive data is customers and orders listing, payment history, inventory if you are dealing in physical goods and HR related files. These are not state secrets. Encryption and a modicum of physical security go a long way.
I personally find the idea that you shouldn't store a local backup of this kind of data out of security concern entirely laughable. But that's me.
This attitude that your data isn’t very important, so it’s fine to not be very concerned about it’s security, while not entirely uncommon, is something most organisations try to avoid when choosing vendors. It’s something consumers are generally unconcerned about, until a breach occurs, and The Intercept write an article about it. At which point I’m sure all the people ITT who are saying it’s fine to take your production database home would be piling on with how stupid the company was for doing ridiculous things like taking a copy of their production database home.
I said there was nothing magical about data centers security, a point I stand with.
It's all about proper storage (encryption) and physical security. Obviously, the physical security of an AWS data center will be tighter that your typical SME but in a way which is of no significance to storing backups.
> This attitude that your data isn’t very important
You are once again missing the point.
It's not that your data isn't important. It's that storing it encrypted in a sensible place (and to be clear by that I just mean not lying around - a drawer in an office or your server room seems perfectly adequate to me) is secure enough.
The benefits of having easily available backups by far trump the utterly far fetched idea that someone might break into your office to steal your encrypted backups.
In the SME space some things are "different", and if you've not worked there it can be hard to get one's head around it:
A client of mine was burgled some years ago.
Typical small business, offices on an industrial estate with no residential housing anywhere nearby. Busy in the daytime, quiet as the grave during the night. The attackers came in the wee small hours, broke through the front door (the locks held, the door frame didn't), which must have made quite a bit of noise. The alarm system was faulty and didn't go off (later determined to be a 3rd party alarm installer error...)
All internal doors were unlocked, PCs and laptops were all in plain sight, servers in the "comms room" - that wasn't locked either.
The attacker(s) made a cursory search at every desk, and the only thing that was taken at all was a light commercial vehicle which was parked at the side of the property, its keys had been kept in the top drawer of one of the desks.
The guy who looked after the vehicle - and who'd lost "his" ride - was extremely cross, everyone else (from the MD on downwards) felt like they'd dodged a bullet.
Physical security duly got budget thrown at it - stable doors and horses, the way the world usually turns.
These solutions are not one-size-fits-all. What works for a small startup isn't appropriate for a 100+ person company.
Lots of people plan for specific elements they can imagine and forget other equally or even more important things they are going to need in a disaster. Check out how many organisations that doubtless have 24/7 IT support in case a web server goes down somehow had no plan for what happens if it's unsafe for their 500 call centre employees to sit in tiny cubicles answering phones all day even though pandemic respiratory viruses are so famously likely that Gates listed them consistently as the #1 threat.
Let's take an example which might lead to bankruptcy. A typical answer to a major disaster (let's say your main and sole building burning as a typical case) for an SME would be to cease activity, furlough employes and stop or defer every payments you can while you claim insurance and assess your options. Well, none of these things are obvious to do especially if all your archive and documents just burnt. If you think about it (which you should), you will quickly realise that you at least need a way to contact all your employes, your bank and your counsel (which would most likely be the accountant certifying your results rather than a lawyer if you are an SME in my country) offsite. That's the heart of disaster planning: having solutions at the ready for what was easy to foresee so you can better focus on what wasn't.
Yes it is. (Though it's better, as GP suggested, as a final layer of a plan and not the only layer.)
> Becoming insolvent might be the end result of a situation but it's not going to help you deal with it.
Insolvency isn't bankruptcy. Becoming insolvent is a consequence, sure. Bankruptcy absolutely does help you deal with that impact, that's rather the point of it.
Bankruptcy when dealt with correctly is a process not an end.
If everything else fail it's better to fill for bankruptcy when there is still something to recover with help of others than to burn everything to ashes because of your vanity.
At least that's how I understood parent's comment.
There are two types of bankruptcies in the US used most often by insolvent businesses: Chapter 7, and Chapter 11.
A Chapter 7 bankruptcy is what most people in other countries think of when they hear "bankruptcy" - it's the total dissolution of a business and liquidiation of its assets to satisfy its creditors. A business does not survive a Chapter 7. This is often referred to as a "bankruptcy" or "liquidation" in other countries.
A Chapter 11 bankruptcy, on the other hand, is a process by which a business is given court protection from its creditors and allowed to restructure. If the creditors are satisfied with the reorganisation plan (which may include agreeing to change the terms of outstanding debts), the business emerges from Chapter 11 protection and is allowed to continue operating. Otherwise, if an agreement can't be reached, the business may end up in Chapter 7 and get liquidated. Most countries have an equivalent to a Chapter 11, but the name for it varies widely. For example, Canada calls it a "Division 1 Proposal," Australia and the UK call it "administation," and Ireland calls it "examinership."
Since there's a lot of international visitors to HN I just thought I'd jump in and provide a bit of clarity so we can all ensure we're using the same definition of "bankruptcy." A US Chapter 7 bankruptcy is not a plan, it's the game over state. A US Chapter 11 bankruptcy, on the other hand, can definitely be a strategic maneuver when you're in serious trouble, so it can be part of the plan (hopefully far down the list).
I wondered why you should plan an event which will "destroy" your company anyways.
Yes, that's why "Go bankrupt" is not a plan which was the entire point of my reply. That's like saying that your disaster recovery plan is "solve the disaster".
Fast Fashion for example often employs workers in more or less sweatshop conditions close to the customers (this makes commercial sense, if you make the hot new items in Bangladesh you either need to expensively air freight them to customers or they're going to take weeks to arrive after they're first ordered - there's a reason it isn't called "Slow fashion"). These jobs are poorly paid, many workers have dubious right-to-work status, weak local language skills, may even be paid in cash - and so if you tell them they must come in, none of them are going to say "No".
In fact the slackening off in R for the area where my sister lives (today the towering chimneys and cavernous brick factories are just for tourists, your new dress was made in an anonymous single story building on an industrial estate) might be driven more by people not needing to own new frocks every week when they've been no further than their kitchen in a month than because it would actually be illegal to staff their business - if nobody's buying what you make then suddenly it makes sense to take a handout from the government and actually shut rather than pretend making mauve turtleneck sweaters or whatever is "essential".
For these major routes, there are typically at least bi-weekly voyages scheduled, so for this kind of distance, you can expect about 11 days pretty uniformly distributed +-2 days, if you pay to get on the next ship.
This may lead to (committing to) paying for the spot on the ship when your pallet is ready for pickup at the factory, not when it arrives at the port) and use low-delay overland trucking services.
Which operate e.g. in lockstep with the port processing to get your pallet on the move within half a day of the container being unloaded from the ship, ideally having containers pre-sorted at the origin to match truck routes at the destination.
So they can go on a trailer directly from the ship and rotate drivers on the delivery tour, spending only a few minutes at each drop-off.
Because those can't rely on customers to be there and get you unloaded in less than 5 minutes, they need locations they can unload at with on-board equipment. They'd notify the customer with a GPS-based ETA display, so the customer can be ready and immediately move the delivery inside.
Rely on 360-degree "dashcam" coverage and encourage the customer to have the drop-off point under video surveillance, just to easily handle potential disputes. Have the delivery person use some suitable high-res camera with a built-in light to get some full-surface-coverage photographic evidence of the condition it was delivered in.
I'd guess with a hydraulic lift on the trailer's back and some kind of folding manual pallet jack stuck on that (fold-up) lift, so they drive up to the location, unlock the pallet jack, un-fold the lift, lower the lift almost to the ground, detach the pallet jack to drop it the last inch/few cm to the ground, pull the jack out, lower the lift the rest of the way, drive it on to the lift, open the container, get up with the pallet jack, drive the pallets (one-by-one) for this drop-off out of the container and leave them on the ground, close and lock the container, re-arm the jack's hooks, shove it jack back under the slightly-lowered folding lift, make it hook back in, fold it up, lock the hooking mechanism (against theft at a rest stop (short meal and toilet breaks exist, but showering can be delayed for the up to 2 nights)), fold it all the way up, and go on to drive to their next drop-off point.
So yes, getting insurance can be a good idea to offset some losses you may have, as long as they are somewhat limited compared to your companies overall assets and income. But as soon as the insurance payout matches a significant part of your net worth, the insurance might not save you.
As such, it makes sense to make the level of risk you plan to accept (by not being insured against it and not mitigating) a conscious economic decision rather than pretending you've covered everything.
The big example of this which springs to mind is business interruption cover - it's ruinously expensive so it's extremely unusual to have the max cover the market might be prepared to offer. It's a pure economic decision.
Usually you'd have to show your homework (offers from insurance companies proving that it really is unaffordable). I totally get the trade-off, and the fact that if the business could not exist if it was properly insured that plenty of companies will simply take their chances.
We also both know that in case something like that does go wrong everybody will be looking for a scapegoat, so for the CEO's own protection it is quite important to play such things by the book, on the off chance the risk one day does materialize.
This has killed quite a few otherwise very viable companies, it is fine to take risks as long as you do so consciously and with full approval of all stakeholders (or at least: a majority of all stakeholders). Interesting effects can result: a smaller investor may demand indemnification, then one by one the others also want that indemnification and ultimately the decision is made that the risk is unacceptable anyway (I've seen this play out), other variations are that one shareholder ends up being bought out because they have a different risk appetite than the others.
Fires in DCs aren't rare at all, I know of at least three, one of those in a building where I had servers. This one seems to be worse than the other two. Datacenters tend to concentrate a lot of flammable stuff, throws a ton of current through them and does so 24x7. The risk of a fire is definitely not imaginary, which is why most DCs have fire suppression mechanisms. Whether those work as advertised depends on the nature of the fire. An exploding on prem transformer took out a good chunk of EV1's datacenter in the early 2000's, and it wasn't so much the fire that caused problems for their customers, but the fact that someone got injured (or even died, I don't recall exactly), and before the investigation was completed and the DC released to the owners again took a long time.
Being paranoid and having off-site backups is what allowed us to be back online before the fire was out. If not for that I don't know if our company would have survived.
Looks glowing red to me.
Container DCs were a big thing for a while. Even Google did a whole PR thing about how they used them.
You can see more details here: https://baxtel.com/data-center/ovh-strasbourg-campus
Hot spare on a different continent with replicated data along with a third box just for backups. The backup box gets offsite backups held in a safe with another redundant copy in another site in another safe.
Restores are tested quarterly.
Keep backups of backups. Once bitten, twice shy.
Probably this is the most important part of your plan. It's not the backup that matters; it's the restore. And if you don't practice it from time to time, it's probably not going to work when you need it.
Just be cautious about data locality laws (not likely to affect you as joe average, more for businesses)
This was so we remained legal in all of the countries BT worked in which required a lot of behind the scenes work to make sure we didn't serve "illegaly encypted" Data.
For example, the Russian data residency law states that a copy of the data must be stored domestically, not that it can't be replicated outside the country.
The UAE has poorly written laws that have different regulations for different types of data - including fun stuff like only being subject to specific requirements if the data enters a 270 acre business park in Dubai.
Don't even get me started on storing encrypted data in one country and the keys in another...
Turns out, our disaster recovery plan is pretty good.
Datacenter burned down and I still was up 4 hours later in another data center with zero data loss. Good times.
Documentation / SOPs that have been tested thoroughly by various team members are really important. It helps work out any kinks in interpretation, syntax errors etc.
It does feel a little ridiculous at the time for all the effort involved, but incidents like this show why it's so important.
As an immediate plan, the 2-3 business critical systems are replicating their primary storages to systems in a different datacenter. This allows us to kick off the configuration management in a disaster, and we need something in between 1-4 hours to setup the necessary application servers and middlewares to get critical production running again.
Regarding backups, backups are archived daily to 2 different borg repo hosts on different cloud providers. We could lose an entire hoster to shenanigans and the damage would be limited to ~2 days of data loss at worst. Later this year, we're also considering to export some of these archives to our sister team, so they can place a monthly or weekly backup on tape in a safe in order to have a proper offline backup.
Regarding restores - there are daily automated restore tests for our prod databases, which are then used for a bunch of other tests after anonymization. On top, we've built most database handling on top of the backup/restore infra in order to force us to test these restores during normal business processes.
As I keep saying, installing a database is not hard. Making backups also isn't hard. Ensuring you can restore backups, and ensuring you are not losing backups almost regardless of what happens... that's hard and expensive.
* All my services are dockerized and have gitlab pipelines to deploy on a kubernetes cluster (RKE/K3s/baremetal-k8s)
* git repo's containing the build scripts/pipelines are replicated on my gitlab instance and multiple work computers (laptop & desktop)
* Data and databases are regularly dumped and stored in S3 and my home server
* Most of the infrastructure setup (AWS/DO/Azure, installing kubernetes) is in Terraform git repositories. And a bit of Ansible for some older projects.
Because of the above, if anything happens all I need to restore a service is a fresh blank VM/dedicated machine or a cloud account with a hosted Kubernetes offering. From there it's just configuring terraform/ansible variables with the new hosts and executing the scripts.
The SBG fire illustrates the importance of geographical redundancy. Just because the datacenters have different numbers at the end doesn't mean that they won't fail at the same time. Apart from a large fire or power outage, there are lots of things that can take out several datacenters in close vicinity at the same time, such as hurricanes and earthquakes.
pretty much a textbook use-case for zfs with some kind of snapshot-rolling utility. Snap every hour, send backups once a day, prune your backups according to some timetable. Transfer as incrementals against the previous stored snapshot. Plus you get great data integrity checking on top of that.
"but linus said..."
Yes i still don't understand him, a he calls himself a "filesystem guy". Also i don't understand that no one ever mentions NILFS2.
The draw of ZFS is that it's the log-structured filesystem with 10 zillionty hours of production experience that says that it works. And that's why BTRFS is not a direct substitute either. Or Hammer2. There are lots of things that could be cool, the question is are you willing to run them in production.
There is a first-mover advantage in filesystems (that occupy a given design and provide a given set of capabilities). At some point a winner sucks most of the oxygen out of the atmosphere here. There is maybe space for a second place winner (btrfs), there isn't a spot for a fourth-place winner.
It's very easy to use spare storage in various places to do backups this way, as ssh, gpg and cron are everywhere, and you don't need to install any complicated backup solutions or trust the backup storage machines much.
All you have to manage centrally is private keys for backup encryption, and CA for signing the ssh keys + some occasional monitoring/tests.
Restic has a single binary that takes care of everything. It feels more modern and seems to work really well. Never had any issue restoring from it.
Just one data point. Stick to whatever works for you. But important to test not only your backups, but also restores!
I've found duplicity to be a little simplistic and brittle. Purging old backups is also difficult, you basically have to make a full backup (i.e. non-incremental) before you can do that, which increases bandwidth and storage cost.
Restic looks great feature-wise, but still feels like the low-level component you'd use to build a backup system, not a backup system in itself. It's also pre-1.0.
GPG is a bit quirky but I do regularly check my backups and restores (if once every few months counts as regular).
It's brilliant, works like a charm on freebsd windows and a rpi with linux since over 2 years.
Have moved to using rdiff-backup over SSH.
I use S3 API compatible object storage platforms for remote backup. E.g. BackBlaze B2. I wrote about my backup scripts for FreeNAS (jail that runs s3cmd to copy files to B2) here: https://www.shogan.co.uk/cloud-2/cheap-s3-cloud-backup-with-...
For Kuberenetes I use velero which can be configured with an S3 storage backend target: https://www.shogan.co.uk/kubernetes/kubernetes-backup-on-ras...
My employer has a single rack of servers at HQ. It's positioned at a very specific angle with an AC unit facing it, their exact positions are marked out on the floor in tape. The servers contain VMs that most employees work on, our git repository, issue trackers, and probably customer admin as well. They say they do off-site backups, but honestly, when (not if) that thing goes it'll be a pretty serious impact on the business. They don't like people keeping their code on their take-home laptop either (I can't fathom how my colleagues work and how they can stand working in a large codebase using barebones vim over ssh), but I've employed some professional disobedience there.
I'm not sure how normal this strategy is outside of container land but I like just using scripts, they are simple and transparent - if you take time and care to write them well.
They basically install and configure packages using sed or heredocs with a few user prompts here and there for setting up domains etc.
If you are constantly tweaking stuff this might not suit you, but if you know what you need and only occasionally do light changes (which you must ensure the scripts reflect) then this could be an option for you.
It does take some care to write reliable clear bash scripts, and there are some critical choices like `set -e` so that you can walk away and have it hit the end and know that it didn't just error in the middle without you noticing.
Our goal and the reason we have a lot of stuff backed up on-prem is to have our most time-critical operations back up within a couple of hours - unless the building is destroyed, in which case that's a moot point and we'll take what we can get.
A dev wiped our almost-monolithic sales/manufacturing/billing/etc MySQL database a month or two ago. (I have been repeatedly overruled on the topic of taking access to prod away from devs) We were down for around an hour. Most of that time was spent pulling gigs of data out of the binlog without also wiping it all again. Because our nightly backups had failed a couple weeks prior - after our most recent monthly "glance at it".
Currently trying to reduce that "less than a day" though. Recently discovered "ReaR" (Relax and Recover) from RedHat and sounds really nice for bare-metal servers. Not everybody runs on virtualized/cloud (being able to recover from VM snapshots is really a plus). Let's share experiencies :)
It's not the best in terms of Disaster Recovery Plan but we accept that level of risk.
I was affected, my personal VPS is safe but down and other VPS I was managing I don't know anything about. I have the backups and right now I'd love for them to just set me up a new VPS so I can restore the backups and restore the services.
But i have a friend who potentially lost important uni work hosted on his nextcloud instance... On SBG2.
A rough reminder that backups are really important, even if you are just an individual
They both run postfix + dovecot, so mail is synced via dovecot replication. Data is rsync-ed daily, and everything has ZFS snapshots. MySQL is not set into replication - my home internet breaks often enough to have serious issues, so instead I drop everything every day import a full dump from the main server, and do a local dump as backup on both sides.
I don't have automatic failover set up.
But when you already use ZFS you can do a very speedy full backup with:
mysql << EOF
FLUSH TABLES WITH READ LOCK;
system zfs snapshot data/db@snapname
You do have backup servers.
(My servers aren't in SBG either - phew!)
There's a whole class of (mostly non-technical) risks that you solve for when you do this.
If anything happens with your payment method (fails and you don't notice in time; all accounts frozen for investigation), OVH account (hacked, suspended), OVH itself (sudden bankruptcy?), etc, then at least you have _one_ other copy. It's not stuff that's likely to happen, but the cost of planning for it at least as far as "haven't completely lost all my data even if it's going to be a pain to restore" here is relatively minimal.
Rolling backups with a month retention to box using rsync.
It creates a network drive to box by default when I boot my desktop.
I have some scripts for putting production db's in test and when I went them locally.
Everything is managed with Ansible and Terraform (on DO side), so I could probably get everything back up and running in less than an hour if needed.
That makes it sound like you didn't try/practice. I imagine that in a real-life scenario things will be a little more painful than in one's imagination.
Even if it possible to trigger and test, actually using the recovery mechanism may have some high cost either monetarily or maybe losing some small amount of data. These mechanisms should almost always be an additional layer of defense and only be invoked in case of true catastrophe.
In both cases, the mechanisms should be tested as thoroughly as possibly, either through artificial environments that can simulate improbable scenarios or in the latter case on a small test environment to minimize cost.
I personally have almost all of the software running in containers with an orchestrator on top (Docker Swarm in my case, others may also use Nomad, Kubernetes or something else). That way, rescheduling services on different nodes becomes less of a hassle in case of any one of them failing, since i know what should be running and what configuration i expect it to have, as well as what data needs to be persisted.
At the moment i'm using Time4VPS ( affiliate link: https://www.time4vps.com/?affid=5294 ) for the stuff that needs decent availability and because they're cheaper than almost all of the alternatives i've looked at (DigitalOcean, Vultr, Scaleway, AWS, Azure) and that matters to me.
Now, in case the entire data centre disappears, all of my data would still be available on a few HDDs under my desk (which are then replicated to other HDDs with rsync locally), given that i use BackupPC for incremental scheduled backups with rsync: https://backuppc.github.io/backuppc/
For simplicity, the containers also use bind mounts, so all of the data is readable directly from the file system, for example, under /docker (not really following some of the *nix file system layout practices, but this works for me because it's really easy to tell where the data that i want is).
I actually had to migrate over to a new node a while back, took around 30 minutes in total (updating DNS records included). Ansible can also really help with configuring new nodes. I'm not saying that my setup would work for most people or even anything past startups, but it seems sufficient for my homelab/VPS needs.
- containers are pretty useful for reproducing software across servers
- knowing exactly which data you want to preserve (such as /var/lib/postgresql/data/pgdata) is also pretty useful, even though a lot of software doesn't really play nicely with the idea
- backups and incremental backups are pretty doable even without relying on a particular platform's offerings, BackupPC is more than competent and buying HDDs is far more cost effective than renting that space
- automatic failover (both DNS and moving the data to a new node) seems complicated, as does using distributed file systems; those are probably useful but far beyond what i actually want to spend time on in my homelab
- you should still check your backups
Prayer and hope, usually.
> Printer flammability
Eerie timing: do they possibly suspect some bad cables?
Why not? Cables with ratings lower than the load they are carrying is a prime cause for electrical fires. If the load is too high for long enough, the shielding melts away, and if it is close enough for other material to catch fire then that's the ball game. It's a common cause for home electrical fires. Some lamp with poor wiring catches the drapes on fire, etc. Wouldn't think a data center would have flammable curtains though.
http://travaux.ovh.net/?do=details&id=47840 earliest one that I found was back in December
At best you might be missing out on some SLA refunds, but at worst it could be disasterous for a business. I've been on the wrong side of a update-by-hand status system from a hosting provider before and it wasn't fun.
Agreed, though. A fake status page is worse than no status page. I don't mind if the status page states that it's manually updated every few hours as long as it's honest. But don't make it look like it's automated when it's not.
> Unable to display this content to due missing consent.
By law, we are required to ask your consent to show the content that is normally displayed here.
I've usually seen this with embedded videos rather than comments.
My gut says yes.
They are not the only ones though. All too common. Well, it's tricky to set this up properly. The only proper way would be to use external infra for the status page.
So nobody chooses to make an honest status page.
But, they're usually manual affairs because sometimes the system is broken even when the healthcheck looks ok, and sometimes writing the healthcheck is tricky, and always you want the status page disconnected from the rest of the system as much as possible.
It is a challenge to get 'update the status page' into the runbook. Especially for runbooks you don't review often (like the one for the building is on fire, probably).
Luckily my status page was not quite public; we could show a note when people were trying to write a customer service email in the app; if you forget to update that, you get more email, but nobody posts the system is down and the status page says everything is ok.
I can appreciate an honest mistake though, like the status page server cron is hosted in the same cluster that caught fire and hence it burnt down and can't update the page anymore.
> The whole site has been isolated, which impacts all our services on SBG1, SBG2, SBG3 and SBG4. If your production is in Strasbourg, we recommend to activate your Disaster Recovery Plan
What more could you want?
What's the point of a status page then if it does not show you the status? I don't want to be chasing down twitter handles and support pages during an outage.
I wonder if a server fire would cause Amazon to go to status red. So far anything and everything has fallen under yellow.
No traffic whatsoever between sbg-g1 and sbg-g2 ant their peers.
I think this is probably linked to a manual reporting system and they got bigger fish to fry at the moment than updating this status page.
If they don't have any servers anymore, how can they be down ;)
Probably the best they could do at the moment.