Back in the late 90s, I implemented the first systematic monitoring of WalMart Store's global network, including all of the store routers, hubs (not switches yet!) and 900mhz access points.
Did you know that WalMart had some stores in Indonesia? They did until 1998.
Around that same time, the equipment in the Jakarta store started sending high temperature alerts prior to going offline. Our NOC wasn't able to reach anyone in the store.
The alerts were quite accurate: that was one of the many buildings that had been burned down in the riots. I guess it's slightly surprising that electrical power to the equipment in question lasted long enough to allow temperature alerting. Most of our stores back then used satellite for their permanent network connection, so it's possible telcom died prior to the fire reaching the UPC office.
In a couple of prominent places in the home office, there were large cutouts of all of the countries WalMart was in at the time up on the walls. A couple of weeks after this event, the Indonesia one was taken down over the weekend and the others re-arranged.
Thanks for sharing this interesting story. Part of my family immigrated from Indonesia due to those riots, but I was unaware up until today of the details covered by the Wikipedia article you linked.
I remember during the 2000s and 2010s that WalMart in the USA earned a reputation for it's inventories primarily consisting of Chinese-made goods. I'm not sure if that reputation goes all the way back to 1998, but it makes me wonder if WalMart was especially targeted by the anti-Chinese element of the Indonesian riots because it.
I can't recall (and probably didn't know at the time...it was far from my area) where products were sourced for the Indonesia stores.
Prior to the early 2000s, WalMart had a strong 'buy American' push. It was even in their advertising at the time, and literally written on the walls at the home office in Bentonville.
Realities changed, though, as whole classes of products were more frequently simply not available from the United States, and that policy and advertising approach were quietly dropped.
Just for the hell of it, I did a quick youtube search: "walmart buy american advertisement" and this came up: https://www.youtube.com/watch?v=XG-GqDeLfI4 "Buy American - Walmart Ad". Description says it's from the 1980s, and that looks about right.
What the hell, here's another story. The summary to catch your attention: in the early 2000s, I first became aware of WalMart's full scale switch to product sourcing from China by noting some very unusual automated network to site mappings.
Part of what my team (Network Management) did was write code and tools to automate all of the various things that needed to be done with networking gear. A big piece of that was automatically discovering the network. Prior to our auto discovery work, there was no good data source for or inventory of the routers, hubs, switches, cache engines, access points, load balancers, VOIP controllers...you name it.
On the surface, it seems scandalous that we didn't know what was on our own network, but in reality, short of comprehensive and accurate auto discovery, there was no way to keep track of everything, for a number of reasons.
First was the staggering scope: when I left the team, there were 180,000 network devices handling the traffic for tens of millions of end nodes across nearly 5,000 stores, hundreds of distribution centers and hundreds of home office sites/buildings in well over a dozen countries. The main US Home Office in Bentonville, Arkansas was responsible for managing all of this gear, even as many of the international home offices were responsible for buying and scheduling the installation of the same gear.
At any given time, there were a dozen store network equipment rollouts ongoing, where a 'rollout' is having people visit some large percentage of stores intending to make some kind of physical change: installing new hardware, removing old equipment, adding cards to existing gear, etc.
If store 1234 in Lexington, Kentucky (I remember because it was my favorite unofficial 'test' store :) was to get some new switches installed, we would probably not know what day or time the tech to do the work was going to arrive.
ANYway...all that adds up to thousands of people coming in and messing with our physical network, at all hours of the day and night, all over the world, constantly.
Robust and automated discovery of the network was a must, and my team implemented that. The raw network discovery tool was called Drake, named after this guy: https://en.wikipedia.org/wiki/Francis_Drake and the tool that used many automatic and manual rules and heuristics to map the discovered networking devices to logical sites (ie, Store 1234, US) was called Atlas, named after this guy: https://en.wikipedia.org/wiki/Atlas_(mythology)
All of that background aside, the interesting story.
In the late 90s and early 2000s, Drake and Atlas were doing their thing, generally quite well and with only a fairly small amount of care and feeding required. I was snooping around and noticed that a particular site of type International Home Office had grown enormously over the course of a few years. When I looked, it had hundreds of network devices and tens of thousands of nodes. This was around 2001 or 2002, and at that time, I knew that only US Home Office sites should have that many devices, and thought it likely that Atlas had a 'leak'. That is, as Atlas did its recursive site mapping work, sometimes the recursion would expand much further than it should, and incorrectly map things.
After looking at the data, it all seemed fine. So I made some inquiries, and lo and behold, that particular international home office site had indeed been growing explosively.
In the early 2000s I was working as a field engineer installing/replacing/fixing network equipment for Walmart at all hours. It's pretty neat to hear the other side of the process! If I remember correctly there was some policy that would automatically turn off switch ports that found new, unrecognized devices active on the network for an extended period of time, which meant store managers complaining to me about voip phones that didn't function when moved or replaced.
Ah neat, so you were an NCR tech! (I peeked at your comment history a bit.) My team and broader department spent a lot of hours working with, sometimes not in the most friendly terms, people at different levels in the NCR organization.
You're correct, if Drake (the always running discovery engine) didn't detect a device on a given port over a long enough time, then another program would shut that port down. This was nominally done for PCI compliance, but of course having open, un-used ports especially in the field is just a terrible security hole in general.
In order to support legit equipment moves, we created a number of tools that the NOC and I believe Field Support could use to re-open ports as needed. I think we eventually made something that authorized in-store people could use too.
As an aside, a port being operationally 'up' wasn't by itself sufficient for us mark the port as being legitimately used. We had to see traffic coming from it as well.
You mentioned elsewhere that you're working with a big, legacy Perl application, porting it to Python. 99% of the software my team at WalMart built was in Perl. (: I'd be curious to know, if you can share, what company/product you were working on.
The NCR/Walmart relationship was fairly strained during my tenure. Given the sheer number of stores/sites that Walmart had and NCR's own problems, it was not always possible to provide the quality of service people might expect, especially with a smile. From a FE perspective, working on networking gear at Walmart meant that you were out at 11pm at night (typically after working 10-11 hours already and spending an hour or two onsite waiting for the part to arrive via courier) and your primary concern was to get the job done and get back home. The worst was to plug a switch in, watch it not power up, and realize you'd need to be back in the same spot at three or four hours later to try again.
Walmart must have been an interesting place to work during the late 90s, early 2000s - I imagine that most everywhere they had to solve problems at scale before scale was a considered a thing. I'd be very interested to see how the solutions created in that period match to best-practices today, especially since outside of the telecom or perhaps defense worlds there probably wasn't much prior art.
As for the Perl application, I probably shouldn't say since I'm still employed at the same company and I know coworkers who read HN. If you're interested, DM me and I can at least provide the company name and some basic details.
> The NCR/Walmart relationship was fairly strained ...
Definitely. (: I didn't hold it against the hands on workers like yourself. Even (and perhaps especially) back then, WalMart was a challenging, difficult and aggressive partner.
> working on networking gear at Walmart meant that you were out at 11pm at night
That sounds about right; the scheduling I was directly aware of was very fast paced. Our Network Engineering Store Team pushed and pushed and pushed, just as they were pushed and pushed and pushed.
> Walmart must have been an interesting place to work during the late 90s, early 2000
Yup. Nowhere I've worked before or since had me learning as much or getting nearly as much done. It was an amazingly positive experience for me and my team, but not so positive for a lot of others.
> I imagine that most everywhere they had to solve problems at scale before scale was a considered a thing.
Sometimes I imagine writing a book about this, because it's absolutely true, all over Information Systems Division.
For a time in the early 2000s, we were, on average, opening up a new store every day, and a typical new store would have two routers, two VOIP routers, two cache engines, between 10 and 20 switches, two or four wireless access point controllers and dozens of AP endpoints. That was managed by one or two people on the Network Engineering side, so my team (Network Management) wrote automation that generated the configs, validated connections, uploaded configs, etc etc etc. (Not one or two people per store: one or two people for ALL of the new stores.)
The networking equipment was managed by a level of automation that is pretty close to what one sees inside of Google or Facebook today, and we were doing it 20 years ago.
> ... telecom ... prior art ...
John Chambers, the long-time CEO of Cisco, was at the time on WalMart's board of directors. He was always a bit of a tech head, and so when he came to Bentonville for board meetings, he'd often come and visit us in Network Engineering.
Around 2001-2002, we were chatting with him and he asked why we weren't using Cisco Works to manage our network. https://en.wikipedia.org/wiki/Cisco_Prime but back then it was mostly focused on network monitoring and to a lesser extent, config management. We chuckled and told him that there's no way that Cisco Works could scale to even a fraction of our network. He asked what we used, and of course we showed him the management system we'd written.
He was so impressed that he went back to San Jose, selected a group of Cisco Works architects, had them sign NDAs, and sent them to Bentonville, Arkansas for a month. The intent was to have them evaluate our software with an eye toward packaging it up and re-selling it.
Those meetings were interesting, but ultimately fruitless. The Cisco Works architects were Ivory Tower Java People. The first thing they wanted to see was our class hierarchy. We laughed and said we had scores of separate and very shallow classes, all written in Perl, C and C++.
Needless to say, they found the very 'rough and ready' way our platform was designed to be shocking and unpalatable. They went back and told Chambers that there was literally no way our products could be tied together.
> ... match to best-practices today ...
Professionally, I've been doing basically the same kinds of things since then, and I'll say that while our particular methods and approaches were extremely unusual, the high level results would meet or perhaps exceed what one gets with 'best practices' seen today.
Not because we were any smarter or better, but because we had no choice but to automate and automate effectively. At that scale, at that rate of change, at those uptime requirements, 'only' automating 99% would be disastrous.
FWIW, my brain was going "book! book! book! book! book!" back at the top-level comment, and the beeper may have got slightly overloaded and broke as I continued reading. :)
Yes please.
As a sidenote, the story about "the CEO vs the architects" was very fascinating: the CEO could see the end-to-end real-world value of what you'd built, but the architects couldn't make everything align. In a sense the CEO was more flexible than the architects, despite the fact that stereotypes might suggest the opposite being more presumable.
Also, the sentiment about your unusual methodology exceeding current best practice makes me wonder whether you achieved so-called "environmental enlightenment" - where everything clicks and just works and makes everyone who touches the system a 5x developer - or whether the environment simply had to just work really really well. Chances are the former is what everyone wishes they'll find one day, while the latter (incredibly complex upstream demands that are not going to go away anytime soon and which require you to simply _deliver_) definitely seems like the likelier explanation for why the system worked, regardless of the language it was written in - it was the product of a set of requirements that would not accept anything else.
Hmm. Now I think about that a bit and try and apply it to "but why is current best practice worse", I was musing the other day about how a lot of non-technical environments don't apply tech in smart ways to increase their efficiency, because their fundamental lack of understanding in technology means they go to a solutions provider, get told "this will cost $x,xxx,xxx", don't haggle because they basically _can't_, and of course don't implement the tech. I wonder if the ubiquitification (that seems to be a word) of so-called "best practices" in an area doesn't function in a similar way, where lack of general awareness/understanding/visibility in an area means methodology and "practices" (best or not) aren't bikeshed to death, and you can just innovate. (Hmm, but then I start wondering about how highly technically competent groups get overtaken by others... I think I'll stop now...)
I love hearing stories from "old" Walmart. I was a Walmartian from 2017 to 2019, and I still miss my co-workers. (Shout-out to the Mobile Client team.)
Some interesting facts to know for those who don't dig into it. Walmart:
- has 80+ internal apps, mostly variants but still unique
- runs k8s inside of Distribution Centers
- maintains a fleet of >180k mobile devices in the US alone
- has a half-dozen data centers in the US
- has most International infrastructure seperate from US Stores'
I've got some stories of my own, maybe I'll post them in a bit.
Wow, that's a hell of a change! When I left in 2009, there were exactly two datacenters: NDC and EDC. Not surprising really.
From where I was sitting, the best era was definitely 1997-2004 or so. ISD really went down hill, pretty quickly, in my last five years there, for many different reasons.
Really though, I feel truly awful for anyone affected by this. The post recommends implementing a disaster recovery plan. The truth is that most people don't have one. So, let's use this post to talk about Disaster Recovery Plans!
Mine: I have 5 servers at OVH (not at SBG) and they all back up to Amazon S3 or Backblaze B2, and I also have a dedicated server (also OVH/Kimsufi) that gets the backups. I can redeploy in less than a day on fresh hardware, and that's good enough for my purposes. What's YOUR Disaster Recovery Plan?
I'm at OVH as well (in the BHS datacenter, fortunately). I run my entire production system on one beefy machine. The apps and database are replicated to a backup machine hosted with Hetzner (in their Germany datacenter). I also run a tiny VM at OVH which proxies all traffic to Hetzner. I use a failover IP to point at the big rig at OVH. If the main machine fails, I move the failover IP to the VM, which sends all traffic to Hetzner.
If OVH is totally down, and the fail over IP doesn't work, I have a fairly low TTL on the DNS.
I backup the database state to S3 every day.
Since I'm truly paranoid, I have an Intel NUC at my house that also replicates the DB. I like knowing that I have a complete backup of my entire business within arm's reach.
I also run our entire production system on one beefy machine at OVH, and replicate to a similar machine at Hetzner. In case of a failure, we just change DNS, which has a 1 hour TTL. We've needed to do an unplanned fail-over only once in over 10 years.
And like you, I have an extra replica at the office, because it feels safe having a physical copy of the data literally at hand.
Same but with a regular offline physical copy (cheap nas). One of my worries is a malicious destruction of the backups if anything worms its way in my network
Which is why "off" is still a great security tool. A copy on a non-powered device, even if that device is attached to the network, is immune to worms. There is something to be said for a NAS solution that requires a physical act to turn on and perform an update.
hetzner has storage boxes and auto snapshots. so even if someone deletes the backups remotely there are still snapshots which they can't get unless they have control panel access.
My threat model is someone that would have full access to my computer without me knowing. So they could over time get access to passwords, modify my OS to MITM yubikeys... Over cautious really likely, but that doesnt cost me much more
Browser zero day or some kind of malicious linux package that gets distributed mostly. Don't think i've a profile that would make people bother to do physical attacks.
Not done any research into it, but I always thought OVH was supposed to be a very budget VPS service primarily for personal use rather than business. Although thought it was akin to having a Raspberry Pi plugged in at home.
Again, I may be completely wrong but why would you not use AWS/GCP? Even if it's complexity, Amazon have Lightsail, or if it's cost I thought DigitalOcean was one of the only reputable business-grade VPS providers.
I just can't imagine many situations where a VPS would be superior to embracing the cloud and using cloud functions, containers, instances with autoscaling/load balancers etc.
You cant imagine it yet big chunk of the independent internet runs on small vps servers. There isnt much difference between DO and OVH, Hetzner, Vultr, Linode... not sure why DO would be better. I mean its US company doing marketing right. Thats the difference. Plus ovh/hetzner have only EU locations.
I think small bussinesses like smaller simple providers instead of bigclouds. Its different philosophy if you are afraid of extreme centralisation of internet it makes sense.
I can think of a lot of big differences. For one you can get much larger machines at OVH and Hetzner with fancy storage configurations for your database if desired (e.g. Optane for your indices, magnetic drives for your transaction log, and raided SSDs for the tables)
They also don't charge for bandwidth, although some of those other providers have a generous free bandwidth and cheap overage.
I didn't realize they had US datacenters before now. It's possible that's no longer an option. It was on the largest servers in the Montreal datacenter when I specced that out.
Much cheaper and better performance at the high end. Doesn't compete at all at the low-end, except through their budget brand Kimsufi. I don't see them really as targeting the same market.
I rent a server from OVH for $32 a month. It's their So You Start line... doesn't come with fancy enterprise support and the like.
It's a 4 core 8 thread Xeon with 3x 1TB SATA with 32GB of ECC RAM IIRC (E3-SAT-1-32, got it during a sale with a price that is guaranteed as long as I keep renewing it)
The thing is great, I can run a bunch of VM's on it, it runs my websites and email.
Overall to get something comparable elsewhere I would be paying 3 to 4 times as much.
I would consider $50 a month or less low end pricing. ¯\_(ツ)_/¯
Yeah, I forgot they also have the so you start brand. It's probably more expensive than the majority of what digital ocean sells, but there is some overlap for sure.
I don't know about OVH but Hetzner beats DO at the lower end: for $5/month you get 2 CPUs vs 1, 2 GB RAM vs 1, 40 GB disk vs 25 and 20 TB traffic vs 1. They have an even lower-end package for 2.96 Euro/month as well.
OVH has at least one large North American datacenter in Beauharnois, located just south of Montreal. I've used them before for cheap dedicated servers. They may have others.
If all you need is compute, storage, and a pipe, all the big cloud providers are a total ripoff and you should look elsewhere. The big ones only make sense if you are leveraging their managed features or if you need extreme elasticity with little chance of a problem scaling up in real time.
OVH is one of the better deals for bare metal, but there are even better ones for bandwidth. You have to shop around a lot.
Also be sure you have a recovery plan... even with the big providers. These days risks include not only physical stuff but some stupid bot shutting you off because it thinks you violated TOS or is reacting to a possibly malicious complaint.
We had a bot at AWS gank some test systems once because it thought we were cryptocurrency mining with free credits. We weren’t, but we were doing very CPU intensive testing. I’ve heard of this and worse happening elsewhere. DDOS detector and IDS bots are particularly notorious.
Twice the revenue of DigitalOcean still puts it < $1B ARR, or am I missing something? I can’t see how that’s the third largest in the world, or does your definition of “hosting provider” exclude clouds?
OVH is one of the largest providers in the world. They run a sub brand for personal use (bare metal for $5/m, hardware replacements in 30 min or less usually).
..and they do support all of those things you just listed, not just API-backed bare metal.
Their sub-brand soyoustart has older servers (that are still perfectly fine), roughly E3 Xeon/16-32GB/3x2TB to 4x2TB for $40/m ex vat.
Their other sub brand kimsufi for personal servers has Atom low-power bare metal with 2TB HDD (in reality it is advertised 500GB/1TB, but they don't really have any of those in stock left, if your drive fails they replace it with a 2T - so far this has been my exp) for $5.
All of this is powered by automation, you don't really get any support and you are expected to be competent. If your server is hacked you get PXE-rebooted into a rescue system and can scp/rsync off your contents before your server is reinstalled. OS installs, reboots, provisioning are all automated, there's essentially no human contact.
PS: Scaleway, in Paris, used to offer $2 bare metal (ultra low voltage, weaker than an Atom, 2GB ram), but pulled all their cheap machines, raised prices on existing users, and rebranded as enterprisey. The offer was called 'kidechire'
--
It is kind of interesting that on the US side everyone is in disbelief, or like "why not use AWS" - while most of the European market knows of OVH, Hetzner, etc.
My own reason for using OVH? It's affordable and I would not have gotten many projects (and the gaming community I help out with) off the ground otherwise. I can rent bare metal with NVMe, and several terabytes of RAM for less than my daily wage for the whole month, and not worry about per-GB billing or attacks. In the gaming world you generally do not ever want to use usage based billing - made the mistake of using Cloudfront and S3 once and banned script kiddies would wget-loop the largest possible file from the most expensive region botnet repeatedly in a money-DoS.
I legitimately wouldn't have been able to do my "for-fun-and-learning" side projects (no funding, no accelerator credits, ...) without someone like them. The equivalent of a digitalocean $1000/m VM is about $100 on OVH.
Edit: Seems like they stopped publishing videos for that datacenter, but this seems to be a video for the burn down datacenter in 2013:
https://www.youtube.com/watch?v=Y47RM9zylFY
OVH STARTED as a budget VPS service some 20 years ago... but they grew a lot since 6-7 years, adding more "cloud" services and capabilities, even not on par with the main players...
Why not use AWS/GCP? From my personal point of view: as a French citizen, I'm more and more convinced that I can't completly trust the (US) big boys for my own safety. Trump showed that "US interest" is far more important than "customer interest" or even "ally interest". And moreover, Google is showing quite regurlaly that it's not a reliable business partner (AWS look better for this).
Yeah, I was thinking about all the horror stories that can be found on this site.
As a customer (or maybe an "involontary data provider"), I do as much as I can to avoid Google to be my SPOF, not technically (it's really technically reliable) but on the business side. I had to setup my own mail server just to avoid any risk of google-ban for example... just in case.
I won't use Google authentificator for the same reason. I'm happy to have left Google Photos some years ago, to avoid problems of Google shutting it down. And the list could go on...
As a business, I like to program Android apps but the Google Store is really a risk too. Risk to have any Google account blacklisted because some algorithm thought I did something wrong. And no appeal.
Maybe all this doesn't apply to GCP customers. Maybe GCP customers have a human direct line, with someone to really help and the capacity to do it. Or maybe it's just Google: as long as it work, enjoy. If it doesn't, go to (algorithmic) hell.
Nope. I was at a company with a $1M dedicated spend contract w/ GCP and what that got us was support through a VAR. It then became the VAR’s job to file support tickets that took two weeks to get the response “oh well that’s now how we do it at Google. Have you read these docs you already said you read and can you send logs you already sent?” instead of my job to do that.
Enterprise-level projects often have only light protection against wrongful hosting account termination, reasoning that spending a lot of money and having an account manager keeps them safe from clumsy automated systems.
So they might have their primary and replica databases at different DCs from the same hosting provider, and only their nightly backup to a different provider. Four copies to four different providers is a step above three copies with two providers!
A large enterprise would probably be using a filesystem with periodic snapshots, or streaming their redo log to a backup, to protect against a fat-fingered DBA deleting the wrong thing. Of course, filesystem snapshots provide no protection against loss of DC or wrongful hosting account termination, so you might not count them as true backup copies.
This is why you should have a “Cloud 3-2-1” backup plan. Have 3 copies of your data, two with your primary provider, and 1 with another.
e.g., if you are an AWS customer, have your back ups in S3 and use simple replication to sync that to either GCS or Azure, where you can get the same level of compliance attestation as from AWS.
It's not paranoia if you're right. All of the risks GP is protecting against are things that happen to someone every day, and they should be seen like wearing the seat belt in a car.
I have a reliability and risk avoidance mindset, but I’ve had to stand back because my mental gas tank for trying to keep things going is near empty.
I’ve really struggled working with others that either are both ignorant and apathetic about the business’s ability to deal with risk or believe that it’s their job to keep putting duct tape over the duct tape that breaks multiple times a day while users struggle.
I like seeing these comments reminding others to a wear seat belt or have backups for their backups, but I don’t know whether I should care more about reliability. I work in an environment that’s a constant figurative fire.
I also like to spend time with my family. I know it’s just a job, and it would be even if I were the only one responsible for it; that doesn’t negate the importance of reliability, but there is a balance.
If you are dedicated to reliability, don’t let this deter you. Some have a full gas tank, which is great.
> ... [F]inance is fundamentally about moving money and risk through a network. [1]
Your employer has taken on many, many risks as part of their enterprise. If every risk is addressed the company likely can’t operate profitably. In this context, your business needs to identify every risk, weigh the likelihood and the potential impact, decide whether to address or accept the risk, and finally, if they decide to address the risk, whether to address it in-house our outsource it.
You’ve identified a risk that is currently being “accepted” by your employer, one that you’d like to address in-house. Perhaps they’ve taken on the risk unintentionally, out of ignorance.
As a professional the best I can do is to make sure that the business isn’t ignorant about the risk they’ve taken on. If the risk is too great I might even leave. Beyond that I accept that life is full of risks.
This resonates with me. I notice my gas tank rarely depletes because of technology. It doesn’t matter how brain dead the 00’s oracle forms app with absurd unsupported EDI submission excel thinga-ma-bob that requires a modem ... <fill in the rest of the dumspter fire as your imagination deems>. Making a tech stack safe is a fun challenge.
Apathetic people though, that can be really tough going. It’s just that way “because”. Or my favourite “oh we don’t have permission to change that”, how about we make the case and get permission? _horrified looks_ sometimes followed by pitch forks.
Reliability is there to keep your things running smoothly during normal operations. Backups are there for when you reach the end of your reliability rope. Neither is really a good replacement for the other. The most reliable systems will still fail eventually, and the best of backups can't run your day to day operations.
At the end of the day you have a budget (of any kind) and a list of priorities on which to spend it. It's up to you or your management to set a reasonable budget, and to set the right priorities. If they refuse, leave or you'll just burn the candle at both ends and just fade out.
When a backup is used to re-enable something, then the amount of time disabled may be decreased. When it is, this is reliability- we keep things usable and in function, more than not.
are your domains at ovh too ? If yes, I'd consider changing this: this morning the manager was quite flooded and the DNS service was down for some time...
For small firms, CEO / CTO maintaining off-sites at a residence is reasonable and not an uncommon practice. As with all security / risk mitigation practices, there is a balance of risks and costs involved.
And as noted, encrypted backups would be resistant to casual interdiction, or even strongly-motivated attempts. Data loss being the principle risk mitigated by off-site, on-hand backups.
> There is nothing magical about data centers making them safe while your local copy isn't.
Is this a serious comment? My house is not certified as being compliant with any security standards. Here's the list that the 3rd party datacenter we use is certified as complaint with:
The data centers we operate ourselves are audited against several of those standards too. I guess you're right that there's nothing magic about security controls, but it has nothing to do with trust. Sensitive data should generally never leave a secure facility, outside of particularly controlled circumstances.
You are entierly missing the point by quoting the compliance programs followed by AWS whose sole business is being a third party hoster.
For most business, what you call sensitive data is customers and orders listing, payment history, inventory if you are dealing in physical goods and HR related files. These are not state secrets. Encryption and a modicum of physical security go a long way.
I personally find the idea that you shouldn't store a local backup of this kind of data out of security concern entirely laughable. But that's me.
This is quite a significant revision to your previous statement that there’s nothing about a data center that makes it more secure than your house.
This attitude that your data isn’t very important, so it’s fine to not be very concerned about it’s security, while not entirely uncommon, is something most organisations try to avoid when choosing vendors. It’s something consumers are generally unconcerned about, until a breach occurs, and The Intercept write an article about it. At which point I’m sure all the people ITT who are saying it’s fine to take your production database home would be piling on with how stupid the company was for doing ridiculous things like taking a copy of their production database home.
> This is quite a significant revision to your previous statement that there’s nothing about a data center that makes it more secure than your house.
I said there was nothing magical about data centers security, a point I stand with.
It's all about proper storage (encryption) and physical security. Obviously, the physical security of an AWS data center will be tighter that your typical SME but in a way which is of no significance to storing backups.
> This attitude that your data isn’t very important
You are once again missing the point.
It's not that your data isn't important. It's that storing it encrypted in a sensible place (and to be clear by that I just mean not lying around - a drawer in an office or your server room seems perfectly adequate to me) is secure enough.
The benefits of having easily available backups by far trump the utterly far fetched idea that someone might break into your office to steal your encrypted backups.
> It's that storing it encrypted in a sensible place (and to be clear by that I just mean not lying around - a drawer in an office or your server room seems perfectly adequate to me) is secure enough.
In the SME space some things are "different", and if you've not worked there it can be hard to get one's head around it:
A client of mine was burgled some years ago.
Typical small business, offices on an industrial estate with no residential housing anywhere nearby. Busy in the daytime, quiet as the grave during the night. The attackers came in the wee small hours, broke through the front door (the locks held, the door frame didn't), which must have made quite a bit of noise. The alarm system was faulty and didn't go off (later determined to be a 3rd party alarm installer error...)
All internal doors were unlocked, PCs and laptops were all in plain sight, servers in the "comms room" - that wasn't locked either.
The attacker(s) made a cursory search at every desk, and the only thing that was taken at all was a light commercial vehicle which was parked at the side of the property, its keys had been kept in the top drawer of one of the desks.
The guy who looked after the vehicle - and who'd lost "his" ride - was extremely cross, everyone else (from the MD on downwards) felt like they'd dodged a bullet.
Physical security duly got budget thrown at it - stable doors and horses, the way the world usually turns.
Once you're big enough to afford a CISO, you're likely big enough to afford office space with decent physical security to serve as a third replicated database site to complement your two datacenters.
These solutions are not one-size-fits-all. What works for a small startup isn't appropriate for a 100+ person company.
Not in my experience. Worked at some small shops that were lightyears ahead in terms of policy, procedures and attitude compared to places I've worked with 50k+ employees globally.
Large organisations tend not to achieve security compliance with overly sophisticated systems of policy and controls. They tend to do it using bureaucracy, which while usually rather effective at implementing the level of control required, will typically leave a lot to be desired in regards to UX and productivity. Small organisations tend to ignore the topic entirely until they encounter a prospective client or regulatory barrier that demands it. At which point they may initially implement some highly elegant systems. Until they grow large enough that they all devolve into bureaucratic mazes.
I'm aware, but that's not been my experience. I've been in large places where there's been a lassiez faire attitude because it was "another team's job" and general bikeshedding over smaller features because the bigger picture security wasn't their area or was forced from a dictat from above to use X because they're on the board, whilst X is completely unfit for purpose. There's no pushback.
However I've worked at small ISPs where we took security extremely seriously. Appropriate background check and industry policy but moreso the attitude... we wanted to offer customers security because we had pride in our work.
If you are a corporate entity of some kind, the final layer of your plan should always be "Go bankrupt". You can't successfully recover from every possible disaster and you shouldn't try to. In the event of a sufficiently unlikely event, your business fails and every penny spent attempting the impossible will be wasted, move on and let professional administrators salvage what they can for your creditors.
Lots of people plan for specific elements they can imagine and forget other equally or even more important things they are going to need in a disaster. Check out how many organisations that doubtless have 24/7 IT support in case a web server goes down somehow had no plan for what happens if it's unsafe for their 500 call centre employees to sit in tiny cubicles answering phones all day even though pandemic respiratory viruses are so famously likely that Gates listed them consistently as the #1 threat.
"Go bankrupt" is not a plan. Becoming insolvent might be the end result of a situation but it's not going to help you deal with it.
Let's take an example which might lead to bankruptcy. A typical answer to a major disaster (let's say your main and sole building burning as a typical case) for an SME would be to cease activity, furlough employes and stop or defer every payments you can while you claim insurance and assess your options. Well, none of these things are obvious to do especially if all your archive and documents just burnt. If you think about it (which you should), you will quickly realise that you at least need a way to contact all your employes, your bank and your counsel (which would most likely be the accountant certifying your results rather than a lawyer if you are an SME in my country) offsite. That's the heart of disaster planning: having solutions at the ready for what was easy to foresee so you can better focus on what wasn't.
Yes it is. (Though it's better, as GP suggested, as a final layer of a plan and not the only layer.)
> Becoming insolvent might be the end result of a situation but it's not going to help you deal with it.
Insolvency isn't bankruptcy. Becoming insolvent is a consequence, sure. Bankruptcy absolutely does help you deal with that impact, that's rather the point of it.
Bankruptcy when dealt with correctly is a process not an end.
If everything else fail it's better to fill for bankruptcy when there is still something to recover with help of others than to burn everything to ashes because of your vanity.
At least that's how I understood parent's comment.
As a quick interlude, since this may be confusing to non-US readers: bankruptcy in the United States in the context of business usually refers to two concepts, whereas in many other countries it refers to just one.
There are two types of bankruptcies in the US used most often by insolvent businesses: Chapter 7, and Chapter 11.
A Chapter 7 bankruptcy is what most people in other countries think of when they hear "bankruptcy" - it's the total dissolution of a business and liquidiation of its assets to satisfy its creditors. A business does not survive a Chapter 7. This is often referred to as a "bankruptcy" or "liquidation" in other countries.
A Chapter 11 bankruptcy, on the other hand, is a process by which a business is given court protection from its creditors and allowed to restructure. If the creditors are satisfied with the reorganisation plan (which may include agreeing to change the terms of outstanding debts), the business emerges from Chapter 11 protection and is allowed to continue operating. Otherwise, if an agreement can't be reached, the business may end up in Chapter 7 and get liquidated. Most countries have an equivalent to a Chapter 11, but the name for it varies widely. For example, Canada calls it a "Division 1 Proposal," Australia and the UK call it "administation," and Ireland calls it "examinership."
Since there's a lot of international visitors to HN I just thought I'd jump in and provide a bit of clarity so we can all ensure we're using the same definition of "bankruptcy." A US Chapter 7 bankruptcy is not a plan, it's the game over state. A US Chapter 11 bankruptcy, on the other hand, can definitely be a strategic maneuver when you're in serious trouble, so it can be part of the plan (hopefully far down the list).
> Bankruptcy when dealt with correctly is a process not an end.
Yes, that's why "Go bankrupt" is not a plan which was the entire point of my reply. That's like saying that your disaster recovery plan is "solve the disaster".
Going bankrupt is a plan. However, it is a somewhat more involved one than it sounds, at first. That's why there should be a corporate lawyer advising on stuff like company structure, liabilities, continuance of pension plans, ordering and reasons for layoffs, etc.
It's not quite that simple, the data you might have may be needed for compliance or regulatory reasons. Having no backup strategy might make you personally liable depending on the country!
The more insecure your workers, the easier it is to get them to come in, regardless of what the supposed rules may or may not be.
Fast Fashion for example often employs workers in more or less sweatshop conditions close to the customers (this makes commercial sense, if you make the hot new items in Bangladesh you either need to expensively air freight them to customers or they're going to take weeks to arrive after they're first ordered - there's a reason it isn't called "Slow fashion"). These jobs are poorly paid, many workers have dubious right-to-work status, weak local language skills, may even be paid in cash - and so if you tell them they must come in, none of them are going to say "No".
In fact the slackening off in R for the area where my sister lives (today the towering chimneys and cavernous brick factories are just for tourists, your new dress was made in an anonymous single story building on an industrial estate) might be driven more by people not needing to own new frocks every week when they've been no further than their kitchen in a month than because it would actually be illegal to staff their business - if nobody's buying what you make then suddenly it makes sense to take a handout from the government and actually shut rather than pretend making mauve turtleneck sweaters or whatever is "essential".
Just to clarify: trans-atlantic shipments take a week port-to-port, e.g. Newark, NJ, USA to Antwerp, Belgium. (Bangladesh to Italy via Suez-channel looks like a 2-week voyage, or 3 weeks to the US west coast. Especially the latter would probably have quite a few stops on the way along the Asian coast.)
You get better economics than shipping via air-freight from one full pallet and up. Overland truck transport to and from the port is still cheaper than air freight, at least in the US and central Europe.
For these major routes, there are typically at least bi-weekly voyages scheduled, so for this kind of distance, you can expect about 11 days pretty uniformly distributed +-2 days, if you pay to get on the next ship.
This may lead to (committing to) paying for the spot on the ship when your pallet is ready for pickup at the factory, not when it arrives at the port) and use low-delay overland trucking services.
Which operate e.g. in lockstep with the port processing to get your pallet on the move within half a day of the container being unloaded from the ship, ideally having containers pre-sorted at the origin to match truck routes at the destination.
So they can go on a trailer directly from the ship and rotate drivers on the delivery tour, spending only a few minutes at each drop-off.
Because those can't rely on customers to be there and get you unloaded in less than 5 minutes, they need locations they can unload at with on-board equipment. They'd notify the customer with a GPS-based ETA display, so the customer can be ready and immediately move the delivery inside.
Rely on 360-degree "dashcam" coverage and encourage the customer to have the drop-off point under video surveillance, just to easily handle potential disputes. Have the delivery person use some suitable high-res camera with a built-in light to get some full-surface-coverage photographic evidence of the condition it was delivered in.
I'd guess with a hydraulic lift on the trailer's back and some kind of folding manual pallet jack stuck on that (fold-up) lift, so they drive up to the location, unlock the pallet jack, un-fold the lift, lower the lift almost to the ground, detach the pallet jack to drop it the last inch/few cm to the ground, pull the jack out, lower the lift the rest of the way, drive it on to the lift, open the container, get up with the pallet jack, drive the pallets (one-by-one) for this drop-off out of the container and leave them on the ground, close and lock the container, re-arm the jack's hooks, shove it jack back under the slightly-lowered folding lift, make it hook back in, fold it up, lock the hooking mechanism (against theft at a rest stop (short meal and toilet breaks exist, but showering can be delayed for the up to 2 nights)), fold it all the way up, and go on to drive to their next drop-off point.
Not really, the insurance won't make things right in an instant. They will usually compensate you financially, but often only after painstaking evaluation of all circumstances, weighing their chances in court to get out of paying you and maybe a lengthy court battle and a race against your bankruptcy.
So yes, getting insurance can be a good idea to offset some losses you may have, as long as they are somewhat limited compared to your companies overall assets and income. But as soon as the insurance payout matches a significant part of your net worth, the insurance might not save you.
There are always uninsurable events and for large enough companies/risks there are also liquidity limits to the size of coverage you can get from the market even for insurable events.
As such, it makes sense to make the level of risk you plan to accept (by not being insured against it and not mitigating) a conscious economic decision rather than pretending you've covered everything.
As long as you have outside shareholders you can decide that. If you do you'd be surprised about how they will respond to an attitude like that. After all: you can decide the levels of risk that you personally are comfortable with leading to extinguishing of the business, but a typical shareholder is looking at you to protect their investment and not insuring against a known risk which at some point in time materializes is an excellent way to find yourself in the crosshairs of a minority shareholder lawsuit against a (former) company executive.
In my work life I am a professional investor, so I've been through the debate on insure/prepare or not many times. It's always an economic debate when you get into "very expensive" territory (cheap and easy is different obviously).
The big example of this which springs to mind is business interruption cover - it's ruinously expensive so it's extremely unusual to have the max cover the market might be prepared to offer. It's a pure economic decision.
Yes, but it is an informed decision and typically taken at the board level, very few CEO's that are not 100% owners would be comfortable with the decision to leave an existential risk uncovered without full approval of all those involved, which is kind of logical.
Usually you'd have to show your homework (offers from insurance companies proving that it really is unaffordable). I totally get the trade-off, and the fact that if the business could not exist if it was properly insured that plenty of companies will simply take their chances.
We also both know that in case something like that does go wrong everybody will be looking for a scapegoat, so for the CEO's own protection it is quite important to play such things by the book, on the off chance the risk one day does materialize.
Absolutely - but that's kind of my point. You should make the decision consciously. The corporate governance that goes around that is the company making that decision consciously.
And this is the heart of the problem: a lot of times these decisions are made by people who shouldn't be making them or they aren't made at all, they are just made by default without bring the fact that a decision is required to the level of scrutiny normally associated with such decisions.
This has killed quite a few otherwise very viable companies, it is fine to take risks as long as you do so consciously and with full approval of all stakeholders (or at least: a majority of all stakeholders). Interesting effects can result: a smaller investor may demand indemnification, then one by one the others also want that indemnification and ultimately the decision is made that the risk is unacceptable anyway (I've seen this play out), other variations are that one shareholder ends up being bought out because they have a different risk appetite than the others.
It's true: most companies do not have a disaster recovery plan, and many of them confuse a breach protocol with a disaster recovery plan ('we have backups').
Fires in DCs aren't rare at all, I know of at least three, one of those in a building where I had servers. This one seems to be worse than the other two. Datacenters tend to concentrate a lot of flammable stuff, throws a ton of current through them and does so 24x7. The risk of a fire is definitely not imaginary, which is why most DCs have fire suppression mechanisms. Whether those work as advertised depends on the nature of the fire. An exploding on prem transformer took out a good chunk of EV1's datacenter in the early 2000's, and it wasn't so much the fire that caused problems for their customers, but the fact that someone got injured (or even died, I don't recall exactly), and before the investigation was completed and the DC released to the owners again took a long time.
Being paranoid and having off-site backups is what allowed us to be back online before the fire was out. If not for that I don't know if our company would have survived.
No, SBG2 was a building in the "tower design", as is SBG3 behind it. The container in the foreground are SBG1 from the time when OVH didn't know if Straßburg is going to be a permanent thing.
Funnily enough, I think it was the fire risk that caused them to ditch the idea and move to their current design. Though I know modular design is highly likely to be used by all players as edge nodes spring up worldwide.
It was also that the container had literally no advantages. It was just a meme that did not survive rational analysis. The building in which the datacenter is located is the simplest, cheapest part of the design. Dividing it up into a bunch of inconveniently-sized rectangles solves nothing.
Got burned once (no pun intended), learned my lesson.
Hot spare on a different continent with replicated data along with a third box just for backups. The backup box gets offsite backups held in a safe with another redundant copy in another site in another safe.
Probably this is the most important part of your plan. It's not the backup that matters; it's the restore. And if you don't practice it from time to time, it's probably not going to work when you need it.
A few years ago I worked on the British Telecom Worldwide intranet team and we had a matrix mapping various countries encryption laws.
This was so we remained legal in all of the countries BT worked in which required a lot of behind the scenes work to make sure we didn't serve "illegaly encypted" Data.
yeah, there's lots of countries with regulations that certain data can't leave the geographical boundary of the country. Often, it is the most sensitive data.
These laws generally don't work how people think they do.
For example, the Russian data residency law states that a copy of the data must be stored domestically, not that it can't be replicated outside the country.
The UAE has poorly written laws that have different regulations for different types of data - including fun stuff like only being subject to specific requirements if the data enters a 270 acre business park in Dubai.
Don't even get me started on storing encrypted data in one country and the keys in another...
Also stupid things not to forget: make sure your dns provider is independent otherwise you won’t be able to point to your new server (or have a secondary DNS provider). Make sure any email required for 2FA or communicating with your hosting service managing your infrastructure isn’t running on that same infrastructure.
We test rolling over the entire stack to another AWS DR region (just one we dont normally use) from S3 backups, etc. We do this annually and try to introduce some variations to the scenarios. It takes us about 18 hours realistically.
Documentation / SOPs that have been tested thoroughly by various team members are really important. It helps work out any kinks in interpretation, syntax errors etc.
It does feel a little ridiculous at the time for all the effort involved, but incidents like this show why it's so important.
As an immediate plan, the 2-3 business critical systems are replicating their primary storages to systems in a different datacenter. This allows us to kick off the configuration management in a disaster, and we need something in between 1-4 hours to setup the necessary application servers and middlewares to get critical production running again.
Regarding backups, backups are archived daily to 2 different borg repo hosts on different cloud providers. We could lose an entire hoster to shenanigans and the damage would be limited to ~2 days of data loss at worst. Later this year, we're also considering to export some of these archives to our sister team, so they can place a monthly or weekly backup on tape in a safe in order to have a proper offline backup.
Regarding restores - there are daily automated restore tests for our prod databases, which are then used for a bunch of other tests after anonymization. On top, we've built most database handling on top of the backup/restore infra in order to force us to test these restores during normal business processes.
As I keep saying, installing a database is not hard. Making backups also isn't hard. Ensuring you can restore backups, and ensuring you are not losing backups almost regardless of what happens... that's hard and expensive.
* All my services are dockerized and have gitlab pipelines to deploy on a kubernetes cluster (RKE/K3s/baremetal-k8s)
* git repo's containing the build scripts/pipelines are replicated on my gitlab instance and multiple work computers (laptop & desktop)
* Data and databases are regularly dumped and stored in S3 and my home server
* Most of the infrastructure setup (AWS/DO/Azure, installing kubernetes) is in Terraform git repositories. And a bit of Ansible for some older projects.
Because of the above, if anything happens all I need to restore a service is a fresh blank VM/dedicated machine or a cloud account with a hosted Kubernetes offering. From there it's just configuring terraform/ansible variables with the new hosts and executing the scripts.
One of my backup servers used to be in the same datacenter as the primary server. I only recently moved it to a different host. It's still in the same city, though, so I'm considering other options. I'm not a big fan of just-make-a-tarball-of-everything-and-upload-it-to-the-cloud backup methodology, I prefer something a bit more incremental. But with Backblaze B2 being so cheap, I might as well just upload tarballs to B2. As long as I have the data, the servers can be redeployed in a couple of hours at most.
The SBG fire illustrates the importance of geographical redundancy. Just because the datacenters have different numbers at the end doesn't mean that they won't fail at the same time. Apart from a large fire or power outage, there are lots of things that can take out several datacenters in close vicinity at the same time, such as hurricanes and earthquakes.
> I'm not a big fan of just-make-a-tarball-of-everything-and-upload-it-to-the-cloud backup methodology, I prefer something a bit more incremental.
pretty much a textbook use-case for zfs with some kind of snapshot-rolling utility. Snap every hour, send backups once a day, prune your backups according to some timetable. Transfer as incrementals against the previous stored snapshot. Plus you get great data integrity checking on top of that.
with all due respect here - I've never heard of it either, and that's not what you want with a filesystem.
The draw of ZFS is that it's the log-structured filesystem with 10 zillionty hours of production experience that says that it works. And that's why BTRFS is not a direct substitute either. Or Hammer2. There are lots of things that could be cool, the question is are you willing to run them in production.
There is a first-mover advantage in filesystems (that occupy a given design and provide a given set of capabilities). At some point a winner sucks most of the oxygen out of the atmosphere here. There is maybe space for a second place winner (btrfs), there isn't a spot for a fourth-place winner.
I use tarballs because it allows me to not trust the backup servers. ssh is set up such that backup server's ssh keys are certified to only run a command that will allow them to run a backup script that will just return the encrypted data, and nothing else.
It's very easy to use spare storage in various places to do backups this way, as ssh, gpg and cron are everywhere, and you don't need to install any complicated backup solutions or trust the backup storage machines much.
All you have to manage centrally is private keys for backup encryption, and CA for signing the ssh keys + some occasional monitoring/tests.
I thought so too for a long while. Until I was trying to restore something (just to test things), and wasn’t able to... it might have been specific to our GPG or an older version or something... but I decided to switch to restic and am much happier now.
Restic has a single binary that takes care of everything. It feels more modern and seems to work really well. Never had any issue restoring from it.
Just one data point. Stick to whatever works for you. But important to test not only your backups, but also restores!
I've been using Duplicati forever. The fact that it's C# is a bit of a pain (some distros don't have recent Mono), but running it in Docker is easy enough. Being able to check the status of backups and restore files from a web UI is a huge plus, so is the ability to run the same app on all platforms.
I've found duplicity to be a little simplistic and brittle. Purging old backups is also difficult, you basically have to make a full backup (i.e. non-incremental) before you can do that, which increases bandwidth and storage cost.
Restic looks great feature-wise, but still feels like the low-level component you'd use to build a backup system, not a backup system in itself. It's also pre-1.0.
Interesting, I will check Restic out, I’ve heard other good things about it. Duplicity is a bit of a pain to set up and Restic’s single binary model is more straightforward (Go is a miracle). Thanks for the recommendation!
GPG is a bit quirky but I do regularly check my backups and restores (if once every few months counts as regular).
Ditto. Moved to rclone after having a bunch of random small issues with Duplicity that on their own weren't major but made me lose faith in something that's going to be largely operating unsupervised except for a monthly check-in.
Self-hosted Kuberenetes and a FreeNAS storage system at home, and a couple of VMs in the cloud. I've got a mixed strategy, but it covers everything to remote locations.
Personal: I run a webserver for some website (wordpress + xenforo), I've set up a cronjob that creates a backup of /var/www, /etc and a mysql database dump, then uploads it to an S3 bucket (with automatic Glacier archiving after X period set up). It should be fairly straightforward to rent a new server and set things back up. I still dislike having to set up a webserver + php manually though, I don't get why that hasn't been streamlined yet.
My employer has a single rack of servers at HQ. It's positioned at a very specific angle with an AC unit facing it, their exact positions are marked out on the floor in tape. The servers contain VMs that most employees work on, our git repository, issue trackers, and probably customer admin as well. They say they do off-site backups, but honestly, when (not if) that thing goes it'll be a pretty serious impact on the business. They don't like people keeping their code on their take-home laptop either (I can't fathom how my colleagues work and how they can stand working in a large codebase using barebones vim over ssh), but I've employed some professional disobedience there.
Basically the same (offsite backups), but the details are in the what and how which is subjective... For my purposes I decided that offsite backups should only comprise user data and that all server configuration be 100% scripted with some interactive parts to speed up any customization including recovering backups. I also have my own backup servers rather than using a service, and implement immutable incremental backups with rotated ZFS snapshots (this is way simpler than it sounds) - I can highly recommend ZFS as an extremely reliable incremental backup solution but you must enable block level deduplication and expect it to gobble up all the server RAM to be effective (but that's why I dedicate a server to it and don't need masses of cheap slow storage)... also the backup server is restorable by script and only relies on having at least one of the mirrored block devices in tact which I make a local copy of occasionally.
I'm not sure how normal this strategy is outside of container land but I like just using scripts, they are simple and transparent - if you take time and care to write them well.
This sounds like what I want to do for the new infrastructure I'm setting up in one of OVH's US-based data centers. Are you running on virtual machines or bare metal? What kind of scripting or config management are you using?
VPS although there is no dependency on VPS manager stuff so I don't see any issue with running on bare metal. No config managers, just bash scripts.
They basically install and configure packages using sed or heredocs with a few user prompts here and there for setting up domains etc.
If you are constantly tweaking stuff this might not suit you, but if you know what you need and only occasionally do light changes (which you must ensure the scripts reflect) then this could be an option for you.
It does take some care to write reliable clear bash scripts, and there are some critical choices like `set -e` so that you can walk away and have it hit the end and know that it didn't just error in the middle without you noticing.
Servers are at a mix of "cloud" providers, and on-site. Most data (including system configs!) is backed up on-site nightly, and to B2 nightly with historical copies - and critical data is also live-replicated to our international branches. (Some "meh" data is backed up only to B2, like our phone logs; we can get most of the info from our carrier anyway).
Our goal and the reason we have a lot of stuff backed up on-prem is to have our most time-critical operations back up within a couple of hours - unless the building is destroyed, in which case that's a moot point and we'll take what we can get.
A dev wiped our almost-monolithic sales/manufacturing/billing/etc MySQL database a month or two ago. (I have been repeatedly overruled on the topic of taking access to prod away from devs) We were down for around an hour. Most of that time was spent pulling gigs of data out of the binlog without also wiping it all again. Because our nightly backups had failed a couple weeks prior - after our most recent monthly "glance at it".
Less than a day for disaster recovery on fresh hardware? Same as my case. As you say, good enough for most purposes, but I'm also looking for improvement. I have offsite realtime replicas for data and mariaDBs, and offsite nightly backups (combo of rsnapshot, lsyncd, mariaDB multi-source replication, and a post-new-install script that setups almost everything in case you have to recover on bare-metal, i.e. no available VM snapshots).
Currently trying to reduce that "less than a day" though. Recently discovered "ReaR" (Relax and Recover) from RedHat and sounds really nice for bare-metal servers. Not everybody runs on virtualized/cloud (being able to recover from VM snapshots is really a plus). Let's share experiencies :)
We have two servers at OVH (RBX and GRA, not SBG). I make backups of all containers and VMs every day and keep the last three, plus one one each month. Backups are stored in a separate OVH storage disk and also downloaded to a NAS on-premise. In case of a disaster, we'd have to rent a new server, reprovision the VMs and containers and restore the backups. About two days of work to make sure everything works fine and we could lose about 24 hours of data.
It's not the best in terms of Disaster Recovery Plan but we accept that level of risk.
Nothing too crazy, just a simple daily cron to see sync user data and database dumps on our OVH boxes to backblaze and rsync.net. This simple setup is already saved our asses a few times already.
Most people/companies don't have money to setup those disaster plans. They need you to have a similar server ready to go and also a backuo solution like Amazon S3.
I was affected, my personal VPS is safe but down and other VPS I was managing I don't know anything about. I have the backups and right now I'd love for them to just set me up a new VPS so I can restore the backups and restore the services.
I only have a personal server running in Hetzner but it's mirrored onto a tiny local computer at home.
They both run postfix + dovecot, so mail is synced via dovecot replication. Data is rsync-ed daily, and everything has ZFS snapshots. MySQL is not set into replication - my home internet breaks often enough to have serious issues, so instead I drop everything every day import a full dump from the main server, and do a local dump as backup on both sides.
Not saying that you should never do a full mysql dump. Nor that you should not ensure that you can import a full dump.
But when you already use ZFS you can do a very speedy full backup with:
mysql << EOF
FLUSH TABLES WITH READ LOCK;
system zfs snapshot data/db@snapname
UNLOCK TABLES;
EOF
Transfer the snapshot off-site (and test!). Either as a simple filecopy (the snapshot ensured a consistent database) or a little more advanced with zfs send/receive. This is much quicker and more painless than mysql dump. Especially with sizeable databases.
Do you even need to flush the tables and grab a read lock while taking the ZFS snapshot? My understanding was that since ZFS snapshots are point-in-time consistent, taking a snapshot without flushing tables or grabbing a read lock would be safe; restoring from that snapshot would be like rebooting after losing power.
I think you are correct. But then you risk data as you would with the unclean shutdown.
I much prefer to have a known clean state which all things considered should be a safer bet.
Just like some are OK running without fsync.
I don't have to "backup servers" for a long time now. I have an Ansible playbook to deploy and orchestrate services, which, in turn, are mostly dockerized. So my recovery plan is to turn on "sorry, maintenance" banner via CDN, spin up a bunch of new VPSes, run Ansible scenario for deployment and restore database from hidden replica or latest dump.
My recovery plan: tarball & upload to Object Store. I'm going to check out exactly how much replication the OVH object store offers, and see about adding a second geographic location, and maybe even a second provider, tomorrow.
If your primary data is on OVH, I'd look at using another company's object store if feasible (S3, B2, etc). If possible, on another payment method. (If you want to be really paranoid, something issued under another legal entity.)
There's a whole class of (mostly non-technical) risks that you solve for when you do this.
If anything happens with your payment method (fails and you don't notice in time; all accounts frozen for investigation), OVH account (hacked, suspended), OVH itself (sudden bankruptcy?), etc, then at least you have _one_ other copy. It's not stuff that's likely to happen, but the cost of planning for it at least as far as "haven't completely lost all my data even if it's going to be a pain to restore" here is relatively minimal.
I have three servers (1 OVH - different location, 2 DO). The only thing I backup is the DB, which is synced daily to S3. There's a rule to automatically delete files after 30 days to handle GDPR and stop the bucket and costs spiralling out of control.
Everything is managed with Ansible and Terraform (on DO side), so I could probably get everything back up and running in less than an hour if needed.
That makes it sound like you didn't try/practice. I imagine that in a real-life scenario things will be a little more painful than in one's imagination.
Exactly. Having a plan is only part of it. Good disaster plans do dry runs a couple of times a year (when time changes is always convenient reminder). If you rehearse the recovery when you're not panicked, you have a better chance of not skipping a step when the timing is much more crucial. Also, some sort of guide with steps given procedurally is a great idea.
I don't think this is necessarily true for all parts of a disaster plan. Some mechanisms may be untestable because it is unknown how to actually trigger it (think certain runtime assertions, but on a larger scale).
Even if it possible to trigger and test, actually using the recovery mechanism may have some high cost either monetarily or maybe losing some small amount of data. These mechanisms should almost always be an additional layer of defense and only be invoked in case of true catastrophe.
In both cases, the mechanisms should be tested as thoroughly as possibly, either through artificial environments that can simulate improbable scenarios or in the latter case on a small test environment to minimize cost.
I haven't ever deleted everything and timed how long I could get it up and running again, but I have tested it works by spinning up new machines and moving everything over to there (it was easier than running "sudo apt-get dist-upgrade").
Here's what i do for my homelab setup that has a few machines running locally and some VPSes "in the cloud":
I personally have almost all of the software running in containers with an orchestrator on top (Docker Swarm in my case, others may also use Nomad, Kubernetes or something else). That way, rescheduling services on different nodes becomes less of a hassle in case of any one of them failing, since i know what should be running and what configuration i expect it to have, as well as what data needs to be persisted.
At the moment i'm using Time4VPS ( affiliate link: https://www.time4vps.com/?affid=5294 ) for the stuff that needs decent availability and because they're cheaper than almost all of the alternatives i've looked at (DigitalOcean, Vultr, Scaleway, AWS, Azure) and that matters to me.
Now, in case the entire data centre disappears, all of my data would still be available on a few HDDs under my desk (which are then replicated to other HDDs with rsync locally), given that i use BackupPC for incremental scheduled backups with rsync: https://backuppc.github.io/backuppc/
For simplicity, the containers also use bind mounts, so all of the data is readable directly from the file system, for example, under /docker (not really following some of the *nix file system layout practices, but this works for me because it's really easy to tell where the data that i want is).
I actually had to migrate over to a new node a while back, took around 30 minutes in total (updating DNS records included). Ansible can also really help with configuring new nodes. I'm not saying that my setup would work for most people or even anything past startups, but it seems sufficient for my homelab/VPS needs.
My conclusions:
- containers are pretty useful for reproducing software across servers
- knowing exactly which data you want to preserve (such as /var/lib/postgresql/data/pgdata) is also pretty useful, even though a lot of software doesn't really play nicely with the idea
- backups and incremental backups are pretty doable even without relying on a particular platform's offerings, BackupPC is more than competent and buying HDDs is far more cost effective than renting that space
- automatic failover (both DNS and moving the data to a new node) seems complicated, as does using distributed file systems; those are probably useful but far beyond what i actually want to spend time on in my homelab
- you should still check your backups
A status update on the OVH tracker for a different datacenter (LIM-1 / Limburg) says "We are going to intervene in the rack to replace a large number of power supply cables that could have an insulation defect." [0][1] The same type of issue is "planned" in BHS [3] and GRA [2].
Eerie timing: do they possibly suspect some bad cables?
>Eerie timing: do they possibly suspect some bad cables?
Why not? Cables with ratings lower than the load they are carrying is a prime cause for electrical fires. If the load is too high for long enough, the shielding melts away, and if it is close enough for other material to catch fire then that's the ball game. It's a common cause for home electrical fires. Some lamp with poor wiring catches the drapes on fire, etc. Wouldn't think a data center would have flammable curtains though.
This definitely could be probable cause. When I was a teen I witnessed such fire once. Basically a friend had a heater but couldn't find any mains cable and eventually decided to disconnect mains cable from the radio and use that. After few minutes the isolation from the mains cable melted away and the cable turned glowing red and it started burning the table it was on. Fortunately we were not asleep and got it under control quickly. Lesson learned.
We have several bare-metal servers on GRA/Gravelines & RBX/Roubaix, 3 weeks ago we had a 3h downtime on RBX because they were replacing power cords without previous notification.
Maybe they were aware this could happen, and were in the process to fix it
I feel like there should be place to report infrastructure suppliers with misleading status pages, some kind of crowdsourced database. Without this information, you only find out that they are misleading when something goes very wrong.
At best you might be missing out on some SLA refunds, but at worst it could be disasterous for a business. I've been on the wrong side of a update-by-hand status system from a hosting provider before and it wasn't fun.
Agreed, though. A fake status page is worse than no status page. I don't mind if the status page states that it's manually updated every few hours as long as it's honest. But don't make it look like it's automated when it's not.
Wtf is this disclaimer on Down Detector for? (Navigate to OVH page.). It sits in front of user comments, I think:
> Unable to display this content to due missing consent.
By law, we are required to ask your consent to show the content that is normally displayed here.
They are not the only ones though. All too common. Well, it's tricky to set this up properly. The only proper way would be to use external infra for the status page.
It's not difficult to make a status page with minimal false negatives. Throw up a server on another host that shows red when it doesn't get a heartbeat. But then instead you end up with false positives. And people will use false positives against you to claim refunds against your SLA.
As someone who maintained a status page (poorly), I'm sorry on behalf of all status pages.
But, they're usually manual affairs because sometimes the system is broken even when the healthcheck looks ok, and sometimes writing the healthcheck is tricky, and always you want the status page disconnected from the rest of the system as much as possible.
It is a challenge to get 'update the status page' into the runbook. Especially for runbooks you don't review often (like the one for the building is on fire, probably).
Luckily my status page was not quite public; we could show a note when people were trying to write a customer service email in the app; if you forget to update that, you get more email, but nobody posts the system is down and the status page says everything is ok.
Yep. I guess what could be done is a two-tiered status page: automated health check which shows "possible outage, we're investigating" and then a manual update (although some would say it looks lame to say "nah, false positive" which is probably why this setup is rare).
Well, it sucks to catch fire and I care for the employees and the firemen, but if their status page is a lie then I have a whole lot less sympathy for the business. That's shady business and they should feel bad.
I can appreciate an honest mistake though, like the status page server cron is hosted in the same cluster that caught fire and hence it burnt down and can't update the page anymore.
Is the status page relevant though? At the very least, OVH immediately made a status announcement on their support page and they've been active on Twitter. I don't see anything shady here. From their support page:
> The whole site has been isolated, which impacts all our services on SBG1, SBG2, SBG3 and SBG4. If your production is in Strasbourg, we recommend to activate your Disaster Recovery Plan
What's the point of a status page then if it does not show you the status? I don't want to be chasing down twitter handles and support pages during an outage.
It seems to be a static site, which seems reasonable since it aggregates a lot of data and might encounter high load when something goes wrong, so generating it live without caching is not viable. So maybe the server that normally updates it is down too (not that this would be a good excuse)?
So when https://en.wikipedia.org/wiki/May_1998_riots_of_Indonesia started happening, we heard some harrowing stories of US employees being abducted, among other things.
Around that same time, the equipment in the Jakarta store started sending high temperature alerts prior to going offline. Our NOC wasn't able to reach anyone in the store.
The alerts were quite accurate: that was one of the many buildings that had been burned down in the riots. I guess it's slightly surprising that electrical power to the equipment in question lasted long enough to allow temperature alerting. Most of our stores back then used satellite for their permanent network connection, so it's possible telcom died prior to the fire reaching the UPC office.
In a couple of prominent places in the home office, there were large cutouts of all of the countries WalMart was in at the time up on the walls. A couple of weeks after this event, the Indonesia one was taken down over the weekend and the others re-arranged.