

Ask HN: How do you backup 200TB across 15 servers onto a single 50TB server? - nanch

I&#x27;m in a situation where I have a lot of data spread across lots of servers and I would like to be able to restore the data on a single server if it fails.<p>Suppose each server has less than 40TB.<p>The solution I&#x27;m thinking of is like RAID, but at a server-level; like a simple XOR of each server&#x27;s data onto a single server. In the case of a server failure, XOR each existing server and rebuild the data from the failed server.<p>Has anybody done this?
======
Zenst
Firstly unless you have a lot of duplication in that 200TB you can't.

Two areas you need to ask yourself:

1) What is the daily amount of change per server and with that the total
amount that you need to backup daily or whatever frequency you need). 2) What
is the initial uniquiness in the data being backed up. Mostly core OS and
system with standard same local copies of data tables?

With that you can work out what you are trying to do form a need perspective.
May be case of local imaging servers and doing differential backups automated.
But if compression and duplication are not the case and/or high frequency of
data then you will not be able to fit that data into a smaller area. You may
only get 2:1 or less compression it is realy data centric and you need at
least 4:1 and actualy better on all that data.

Look into the problem and define it better, with that a solution will be
easier and may be case of proving that you can not do what you need may even
need 400TB to back up those 200TB servers due to high volume data changes and
need to be able to go back more than the last backup copy and may need to go
back a month. Define the buisness needs, impact and with that the cost and
budget as this is very much a case of working to a limitation when that
limitation is not defined and may well be articifical and defeating in the
objective to a cost that exceeds just adding more storage.

In short, define the problem better with regards to data and busness needs for
backup -- hourly/realtime daily weekly and the type of data and how dynamic
that data is. May be case of easy to do as realy only 1TB of data that changes
across all servers each week, just do not know.

Then do not go about doing yoru own backup solution from scratch, as something
to replace what you have worth a look if you realy know what you are doing and
you may not. But do not reinvent wheels, many options/solution out there but
until the true problem is defined then no easly solution can be given.

Heck do not even know if these are Windows, Linux, AIX, BSD or a combination
of different servers and with that the options change. ALso factor in offsite
aspect of backup as one electrical zapping from a storm or a fire could mess
things up if that backupserver is sat next too the servers it is backing up.
So many things to consider but above all sit down and define the problem in
more detail as sadly the devil is always in the detail and better definition
of the problem allows for easier solutions and over all less work as well as
smarter work. It is a good leason too learn, just don't do it the hard way and
write a backup system backed system on a fag packet problem and design.

------
nwh
Sounds like an outrageously dangerous place to roll your own system. What
you're describing is some sort of reed-solomon type data recovery, though I've
absolutely no idea how that would work in practise. Wouldn't you be spending
all of your time updating the parity information?

~~~
nanch
Yes, similar to reed-solomon/tornado codes in theory - being able to recover
from lost data; but closer to RAID 5 in practice. I'm familiar with tornado
codes, but that's beyond the scope of v1 for this.

Updating parity wouldn't be as much of a concern because real-time backups are
not a requirement - getting a "snapshot" from each server at a current time
would be enough; and we'd be able to recover to that snapshot.

For v1 I would assume no data is ever deleted from a server, so a recovery
would from a single server failure would be possible. (but this is where
additional redundancy would be useful via RS/tornado codes).

------
tomfitz
tahoe-lafs, [https://tahoe-lafs.org/trac/tahoe-lafs](https://tahoe-
lafs.org/trac/tahoe-lafs) , is "an open source, secure, decentralized, fault-
tolerant, peer-to-peer distributed data store and distributed file system."

It has the server-level RAID you desire. The FAQ states: ``You know how with
RAID-5 you can lose any one drive and still recover? And there is also
something called RAID-6 where you can lose any two drives and still recover.
Erasure coding is the generalization of this pattern: you get to configure how
many drives you could lose and still recover. You can choose how many drives
(actually storage servers) will be used in total, from 1 to 256, and how many
storage servers are required to recover all the data, from 1 to however many
storage servers there are. We call the number of total servers N and the
number required K, and we write the parameters as "K-of-N".''

It's a very active open source project, with full-time contributors (I
think?), funded by their commercial arm,
[https://leastauthority.com/](https://leastauthority.com/) .

------
manglav
Have you looked at FlexRAID? I am not too knowledgeable on the depths of this
topic, but I know it can back up over SMB/FTP. It's a very interesting engine
as well, and if you decide to roll your own, you might get some inspiration.

[http://www.flexraid.com/](http://www.flexraid.com/)

~~~
nanch
This looks very close to what I'm looking for, thanks!

------
chris_va
Your proposed scheme might work, but the odds of a system failure are
extremely high (like, what if 2 servers go out instead of one?).

How hard is your constraint? Can you just use an aggressive Reed Solomon codec
with a ~1.5x blowup factor (so, 100TB of extra space instead of 50TB), and
spread the chunks around to each of the machines?

~~~
nanch
failure of 2 servers is a risk for the proposed design. for a v2, the system
could use Reed-Solomon on the backup-server to add entropy and support the
simultaneous failure of N servers.

Yes, a distributed design with the backup blocks distributed across servers
would be adequate as well, even without the RS blowup factor.

~~~
chris_va
This is your best bet: Distributing blocks to all servers evenly. Using RS to
encode your blocks to limit storage space requirements, and make recovery
obvious.

Also, make sure you don't buy all of your hard drives from the same end
manufacturer. Most of these types of systems assume random hard drive
failure... turns out drives can have highly correlated failures.

------
bobx11
Hadoop is like raid for servers.

