Ask HN: How do you backup 200TB across 15 servers onto a single 50TB server?

nwh · on Nov 2, 2013

Sounds like an outrageously dangerous place to roll your own system. What you're describing is some sort of reed-solomon type data recovery, though I've absolutely no idea how that would work in practise. Wouldn't you be spending all of your time updating the parity information?

nanch · on Nov 2, 2013

Yes, similar to reed-solomon/tornado codes in theory - being able to recover from lost data; but closer to RAID 5 in practice. I'm familiar with tornado codes, but that's beyond the scope of v1 for this.

Updating parity wouldn't be as much of a concern because real-time backups are not a requirement - getting a "snapshot" from each server at a current time would be enough; and we'd be able to recover to that snapshot.

For v1 I would assume no data is ever deleted from a server, so a recovery would from a single server failure would be possible. (but this is where additional redundancy would be useful via RS/tornado codes).

tomfitz · on Nov 3, 2013

tahoe-lafs, https://tahoe-lafs.org/trac/tahoe-lafs , is "an open source, secure, decentralized, fault-tolerant, peer-to-peer distributed data store and distributed file system."

It has the server-level RAID you desire. The FAQ states: ``You know how with RAID-5 you can lose any one drive and still recover? And there is also something called RAID-6 where you can lose any two drives and still recover. Erasure coding is the generalization of this pattern: you get to configure how many drives you could lose and still recover. You can choose how many drives (actually storage servers) will be used in total, from 1 to 256, and how many storage servers are required to recover all the data, from 1 to however many storage servers there are. We call the number of total servers N and the number required K, and we write the parameters as "K-of-N".''

It's a very active open source project, with full-time contributors (I think?), funded by their commercial arm, https://leastauthority.com/ .

manglav · on Nov 3, 2013

Have you looked at FlexRAID? I am not too knowledgeable on the depths of this topic, but I know it can back up over SMB/FTP. It's a very interesting engine as well, and if you decide to roll your own, you might get some inspiration.

http://www.flexraid.com/

nanch · on Nov 3, 2013

This looks very close to what I'm looking for, thanks!

Zenst · on Nov 2, 2013

Firstly unless you have a lot of duplication in that 200TB you can't.

Two areas you need to ask yourself:

1) What is the daily amount of change per server and with that the total amount that you need to backup daily or whatever frequency you need). 2) What is the initial uniquiness in the data being backed up. Mostly core OS and system with standard same local copies of data tables?

With that you can work out what you are trying to do form a need perspective. May be case of local imaging servers and doing differential backups automated. But if compression and duplication are not the case and/or high frequency of data then you will not be able to fit that data into a smaller area. You may only get 2:1 or less compression it is realy data centric and you need at least 4:1 and actualy better on all that data.

Look into the problem and define it better, with that a solution will be easier and may be case of proving that you can not do what you need may even need 400TB to back up those 200TB servers due to high volume data changes and need to be able to go back more than the last backup copy and may need to go back a month. Define the buisness needs, impact and with that the cost and budget as this is very much a case of working to a limitation when that limitation is not defined and may well be articifical and defeating in the objective to a cost that exceeds just adding more storage.

In short, define the problem better with regards to data and busness needs for backup -- hourly/realtime daily weekly and the type of data and how dynamic that data is. May be case of easy to do as realy only 1TB of data that changes across all servers each week, just do not know.

Then do not go about doing yoru own backup solution from scratch, as something to replace what you have worth a look if you realy know what you are doing and you may not. But do not reinvent wheels, many options/solution out there but until the true problem is defined then no easly solution can be given.

Heck do not even know if these are Windows, Linux, AIX, BSD or a combination of different servers and with that the options change. ALso factor in offsite aspect of backup as one electrical zapping from a storm or a fire could mess things up if that backupserver is sat next too the servers it is backing up. So many things to consider but above all sit down and define the problem in more detail as sadly the devil is always in the detail and better definition of the problem allows for easier solutions and over all less work as well as smarter work. It is a good leason too learn, just don't do it the hard way and write a backup system backed system on a fag packet problem and design.

chris_va · on Nov 3, 2013

Your proposed scheme might work, but the odds of a system failure are extremely high (like, what if 2 servers go out instead of one?).

How hard is your constraint? Can you just use an aggressive Reed Solomon codec with a ~1.5x blowup factor (so, 100TB of extra space instead of 50TB), and spread the chunks around to each of the machines?

nanch · on Nov 3, 2013

failure of 2 servers is a risk for the proposed design. for a v2, the system could use Reed-Solomon on the backup-server to add entropy and support the simultaneous failure of N servers.

Yes, a distributed design with the backup blocks distributed across servers would be adequate as well, even without the RS blowup factor.

chris_va · on Nov 4, 2013

This is your best bet: Distributing blocks to all servers evenly. Using RS to encode your blocks to limit storage space requirements, and make recovery obvious.

Also, make sure you don't buy all of your hard drives from the same end manufacturer. Most of these types of systems assume random hard drive failure... turns out drives can have highly correlated failures.

bobx11 · on Nov 3, 2013

Hadoop is like raid for servers.