A few people are continuing the concept to other tools -- that should be available at http://fileutils.io/ relatively soon.
We also had another tool written on top of https://github.com/hpc/libcircle that would gather metadata on a few hundred-million files in a few hours (we had to limit the speed so it wouldn't take down the filesystem). For a slimmed down version of that tool, take a look at https://github.com/hpc/libdftw
And it's interesting and useful for scientific computing where you already have an MPI environment and distributed/parallel filesystems. However, it's not really applicable to this workload, as the paper itself says.
There is a provision in most file systems to use links (symlinks,
hardlinks, etc.). Links can cause cycles in the file tree, which
would result in a traversal algorithm going into an infinite loop.
To prevent this from happening, we ignore links in the file tree
during traversal. We note that the algorithms we propose in
the paper will duplicate effort proportional to the number of
hardlinks. However, in real world production systems, such as
in LANL (and others), for simplicity, the parallel filesystems
are generally not POSIX compliant, that is, they do not use
hard links, inodes, and symlinks. So, our assumption holds.
The reason this cp took such large amounts of time was the desire to preserve hardlinks and the resize of the hashtable used to track the device and inode of the source and destination files.
Sure, but if you read that article you walk away with a sense of thats a lot of files to copy. And the GP built a tool for jobs 2-3 orders of magnitude larger?! Clearly there are tradeoffs forced on you at that size...
Author of the paper here. The file operations are distributed strictly without links, otherwise we could make no guarantees that work wouldn't be duplicated, or even that the algorithm would terminate. We were lucky in that because the parallel file system itself wasn't POSIX, so we didn't have to make our tools POSIX either.
NB, that PDF seems to have a number of formatting glitches, e.g., the first sentence: "The amount of scienti c data". Numerous others as well, under both xpdf and evince.
How about this for a better cp strategy to deal with hardlinks:
1. Calculate the hash of /sourcedir/some/path/to/file
2. Copy the file to /tempdir/$hash if it doesn't exist yet
3. Hard-link /destdir/some/path/to/file to /tempdir/$hash
4. Repeat until you run out of source files
5. Recursively delete /tempdir/
This should give you a faithful copy with all the hard-links with constant RAM at the cost of CPU to run all the hashing. If you're smart about doing steps 1 and 2 together it shouldn't require any additional I/O (ignoring the extra file metadata).
Edit: actually this won't recreate the same hardlink structure, it will deduplicate any identical files, which may not be what you want. Replacing the hashing with looking up the inode with stat() would actually do the right thing. And that would basically be an on-disk implementation of the hash table cp is setting up in memory.
Let's assume we change your algorithm to use inode+device as key instead of the content hash, to make it behave as it 'should'.
> constant RAM
Well, definitely not constant space. The only reason for it to use less RAM than the hash table is if it uses disk instead of RAM, and that's what happened in op's post, too, by way of swap.
You're trading a hash table implemented in user-space for a hash table or b-tree or similar implemented in kernel-space.
The advantage of the user-space solution is that you can optimize it for the job at hand. A trivial implementation would use a binary-encoded inode+device number pair (probably 2*8 bytes) and a pointer to the first path encountered (8 bytes), where the path is only accessed in matching cases. In fact, 17e9/432e6=39.35 B/file, pretty spot on. Seems like there's some room for optimization.
The kernel would need to store the file name string in addition to its hash or b-tree entry. This is unlikely to use less memory. Since it will presumably be randomly accessed, reading back from disk will not be more efficient than swapping either.
Now what I'm wondering is whether the OP really had a filesystem with basically every file having a hard link count >1. Since if most files only have a link count of 1, then cp has no reason to hold them in the hash table, as they can't be encountered again. Unless it does so anyway for the case where hardlinks are being created while cp is running. If so, it should probably have a command line flag to turn this safety feature off.
In fact, 17e9/432e6=39.35 B/file, pretty spot on. Seems like there's some room for optimization
It does not suggest there is a possible optimization. cp needs to store the pointer and the path itself in memory, so it is more than 8 bytes. ~39 bytes/file means the average path length was 15 bytes (8-byte source device + 8-byte source inode + 8-byte pointer + ~15-byte path = ~39 bytes). I guess one theoretical optimization would be to add to Linux a system call to open a file by device+inode number. So the pointer+path would be replaced by a fixed 8-byte dest device + 8-byte dest inode (so 32 bytes/file). Or since the destination is only one filesystem in this case, you would just need 8-byte source device + 8-byte source inode + 8-byte dest inode (24 bytes/file). In this latter case, the 17 GB hash table would have shrunk to 10 GB, which could have help completely avoid the swap in the OP's case.
Edit: correct, I do not account for unused space in the hash table, for simplicity. So read all my numbers assuming a ±10-30% range. However it seems that the OP who quoted 432e6 files meant 432e6 file names, not inodes. So if the average inode had, say, 4.3 hardlinks (IOW 4.3 names), it means the 17GB hashtable only had 100e6 files in it, so 170 bytes/file, so 170-8-8-8 = 146-byte pathnames on average (±10-30%), with sounds more reasonable.
Yes, you're right, I made a jump in my thinking; I wanted to make the point that you could exclude the storage for the paths from the working set size, if there is only rarely a match. Problem is we don't have a number for the working set size (except that it must be between 10 GB and 17 GB), and my brain just used the 17 GB.
But your calculation is missing the empty space in the hash table. When using open addressing (with the table being a flat array), there must be something like 10-30% empty slots (and in any case, twice the space during rehashing). That would leave just about 39.35 - 8 * 3 / 0.8 = 9.35 bytes per path. So, likely cp already optimizes by using per-device hash tables at least, then 39.35 - 8 * 2 / 0.8 = 19.35 bytes will remain for the path. The pointer could be shortened to 5 or perhaps even 4 bytes. Paths could be stored as an 'inverted' tree (linked lists of segments while sharing equal tails), which should be very natural since cp gets them piecewise from readdir anyway, so the 19.35 bytes just need to describe a parent path pointer (5 bytes?) plus some shared storage and the space for the filename.
Probably there are better hash table algorithms for low memory use, too. (Using buckets in arrays (instead of linked lists as is often done), and resize those on every append, trading CPU for space?) I wonder if there's a way to avoid temporary doubling of the hash table storage during resizing.
Edit: I've checked the GNU coreutils-8.23 sources, oddly cp is not using any of those optimizations; it uses both dev and ino as a key, concatenates every filename to its parent path and stores that (and a normal pointer in the table), and uses linked lists for the buckets in its hash tables.
Edit 2: a different solution that should really lower RAM requirements a lot would be to separate the walking of the source tree from the copying. When walking the source tree, just create a list of dev/ino/path; then sort this list according to dev/ino (there are optimized sort algorithms that keep RAM usage small, I remember them being in Knuth's TAOCP); then walk the list, while linking all the paths with the same dev/ino which you will now encounter in a group.
>The only reason for it to use less RAM than the hash table is if it uses disk instead of RAM, and that's what happened in op's post, too, by way of swap.
Yeah I mentioned it was just an on-disk implementation of cp's in-memory hash table.
>Now what I'm wondering is whether the OP really had a filesystem with basically every file having a hard link count >1
Almost certainly. It was a backup machine with rsnapshot. Most files are just hardlinks to the same file from the previous backup. For his specific case it would probably be easier to code up a domain-specific solution. He probably knows the hard-linked files are all in a regular form like /basepath/YYYY-MM-DD/path/to/file. So he could just iterate /basepath/*/path/to/file and do the correct copy/hardlinks on the destination.
> The only reason for it to use less RAM than the hash table is if it uses disk instead of RAM, and that's what happened in op's post, too, by way of swap.
For further reference here's a modified setup that won't use any resources that wouldn't be used anyway:
1. stat() /sourcedir/some/path/to/file and find out the hard-link count and the inode number
2. if the hardlink count is 1 copy to destination and this file is done (go to 7)
3. if the hardlink count is >1 copy /srcdir/path/to/file to /tempdir/$inodenumber if it doesn't exist yet
4. stat() /tempdir/$inodenumber and find out the destination hardlink count
5. if the hardlink count is the same as the source hardlink count move /tempdir/$inodenumber to destination (we're creating the last hard-link for this file)
6. if the hardlink count is less than source count hard-link /tempdir/$inodenumber to destination
7. Repeat until you run out of source files
8. Recursively remove /tempdir/ (if there are any files left there it means there were hardlinks to files in /sourcedir/ from other parts of the filesystem)
By doing this we're using the destination filesystem to keep the map for only for the files with hard-link > 1 and doing it without using any more resources than will be used by the end of the copy as long as there aren't any hardlinks to the same files outside /sourcedir. If there are we'll be using an additional hard-link for each of them by the end of the copy.
The reason I prefer this to hand-rolling a solution is that it reuses well tested code (hard-links in the filesystem) and more importantly doesn't introduce any resource limits that aren't native to cp anyway, such such as having enough memory/swap to hold the map. Basically the contract becomes "if the copy can fit in the destination filesystem, cp will complete".
> By doing this we're using the destination filesystem to keep the map for only for the files with hard-link > 1
That's the assumption anyway (cp already works that way); we're talking about this only because the OP had hard links everywhere.
> /tempdir/$inodenumber
That's what I was talking about in my former post. This does use resources: the folder "tempdir" takes up additional space (it is in addition to the file's content, inode and the other paths). It either needs a b-tree, hash-table or similar in kernel space (on disk, and cached in RAM). Every entry needs to store $inodenumber (which you should mean to be the inode number of the source file, not the target, because you need to map from the source inode, so you can't even argue that it is stored anyway already), plus the storage in the index (b-tree or whatever that makes up the directory listing), just like the user-space solution, but additionally at least also the length of the file name and the actual inode number of the target file.
You're not better off (almost surely worse), and you can't get around the fact that the basic algorithm, which is copying every file while walking through the source tree and checking it immediately, is what is at fault: it needs random access (every iteration needs a look up to a random spot), which is why RAM is required and any kind of disk access (be it swap, or just reloading directory index storage from the file system) is out.
That's why I suggest to use a different algorithm. See my post about "megacopy". That should already be usable by now, BTW. It has a --tmp option to give a non-default location for temporary file storage (used for the linking table), you can point that to a place on the target file system or wherever else you like, which I think should answer your call for resource limitation. How well that algorithm (sorting (and linear append and read) instead of random access) works with regards to RAM usage, I still have to test.
> That's what I was talking about in my former post. This does use resources: the folder "tempdir" takes up additional space (it is in addition to the file's content, inode and the other paths).
See what I did again. I'm only creating /tempdir/$inodenumber if the hard-link count is >1 and once the destination hard-link count reaches the same count as the source it gets moved. So if there aren't any hard-links outside sourcedir this doesn't use more resources than you're going to use anyway at the end of the copy.
> That's why I suggest to use a different algorithm. See my post about "megacopy". That should already be usable by now
If I'm reading it right it requires doing a sort of the full file list. This means you need to use enough space to list all the files and their inode number even if there are no hardlinks. Your disk usage is O(total files) whereas mine is O(inodes with hardlinks) and thus is always lower.
> I'm only creating /tempdir/$inodenumber if the hard-link count is >1
Which (again) is always, in the case of the OP.
> and once the destination hard-link count reaches the same count as the source it gets moved
Yes I've thought of that optimization; I haven't checked whether cp does it already, might well be.
Thing is the OP said he's got an nlinks of about 20. This means, if they are evenly spread out over the file system, that you see the last link when you're about 95% through. That won't help much.
> If I'm reading it right it requires doing a sort of the full file list.
Correct.
> This means you need to use enough space to list all the files and their inode number even if there are no hardlinks.
My algorithm works with the nlink==1 optimization just as well: just don't put those files into the file list that needs to be sorted, instead copy them immediately or put them into a list that doesn't need sorting.
And, the whole point of using the sort approach is not to make the list or index smaller (that's simply not possible). The point is that it works with lower RAM requirements. Because sorting works well with serial access and little random access, using a list much bigger than RAM works well.
PS. I wonder how much smaller the table gets when you always have nlinks==2 (the best case for the "done with the inode" optimization). The size of the table will follow some curve over the duration of the program; I'm not sure what it is exactly, perhaps a sine half wave? I expect the max of the curve to be about half way of the max of the unoptimized variant.
Is it a pity to lose the possibility for that optimization by using my algorithm? Disk space is cheaper than RAM. Dropping most of the RAM usage and instead using twice as much in the form of disk still seems worthwhile.
The "done with the inode" optimization will work better if hard linked paths appear close to each other. So perhaps sometimes still useful. It should be possible to come up with a hybrid: go with the hash table until it reaches a certain size; then convert the hash table into the start of the list, and continue by appending to the list. I'm not sure this is worth the more complex programming, though. If it looks nice just to avoid creation of a temp file in the common case, then a less complex alternative may be to store the list in the heap until a certain size (only then dump to file), otherwise still use my algorithm. That way swapping will behave nicer, too, in case part of the cp heap needs to be swapped out because of other system activity.
PPS. note that we're assuming that all the hard links are within the source tree that's being copied. If the files also have a path outside the source path, then the optimization won't work as you never reach zero.
> My algorithm works with the nlink==1 optimization just as well: just don't put those files into the file list that needs to be sorted, instead copy them immediately or put them into a list that doesn't need sorting.
If you do that then your extra storage requirements are now O(paths to inodes with nlinks > 1) which is still strictly higher than O(inodes with nlinks>1). You're also doing two passes, one to enumerate the files, another to do the copying and hardlinking, and doing an extra operation (the sort). I'd be very surprised if that was faster, and unless you have an on-disk sort it will still be RAM+swap limited instead of just disk-space limited.
> O(paths to inodes with nlinks > 1) .. O(inodes with nlinks>1)
Yes, that's right, I ignored that part in the calculation above.
By using the filename+parent pointer approach for path compression, that could be improved. (The current version of megacp also stores all necessary stat information (size, uid, gid, perms) in the tempfile to avoid a second stat call; this could be dropped if it makes matters worse.)
Also, quite likely using a fast compression algorithm like lzop on the file would make disk access faster and disk storage requirements lower.
> You're also doing two passes, one to enumerate the files, another to do the copying and hardlinking
Right now I'm doing exactly the same number of system calls as the cp approach needs. The second pass doesn't do readdir or, currently, even stat on the source file (that's why I saved the stat info above; which is more beneficial would need to be tested). It may be beneficial to sort the sorted-by-inode file back into sorted-by-path after the inode cluster recognition step to get original path locality during the copy phase (the included test/genfiles script already does that); I'll have to test that.
> unless you have an on-disk sort it will still be RAM+swap limited
Using an online algorithm is the whole point, I've been saying that from the start. And I remember that Knuth wrote about them so I conclude that they exist for sort, and I quite expect that GNU sort(1) is already sorting big files efficiently (limited testing seems to confirm that, files several times the size of RAM sort just fine in the low single digit minute count).
> I'd be very surprised if that was faster
You're ignoring how much slower disk seek is compared to serial reading/writing. A lot of similar optimizations are done in OSes, databases etc. for that reason. I guess you'll argue next that SSDs don't have such high seek times (and that everybody should be using them), and while those points are right SSDs are still a lot slower for random versus serial access, so whether the overhead of the sorting approach is warranted on SSDs remains to be tested. In any case, I very much expect it to be faster in the OP's scenario which was with spinning disks (note how it took him many days; I expect the sort for the same data to take few hours or quite likely less than an hour given his RAID array).
> Also, quite likely using a fast compression algorithm like lzop on the file would make disk access faster and disk storage requirements lower.
>Using an online algorithm is the whole point, I've been saying that from the start. And I remember that Knuth wrote about them so I conclude that they exist for sort, and I quite expect that GNU sort(1) is already sorting big files efficiently (limited testing seems to confirm that, files several times the size of RAM sort just fine in the low single digit minute count).
The cp algorithm already solves this just fine and my proposed reimplementation on-disk solves the RAM usage problem. You're trying to make a strictly worse algorithm work. It both requires more storage and is heavier than it needs to be (full sort vs just a map). What do you expect to gain from it?
If you cp your data onto a Plan9 machine, what results is pretty much exactly the process you've outlined.
Plan9's default filesystem is made up of two parts: Fossil, and Venti.
- Fossil is a content-addressable on-disk object store. Picture a disk "formatted as" an S3 bucket, where the keys are strictly the SHAsums of the values.
- Venti is a persistent graph database that holds what would today be called "inode metadata." It presents itself as a regular hierarchical filesystem. The "content" property of an inode simply holds a symbolic path, usually to an object in a mounted Fossil "bucket."
When you write to Venti, it writes the object to its configured Fossil bucket, then creates an inode pointing to that key in that bucket. If the key already existed in Fossil, though, Fossil just returns the write as successful immediately, and Venti gets on with creating the inode.
Honestly, I'm terribly confused why all filesystems haven't been broken into these two easily-separable layers. (Microsoft attempted this with WinFS, but mysteriously failed.) Is it just inertia? Why are we still creating new filesystems (e.g. btrfs) that don't follow this design?
For my understanding: what happens if you open a file, change one byte and close it again? Since the SHAsum of the contents has changed, is the entire file now copied?
Only the block where the byte resides. A block is typically 512b to 4096b. (So it's not that unlike a normal drive, where you also have to rewrite an entire sector even if just a byte changed)
Venti doesn't know about files, it only knows about blocks of data. It's a key/value store where the key is sha-1 and the value is a block(blob) of data.
The filesystem running on top of Venti will ask Venti to store the new block where you changed the byte and the filesystem will update the metadata that assembles all blocks to a file.
1. Plan9 has no hard links so if you copy unix dir. tree to a plan9 machine you'd lose all the hard link info.
2. Venti doesn't use fossil or inodes. Venti is just content addressable storage system; not a fileserver/system.
3. Fossil is a fileserver.
> Honestly, I'm terribly confused why all filesystems haven't been broken into these two easily-separable layers. Is it just inertia?
The penalty for doing content addressed filesystems is of course the CPU usage. btrfs probably has most of the benefits without the CPU cost with its copy-on-write semantics.
Note that what you describe (and my initial process) is a different semantic than hard-links. What you get is shared storage but if you write to one of the files only that one gets changed. Whereas with hardlinks both files change.
In effect, hard links (of mutable files) are a declaration that certain files have the same "identity." You can't get this with plain Venti-on-Fossil, but it's a problem with Fossil (objects are immutable), not with Venti.
Venti-on-Venti-on-Fossil would work, though, since Venti just creates imaginary files that inherit their IO semantics from their underlying store, and this should apply recursively:
1. create two nodes A and B in Venti[1] that refer to one node C in Venti[2], which refers to object[x] with key x in Fossil.
2. Append to A in Venti[1], causing a write to C in Venti[2], causing a write to object[x] Fossil, creating object[y] with key y.
3. Fossil returns y to Venti[2]; Venti[2] updates C to point to object[y] and returns C to Venti[1]; Venti[1] sees that C is unchanged and does nothing.
Now A and B both effectively point to object[y].
(Note that you don't actually have to have two Venti servers for this! There's nothing stopping you from having Venti nodes that refer to other Venti nodes within the same projected filesystem--but since you're exposing these nodes to the user, your get the "dangers" of symbolic links, where e.g. moving them breaks the things that point to them. For IO operations they have the semantics of hard links, though, instead of needing to be special-cased by filesystem-operating syscalls.)
Content addressable systems trade CPU and memory with disk space. If you expect duplications to be low, you are usually better off with a background scrubber.
Be careful not to create too many files in /tempdir (create a hierarchy of subdirs). Filesystems slow way down when they get "too many" files in one dir.
But take a good look at a tarpipe before applying 'cp' to such a large set of files.
>Be careful not to create too many files in /tempdir (create a hierarchy of subdirs). Filesystems slow way down when they get "too many" files in one dir.
Yeah, you probably want something like /tempdir/srcfilesystem/NN/NN/NN/NN to handle the issue you mention and recursive copies that cross filesystem boundaries. Besides that tempdir will need to be on the same filesystem as destdir so the hard copy will work, and if destdir has been setup as a multiple filesystem target you'll need to keep a tempdir instance for each destination filesystem. Plenty of corner cases to think about.
This has the slight flaw that you create two copies instead of one for each file (double write IO, double number of inodes) if there aren't many hardlinks.
I don't think that's an acceptable perfomance penalty in most situations.
That won't happen. I'm suggesting using hard-links so you only write the file once to one inode, you just create two files pointing to that inode. You can even optimize that a bit and not even create the two files when the inode has a hard-link count of 1.
Disassembling data structures nicely can take much more time than just tearing them down brutally when the process exits.
A wonderful trend I've noticed in Free/Open Source software lately is proudly claiming that a program is "Valgrind clean." It's a decent indication that the program won't doing anything silly with memory during normal use, like leak it. (There's also a notable upswing in the number of projects using static analyzers on their code and fixing legitimate problems that turn up, which is great, too!)
While you can certainly just let the OS reclaim all of your process's allocated memory at exit time, you're technically (though intentionally) leaking memory. When it becomes too hard to separate the intentional leaks from the unintentional leaks, I'd wager most programmers will just stop looking at the Valgrind reports. (I suppose you could wrap free() calls in "#ifdef DEBUG ... #endif" blocks and only run Valgrind on debug builds, but that seems ugly.)
A more elegant solution is to use an arena/region/zone allocator and place potentially large data structures (like cp's hard link/inode table) entirely in their own arenas. When the time comes to destroy one of these data structures, you can destroy its arena with a single function call instead of walking the data structure and free()ing it piece by piece.
Unfortunately, like a lot of useful plumbing, there isn't a standard API for arena allocators, so actually doing this in a cross-platform way is painful:
• Windows lets you create multiple heaps and allocate/free memory in them (HeapCreate(), HeapDestroy(), HeapAlloc(), HeapFree(), etc.).
• OS X and iOS come with a zone allocator (malloc_create_zone(), malloc_destroy_zone(), malloc_zone_malloc(), malloc_zone_free(), etc.).
• glibc doesn't have a user-facing way to create/destroy arenas (though it uses arenas internally), so you're stuck using a third-party allocator on Linux to get arena support.
• IRIX used to come with an arena allocator (acreate(), adelete(), amalloc(), afree(), etc.), so if you're still developing on an SGI Octane because you can't get enough of that sexy terminal font, you're good to go.
Adding some kind of arena-allocation library to both the build & runtime dependencies solely to keep valgrind happy, with no actual improvement in functionality or performance, doesn't seem like a great tradeoff on the software engineering front. I'd rather see work on improving the static analysis. For example if some memory is intended to be freed at program cleanup, Valgrind could have some way of being told, "this is intended to be freed at program cleanup". Inserting an explicit (and redundant) deallocation as the last line of the program just to make the static analyzer happy is a bit perverse.
(That is, assuming that you don't need portability to odd systems that don't actually free memory on process exit.)
I don't see why you assume arenas would be added "solely to keep valgrind happy". Arenas have better performance when allocating a high number of small chunks, because an arena can make better performance trade-offs for this use case than the general malloc allocator.
Yes, we need a standard arena call(s). As you said, glibc already uses them internally. From the malloc(7) man page:
In a multithreaded application in which threads
simultaneously allocate and free memory, there could be contention for these mutexes. To scalably handle memory allocation in multithreaded applications, glibc creates additional memory allocation arenas if mutex contention is detected. Each arena is a large region of memory that is internally allocated by the system
(using brk(2) or mmap(2)), and managed with its own mutexes.
The proper thing to do is include valgrind.h and use the macros to determine if it's running under valgrind. Alternatively to accommodate other tools, hide it behind an env var. We should make it as easy to test production binaries as possible.
The advantage of the #ifdef DEBUG method is that it's possible to ensure that all of your allocations are clean at the object level, not just at the arena level.
In the limit, you could make one arena for your entire program and then deallocate it on process exit (essentially what the OS gives you by default). Using smaller and smaller arenas lets you see that at a finer and finer granularity. In the other limit you're back to the per-object allocations.
Using an arena to avoid per-object leak checking is good if you're absolutely sure that there are no leaks in that block of code. You're essentially hiding any leaks within the arena from valgrind.
But that seems dangerous, because it makes it harder to use valgrind for what it's good at: finding leaks that you're unaware of, especially in code that you 'know' is good.
The advantage of the #ifdef DEBUG method is that it's possible to ensure that all of your allocations are clean at the object level, not just at the arena level.
One of the points of an arena allocator is to not have to be clean at object level because everything is intended to be freed anyway in the end. For example, one HTTP request or one copy operation fit perfectly for an arena allocator: the system is allowed to "leak" because it's known that the operation won't last that long within the lifetime of the process and all memory related to the operation won't be needed later. (Or cross-operation objects could be allocated from a different pool.)
This may be a little off topic, but I used to think RAID 5 and RAID 6 were the best RAID configs to use. It seemed to offer the best bang for buck. However, after seeing how long it took to rebuild an array after a drive failed (over 3 days), I'm much more hesitant to use those RAIDS. I much rather prefer RAID 1+0 even though the overall cost is nearly double that of RAID 5. It's much faster, and there is no rebuild process if the RAID controller is smart enough. You just swap failed drives, and the RAID controller automatically utilizes the back up drive and then mirrors onto the new drive. Just much faster and much less prone to multiple drive failures killing the entire RAID.
This can not be stressed strongly enough. There is never a case when RAID5 is the best choice, ever [1]. There are cases where RAID0 is mathematically proven more reliable than RAID5 [2]. RAID5 should never be used for anything where you value keeping your data. I am not exaggerating when I say that very often, your data is safer on a single hard drive than it is on a RAID5 array. Please let that sink in.
The problem is that once a drive fails, during the rebuild, if any of the surviving drives experience an unrecoverable read error (URE), the entire array will fail. On consumer-grade SATA drives that have a URE rate of 1 in 10^14, that means if the data on the surviving drives totals 12TB, the probability of the array failing rebuild is close to 100%. Enterprise SAS drives are typically rated 1 URE in 10^15, so you improve your chances ten-fold. Still an avoidable risk.
RAID6 suffers from the same fundamental flaw as RAID5, but the probability of complete array failure is pushed back one level, making RAID6 with enterprise SAS drives possibly acceptable in some cases, for now (until hard drive capacities get larger).
I no longer use parity RAID. Always RAID10 [3]. If a customer insists on RAID5, I tell them they can hire someone else, and I am prepared to walk away.
I haven't even touched on the ridiculous cases where it takes RAID5 arrays weeks or months to rebuild, while an entire company limps inefficiently along. When productivity suffers company-wide, the decision makers wish they had paid the tiny price for a few extra disks to do RAID10.
In the article, he has 12x 4TB drives. Once two drives failed, assuming he is using enterprise drives (Dell calls them "near-line SAS", just an enterprise SATA), there is a 33% chance the entire array fails if he tries to rebuild. If the drives are plain SATA, there is almost no chance the array completes a rebuild.
Note that the 10^14 figure is only what the HDD mfgs publish, and it has been the same for something like a decade. It's a nice, safe, conservative figure that seems impressively high to IT Directors and accountants, and yet it's low enough that HDD mfgs can easily fall back on it as a reason why a drive failed to meet a customers expectations.
In reality you'll find that drives perform significantly better than that, arguably orders of magnitude better.
That said, I'm still not a fan of RAID5. Both rebuild speed and probability of batch failures of drives (if one drive fails, the probability of another drive failing is significantly higher if they came from the same batch), make it too risky a prospect for my comfort.
10^14 bit error rate is false or routine ZFS scrubs would produce documented read errors. I'm inclined to believe the entire math here is wrong as well.
What I don't understand about the URE is from [2]. If you have a 12TB raid 5 array and you need to rebuild. If 10^14 approaches an URE at around 12TB of data as the article says. What causes it to hit 12TB? Each disk has 10^14 which is 100TB. If you had 12TB from 4x3TB disks it should have alot to go through.
10^14 is bits. When you divide 10^14 by 8 you get 12.5 trillion bytes, or 12.5 TB.
If you have 4x 4TB drives in RAID5, and one fails, then in order to rebuild with a replacement drive, you have to read all data from all surviving drives (3 x 4TB = 12TB).
They call it "Non-recoverable Read Errors per Bits Read" and then list "1 sector per 10^15". So for every 10^15 bits read they expect 1 sector to be unreadable.
OK i see now but if it needs to read 12TB from all drives together then that's still far from 12.5TB per drive which is the limit. That's where I am confused.
Say we have a department of a company with 36 employees, and one pair of dice. We decide that if any person out of the entire department rolls a 12, then everyone in the department will be fired. The chance of rolling a 12 is 1/36. It doesn't matter if one person keeps rolling the dice, or if they take turns, the chances of everyone being fired are close to 100%.
The same is true for a disk array. Each read operation is an independent event (for the purpose of doing this math). The chance of one URE happening is 1/(10^14) for every bit read. It doesn't matter which disk it happens on. When it happens, the entire array is failed.
Also 12.5 TB is not a hard limit, just an average. The URE could happen on the very first read operation, or you might read 100 TB without a URE.
>On consumer-grade SATA drives that have a URE rate of 1 in 10^14, that means if the data on the surviving drives totals 12TB, the probability of the array failing rebuild is close to 100%.
10^14 bits is 12.5 TB, so on average, the chance of 12TB being read without a single URE is very low, and the probability the array fails to rebuild is close to 100%. I was estimating 10^14 bits to be about 12TB, so the probability is actually 12/12.5 = 96% chance of failure.
>...he has 12x 4TB drives. Once two drives failed, assuming he is using enterprise drives...there is a 33% chance the entire array fails if he tries to rebuild. If the drives are plain SATA, there is almost no chance the array completes a rebuild.
A RAID6 with two failed drives is effectively the same situation as a RAID5 with one failed drive. In order to rebuild one failed drive, the RAID controller must read all data from every surviving drive to recreate the failed drive. In this case, there are 10x 4TB surviving drives, meaning 40TB of data must be read to rebuild. Because these drives are presumably enterprise quality, I am assuming they are rated to fail reading one sector for every 10^15 bits read (10^15 bits = 125 TB). So it's actually 40/125 = 32% chance of failure if you try to rebuild.
The main reason is not only the rebuild time (which indeed is horrible for a loaded system), but also the performance characteristics that can be terrible if you do a lot of writing.
We for example more than doubled the throughput of Cassandra by dropping RAID5. So the saving of 2 disks per server in the end required buying almost twice as many servers.
The problem with rsync is that it would have to traverse all the FS tree and check every single file on both sides for the timestamp and the file. In a relatively small FS tree is just fine, but when you start having GBs and GBs and tens of thousands of files, it becomes somewhat impractical.
Then, you need to come up with other creative solutions, like deep syncing inside the FS tree, etc. Fun times.
Add checksumming, and you probably would like to take a holiday whilst it copies the data :-)
if you want to copy everything and there's nothing at the target:
rsync --whole-file --ignore-times
that should turn off the metadata checks and the rsync block checksum algorithm entirely and transfer all of the bits at the source to the dest without any rsync CPU penalty.
also for this purpose it looks like -H is also required to preserve hard links which the man page notes:
"Note that -a does not preserve hardlinks, because finding multiply-linked files is expensive. You must separately specify -H."
be mildly interesting to see a speed test between rsync with these options and cp.
there are also utilities out there to "post-process" and re-hardlink everything that is identical, so that a fast copy without preserving hardlinks and then a slow de-duplication step would get you to the same endpoint, but at an obvious expense.
I use rsync to replicate 10TB+ volumes without any problems. It's also very fast to catch up on repeat runs, so you can ^C it without being too paranoid.
It sounds like OP had large hard-link counts, since he was using rsnapshot. It's likely that he simply wouldn't have had the space to copy everything over to the new storage without hard-links.
You cannot exaggerate what rsync can do... I use it to backup my almost full 120gb Kubuntu system disk daily, and it goes through 700k files in ~ 2-3 minutes. Oh, and it does it live, while I'm working.
To be fair: the case described here is two magnitudes larger in both directions than your use case.
And the critical aspect from a performance perspective was where the hash table became too large to fit into memory. Performance of all sorts goes pear-shaped when that happens.
rsync is pretty incredible, but, well, quantity has a quality all its own, and the scale involved here (plus the possible in-process disk failure) likely wasn't helping much.
Yep, my answer when cp or scp is not working well is always to break out rsync, even if I'm just copying files over to an empty location.
I've had good luck with the -H option, though it is slower than without the option. I have never copied a filesystem with nearly as many files as the OP with -H; the most I've done is probably a couple million, consisting of 20 or so hardlinked snapshots with probably 100,000 or 200,000 files each. rsync -H works fine for that kind of workload.
> In a relatively small FS tree is just fine, but when you start having GBs and GBs and tens of thousands of files, it becomes somewhat impractical.
I regularly rsync machines with millions of files. On a slow portable 2.5" 25MB/sec USB2 connection it's never taken me more than 1hr on completely cold caches to verify that no file needs to be copied. With caches being hot, it's a few minutes. And on faster drives it's faster still.
Unless you are doing something weird and force it to checksum, checksumming mostly kicks in on files that have actually changed and can be transferred efficiently in parts (i.e. network; NOT on local disks). In other cases, it's just a straight copy.
Have you actually used rsync in the setting you describe, or at least read its manual?
> The problem with rsync is that it would have to traverse all the FS tree and check every single file on both sides for the timestamp and the file.
- In this scenario, the receiving side is empty, so there is no need to check every single file.
- Since 3.0, rsync walks the dirtree while copying (unless you use some special options). So, in some sense, rsync is already "deep syncing inside the FS tree", as you put it.
rsync since version 3 has stopped doing the full tree traversal It's not in RHEL/CentOS 5 but is in 6, and is really trivial to compile by hand anyway. That's made life a lot easier for me in the past.
Some of the other features in rsync would seem to make it a more logical fit for this task anyway, e.g. it's resumable, it supports on-the-fly compression (though the value of that depends on the types of files being transferred), native logging etc etc.
Damn, I could use some advice there. Right now I have a system which uses rsync to backup some files. It takes about 8 hours to go over 7+ million files (~3.5TB). I'd love to speed up this. I should mention that copy is done over curlftpfs :(.
Assuming it is reading and writing almost simultaneously that is ~125Mbyte/sec on each physical drive, which isn't bad considering the latency of the read-from-three-or-four-then-checksum-then-write-to-the-other-one-or-two process if you if the array was was in use at the time.
There are a few ways to improve RAID rebuild time (see http://www.cyberciti.biz/tips/linux-raid-increase-resync-reb... for example) but be aware that they tend to negatively affect performance of other use of the array while the build is happening (that is, you are essentially telling the kernel to give more priority to the rebuild process than it normally would when there are other processes accessing the data too).
In my experience working with Debian, the rebuild is sensitive to disk usage. If you start using the disks, the rebuild will slow down intentionally so as to not choke the drives. This could explain why it was fast for you but not for the parent.
Edit: To clarify, I was using mdadm, not hardware raid.
Yeah, I question that rebuild. If the array is being used along with the rebuild, typically the rebuild process gets put to a lower priority. I've seen IBM enterprise RAID 5 storage arrays take forever to rebuild if they were the main storage for a high transaction database.
I you can tune the speed when the disks are in use; I usually bump it to around 60MB/s as long as there isn't a live database on the disk; the default is something insanely slow, and it impacts read/writes almost as much as the higher value.
My largest ZFS pool is currently ~64TB ( 3 X 10 3TB (raidz2) )
The pool has ranged from 85%-95% full (it's mostly at 85% now and used mostly for reads).
Resilvering one drive usually takes < 48 hours. Last time took ~32 hours.
Something else cool:
When I was planning this out I wrote a little script to calculate the chance of failure.
With a drive APF of 10% (which is pretty high), and a rebuild speed of 50MB/sec (very low compared to what I typically get) I have a 1% chance of losing the entire pool over a 46.6 year period. If I add 4 more raidz2 10X3TB VDEVS that would drop to 3.75 years.
What is APF? And why not use the typical URRE rate to calculate your stats?
I'm always mystified at how stupid our storage systems are. Even very expensive SAN solutions from EMC and the like area just... stupid. We've got loads of metrics on every drive, but figuring out that those things should be aggregated and subjected to statistical analysis just seems to have not been done yet.
What I really want is a "pasture" system - a place I can stick old drives of totally random sizes and performance characteristics and have the system figure out where to put the data in order to maintain a specific level of reliability. Preferrably backed by an online database that tracks the drive failure rate of every drive on every deployed system, noting patterns in 'bad batches' of certain models and the like. If one of my drives would have to beat 3-standard-deviations odds to survive for the next week, move the damn data to somewhere better. And if you've got 2 150GB drives and 1 300GB drive, then each block on those drives has a rating of 2.0 - adjusted for the age and metrics of the drive.
Oh well, maybe when I retire in 30 years storage systems will still be as stupid as they've remained for the past 30 years and I'll have another project to add to the pile I don't have time to work on now.
btrfs at least offers the flexibility to use a collection of mismatched drives and ensure that N copies of the data exist across separate devices. Once it's parity-based RAID modes are stable, it should be able to retain that flexibility while offering a redundant option for cases where N=2 wastes too much space. That combined with regular scrubbing should suffice for moderately sized arrays.
Treating drives differently based on their expected failure rates seems like it would only matter for very large arrays trying to keep the redundancy as low as possible.
It really depends on the load and if your raidz has 1,2, or 3 spare drives. From what I've experienced, re-slivering a the new drive in a raidz seems to happen at about 30% of the maximum ZFS throughput under medium load. And even better than many RAID5/6 implementations - it only re-slivers enough to store your data and not just the entire drive.
in my opinion, raid 10 is by far the best if you have to support the operations of a production site in a real world environment. just keep spare controllers and drives handy. it's just a matter of swapping out failures.
three cost levers: capacity, speed, and downtime.
basically, at some point you will begin to realize the cheapest cost is the capacity itself, everything else is an order of magnitude more difficult to justify. downtime and slowness are simply unacceptable, whereas buying more stuff is perfectly acceptable and quite feasible solution in an enterprise environment.
I work with a number of medium sized businesses of which you can clearly tell who have had and who havent had major downtime due to these types of hardware failures.
Many people who have never experienced issues have no interest in even forming a basic DR plan, but the ones who have just start throwing money at you the moment you mention it.
Buying more stuff is WAY cheaper than slowness and downtime, and penny pinching on this count is going to cause so much additional heartache for what will in the end cost more anyway.
At one time there was; it was called the Internet. The archive still exists, but it's been made harder to browse through due to being jumbled up with javascript and cat gifs.
False choice, isn't it? I mean, the complaint isn't that the public now has sites with massive javascript and related technologies. The complaint is that it has muscled out useful sites that did not use those technologies. And it should be heavily noted that the heavy muscles that have pushed out many of these sites is not necessarily "the public."
Kind of funny to say that on a text-only JS-free site that seems to be alive and well, linking to an article on an old-school mailing list archive site. :)
Oh, certainly. I just can resonate with the sentiment that these sites aren't the majority.
Even this site, honestly, is less than easy to deal with on a recurring basis. (Consider, hard to remember which was the top story three days ago at noon.) Specifically, sometimes I lose a story because I refresh and something plummeted off the page. Hard to have any idea how far to "scroll back" to see it.
It's also an astute observation that is incredibly true! Most sites now incorporate JavaScript.
I remember when it was "new-fangled" (and DHTML anyone?) and everyone put those annoying cursor trackers that trailed blobs from where your cursor was.
What I mainly remember is two things: suddenly everyone had to do client-side form checking to prevent people from typing dashes in a phone number field, and everyone tried to imitate Java without actually using Java. It seems like any way you could enhance a page by combining mouseover events, gifs and loading content dynamically you would throw into a page, even if it made the user experience a total mess. Which of course never happens today.
My favorite memory is helping to write some of the first "griefers" using JS. If you were unlucky enough to stumble onto our page you had a window literally jumping around your screen with pop-ups til the end of time. There was no way to exit the browser until you'd clicked it all and it'd never end. Just getting your mouse near the 'X' would make the window jump around again. Of course, that was a mild annoyance compared to loading fonts and images so big they'd consume all the available RAM and make the machine fall over from swapping in about 15 seconds.
I have those memories too, on Windows 95 or 98? I remember IE4 being released and it could do a lot more than IE3, plus it was more "forgiving" with sloppy HTML compared to Netscape.
I would usually use the tarpipe mentioned already by others for this sort of thing (although I probably wouldn't do 432 million files in one shot):
(cd $SOURCE && tar cf - .) | (mkdir -p $DEST && cd $DEST && tar xf -)
Another option which I just learned about through reading some links from this thread is pax (http://en.wikipedia.org/wiki/Pax_%28Unix%29), which can do it with just a single process:
(mkdir -p $DEST && cd $SOURCE && pax -rw . $DEST)
Both will handle hard links fine, but pax may have some advantages in terms of resource usage when processing huge numbers of files and tons of hard links.
I wasn't suggesting to blindly use a tar pipe for as many files as the OP was trying to address. My main point was that the tar pipe (locally or over ssh) is superior to cp/scp for the sorts of things that many of us do regularly (as discussed elsewhere in this thread), such as copying tens of thousands to low millions of mostly smallish files locally or remotely and wanting to preserve the structure of links and file metadata, and that pax looks like a simpler alternative to the tar pipe that accomplishes the same thing in a single command, although it does still use a hash table internally.
I wonder if using something other than a hash table would be worthwhile for cases like the OP though, or if there is a tool that has that option -- the memory overhead of empty buckets and resizing seem like serious downsides when you're trying to handle as many files as the OP.
It's going to scale just like you'd imagine it would. All the people saying "oh, tar was built for this" obviously haven't actually tried replicating the experiment using tar.
I've written a program that attempts to deal with the given situation gracefully: instead of using a hash table, it creates a temporary file with a list of inode/device/path entries, then sorts this according to inode/device, then uses the sorted list to perform the copying/hardlinking. The idea is that sorting should work well with much lower RAM requirements than the size of the file to be sorted (due to data locality, unless the random accesses with the hash, it will be able to work with big chunks, at least when done right (a bit hand-wavy, I know, this is called an "online algorithm" and I remember Knuth having written about those, haven't had the chance to recheck yet); the program is using the system sort command, which is hopefully implementing this well already).
The program stupidly calls "cp" right now for every individual file copy (not the hard linking), just to get the script done quickly, it's easy to replace that with something that saves the fork/exec overhead; even so, it might be faster than the swapping hash table if the swap is on a spinning disk. Also read the notes in the --help text. I.e. this is a work in progress as a basis to test the idea, it will be easy to round off the corners if there's interest.
PS. the idea of this is to make copying work well with the given situation on a single machine, unless the approach taken by the dcp program mentioned by fintler which seems to rely on a cluster of machines.
So it was all the files in one go, presumably with `cp -r`?
What about doing something with find/xargs/i-dunno to copy all the files, but break em into batches so you aren't asking cp to do it's bookkeeping for so many files in one process? Would that work better? Or worse in other ways?
The main issue is that there's no api to get the list of files hard linked together: the only way is to check all the existing files and compare inodes. If you're doing a plain copy over 2 fs, you cannot choose which number the target inode will be, so you need to keep a map between inode numbers, or between inodes and file names ("cp" does the later).
It should work much better provided you can easily divide it into batches where you know they don't have any shared hardlinked files. Otherwise you will duplicate those files wasting some space. That may be fine as well though.
Unix could really use a way to get all the paths that point to a given inode. These days that shouldn't really cost all that much and this issue comes up a lot in copying/sync situations. Here's the git-annex bug report about this:
Wow, it's not every day I hear about a filesystem feature that Windows has and Linux doesn't. (On a recent windows system: fsutil hardlink list <path> -- you can try any random exe or DLL in system32 for an example of a hard link.)
I forget what the api for that looks like if I ever knew. Might be private.
I am surprised, usually Linux is way ahead of Windows on shiny filesystem stuff.
Linux just has more filesystems, and sadly a lot of them have various flaws. I'm surprised when people are surprised that Linux isn't some completely superior technical marvel. BSD and Unix systems have been more advanced for decades..
Everyone on Linux still uses tar for god's sake, even though zip can use the same compression algorithms people use on tarballs, and zip actually stores an index of its files rather than 'cat'ing each record on top of the next like an append-only tape archive. (Obviously there are better formats than 'zip' for any platform, but it's just strange that nobody has moved away from tar)
tar is good enough for many uses, so people did not move on.
And it doesn't help that tar.gz / tar.bz2 compresses way better than zip in most cases (thanks to using a single compression context, rather than a new one for each file; and also compressing the filenames in the same context), and that it carries ownership and permission information with it - whereas zip doesn't.
The HVSC project, who try to collect every single piece of music ever created on a Commodore C64, distribute their archive as a zip-within-a-zip. The common music file is 1k-4k, goes down to ~500-1000 bytes zipped; The subdirectory+filename are often 100 bytes with a lot of redundancy that zip doesn't use, so they re-zip. Had they used .tar.gz or .tar.bz2, the second stage would not be needed.
I don't know. I do know there are tens of thousands of hardlinked files under system32, and the command I listed above can enumerate links pretty quickly.
I found an issue in cp that caused 350% extra mem usage for the original bug reporter, which fixing would have kept his working set at least within RAM.
> Wanting the buffers to be flushed so that I had a complete
logfile, I gave cp more than a day to finish disassembling its hash table,
before giving up and killing the process....Disassembling data structures nicely can take much more time than just tearing them down brutally when the process exits.
Does anyone know what the 'tear down' part is about? If it's about erasing the hashtable from memory, what takes so long? I would expect that to be very fast: you don't have to write zeros to it all, you just tell your GC or memory manager to mark it as free.
Looking at the code, it looks like deallocating a hash table requires traversing the entire table, because there is malloc()'d memory associated with each hash entry, so each entry has to be visited and free()'d. From hash_free() in coreutils hash.c:
for (bucket = table->bucket; bucket < table->bucket_limit; bucket++)
{
for (cursor = bucket->next; cursor; cursor = next)
{
next = cursor->next;
free (cursor);
}
}
Whereas if you just don't bother to deallocate the table before the process exits, the OS will reclaim the whole memory block without having to walk a giant data structure. That's a fairly common situation in C programs that do explicit memory management of complex data structures in the traditional malloc()/free() style. Giant linked lists and graph structures are another common culprit, where you have to pointer-chase all over the place to free() them if you allocated them in the traditional way (vs. packing them into an array or using a userspace custom allocator for the bookkeeping).
... and both malloc() and free() need to maintain (update) their own data structures, which in many implementations (don't know about the current glibc one) are stored interspersed with the data. Walking a hash table and calling free() on each and every item, even if it doesn't seem so, actually dirties a large number of memory pages (which could be truly randomly scattered because it is a hash table), which then need to be re-written into the swap space on swap-out.
You could, but it's fairly uncommon to do that kind of "custom allocator" style in C, especially in traditional Unix utility programming. It seems to be more of a C++ thing culturally in my experience, though you do find it in C in "big" programs, like scientific-computing code.
Others have already mentioned that it isn't really necessary, but for some historical context this wasn't always true. When I first started programming computers, apps (written in languages that aren't garbage collected) on your typical consumer level computer/OS had to be really careful with freeing up all allocated memory before exiting or that memory would be gone forever (as far as the OS was concerned when it came to reallocating it to other apps) until the user rebooted the computer.
Advances coming from the combination of more advanced operating systems and CPU features like MMUs have made this a non-issue in most cases (can still be an issue on embedded, etc).
This was a particular problem on the Amiga system, among others. There were attempts to retro-fit resource tracking to the OS, but this was made all-but-impossible by the fact that a process can malloc() a chunk of memory, write a message in it, and then hand a pointer to that message to another process, expecting the other process to dealloc() it. This made for wonderfully cheap inter-process communication (for instance, the filesystem would load blocks from disc and pass them as messages to the process wanting them).
If it's the last thing you do before you exit the process, it isn't necessary, because the OS will reclaim your process's memory in one fell swoop. I believe that's what the linked post is advocating 'cp' should do. (At least on modern systems that's true; maybe there are some exotic old systems where not freeing your data structures before exit causes permanent memory leaks?)
It's seen as good C programming practice to free() your malloc()s, though, and it makes extending programs easier if you have that functionality, since what was previously the end of program can be wrapped in a higher-level loop without leaking memory. But if you really are exiting for sure, you don't have to make the final free-memory call. It can also be faster to not do any intermediate deallocations either: just leave everything for the one big final deallocation, as a kind of poor-man's version of one-generation generational GC. Nonetheless many C programmers see it somehow as a bit unclean not to deallocate properly. Arguably it does make some kind of errors more likely if you don't, e.g. if you have cleanup that needs to be done that the OS doesn't do automatically, you now have different kinds of cleanup routines for the end-of-process vs. not-end-of-process case.
I tend to do this in my C programs because in development usually have malloc() wrapped so that if any block hasn't been free()'ed it's reported at exit() time. This kind of check for lost pointers is usually so cheap that you use it even if you never expect to run on a system without decent memory management.
As an aside, GNU libc keeps ( or at least used to keep, I haven't checked in years ) the pointers used by malloc()/free() next to the blocks themselves, which gives really bad behavior when freeing a large number of blocks that have been pushed out to swap--you wind up bringing in pages in order to free them because the memory manager's working set is the size of all allocated memory. Years ago I wrote a replacement that avoided this just to speed up Netscape's horrible performance when it re-sized the bdb1.85 databases it used to track browser history. The browser would just "go away" thrashing the disk for hours and killing it just returned you to a state where it would decide to resize again an hour or so after a restart. Using LD_PRELOAD to use a malloc that kept it's bookkeeping away from the allocated blocks changed hours to seconds.
It's not necessary per se, but many programmers do it out of good habits. It becomes much harder to e.g. find memory leaks in your code if you never free the memory you haven't leaked. And normally the extra overhead it entails is imperceptible.
Obviously many programmers do not think to (or cannot easily) test their code with 40 TBs of data.
If you wrote cp in such a way that it's a reusable C library that other programs might use, then I can understand why they do this. It still doesn't make much sense for the command-line version to call this. Of course, it might run on some weird platform where the OS doesn't reclaim the memory of a process on exit(?).
Once your hash table and its linked lists have been swapped to disk, each pointer dereference causes a random disk access, which are orders of magnitude slower than RAM.
So it's a hash table that's about twice the size of physical memory. The entries in the hashtable are usually pointers to memory that you need to deallocate (char* filenames in this case, maybe some other stuff too) so if you're a good citizen you iterate through them all and deallocate (mark is as free) them as you go. It would seem there were some pathological swap issues doing that. It would be interesting to see if a tree behaved better in this type of situation. The allocator has tables of its own in memory too, if you have to swap those out to find the memory that your then going to deallocate then swap out all the user data structures to swap in the allocator's structures to do the deallocation or something whacky like that, it kind of makes sense.
His patch just skips that step and ends the program.
From reading the thread, what happens is that cp before it exits tries to return all memory to the operating system by explicitly freeing it (this is not necessary on modern operating systems as far as I'm aware). What I draw from it is that since it's so big, parts of the structure of the hash table are on the disk, so it will have to swap in almost all of the table to be able to explicitly free everything.
I don't understand it, but the OP says this, implying the author agrees with you for any modern system, but not on 'old systems without working memory management'
> And unless old systems without working memory management must be supported, I don't see any harm in simply removing the call to the forget_all function towards the end of cp.c.
This is system C the memory is manually managed with malloc(). You can't simply nil the reference and expect GC to come through and clean it for you, you have to issue a free() for each malloc() that was called. While it might be possible to write a custom disposal routine that does that efficiently for the structure, the simpler way is to just use the normal functions you use for removing single items and call it against everything. Doing that will not be as efficient because those functions keep the structure in a consistent state and don't perform bulk operations, but they have already been written and tested, and in normal use cases they won't be a bottle neck.
> You can't simply nil the reference and expect GC to come through and clean it for you
This is exactly what happens when your process exits and its private memory space is freed in its entirety. Hence the discussion. There's no need to clean up before exit, except if you want to (i) use a leak checker; or (ii) run on a machine with no MMU.
I appreciate that he had the foresight to install more ram and configure more swap. I would hate to be days into a transfer and have the OOM killer strike.
The difficulty is that you are using a filesystem hierarchy to 'copy files' when you actually want to do a volume dump (block copy). Use XFS and xfsdump, or ZFS and zfs send, to achieve this.
Copy with hard link preservation is essentially like running dedupe except that you know ahead of time how many dupes there are. Dedupe is often very memory intensive, and even well thought out implementations don't support keeping book keeping structures on disk.
"Normally I'd have copied/moved the files at block-level (eg. using dd or pvmove), but suspecting bad blocks, I went for a file-level copy because then I'd know which files contained the bad blocks."
I was simplifying... dump backs up inodes not blocks. Some inodes point to file data and some point to directory data. Hard links are references to the same inode in multiple directory entries, so when you run xfsrestore, the link count increments as the FS hierarchy is restored.
xfsdump/zfs send are file system aware, unlike dd, and can detect fs corruption (ZFS especially having extensive checksums). In fact, any info cp sees about corruption comes from the FS code parsing the FS tree.
However, except on zfs/btrfs, data block corruption will pass unnoticed. And in my experience, when you have bad blocks, you have millions of them -- too many to manually fix. As this causes a read hang, it is usually better to dd copy the fs to a clean disk, set to replace bad blocks with zeros, then fsck/xfs_repair when you mount, then xfsdump.
If the risk of keeping the system running while the array rebuilt was deemed to high, I would have just gone with a dd/ddrescue of the remaining disks onto new disks and then moved on from there.
+1 for mentioning ZFS. It's really quite amazing. Almost like futuristic alien technology compared to the other freely available file systems.
Given the issues encountered it would make sense to split this into two passes: 1. identify files with bad blocks and log; 2. do the block-level copy. The disadvantages of file-level copy weigh heavier than the doing this in a single step.
In light of experience would it perhaps be helpful after all to use a block-level copy (such as Partclone, PartImage, or GNU ddrescue) and analyze later which files have the bad blocks?
I see that the choice of a file-level copy was deliberate: "I'd have copied/moved the files at block-level (eg. using dd or pvmove), but suspecting bad blocks, I went for a file-level copy because then I'd know which files contained the bad blocks."
A while back I had a hard drive die with an unrecoverable error on the superblock (ext3) and a few other places. I sent the hard drive to a professional recovery firm who were as incompetent as incompetent can be. I got the duff drive back and decided to have a go myself. Recovering the superblock was easy - there are extra copies of that. I then did a full map of the drive listing bad blocks, and did a reverse map to work out which files contained those blocks. Luckily, the only file that contained any bad blocks was a swap file, so I recovered absolutely everything. Other bad blocks were in unallocated space. Ext3 is actually a sane enough disc layout that it is possible to work out what is going on manually. I would have been stuffed if it had been a more complicated filesystem.
Also there is no mention of unrecoverable file analysis like error handling of cp operations in the article. And with this many files it would not be feasible without using an error log file.
So going with a simple block copy should suffice IMHO.
I'm the OP, so I can shred a bit of light on that: Dell's support suggested a file-level copy when I asked them what they recommended (but I'm not entirely sure they understood the implications). Also, time was not a big issue.
I did keep a log file with the output from cp, and it clearly identified the filenames for the inodes with bad blocks. Actually, I'm not sure how dd would handle bad blocks.
I was about to bet on "read fail repeat skip" cycle for dd's behaviour but, looking into coreutil's source code at https://github.com/goj/coreutils/blob/master/src/dd.c , if I'm not mistaken , dd does not try to be intelligent and just uses a zeroed out buffer so It would return 0's for unreadable blocks.
In Windows-land, the default copy is pretty anemic, so probably most people avoid it for serious work.
I'd probably use robocopy from the command line. And if I was being lazy, I'd use the Teracopy GUI.
I think my limit for a single copy command has been around 4TB with robocopy--and that was a bunch of large media files, instead of smaller more numerous files. Maybe there's a limit I haven't hit.
I've used FastCopy for GUI based larger transfers, it's open source and can handle larger datasets well in my experience. It also doesn't choke on >MAX_PATH paths. Haven't had problems with it. Supposedly it's the fastest tool around...
The only slight issue is that the author is Japanese so the English translations aren't perfect plus the comments in the source are in Japanese.
Today there are N applications. "We can't increase MAX_PATH because it will break existing applications!"
Tomorrow there are N+M applications. "We can't increase MAX_PATH because it will break existing applications!"
Repeat forever.
Any time you are faced with a hard technical decision like this, the pain will always be least if you make the change either:
1. During another transition (e.g. 16-bit to 32-bit, or 32-bit to 64-bit). Microsoft could have required all 64-bit Windows apps to adopt a larger MAX_PATH, among other things.
2. Right NOW, because there will never be an easier time to make the change. The overall pain to all parties will only increase over time.
The MAX_PATH limitation can be avoided by using the \\?\ prefix (which also disables some kinds of path normalization, like converting / to \) or by using paths relative to the current directory.
Unfortunately, most tools (including Windows Explorer and other Microsoft software) still don't work with paths longer than MAX_PATH.
It happens on pretty much every network share, especially when you address it with an UNC path ("\\server\share\...") instead of mapping it as a drive.
If you have non-Windows users using the share it's pretty much guaranteed, since they don't have a problem with it - only Windows chokes on this.
Even Windows users manage to somehow create such paths (files) occasionally. Then you need to break out cmd.exe or somesuch because Explorer can't touch them.
Once you get above about a million files on NTFS, life becomes very slow. If you use a DVCS and check out a bunch of projects with a bunch of history, you can easily get to a point where you can't do too much with the drive. This was true as of 7/2008, perhaps 8/2012 have different/better tunings.
Number of files on the drive shouldn't matter that much, actually. In a single folder is a different matter, of course (for that you definitely want to disable 8.3 compatibility). But the only way I can imagine right now where the number of files on the drive makes a difference would be where you crammed the drive almost full, resulting in a very fragmented MFT (normally the file system tries to keep space for growing the MFT, but not if you try to fill that space with data, too).
When I was using Windows, I personally used Robocopy to make to personal backups to external drives. I was happy when MS started including it in Windows.
The email states that file-based copy operations were used in favor of dd due to suspected block errors. Two questions come to mind:
1. I've not used dd on failing media, so I'm not sure of the behavior. Will it plow through a file with block-read failures or halt?
2. There's the ddrescue utility, which is specifically intended for reading from nonreliable storage. Seems that this could have offered another means for addressing Rasmus's problem. It can also fill in additional data on multiple runs across media, such that more complete restores might be achieved.
https://www.gnu.org/software/ddrescue/ddrescue.html
OP said "I went for a file-level copy because then I'd know which files contained the bad blocks". When you copy the block device with ddrescue (dd doesn't have logic to work around the bad sectors and the only sensible action for it is thus to stop, but don't take my word for it), the result will just have zeroes in the places where bad blocks were, and, assuming the filesystem structure is good enough (you should run fsck on it), will give you files with zones of zeroes. But you won't know which files without either comparing them to a backup (which you won't have by definition if you're trying to recover) or with a program that verifies every file's structure (which won't exist for the general case). Whereas cp will issue error messages with the path of the file in question. So the OP's decision makes sense.
I've played with ddrescue very lightly. From the GNU webpage linked above, it appears it creates logfiles which can be examined:
Ddrescuelog is a tool that manipulates ddrescue logfiles, shows logfile contents, converts logfiles to/from other formats, compares logfiles, tests rescue status, and can delete a logfile if the rescue is done. Ddrescuelog operations can be restricted to one or several parts of the logfile if the domain setting options are used.
That might allow for identification of files with bad sectors.
That would need either a hook to the kernel or a file system parser.
Even if you manage to do that, I'm not sure it would be a good idea to continue to use a file system that has lost sectors, even after fsck. Are you sure fsck is fixing any inconsistency? Are there any automatic procedures in place that guarantee that the fsck algorithms are in sync with the actual file system code? (Answer anew for any file system I might be using.) You definitely should do backups by reading the actual files, not the underlying device; perhaps in this case it could be OK (since it was a backup itself already, hence a copy of life data; but then if OP bothered enough to recover the files, maybe he'll bother enough to make sure they stay recovered?)
And hopefully you would have written up a similar essay on the oddball experiences you had with rsync, which is even more stateful than cp and even more likely to have odd interactions when used outside its comfort zone.
Ditto for tricks like:(cd $src; tar cf - .) | (cd $dst; tar xf -).
Pretty much nothing is going to work in an obvious way in a regime like this. That's sort of the point of the article.
Or maybe not. He mentions rsnapshot in the article, which uses rsync under the hood. This implies rsync would have a very good chance of handling a large number of hardlinks... since it created them in the first place.
That doesn't follow. If backups are for multiple machines to a big file server, the backup machine will have a much larger set of files than those that come from an individual machine. Further, each backup "image" compares the directory for the previous backup to the current live system. Generally it looks something like this:
1. Initial backup or "full backup" - copy the full targeted filesystem to the time indexed directory of the backup machine.
2. Sequential backups:
a. on the backup machine, create a directory for the new time, create a mirror directory structure of the previous time.
b. hard link the files in the new structure to those in the previous backup (which may be links themselves, back to the last full backup.
c. rsync the files to the new backup directory. Anything that needs to be transfered results in rsync transfering the file to a new directory, the moving it into the proper place. This unlinks the filename from the previous version and replaces it with the full version.
So yeah, the result of this system over a few machines and a long-timeframe backup system is way more links on the backup machine than any iteration of the backup will ever actually use.
But rsnapshot created that large number of hardlinks incrementally. Without trying it there's no indication that rsync handles that better, especially as rsync with default settings doesn't preserve hardlinks as rsync builds a quite expensive data structure for them.
I'm surprised that cp even worked. I wouldn't have had the balls to just use cp as it could just have failed.
My strategy would probably have been to cp or rsync over the latest backup (so that new incremental backups could be immediately be created again) and then incrementally rsync'ing the previous backups with hardlink detection.
> The number of hard drives flashing red is not the same as the number of hard drives with bad blocks.
This is the real take-away. Monitor your drives. At very least enable SMART, and also regularly run a read on the full underlying drive (SMART won't see and log blocks that are on the way out so need retries for successful reads, unless you actually try to read those blocks).
That won't completely make you safe, but it'll greatly reduce the risk of other drives failing during a rebuild by increasing the chance you get advanced warning that problems are building up.
Glad someone noticed it (I'm the OP). Reading the drives systematically is called "Patrol Read" and is often enabled by default, but you can tweak the parameters.
The later replies regarding the size of the data structures cp is using are also worth reading. This is a case where pushing the command farther can make you think harder about the computations being done.
I doubt it, since you're using a pipe. It's not like you're actually storing the whole archive anywhere.
Also, and just to be clear, I did not invent this idea of using tar for this. I remember having read it in a Unix manual when I was learning about this OS.
And even if using tar is not a good idea, it should certainly be considered and as such I don't understand why OP doesn't even mention it in his post.
The pipe is not the issue; you actually are storing the full paths of all files in the archive somewhere (namely in RAM) so that you're able to decide whether a given file you're looking at is an alternative hardlink for a file you have already sent.
Yes, using tar for copying is not new and has its use cases, but this is not one of them.
The filesystem keeps a track of the number of nodes links for each file. So tar does not have to memorize the inode for each file, but only for those who have duplicates. So unless a large amount of files have more than one hard link, this is probably not a problem.
We will never know though, considering the author did not try it. Neither did he explain why he did not, despite the fact that it is the recommended Unix procedure.
From the original post: "We use rsnapshot (..) so most of the files have a high link count."
> it is the recommended Unix procedure
I don't think it is "recommended procedure", not today. At least on GNU/Linux cp is just fine. It may have been different two decades ago for some reason: I remember a graphical file manager (called "dfm") that used piped tar back in the late 90ies [1]. I know that installing open source utilities was pretty common back then to get better programs on closed source Unixes, and tar was probably high on the priorities to install whereas cp less so, and hence tar might have in fact been better than the preinstalled cp. I haven't seen anyone in the public technical (hence primarily open source) community advocating that even for portability reasons, so I can only think of it as "perhaps historically relevant, if at all".
Your point is valid though that one could have tested it here. See rbh42's answer on this point. Although I'm not sure he's right about both tar instances requiring a hash table: if tar submits the hard linking as from a previously submitted path to the new one, then there's no need for the table. And my quick look at the transmitted data seems to contain the source path for the link. The question then becomes whether tar has a better optimized hash table (not very likely), or uses a better algorithm (see my post about "megacopy" for an attempt at the latter).
[1] It actually had a problem: it didn't put "--" between the options and the path, and promptly failed when I moved an item that had a file name starting with a minus: it was interpreted by tar as option, and in addition the file manager didn't properly check the exit codes and simply unlinked the source even though it wasn't copied. The tools before GNU may really have been wanting: his failure to use "--" may have been caused by some contemporary non-GNU tar not supporting it. Anyway, of course I moved on to another file manager. Those bits of mine were sacrificed to the god of superior recommended Unix procedures or something.
Maybe this is naive, but wouldn't it have made more sense to do a bunch of smaller cp commands? Like sweep through the directory structure and do one cp per directory? Or find some other way to limit the number of files copied per command?
No, because then it wouldn't have replicated the hardlink structure of the original tree. That was the goal, and also the bit that causes the high resource consumption.
A problem with cp (and rsync, tar, and linux in general) is there is read-ahead within single files, but no read-ahead for the next file in the directory. So it doesn't make full use of the available IOPS capacity.
I'm guessing he means that RAID is designed as a technology to increase the availability of your data and the person in this story is using it as a backup solution. Google 'RAID backup' to see why it's such a bad idea to use it for that purpose.
He specifically used cp because he was worried about data loss from using something like dd. But if he'd been doing regular backups to tape, he could have just restored the backup from tape with no fuss or problems. It's only because he was using tools that aren't suited to what he was trying to do that he ran into problems.
Thanks for the link! The article says, that due to the read error rate and the size of today's disks, RAID 5 and RAID 6 have (kind of) lost their purpose.
XFS really shines when you're dealing with large volumes, especially with RAID. There are certain features, like striped allocation, that mean you end up getting faster performance, because the filesystem has a better idea of how blocks are laid out on the disks.
I personally still consider XFS a very mature and reliable filesystem. Both in terms of utility programs and kernel implementation. If I remember correctly, it was ported to linux from SGI/Irix where it was used for decades. It also was the default fs for RedHat/centos for a long time, so it might still have stuck at many shops.
Heres my anecdotal datapoint on which I base my personal believe:
From about 10-6 years ago, when I was doing sysadmin-work at university building storage-systems from commodity parts for experimental bulk data, we first had a load of not-reliably working early raid/sata(?) adapters, and those made ext3 and reiserfs (I think...) oops the kernel when the on-disk structure went bad. Whereas XFS just put a "XFS: remounted FS readonly due to errors" in the kernel logfile. That experience made XFS my default filesystem up to recently when I started to switch to btrfs. (of course, we fixed the hardware-errors, too... :-) )
Also, from that time, I got to use xfsdump/xfsrestore for backups and storage of fs-images which not even once failed on me.
As a blithe, new-Linux user (3.5 years), I was bumfuzzled when I saw RHEL/CentOS 7 switched from ext4 to XFS, figuring it to be some young upstart stealing the crown from the king. Then I did some Googling and figured out that XFS is as old as ext2! I'm looking forward to discovering how tools like xfs* can make my life easier.
ext2/3/4 have terrible large-directory performance. If you create 100,000 files in a directory in an ext2/3/4 filesystem, enumerating said files (ls, readdir, etc.) becomes unbearably slow -- and the slowness persists even after you delete all but one of the files in the directory.
The lure of "ohhh shiny!" is prevalent in tech. I love to use bleeding edge stuff, play with new toys etc, but that is purely for home and personal stuff. Never for work.
I want my work life to be utterly boring. Nothing breaking. Everything behaving exactly as expected.
I will happily trade a bit of performance and a lack of some shiny features for a system or component that works reliably. The less time I have to spend fighting stuff the more time I can spend tuning, automating or improving what I do have to look after.
XFS performs well. It's reliable, reasonably fast (they trade some speed for durability/reliability), and just works
About rsync: if you just use -a, it does not copy hard links correctly:
-a, --archive
[...] Note that -a does not preserve hardlinks, because
finding multiply-linked files is expensive. You must
separately specify -H.
If you do specify -H, rsync does keep track of hard links, but presumably at the cost of keeping a data structure similar to cp:
-H, --hard-links
This tells rsync to look for hard-linked files in the source and
link together the corresponding files on the destination. Without
this option, hard-linked files in the source are treated as though
they were separate files. [...]
Of course, rsync could be more efficient at keeping track of the hard links than cp, but there's no reason to believe a priori that it would be.
Darn, thanks! I've been using rsync for incremental backups (themselves hard-link-based) for years, but neglecting to do this. That is to say, snapshot N+1 shares structure with backup N to save space (via hard links) via --link-dest. But files may be replicated because -H (missing) is orthogonal of --link-dest.
You're right to recommend a tarpipe. I've had to copy several very large BackupPC storage pools in the past, and a tarpipe is the most reliable way to do it. (The only downside to BackupPC IMO...)
For future reference for other folks, the command would look something like this:
cd /old-directory && tar czvflpS - . | tar -C /new-directory -xzvf -
Tarpipes are especially neat because they can work well over ssh (make sure you have ssh configured for passwordless login, any prompt at all will bone the tarpipe):
cd /old-directory && tar czvflpS - . | ssh -i /path/to/private-key user@host "tar -C /new-directory -xzvf -"
...but tarpipe-over-ssh is not very fast. I have a note that says, "36 hours for 245G over a reasonable network" (probably 100Mb).
Disk-to-disk SATA or SAS without ssh in between would be significantly faster.
Be cautious about using tar pipes with compression. I've had experiences where the tar pipe will see the tarball being sent over and truncate the inner tarball. The compression in SSH tends to be good enough. My other suggestion for SSH is to change the algorithm used; rc4, while insecure for most things, is normally good enough, if not, blowfish tends to be a bit faster than AES.
Also, re: bonage, I agree that it "shouldn't", but it definitely did. From my sysadmin notes file:
> The tar operation kicks off before ssh engages; having ssh ask for a password seems to intermittently cause problems with the tar headers on the receiving end. (It _shouldn't_, but it seems to.)
On modereately fast networks, instead of -z which uses a single threaded gzip implementation, parallel gzip may still give big improvements: http://zlib.net/pigz/
No giant bookkeeping datastructures that end up with the process thrashing in swap, because tar is designed to work with so many files that you don't want to keep the metadata in memory.
This is true if the bookkeeping was irrelevant and unnecessary. I imagine the scenario where your copy gets stopped in the middle is worse when you use the tar method.
It's not as if cp gives you anything more useful in that case. The bookkeeping that caused the problem was a hashtable of files that have been copied, necessary to handle hard links. Would be interesting how tar does that; if it can't do any better then the root comment and mine above are wrong.
tarpipes can handle gnarly piles of hardlinked files without sweating. BackupPC is free software for network backups that builds a massively hardlinked file pool; tarpipes are the only thing that can reliably copy BackupPC pools from one location to another.
Not so. Both the sending and the receiving tar processes will need a data structure keeping track of which inodes they've already processed. They can skip inodes with a link count of 1, but if all inodes have multiple links (as in this case), the overhead will be twice that of a single cp process.
Rsync has similar issues when you are copying hard links. It has to keep a table of all the inodes/filename pairs it sees so it can detect hard links and relink them on the destination. This can be big if you have a ton of hard links (say in a backup situation).
Rsync at least acts reasonably idempotent, so you can just run it again if it gets interrupted, which is usually why I use it for large copies.
I don't remember off the top of my head if tar handles hard links—it may be inappropriate for this usage.
If you are often restarting rsync jobs or they fail over low speed WAN connections
--partial keep partially transferred files
Is your friend. Esp if you are trying to sync directories with multi GB files. Also if your media is bad or there is a failing piece of network gear along the path
--bwlimit=KBPS limit I/O bandwidth; KBytes per second
Can keep things running smoothly. It even works against disks, I have used it where an array would die if one attempted to read from it too quickly.
Not much of a trick, the first tar serializes the entire directory structure including all the metadata into a stream of bytes, from which the second tar reassembles it.
Basically the same as tar's standard usage, except you don't store that stream as a tar file/image anywhere.
Typically when copying from one machine to another, I've had the best results by far with tar | tar, but I'm not sure how well that would fare with tons of hard links like this. I think it's possible, but the default behavior would just gobble up umpteen copies of the same file.
This. I guess anyone who began using UNIX systems back in the SunOS days or earlier would never think of using cp for this job. Well, mostly because the vendor-provided cp usually sucked badly (no recursive copy), but also because tar has always been such a great, reliable tool.
The only disadvantage of tar is its peculiar syntax, but that's something you get used to.
OK, fair point. cpio doesn't have global data structures that grow with the size of the operation, which cp does have. rsync prior to 3.0 also had to scan the entire job before it would begin the transfer, which can take forever. cpio operates on a stream, in the form of find | cpio , so cpio only sees one file from the stream at a time. Find, at least GNU find, is also written such that its resources don't vary with the amount of output.
rsync is a bit of a kitchen sink, and cp is too simple. cpio is just the thing, and was designed for this exact use case.
> I browsed the net for other peoples' experience with copying many files and quickly decided that cp would do the job nicely.
After 20 years you no longer google how to copy files.
Edit: Reading on he talks about strace and even reading cp's source code which makes it even weirder that he had to google how to do this...
Edit2: Comments! Took only ten downvotes before someone bothered to explain what I was doing wrong, but now there are three almost simultaneously. I guess those make a few good points. I'd still think cp ought to handle just about anything especially given its ubiquitousness and age, but I see the point.
And to clarify: I'm not saying the author is stupid or anything. It's just weird to me that someone with that much experience would google something which on the surface sounds so trivial, even at 40TB.
Because the man is wise. He also didn't kill a job that appeared to be hung, he started reading the code to figure out why and determined that it would in fact, complete.
Because it turns out cp _barely_ managed to do the job, and he was almost wrong despite googling. (It did the job, I don't know about 'nicely')
Just assuming that, oh, cp copies files, of course it can copy a few terrabytes no problem... is something that you might do, but not something someone with 20 years experience with unix would do.
I'm not yet 20 years into this game, but the further I go, the more I'm finding I need to be willing to swallow my pride and check in with the current state of best practices.
Whenever you get to extreme cases, there's always going to be something subtle. I'm sure he didn't google " how to copy files", but looked for people who'd worked with many terabytes of data. Reminds me of a nice story by Reginakd Braithwaite: http://raganwald.com/2013/03/26/the-interview.html
Because the typical use case of copying a few files does not really compare to the use case of having to copy 43 terabyte of critical data that may or may not have corruptions. It's nothing but wise to double check.
https://github.com/hpc/dcp
We got an IEEE paper out of it:
http://conferences.computer.org/sc/2012/papers/1000a015.pdf
A few people are continuing the concept to other tools -- that should be available at http://fileutils.io/ relatively soon.
We also had another tool written on top of https://github.com/hpc/libcircle that would gather metadata on a few hundred-million files in a few hours (we had to limit the speed so it wouldn't take down the filesystem). For a slimmed down version of that tool, take a look at https://github.com/hpc/libdftw