Hacker News new | past | comments | ask | show | jobs | submit login
My experience with using cp to copy 432 million files (39 TB) (gnu.org)
826 points by nazri1 on Sept 11, 2014 | hide | past | favorite | 263 comments

I wrote a little copy program at my last job to copy files in a reasonable time frame on 5PB to 55PB filesystems.


We got an IEEE paper out of it:


A few people are continuing the concept to other tools -- that should be available at http://fileutils.io/ relatively soon.

We also had another tool written on top of https://github.com/hpc/libcircle that would gather metadata on a few hundred-million files in a few hours (we had to limit the speed so it wouldn't take down the filesystem). For a slimmed down version of that tool, take a look at https://github.com/hpc/libdftw

And it's interesting and useful for scientific computing where you already have an MPI environment and distributed/parallel filesystems. However, it's not really applicable to this workload, as the paper itself says.

There is a provision in most file systems to use links (symlinks, hardlinks, etc.). Links can cause cycles in the file tree, which would result in a traversal algorithm going into an infinite loop. To prevent this from happening, we ignore links in the file tree during traversal. We note that the algorithms we propose in the paper will duplicate effort proportional to the number of hardlinks. However, in real world production systems, such as in LANL (and others), for simplicity, the parallel filesystems are generally not POSIX compliant, that is, they do not use hard links, inodes, and symlinks. So, our assumption holds.

The reason this cp took such large amounts of time was the desire to preserve hardlinks and the resize of the hashtable used to track the device and inode of the source and destination files.

Sure, but if you read that article you walk away with a sense of thats a lot of files to copy. And the GP built a tool for jobs 2-3 orders of magnitude larger?! Clearly there are tradeoffs forced on you at that size...

Author of the paper here. The file operations are distributed strictly without links, otherwise we could make no guarantees that work wouldn't be duplicated, or even that the algorithm would terminate. We were lucky in that because the parallel file system itself wasn't POSIX, so we didn't have to make our tools POSIX either.

man! and here I am feeling like a champ conquering NTFS's long file name limitations :/

Is your conquering public? We have issues occasionally, and a toolkit would be nice :)

couple nitpicks:

Check return value of malloc

You don't need \ when breaking function parameters

NB, that PDF seems to have a number of formatting glitches, e.g., the first sentence: "The amount of scienti c data". Numerous others as well, under both xpdf and evince.

There is an fi ligature between the i and the c, perhaps those PDF renderers don't support them? Could be some sort of font loading issue.

Perhaps. That's only one of many similar issues.

How about this for a better cp strategy to deal with hardlinks:

1. Calculate the hash of /sourcedir/some/path/to/file

2. Copy the file to /tempdir/$hash if it doesn't exist yet

3. Hard-link /destdir/some/path/to/file to /tempdir/$hash

4. Repeat until you run out of source files

5. Recursively delete /tempdir/

This should give you a faithful copy with all the hard-links with constant RAM at the cost of CPU to run all the hashing. If you're smart about doing steps 1 and 2 together it shouldn't require any additional I/O (ignoring the extra file metadata).

Edit: actually this won't recreate the same hardlink structure, it will deduplicate any identical files, which may not be what you want. Replacing the hashing with looking up the inode with stat() would actually do the right thing. And that would basically be an on-disk implementation of the hash table cp is setting up in memory.

Let's assume we change your algorithm to use inode+device as key instead of the content hash, to make it behave as it 'should'.

> constant RAM

Well, definitely not constant space. The only reason for it to use less RAM than the hash table is if it uses disk instead of RAM, and that's what happened in op's post, too, by way of swap.

You're trading a hash table implemented in user-space for a hash table or b-tree or similar implemented in kernel-space.

The advantage of the user-space solution is that you can optimize it for the job at hand. A trivial implementation would use a binary-encoded inode+device number pair (probably 2*8 bytes) and a pointer to the first path encountered (8 bytes), where the path is only accessed in matching cases. In fact, 17e9/432e6=39.35 B/file, pretty spot on. Seems like there's some room for optimization.

The kernel would need to store the file name string in addition to its hash or b-tree entry. This is unlikely to use less memory. Since it will presumably be randomly accessed, reading back from disk will not be more efficient than swapping either.

Now what I'm wondering is whether the OP really had a filesystem with basically every file having a hard link count >1. Since if most files only have a link count of 1, then cp has no reason to hold them in the hash table, as they can't be encountered again. Unless it does so anyway for the case where hardlinks are being created while cp is running. If so, it should probably have a command line flag to turn this safety feature off.

In fact, 17e9/432e6=39.35 B/file, pretty spot on. Seems like there's some room for optimization

It does not suggest there is a possible optimization. cp needs to store the pointer and the path itself in memory, so it is more than 8 bytes. ~39 bytes/file means the average path length was 15 bytes (8-byte source device + 8-byte source inode + 8-byte pointer + ~15-byte path = ~39 bytes). I guess one theoretical optimization would be to add to Linux a system call to open a file by device+inode number. So the pointer+path would be replaced by a fixed 8-byte dest device + 8-byte dest inode (so 32 bytes/file). Or since the destination is only one filesystem in this case, you would just need 8-byte source device + 8-byte source inode + 8-byte dest inode (24 bytes/file). In this latter case, the 17 GB hash table would have shrunk to 10 GB, which could have help completely avoid the swap in the OP's case.

Edit: correct, I do not account for unused space in the hash table, for simplicity. So read all my numbers assuming a ±10-30% range. However it seems that the OP who quoted 432e6 files meant 432e6 file names, not inodes. So if the average inode had, say, 4.3 hardlinks (IOW 4.3 names), it means the 17GB hashtable only had 100e6 files in it, so 170 bytes/file, so 170-8-8-8 = 146-byte pathnames on average (±10-30%), with sounds more reasonable.

Yes, you're right, I made a jump in my thinking; I wanted to make the point that you could exclude the storage for the paths from the working set size, if there is only rarely a match. Problem is we don't have a number for the working set size (except that it must be between 10 GB and 17 GB), and my brain just used the 17 GB.

But your calculation is missing the empty space in the hash table. When using open addressing (with the table being a flat array), there must be something like 10-30% empty slots (and in any case, twice the space during rehashing). That would leave just about 39.35 - 8 * 3 / 0.8 = 9.35 bytes per path. So, likely cp already optimizes by using per-device hash tables at least, then 39.35 - 8 * 2 / 0.8 = 19.35 bytes will remain for the path. The pointer could be shortened to 5 or perhaps even 4 bytes. Paths could be stored as an 'inverted' tree (linked lists of segments while sharing equal tails), which should be very natural since cp gets them piecewise from readdir anyway, so the 19.35 bytes just need to describe a parent path pointer (5 bytes?) plus some shared storage and the space for the filename.

Probably there are better hash table algorithms for low memory use, too. (Using buckets in arrays (instead of linked lists as is often done), and resize those on every append, trading CPU for space?) I wonder if there's a way to avoid temporary doubling of the hash table storage during resizing.

Edit: I've checked the GNU coreutils-8.23 sources, oddly cp is not using any of those optimizations; it uses both dev and ino as a key, concatenates every filename to its parent path and stores that (and a normal pointer in the table), and uses linked lists for the buckets in its hash tables.

Edit 2: a different solution that should really lower RAM requirements a lot would be to separate the walking of the source tree from the copying. When walking the source tree, just create a list of dev/ino/path; then sort this list according to dev/ino (there are optimized sort algorithms that keep RAM usage small, I remember them being in Knuth's TAOCP); then walk the list, while linking all the paths with the same dev/ino which you will now encounter in a group.

> Edit 2: a different solution

See "megacopy" in my top level comment.

>The only reason for it to use less RAM than the hash table is if it uses disk instead of RAM, and that's what happened in op's post, too, by way of swap.

Yeah I mentioned it was just an on-disk implementation of cp's in-memory hash table.

>Now what I'm wondering is whether the OP really had a filesystem with basically every file having a hard link count >1

Almost certainly. It was a backup machine with rsnapshot. Most files are just hardlinks to the same file from the previous backup. For his specific case it would probably be easier to code up a domain-specific solution. He probably knows the hard-linked files are all in a regular form like /basepath/YYYY-MM-DD/path/to/file. So he could just iterate /basepath/*/path/to/file and do the correct copy/hardlinks on the destination.

> The only reason for it to use less RAM than the hash table is if it uses disk instead of RAM, and that's what happened in op's post, too, by way of swap.

For further reference here's a modified setup that won't use any resources that wouldn't be used anyway:

1. stat() /sourcedir/some/path/to/file and find out the hard-link count and the inode number

2. if the hardlink count is 1 copy to destination and this file is done (go to 7)

3. if the hardlink count is >1 copy /srcdir/path/to/file to /tempdir/$inodenumber if it doesn't exist yet

4. stat() /tempdir/$inodenumber and find out the destination hardlink count

5. if the hardlink count is the same as the source hardlink count move /tempdir/$inodenumber to destination (we're creating the last hard-link for this file)

6. if the hardlink count is less than source count hard-link /tempdir/$inodenumber to destination

7. Repeat until you run out of source files

8. Recursively remove /tempdir/ (if there are any files left there it means there were hardlinks to files in /sourcedir/ from other parts of the filesystem)

By doing this we're using the destination filesystem to keep the map for only for the files with hard-link > 1 and doing it without using any more resources than will be used by the end of the copy as long as there aren't any hardlinks to the same files outside /sourcedir. If there are we'll be using an additional hard-link for each of them by the end of the copy.

The reason I prefer this to hand-rolling a solution is that it reuses well tested code (hard-links in the filesystem) and more importantly doesn't introduce any resource limits that aren't native to cp anyway, such such as having enough memory/swap to hold the map. Basically the contract becomes "if the copy can fit in the destination filesystem, cp will complete".

> By doing this we're using the destination filesystem to keep the map for only for the files with hard-link > 1

That's the assumption anyway (cp already works that way); we're talking about this only because the OP had hard links everywhere.

> /tempdir/$inodenumber

That's what I was talking about in my former post. This does use resources: the folder "tempdir" takes up additional space (it is in addition to the file's content, inode and the other paths). It either needs a b-tree, hash-table or similar in kernel space (on disk, and cached in RAM). Every entry needs to store $inodenumber (which you should mean to be the inode number of the source file, not the target, because you need to map from the source inode, so you can't even argue that it is stored anyway already), plus the storage in the index (b-tree or whatever that makes up the directory listing), just like the user-space solution, but additionally at least also the length of the file name and the actual inode number of the target file.

You're not better off (almost surely worse), and you can't get around the fact that the basic algorithm, which is copying every file while walking through the source tree and checking it immediately, is what is at fault: it needs random access (every iteration needs a look up to a random spot), which is why RAM is required and any kind of disk access (be it swap, or just reloading directory index storage from the file system) is out.

That's why I suggest to use a different algorithm. See my post about "megacopy". That should already be usable by now, BTW. It has a --tmp option to give a non-default location for temporary file storage (used for the linking table), you can point that to a place on the target file system or wherever else you like, which I think should answer your call for resource limitation. How well that algorithm (sorting (and linear append and read) instead of random access) works with regards to RAM usage, I still have to test.

> That's what I was talking about in my former post. This does use resources: the folder "tempdir" takes up additional space (it is in addition to the file's content, inode and the other paths).

See what I did again. I'm only creating /tempdir/$inodenumber if the hard-link count is >1 and once the destination hard-link count reaches the same count as the source it gets moved. So if there aren't any hard-links outside sourcedir this doesn't use more resources than you're going to use anyway at the end of the copy.

> That's why I suggest to use a different algorithm. See my post about "megacopy". That should already be usable by now

If I'm reading it right it requires doing a sort of the full file list. This means you need to use enough space to list all the files and their inode number even if there are no hardlinks. Your disk usage is O(total files) whereas mine is O(inodes with hardlinks) and thus is always lower.

> I'm only creating /tempdir/$inodenumber if the hard-link count is >1

Which (again) is always, in the case of the OP.

> and once the destination hard-link count reaches the same count as the source it gets moved

Yes I've thought of that optimization; I haven't checked whether cp does it already, might well be.

Thing is the OP said he's got an nlinks of about 20. This means, if they are evenly spread out over the file system, that you see the last link when you're about 95% through. That won't help much.

> If I'm reading it right it requires doing a sort of the full file list.


> This means you need to use enough space to list all the files and their inode number even if there are no hardlinks.

My algorithm works with the nlink==1 optimization just as well: just don't put those files into the file list that needs to be sorted, instead copy them immediately or put them into a list that doesn't need sorting.

And, the whole point of using the sort approach is not to make the list or index smaller (that's simply not possible). The point is that it works with lower RAM requirements. Because sorting works well with serial access and little random access, using a list much bigger than RAM works well.

PS. I wonder how much smaller the table gets when you always have nlinks==2 (the best case for the "done with the inode" optimization). The size of the table will follow some curve over the duration of the program; I'm not sure what it is exactly, perhaps a sine half wave? I expect the max of the curve to be about half way of the max of the unoptimized variant.

Is it a pity to lose the possibility for that optimization by using my algorithm? Disk space is cheaper than RAM. Dropping most of the RAM usage and instead using twice as much in the form of disk still seems worthwhile.

The "done with the inode" optimization will work better if hard linked paths appear close to each other. So perhaps sometimes still useful. It should be possible to come up with a hybrid: go with the hash table until it reaches a certain size; then convert the hash table into the start of the list, and continue by appending to the list. I'm not sure this is worth the more complex programming, though. If it looks nice just to avoid creation of a temp file in the common case, then a less complex alternative may be to store the list in the heap until a certain size (only then dump to file), otherwise still use my algorithm. That way swapping will behave nicer, too, in case part of the cp heap needs to be swapped out because of other system activity.

PPS. note that we're assuming that all the hard links are within the source tree that's being copied. If the files also have a path outside the source path, then the optimization won't work as you never reach zero.

> My algorithm works with the nlink==1 optimization just as well: just don't put those files into the file list that needs to be sorted, instead copy them immediately or put them into a list that doesn't need sorting.

If you do that then your extra storage requirements are now O(paths to inodes with nlinks > 1) which is still strictly higher than O(inodes with nlinks>1). You're also doing two passes, one to enumerate the files, another to do the copying and hardlinking, and doing an extra operation (the sort). I'd be very surprised if that was faster, and unless you have an on-disk sort it will still be RAM+swap limited instead of just disk-space limited.

> O(paths to inodes with nlinks > 1) .. O(inodes with nlinks>1)

Yes, that's right, I ignored that part in the calculation above.

By using the filename+parent pointer approach for path compression, that could be improved. (The current version of megacp also stores all necessary stat information (size, uid, gid, perms) in the tempfile to avoid a second stat call; this could be dropped if it makes matters worse.)

Also, quite likely using a fast compression algorithm like lzop on the file would make disk access faster and disk storage requirements lower.

> You're also doing two passes, one to enumerate the files, another to do the copying and hardlinking

Right now I'm doing exactly the same number of system calls as the cp approach needs. The second pass doesn't do readdir or, currently, even stat on the source file (that's why I saved the stat info above; which is more beneficial would need to be tested). It may be beneficial to sort the sorted-by-inode file back into sorted-by-path after the inode cluster recognition step to get original path locality during the copy phase (the included test/genfiles script already does that); I'll have to test that.

> unless you have an on-disk sort it will still be RAM+swap limited

Using an online algorithm is the whole point, I've been saying that from the start. And I remember that Knuth wrote about them so I conclude that they exist for sort, and I quite expect that GNU sort(1) is already sorting big files efficiently (limited testing seems to confirm that, files several times the size of RAM sort just fine in the low single digit minute count).

> I'd be very surprised if that was faster

You're ignoring how much slower disk seek is compared to serial reading/writing. A lot of similar optimizations are done in OSes, databases etc. for that reason. I guess you'll argue next that SSDs don't have such high seek times (and that everybody should be using them), and while those points are right SSDs are still a lot slower for random versus serial access, so whether the overhead of the sorting approach is warranted on SSDs remains to be tested. In any case, I very much expect it to be faster in the OP's scenario which was with spinning disks (note how it took him many days; I expect the sort for the same data to take few hours or quite likely less than an hour given his RAID array).

> Also, quite likely using a fast compression algorithm like lzop on the file would make disk access faster and disk storage requirements lower.

>Using an online algorithm is the whole point, I've been saying that from the start. And I remember that Knuth wrote about them so I conclude that they exist for sort, and I quite expect that GNU sort(1) is already sorting big files efficiently (limited testing seems to confirm that, files several times the size of RAM sort just fine in the low single digit minute count).

The cp algorithm already solves this just fine and my proposed reimplementation on-disk solves the RAM usage problem. You're trying to make a strictly worse algorithm work. It both requires more storage and is heavier than it needs to be (full sort vs just a map). What do you expect to gain from it?

If you cp your data onto a Plan9 machine, what results is pretty much exactly the process you've outlined.

Plan9's default filesystem is made up of two parts: Fossil, and Venti.

- Fossil is a content-addressable on-disk object store. Picture a disk "formatted as" an S3 bucket, where the keys are strictly the SHAsums of the values.

- Venti is a persistent graph database that holds what would today be called "inode metadata." It presents itself as a regular hierarchical filesystem. The "content" property of an inode simply holds a symbolic path, usually to an object in a mounted Fossil "bucket."

When you write to Venti, it writes the object to its configured Fossil bucket, then creates an inode pointing to that key in that bucket. If the key already existed in Fossil, though, Fossil just returns the write as successful immediately, and Venti gets on with creating the inode.

Honestly, I'm terribly confused why all filesystems haven't been broken into these two easily-separable layers. (Microsoft attempted this with WinFS, but mysteriously failed.) Is it just inertia? Why are we still creating new filesystems (e.g. btrfs) that don't follow this design?

For my understanding: what happens if you open a file, change one byte and close it again? Since the SHAsum of the contents has changed, is the entire file now copied?

Only the block where the byte resides. A block is typically 512b to 4096b. (So it's not that unlike a normal drive, where you also have to rewrite an entire sector even if just a byte changed)

Venti doesn't know about files, it only knows about blocks of data. It's a key/value store where the key is sha-1 and the value is a block(blob) of data.

The filesystem running on top of Venti will ask Venti to store the new block where you changed the byte and the filesystem will update the metadata that assembles all blocks to a file.

1. Plan9 has no hard links so if you copy unix dir. tree to a plan9 machine you'd lose all the hard link info. 2. Venti doesn't use fossil or inodes. Venti is just content addressable storage system; not a fileserver/system. 3. Fossil is a fileserver.

Ah, I was going off an explanation I was given myself at one point--but this makes more sense, thanks.

> Honestly, I'm terribly confused why all filesystems haven't been broken into these two easily-separable layers. Is it just inertia?

The penalty for doing content addressed filesystems is of course the CPU usage. btrfs probably has most of the benefits without the CPU cost with its copy-on-write semantics.

Note that what you describe (and my initial process) is a different semantic than hard-links. What you get is shared storage but if you write to one of the files only that one gets changed. Whereas with hardlinks both files change.

In effect, hard links (of mutable files) are a declaration that certain files have the same "identity." You can't get this with plain Venti-on-Fossil, but it's a problem with Fossil (objects are immutable), not with Venti.

Venti-on-Venti-on-Fossil would work, though, since Venti just creates imaginary files that inherit their IO semantics from their underlying store, and this should apply recursively:

1. create two nodes A and B in Venti[1] that refer to one node C in Venti[2], which refers to object[x] with key x in Fossil.

2. Append to A in Venti[1], causing a write to C in Venti[2], causing a write to object[x] Fossil, creating object[y] with key y.

3. Fossil returns y to Venti[2]; Venti[2] updates C to point to object[y] and returns C to Venti[1]; Venti[1] sees that C is unchanged and does nothing.

Now A and B both effectively point to object[y].

(Note that you don't actually have to have two Venti servers for this! There's nothing stopping you from having Venti nodes that refer to other Venti nodes within the same projected filesystem--but since you're exposing these nodes to the user, your get the "dangers" of symbolic links, where e.g. moving them breaks the things that point to them. For IO operations they have the semantics of hard links, though, instead of needing to be special-cased by filesystem-operating syscalls.)

You seem to be confusing venti and fossil.

Can you explain further? I am not a plan9 expert, by any means, but I'm stuck at where GP made the confusion. Thanks!

He just swapped the names I think - Venti is the block store, Fossil is the file system layer.

> Honestly, I'm terribly confused why all filesystems haven't been broken into these two easily-separable layers. Is it just inertia

They are. Pretty much all operating systems provides a block layer, on top of which filesystems are normally layered.

Though the block layer isn't content addressed, it's just indexed (store block number 55, read block number 9. etc.)

Content addressable systems trade CPU and memory with disk space. If you expect duplications to be low, you are usually better off with a background scrubber.

ZFS sits on top of zpools.

Using the /sourcedir inode number sounds right.

Be careful not to create too many files in /tempdir (create a hierarchy of subdirs). Filesystems slow way down when they get "too many" files in one dir.

But take a good look at a tarpipe before applying 'cp' to such a large set of files.

>Be careful not to create too many files in /tempdir (create a hierarchy of subdirs). Filesystems slow way down when they get "too many" files in one dir.

Yeah, you probably want something like /tempdir/srcfilesystem/NN/NN/NN/NN to handle the issue you mention and recursive copies that cross filesystem boundaries. Besides that tempdir will need to be on the same filesystem as destdir so the hard copy will work, and if destdir has been setup as a multiple filesystem target you'll need to keep a tempdir instance for each destination filesystem. Plenty of corner cases to think about.

This has the slight flaw that you create two copies instead of one for each file (double write IO, double number of inodes) if there aren't many hardlinks.

I don't think that's an acceptable perfomance penalty in most situations.

That won't happen. I'm suggesting using hard-links so you only write the file once to one inode, you just create two files pointing to that inode. You can even optimize that a bit and not even create the two files when the inode has a hard-link count of 1.

Disassembling data structures nicely can take much more time than just tearing them down brutally when the process exits.

A wonderful trend I've noticed in Free/Open Source software lately is proudly claiming that a program is "Valgrind clean." It's a decent indication that the program won't doing anything silly with memory during normal use, like leak it. (There's also a notable upswing in the number of projects using static analyzers on their code and fixing legitimate problems that turn up, which is great, too!)

While you can certainly just let the OS reclaim all of your process's allocated memory at exit time, you're technically (though intentionally) leaking memory. When it becomes too hard to separate the intentional leaks from the unintentional leaks, I'd wager most programmers will just stop looking at the Valgrind reports. (I suppose you could wrap free() calls in "#ifdef DEBUG ... #endif" blocks and only run Valgrind on debug builds, but that seems ugly.)

A more elegant solution is to use an arena/region/zone allocator and place potentially large data structures (like cp's hard link/inode table) entirely in their own arenas. When the time comes to destroy one of these data structures, you can destroy its arena with a single function call instead of walking the data structure and free()ing it piece by piece.

Unfortunately, like a lot of useful plumbing, there isn't a standard API for arena allocators, so actually doing this in a cross-platform way is painful:

• Windows lets you create multiple heaps and allocate/free memory in them (HeapCreate(), HeapDestroy(), HeapAlloc(), HeapFree(), etc.).

• OS X and iOS come with a zone allocator (malloc_create_zone(), malloc_destroy_zone(), malloc_zone_malloc(), malloc_zone_free(), etc.).

• glibc doesn't have a user-facing way to create/destroy arenas (though it uses arenas internally), so you're stuck using a third-party allocator on Linux to get arena support.

• IRIX used to come with an arena allocator (acreate(), adelete(), amalloc(), afree(), etc.), so if you're still developing on an SGI Octane because you can't get enough of that sexy terminal font, you're good to go.

Adding some kind of arena-allocation library to both the build & runtime dependencies solely to keep valgrind happy, with no actual improvement in functionality or performance, doesn't seem like a great tradeoff on the software engineering front. I'd rather see work on improving the static analysis. For example if some memory is intended to be freed at program cleanup, Valgrind could have some way of being told, "this is intended to be freed at program cleanup". Inserting an explicit (and redundant) deallocation as the last line of the program just to make the static analyzer happy is a bit perverse.

(That is, assuming that you don't need portability to odd systems that don't actually free memory on process exit.)

I don't see why you assume arenas would be added "solely to keep valgrind happy". Arenas have better performance when allocating a high number of small chunks, because an arena can make better performance trade-offs for this use case than the general malloc allocator.

Yes, we need a standard arena call(s). As you said, glibc already uses them internally. From the malloc(7) man page:

In a multithreaded application in which threads simultaneously allocate and free memory, there could be contention for these mutexes. To scalably handle memory allocation in multithreaded applications, glibc creates additional memory allocation arenas if mutex contention is detected. Each arena is a large region of memory that is internally allocated by the system (using brk(2) or mmap(2)), and managed with its own mutexes.

The "#ifdef DEBUG..." option is pretty much exactly what they did:


The proper thing to do is include valgrind.h and use the macros to determine if it's running under valgrind. Alternatively to accommodate other tools, hide it behind an env var. We should make it as easy to test production binaries as possible.

The advantage of the #ifdef DEBUG method is that it's possible to ensure that all of your allocations are clean at the object level, not just at the arena level.

In the limit, you could make one arena for your entire program and then deallocate it on process exit (essentially what the OS gives you by default). Using smaller and smaller arenas lets you see that at a finer and finer granularity. In the other limit you're back to the per-object allocations.

Using an arena to avoid per-object leak checking is good if you're absolutely sure that there are no leaks in that block of code. You're essentially hiding any leaks within the arena from valgrind.

But that seems dangerous, because it makes it harder to use valgrind for what it's good at: finding leaks that you're unaware of, especially in code that you 'know' is good.

The advantage of the #ifdef DEBUG method is that it's possible to ensure that all of your allocations are clean at the object level, not just at the arena level.

One of the points of an arena allocator is to not have to be clean at object level because everything is intended to be freed anyway in the end. For example, one HTTP request or one copy operation fit perfectly for an arena allocator: the system is allowed to "leak" because it's known that the operation won't last that long within the lifetime of the process and all memory related to the operation won't be needed later. (Or cross-operation objects could be allocated from a different pool.)

This may be a little off topic, but I used to think RAID 5 and RAID 6 were the best RAID configs to use. It seemed to offer the best bang for buck. However, after seeing how long it took to rebuild an array after a drive failed (over 3 days), I'm much more hesitant to use those RAIDS. I much rather prefer RAID 1+0 even though the overall cost is nearly double that of RAID 5. It's much faster, and there is no rebuild process if the RAID controller is smart enough. You just swap failed drives, and the RAID controller automatically utilizes the back up drive and then mirrors onto the new drive. Just much faster and much less prone to multiple drive failures killing the entire RAID.

This can not be stressed strongly enough. There is never a case when RAID5 is the best choice, ever [1]. There are cases where RAID0 is mathematically proven more reliable than RAID5 [2]. RAID5 should never be used for anything where you value keeping your data. I am not exaggerating when I say that very often, your data is safer on a single hard drive than it is on a RAID5 array. Please let that sink in.

The problem is that once a drive fails, during the rebuild, if any of the surviving drives experience an unrecoverable read error (URE), the entire array will fail. On consumer-grade SATA drives that have a URE rate of 1 in 10^14, that means if the data on the surviving drives totals 12TB, the probability of the array failing rebuild is close to 100%. Enterprise SAS drives are typically rated 1 URE in 10^15, so you improve your chances ten-fold. Still an avoidable risk.

RAID6 suffers from the same fundamental flaw as RAID5, but the probability of complete array failure is pushed back one level, making RAID6 with enterprise SAS drives possibly acceptable in some cases, for now (until hard drive capacities get larger).

I no longer use parity RAID. Always RAID10 [3]. If a customer insists on RAID5, I tell them they can hire someone else, and I am prepared to walk away.

I haven't even touched on the ridiculous cases where it takes RAID5 arrays weeks or months to rebuild, while an entire company limps inefficiently along. When productivity suffers company-wide, the decision makers wish they had paid the tiny price for a few extra disks to do RAID10.

In the article, he has 12x 4TB drives. Once two drives failed, assuming he is using enterprise drives (Dell calls them "near-line SAS", just an enterprise SATA), there is a 33% chance the entire array fails if he tries to rebuild. If the drives are plain SATA, there is almost no chance the array completes a rebuild.

[1] http://www.smbitjournal.com/2012/11/choosing-a-raid-level-by...

[2] http://www.smbitjournal.com/2012/05/when-no-redundancy-is-mo...

[3] http://www.smbitjournal.com/2012/11/one-big-raid-10-a-new-st...

Note that the 10^14 figure is only what the HDD mfgs publish, and it has been the same for something like a decade. It's a nice, safe, conservative figure that seems impressively high to IT Directors and accountants, and yet it's low enough that HDD mfgs can easily fall back on it as a reason why a drive failed to meet a customers expectations.

In reality you'll find that drives perform significantly better than that, arguably orders of magnitude better.

That said, I'm still not a fan of RAID5. Both rebuild speed and probability of batch failures of drives (if one drive fails, the probability of another drive failing is significantly higher if they came from the same batch), make it too risky a prospect for my comfort.

10^14 bit error rate is false or routine ZFS scrubs would produce documented read errors. I'm inclined to believe the entire math here is wrong as well.

What I don't understand about the URE is from [2]. If you have a 12TB raid 5 array and you need to rebuild. If 10^14 approaches an URE at around 12TB of data as the article says. What causes it to hit 12TB? Each disk has 10^14 which is 100TB. If you had 12TB from 4x3TB disks it should have alot to go through.

10^14 is bits. When you divide 10^14 by 8 you get 12.5 trillion bytes, or 12.5 TB.

If you have 4x 4TB drives in RAID5, and one fails, then in order to rebuild with a replacement drive, you have to read all data from all surviving drives (3 x 4TB = 12TB).

Here is an example from a manufacturer:


They call it "Non-recoverable Read Errors per Bits Read" and then list "1 sector per 10^15". So for every 10^15 bits read they expect 1 sector to be unreadable.

OK i see now but if it needs to read 12TB from all drives together then that's still far from 12.5TB per drive which is the limit. That's where I am confused.

Say we have a department of a company with 36 employees, and one pair of dice. We decide that if any person out of the entire department rolls a 12, then everyone in the department will be fired. The chance of rolling a 12 is 1/36. It doesn't matter if one person keeps rolling the dice, or if they take turns, the chances of everyone being fired are close to 100%.

The same is true for a disk array. Each read operation is an independent event (for the purpose of doing this math). The chance of one URE happening is 1/(10^14) for every bit read. It doesn't matter which disk it happens on. When it happens, the entire array is failed.

Also 12.5 TB is not a hard limit, just an average. The URE could happen on the very first read operation, or you might read 100 TB without a URE.

I think your calculation on failing an array rebuild is wrong. Can you show how you got those numbers?

Sure, there were two statements I made.

>On consumer-grade SATA drives that have a URE rate of 1 in 10^14, that means if the data on the surviving drives totals 12TB, the probability of the array failing rebuild is close to 100%.

10^14 bits is 12.5 TB, so on average, the chance of 12TB being read without a single URE is very low, and the probability the array fails to rebuild is close to 100%. I was estimating 10^14 bits to be about 12TB, so the probability is actually 12/12.5 = 96% chance of failure.

>...he has 12x 4TB drives. Once two drives failed, assuming he is using enterprise drives...there is a 33% chance the entire array fails if he tries to rebuild. If the drives are plain SATA, there is almost no chance the array completes a rebuild.

A RAID6 with two failed drives is effectively the same situation as a RAID5 with one failed drive. In order to rebuild one failed drive, the RAID controller must read all data from every surviving drive to recreate the failed drive. In this case, there are 10x 4TB surviving drives, meaning 40TB of data must be read to rebuild. Because these drives are presumably enterprise quality, I am assuming they are rated to fail reading one sector for every 10^15 bits read (10^15 bits = 125 TB). So it's actually 40/125 = 32% chance of failure if you try to rebuild.

There is even a community for it: BAARF - http://www.miracleas.com/BAARF/

The main reason is not only the rebuild time (which indeed is horrible for a loaded system), but also the performance characteristics that can be terrible if you do a lot of writing.

We for example more than doubled the throughput of Cassandra by dropping RAID5. So the saving of 2 disks per server in the end required buying almost twice as many servers.

Wouldn't rsync of been a better and more reliable choice for this?

The problem with rsync is that it would have to traverse all the FS tree and check every single file on both sides for the timestamp and the file. In a relatively small FS tree is just fine, but when you start having GBs and GBs and tens of thousands of files, it becomes somewhat impractical.

Then, you need to come up with other creative solutions, like deep syncing inside the FS tree, etc. Fun times.

Add checksumming, and you probably would like to take a holiday whilst it copies the data :-)

if you want to copy everything and there's nothing at the target:

rsync --whole-file --ignore-times

that should turn off the metadata checks and the rsync block checksum algorithm entirely and transfer all of the bits at the source to the dest without any rsync CPU penalty.

also for this purpose it looks like -H is also required to preserve hard links which the man page notes:

"Note that -a does not preserve hardlinks, because finding multiply-linked files is expensive. You must separately specify -H."

be mildly interesting to see a speed test between rsync with these options and cp.

there are also utilities out there to "post-process" and re-hardlink everything that is identical, so that a fast copy without preserving hardlinks and then a slow de-duplication step would get you to the same endpoint, but at an obvious expense.

I use rsync to replicate 10TB+ volumes without any problems. It's also very fast to catch up on repeat runs, so you can ^C it without being too paranoid.

It sounds like OP had large hard-link counts, since he was using rsnapshot. It's likely that he simply wouldn't have had the space to copy everything over to the new storage without hard-links.

You cannot exaggerate what rsync can do... I use it to backup my almost full 120gb Kubuntu system disk daily, and it goes through 700k files in ~ 2-3 minutes. Oh, and it does it live, while I'm working.

To be fair: the case described here is two magnitudes larger in both directions than your use case.

And the critical aspect from a performance perspective was where the hash table became too large to fit into memory. Performance of all sorts goes pear-shaped when that happens.

rsync is pretty incredible, but, well, quantity has a quality all its own, and the scale involved here (plus the possible in-process disk failure) likely wasn't helping much.

Yep, my answer when cp or scp is not working well is always to break out rsync, even if I'm just copying files over to an empty location.

I've had good luck with the -H option, though it is slower than without the option. I have never copied a filesystem with nearly as many files as the OP with -H; the most I've done is probably a couple million, consisting of 20 or so hardlinked snapshots with probably 100,000 or 200,000 files each. rsync -H works fine for that kind of workload.

> In a relatively small FS tree is just fine, but when you start having GBs and GBs and tens of thousands of files, it becomes somewhat impractical.

I regularly rsync machines with millions of files. On a slow portable 2.5" 25MB/sec USB2 connection it's never taken me more than 1hr on completely cold caches to verify that no file needs to be copied. With caches being hot, it's a few minutes. And on faster drives it's faster still.

Unless you are doing something weird and force it to checksum, checksumming mostly kicks in on files that have actually changed and can be transferred efficiently in parts (i.e. network; NOT on local disks). In other cases, it's just a straight copy.

Have you actually used rsync in the setting you describe, or at least read its manual?

> The problem with rsync is that it would have to traverse all the FS tree and check every single file on both sides for the timestamp and the file.

- In this scenario, the receiving side is empty, so there is no need to check every single file.

- Since 3.0, rsync walks the dirtree while copying (unless you use some special options). So, in some sense, rsync is already "deep syncing inside the FS tree", as you put it.

rsync since version 3 has stopped doing the full tree traversal It's not in RHEL/CentOS 5 but is in 6, and is really trivial to compile by hand anyway. That's made life a lot easier for me in the past.

Some of the other features in rsync would seem to make it a more logical fit for this task anyway, e.g. it's resumable, it supports on-the-fly compression (though the value of that depends on the types of files being transferred), native logging etc etc.

WE use 10 rsyncs in parallel to copy .5pb in less than 3 days.

CPU is cheap.

Have done the same, found that rsync seems to not natively parallelize itself, so spread across 20 cores, it really screamed.

Damn, I could use some advice there. Right now I have a system which uses rsync to backup some files. It takes about 8 hours to go over 7+ million files (~3.5TB). I'd love to speed up this. I should mention that copy is done over curlftpfs :(.


aint' nobody got time fo dat!

I had a RAID 6 have a failed disk a couple weeks ago... 8 x 3TB drives... 18TB usable. Took 7 hours to rebuild.

Disks were connected via SATA 3 and was using mdadm

Assuming it is reading and writing almost simultaneously that is ~125Mbyte/sec on each physical drive, which isn't bad considering the latency of the read-from-three-or-four-then-checksum-then-write-to-the-other-one-or-two process if you if the array was was in use at the time.

There are a few ways to improve RAID rebuild time (see http://www.cyberciti.biz/tips/linux-raid-increase-resync-reb... for example) but be aware that they tend to negatively affect performance of other use of the array while the build is happening (that is, you are essentially telling the kernel to give more priority to the rebuild process than it normally would when there are other processes accessing the data too).

In my experience working with Debian, the rebuild is sensitive to disk usage. If you start using the disks, the rebuild will slow down intentionally so as to not choke the drives. This could explain why it was fast for you but not for the parent.

Edit: To clarify, I was using mdadm, not hardware raid.

Yeah, I question that rebuild. If the array is being used along with the rebuild, typically the rebuild process gets put to a lower priority. I've seen IBM enterprise RAID 5 storage arrays take forever to rebuild if they were the main storage for a high transaction database.

I you can tune the speed when the disks are in use; I usually bump it to around 60MB/s as long as there isn't a live database on the disk; the default is something insanely slow, and it impacts read/writes almost as much as the higher value.

How does the rebuild time of ZFS's raidz compare to RAID5/6?

My largest ZFS pool is currently ~64TB ( 3 X 10 3TB (raidz2) ) The pool has ranged from 85%-95% full (it's mostly at 85% now and used mostly for reads).

Resilvering one drive usually takes < 48 hours. Last time took ~32 hours.

Something else cool: When I was planning this out I wrote a little script to calculate the chance of failure. With a drive APF of 10% (which is pretty high), and a rebuild speed of 50MB/sec (very low compared to what I typically get) I have a 1% chance of losing the entire pool over a 46.6 year period. If I add 4 more raidz2 10X3TB VDEVS that would drop to 3.75 years.

What is APF? And why not use the typical URRE rate to calculate your stats?

I'm always mystified at how stupid our storage systems are. Even very expensive SAN solutions from EMC and the like area just... stupid. We've got loads of metrics on every drive, but figuring out that those things should be aggregated and subjected to statistical analysis just seems to have not been done yet.

What I really want is a "pasture" system - a place I can stick old drives of totally random sizes and performance characteristics and have the system figure out where to put the data in order to maintain a specific level of reliability. Preferrably backed by an online database that tracks the drive failure rate of every drive on every deployed system, noting patterns in 'bad batches' of certain models and the like. If one of my drives would have to beat 3-standard-deviations odds to survive for the next week, move the damn data to somewhere better. And if you've got 2 150GB drives and 1 300GB drive, then each block on those drives has a rating of 2.0 - adjusted for the age and metrics of the drive.

Oh well, maybe when I retire in 30 years storage systems will still be as stupid as they've remained for the past 30 years and I'll have another project to add to the pile I don't have time to work on now.

btrfs at least offers the flexibility to use a collection of mismatched drives and ensure that N copies of the data exist across separate devices. Once it's parity-based RAID modes are stable, it should be able to retain that flexibility while offering a redundant option for cases where N=2 wastes too much space. That combined with regular scrubbing should suffice for moderately sized arrays.

Treating drives differently based on their expected failure rates seems like it would only matter for very large arrays trying to keep the redundancy as low as possible.

I have a 24TB RAIDZ3 array. Losing a disk out of that takes about 12-18 hours to rebuild.

It is generally superior because it can skip blank blocks. RAID is byte-by-byte.

It really depends on the load and if your raidz has 1,2, or 3 spare drives. From what I've experienced, re-slivering a the new drive in a raidz seems to happen at about 30% of the maximum ZFS throughput under medium load. And even better than many RAID5/6 implementations - it only re-slivers enough to store your data and not just the entire drive.

If you want to burn even more disks, 6+0 or 5+0 can be a nice way to cut down on the worst car senerio of losing data.

in my opinion, raid 10 is by far the best if you have to support the operations of a production site in a real world environment. just keep spare controllers and drives handy. it's just a matter of swapping out failures.

three cost levers: capacity, speed, and downtime.

basically, at some point you will begin to realize the cheapest cost is the capacity itself, everything else is an order of magnitude more difficult to justify. downtime and slowness are simply unacceptable, whereas buying more stuff is perfectly acceptable and quite feasible solution in an enterprise environment.


I work with a number of medium sized businesses of which you can clearly tell who have had and who havent had major downtime due to these types of hardware failures.

Many people who have never experienced issues have no interest in even forming a basic DR plan, but the ones who have just start throwing money at you the moment you mention it.

Buying more stuff is WAY cheaper than slowness and downtime, and penny pinching on this count is going to cause so much additional heartache for what will in the end cost more anyway.

These are the types of stories I love. I just learned a boat load in 5 minutes.

Is there maybe an archive website dedicated to these kind of stories?

At one time there was; it was called the Internet. The archive still exists, but it's been made harder to browse through due to being jumbled up with javascript and cat gifs.

It's true. We should have never let the public on the Internet. It has been downhill since then.

False choice, isn't it? I mean, the complaint isn't that the public now has sites with massive javascript and related technologies. The complaint is that it has muscled out useful sites that did not use those technologies. And it should be heavily noted that the heavy muscles that have pushed out many of these sites is not necessarily "the public."

Kind of funny to say that on a text-only JS-free site that seems to be alive and well, linking to an article on an old-school mailing list archive site. :)

Oh, certainly. I just can resonate with the sentiment that these sites aren't the majority.

Even this site, honestly, is less than easy to deal with on a recurring basis. (Consider, hard to remember which was the top story three days ago at noon.) Specifically, sometimes I lose a story because I refresh and something plummeted off the page. Hard to have any idea how far to "scroll back" to see it.

"due to being jumbled with javascript..." - This made my day.

It's also an astute observation that is incredibly true! Most sites now incorporate JavaScript.

I remember when it was "new-fangled" (and DHTML anyone?) and everyone put those annoying cursor trackers that trailed blobs from where your cursor was.

Uh oh, is it nostalgia time already?

What I mainly remember is two things: suddenly everyone had to do client-side form checking to prevent people from typing dashes in a phone number field, and everyone tried to imitate Java without actually using Java. It seems like any way you could enhance a page by combining mouseover events, gifs and loading content dynamically you would throw into a page, even if it made the user experience a total mess. Which of course never happens today.

My favorite memory is helping to write some of the first "griefers" using JS. If you were unlucky enough to stumble onto our page you had a window literally jumping around your screen with pop-ups til the end of time. There was no way to exit the browser until you'd clicked it all and it'd never end. Just getting your mouse near the 'X' would make the window jump around again. Of course, that was a mild annoyance compared to loading fonts and images so big they'd consume all the available RAM and make the machine fall over from swapping in about 15 seconds.

I have those memories too, on Windows 95 or 98? I remember IE4 being released and it could do a lot more than IE3, plus it was more "forgiving" with sloppy HTML compared to Netscape.

I remember those myriads of mouseover buttons.

Raymond Chen's The Old New Thing comes to mind, http://blogs.msdn.com/b/oldnewthing/. Mostly windows though.

If not, there should be.

yarchive is pretty good. http://www.yarchive.net/comp/index.html

I would usually use the tarpipe mentioned already by others for this sort of thing (although I probably wouldn't do 432 million files in one shot):

  (cd $SOURCE && tar cf - .) | (mkdir -p $DEST && cd $DEST && tar xf -)
Another option which I just learned about through reading some links from this thread is pax (http://en.wikipedia.org/wiki/Pax_%28Unix%29), which can do it with just a single process:

  (mkdir -p $DEST && cd $SOURCE && pax -rw . $DEST)
Both will handle hard links fine, but pax may have some advantages in terms of resource usage when processing huge numbers of files and tons of hard links.

You know how tar handles hardlinks, right? By creating a giant hash table of every file.

I wasn't suggesting to blindly use a tar pipe for as many files as the OP was trying to address. My main point was that the tar pipe (locally or over ssh) is superior to cp/scp for the sorts of things that many of us do regularly (as discussed elsewhere in this thread), such as copying tens of thousands to low millions of mostly smallish files locally or remotely and wanting to preserve the structure of links and file metadata, and that pax looks like a simpler alternative to the tar pipe that accomplishes the same thing in a single command, although it does still use a hash table internally.

I wonder if using something other than a hash table would be worthwhile for cases like the OP though, or if there is a tool that has that option -- the memory overhead of empty buckets and resizing seem like serious downsides when you're trying to handle as many files as the OP.

How's that going to scale with memory? In-memory hash tables were the downfall of cp here.

It's going to scale just like you'd imagine it would. All the people saying "oh, tar was built for this" obviously haven't actually tried replicating the experiment using tar.

Pretty much as I'd suspected.

I've written a program that attempts to deal with the given situation gracefully: instead of using a hash table, it creates a temporary file with a list of inode/device/path entries, then sorts this according to inode/device, then uses the sorted list to perform the copying/hardlinking. The idea is that sorting should work well with much lower RAM requirements than the size of the file to be sorted (due to data locality, unless the random accesses with the hash, it will be able to work with big chunks, at least when done right (a bit hand-wavy, I know, this is called an "online algorithm" and I remember Knuth having written about those, haven't had the chance to recheck yet); the program is using the system sort command, which is hopefully implementing this well already).

The program stupidly calls "cp" right now for every individual file copy (not the hard linking), just to get the script done quickly, it's easy to replace that with something that saves the fork/exec overhead; even so, it might be faster than the swapping hash table if the swap is on a spinning disk. Also read the notes in the --help text. I.e. this is a work in progress as a basis to test the idea, it will be easy to round off the corners if there's interest.


PS. the idea of this is to make copying work well with the given situation on a single machine, unless the approach taken by the dcp program mentioned by fintler which seems to rely on a cluster of machines.

There may also be some more discussion about this on the mailing list: http://lists.gnu.org/archive/html/coreutils/2014-09/msg00013...

So it was all the files in one go, presumably with `cp -r`?

What about doing something with find/xargs/i-dunno to copy all the files, but break em into batches so you aren't asking cp to do it's bookkeeping for so many files in one process? Would that work better? Or worse in other ways?

This page may be useful:


The main issue is that there's no api to get the list of files hard linked together: the only way is to check all the existing files and compare inodes. If you're doing a plain copy over 2 fs, you cannot choose which number the target inode will be, so you need to keep a map between inode numbers, or between inodes and file names ("cp" does the later).

pedrocr's comment above suggests a good solution:

1. Copy each file from the source volume to a single directory (e.g. /tmp) on the target volume, named for the source volume inode number.

(edit: I suggest using a hierarchy of dirs to avoid the "too many dentry's" slowdown)

2. If the file has already been copied, it will already exist in /tmp - looking up the inode is a vanilla directory lookup

3. Create a hard link from /tmp to the actual path of the file

4. When all the files have been created on the target volume, delete the inode numbers in /tmp

It should work much better provided you can easily divide it into batches where you know they don't have any shared hardlinked files. Otherwise you will duplicate those files wasting some space. That may be fine as well though.

The idea that you would need to do this seems fundamentally broken to me.

How do you keep hard links?

Unix could really use a way to get all the paths that point to a given inode. These days that shouldn't really cost all that much and this issue comes up a lot in copying/sync situations. Here's the git-annex bug report about this:


Seems like this is actually a feature in VxFS:

https://sort.symantec.com/public/documents/sfha/6.0/linux/pr... http://sfdoccentral.symantec.com/sf/5.0/aix/manpages/vxfs/vx...

I wonder if any modern filesystems (xfs/ext4/btrfs/zfs) can actually enable tracking this.

Wow, it's not every day I hear about a filesystem feature that Windows has and Linux doesn't. (On a recent windows system: fsutil hardlink list <path> -- you can try any random exe or DLL in system32 for an example of a hard link.)

I forget what the api for that looks like if I ever knew. Might be private.

I am surprised, usually Linux is way ahead of Windows on shiny filesystem stuff.

Linux just has more filesystems, and sadly a lot of them have various flaws. I'm surprised when people are surprised that Linux isn't some completely superior technical marvel. BSD and Unix systems have been more advanced for decades..

Everyone on Linux still uses tar for god's sake, even though zip can use the same compression algorithms people use on tarballs, and zip actually stores an index of its files rather than 'cat'ing each record on top of the next like an append-only tape archive. (Obviously there are better formats than 'zip' for any platform, but it's just strange that nobody has moved away from tar)

tar is good enough for many uses, so people did not move on.

And it doesn't help that tar.gz / tar.bz2 compresses way better than zip in most cases (thanks to using a single compression context, rather than a new one for each file; and also compressing the filenames in the same context), and that it carries ownership and permission information with it - whereas zip doesn't.

The HVSC project, who try to collect every single piece of music ever created on a Commodore C64, distribute their archive as a zip-within-a-zip. The common music file is 1k-4k, goes down to ~500-1000 bytes zipped; The subdirectory+filename are often 100 bytes with a lot of redundancy that zip doesn't use, so they re-zip. Had they used .tar.gz or .tar.bz2, the second stage would not be needed.

Is that really a feature the file system provides? Or does that command simply walk the MFT and finds all hardlinks to the same file record?

As far as I know NTFS does not store what hard links point to a file.

I don't know. I do know there are tens of thousands of hardlinked files under system32, and the command I listed above can enumerate links pretty quickly.

I found an issue in cp that caused 350% extra mem usage for the original bug reporter, which fixing would have kept his working set at least within RAM.


> Wanting the buffers to be flushed so that I had a complete logfile, I gave cp more than a day to finish disassembling its hash table, before giving up and killing the process....Disassembling data structures nicely can take much more time than just tearing them down brutally when the process exits.

Does anyone know what the 'tear down' part is about? If it's about erasing the hashtable from memory, what takes so long? I would expect that to be very fast: you don't have to write zeros to it all, you just tell your GC or memory manager to mark it as free.

Looking at the code, it looks like deallocating a hash table requires traversing the entire table, because there is malloc()'d memory associated with each hash entry, so each entry has to be visited and free()'d. From hash_free() in coreutils hash.c:

    for (bucket = table->bucket; bucket < table->bucket_limit; bucket++)
        for (cursor = bucket->next; cursor; cursor = next)
            next = cursor->next;
            free (cursor);
Whereas if you just don't bother to deallocate the table before the process exits, the OS will reclaim the whole memory block without having to walk a giant data structure. That's a fairly common situation in C programs that do explicit memory management of complex data structures in the traditional malloc()/free() style. Giant linked lists and graph structures are another common culprit, where you have to pointer-chase all over the place to free() them if you allocated them in the traditional way (vs. packing them into an array or using a userspace custom allocator for the bookkeeping).

... and both malloc() and free() need to maintain (update) their own data structures, which in many implementations (don't know about the current glibc one) are stored interspersed with the data. Walking a hash table and calling free() on each and every item, even if it doesn't seem so, actually dirties a large number of memory pages (which could be truly randomly scattered because it is a hash table), which then need to be re-written into the swap space on swap-out.

Isn't it a terribly inefficient design to malloc hash entries individually? It would make more sense to use a pool / arena for that.

You could, but it's fairly uncommon to do that kind of "custom allocator" style in C, especially in traditional Unix utility programming. It seems to be more of a C++ thing culturally in my experience, though you do find it in C in "big" programs, like scientific-computing code.

Why exactly is it necessary to to free each hash entry instead of exiting the process?

Others have already mentioned that it isn't really necessary, but for some historical context this wasn't always true. When I first started programming computers, apps (written in languages that aren't garbage collected) on your typical consumer level computer/OS had to be really careful with freeing up all allocated memory before exiting or that memory would be gone forever (as far as the OS was concerned when it came to reallocating it to other apps) until the user rebooted the computer.

Advances coming from the combination of more advanced operating systems and CPU features like MMUs have made this a non-issue in most cases (can still be an issue on embedded, etc).

This was a particular problem on the Amiga system, among others. There were attempts to retro-fit resource tracking to the OS, but this was made all-but-impossible by the fact that a process can malloc() a chunk of memory, write a message in it, and then hand a pointer to that message to another process, expecting the other process to dealloc() it. This made for wonderfully cheap inter-process communication (for instance, the filesystem would load blocks from disc and pass them as messages to the process wanting them).

If it's the last thing you do before you exit the process, it isn't necessary, because the OS will reclaim your process's memory in one fell swoop. I believe that's what the linked post is advocating 'cp' should do. (At least on modern systems that's true; maybe there are some exotic old systems where not freeing your data structures before exit causes permanent memory leaks?)

It's seen as good C programming practice to free() your malloc()s, though, and it makes extending programs easier if you have that functionality, since what was previously the end of program can be wrapped in a higher-level loop without leaking memory. But if you really are exiting for sure, you don't have to make the final free-memory call. It can also be faster to not do any intermediate deallocations either: just leave everything for the one big final deallocation, as a kind of poor-man's version of one-generation generational GC. Nonetheless many C programmers see it somehow as a bit unclean not to deallocate properly. Arguably it does make some kind of errors more likely if you don't, e.g. if you have cleanup that needs to be done that the OS doesn't do automatically, you now have different kinds of cleanup routines for the end-of-process vs. not-end-of-process case.

I tend to do this in my C programs because in development usually have malloc() wrapped so that if any block hasn't been free()'ed it's reported at exit() time. This kind of check for lost pointers is usually so cheap that you use it even if you never expect to run on a system without decent memory management.

As an aside, GNU libc keeps ( or at least used to keep, I haven't checked in years ) the pointers used by malloc()/free() next to the blocks themselves, which gives really bad behavior when freeing a large number of blocks that have been pushed out to swap--you wind up bringing in pages in order to free them because the memory manager's working set is the size of all allocated memory. Years ago I wrote a replacement that avoided this just to speed up Netscape's horrible performance when it re-sized the bdb1.85 databases it used to track browser history. The browser would just "go away" thrashing the disk for hours and killing it just returned you to a state where it would decide to resize again an hour or so after a restart. Using LD_PRELOAD to use a malloc that kept it's bookkeeping away from the allocated blocks changed hours to seconds.

Is cleaning up after yourself memory-wise considered part of the POSIX standard?

It's not necessary per se, but many programmers do it out of good habits. It becomes much harder to e.g. find memory leaks in your code if you never free the memory you haven't leaked. And normally the extra overhead it entails is imperceptible.

Obviously many programmers do not think to (or cannot easily) test their code with 40 TBs of data.

For one it makes automated leak detection tools like Valgrind useless.

As others have commented, it's not necessary.

If you wrote cp in such a way that it's a reusable C library that other programs might use, then I can understand why they do this. It still doesn't make much sense for the command-line version to call this. Of course, it might run on some weird platform where the OS doesn't reclaim the memory of a process on exit(?).

OP says the system was swapping.

Once your hash table and its linked lists have been swapped to disk, each pointer dereference causes a random disk access, which are orders of magnitude slower than RAM.

So it's a hash table that's about twice the size of physical memory. The entries in the hashtable are usually pointers to memory that you need to deallocate (char* filenames in this case, maybe some other stuff too) so if you're a good citizen you iterate through them all and deallocate (mark is as free) them as you go. It would seem there were some pathological swap issues doing that. It would be interesting to see if a tree behaved better in this type of situation. The allocator has tables of its own in memory too, if you have to swap those out to find the memory that your then going to deallocate then swap out all the user data structures to swap in the allocator's structures to do the deallocation or something whacky like that, it kind of makes sense.

His patch just skips that step and ends the program.

From reading the thread, what happens is that cp before it exits tries to return all memory to the operating system by explicitly freeing it (this is not necessary on modern operating systems as far as I'm aware). What I draw from it is that since it's so big, parts of the structure of the hash table are on the disk, so it will have to swap in almost all of the table to be able to explicitly free everything.

I don't understand it, but the OP says this, implying the author agrees with you for any modern system, but not on 'old systems without working memory management'

> And unless old systems without working memory management must be supported, I don't see any harm in simply removing the call to the forget_all function towards the end of cp.c.

How would such a system even work? Any abnormal process termination or bugs would mean eventually the system becomes unusable.

The system would slowly leak memory until you reboot.

Yep, I remember programming on such a system (Amiga) in the early 90's! Fun times.

This is system C the memory is manually managed with malloc(). You can't simply nil the reference and expect GC to come through and clean it for you, you have to issue a free() for each malloc() that was called. While it might be possible to write a custom disposal routine that does that efficiently for the structure, the simpler way is to just use the normal functions you use for removing single items and call it against everything. Doing that will not be as efficient because those functions keep the structure in a consistent state and don't perform bulk operations, but they have already been written and tested, and in normal use cases they won't be a bottle neck.

> You can't simply nil the reference and expect GC to come through and clean it for you

This is exactly what happens when your process exits and its private memory space is freed in its entirety. Hence the discussion. There's no need to clean up before exit, except if you want to (i) use a leak checker; or (ii) run on a machine with no MMU.

It's funny that ii is even an issue because Linux does not support platforms without MMUs without some serious modifications.

But the program being discussed is (of course) GNU cp. That program is not in any way designed for, limited to, or intended only for, use in Linux.

So what Linux supports and doesn't support in terms of memory management really has no bearing on what GNU cp can or cannot do, in general.

I appreciate that he had the foresight to install more ram and configure more swap. I would hate to be days into a transfer and have the OOM killer strike.

The difficulty is that you are using a filesystem hierarchy to 'copy files' when you actually want to do a volume dump (block copy). Use XFS and xfsdump, or ZFS and zfs send, to achieve this.

Copy with hard link preservation is essentially like running dedupe except that you know ahead of time how many dupes there are. Dedupe is often very memory intensive, and even well thought out implementations don't support keeping book keeping structures on disk.

"Normally I'd have copied/moved the files at block-level (eg. using dd or pvmove), but suspecting bad blocks, I went for a file-level copy because then I'd know which files contained the bad blocks."

I was simplifying... dump backs up inodes not blocks. Some inodes point to file data and some point to directory data. Hard links are references to the same inode in multiple directory entries, so when you run xfsrestore, the link count increments as the FS hierarchy is restored.

xfsdump/zfs send are file system aware, unlike dd, and can detect fs corruption (ZFS especially having extensive checksums). In fact, any info cp sees about corruption comes from the FS code parsing the FS tree.

However, except on zfs/btrfs, data block corruption will pass unnoticed. And in my experience, when you have bad blocks, you have millions of them -- too many to manually fix. As this causes a read hang, it is usually better to dd copy the fs to a clean disk, set to replace bad blocks with zeros, then fsck/xfs_repair when you mount, then xfsdump.

dd conv=noerror,sync,notrunc bs=512 if=/dev/disk of=diskimg

See Also: http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US... http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Me...

If the risk of keeping the system running while the array rebuilt was deemed to high, I would have just gone with a dd/ddrescue of the remaining disks onto new disks and then moved on from there.

+1 for mentioning ZFS. It's really quite amazing. Almost like futuristic alien technology compared to the other freely available file systems.

Given the issues encountered it would make sense to split this into two passes: 1. identify files with bad blocks and log; 2. do the block-level copy. The disadvantages of file-level copy weigh heavier than the doing this in a single step.

That's probably what I'll do if I run into a problem like that another time (I'm the OP).

why is block level preferable?

In light of experience would it perhaps be helpful after all to use a block-level copy (such as Partclone, PartImage, or GNU ddrescue) and analyze later which files have the bad blocks?

I see that the choice of a file-level copy was deliberate: "I'd have copied/moved the files at block-level (eg. using dd or pvmove), but suspecting bad blocks, I went for a file-level copy because then I'd know which files contained the bad blocks."

A while back I had a hard drive die with an unrecoverable error on the superblock (ext3) and a few other places. I sent the hard drive to a professional recovery firm who were as incompetent as incompetent can be. I got the duff drive back and decided to have a go myself. Recovering the superblock was easy - there are extra copies of that. I then did a full map of the drive listing bad blocks, and did a reverse map to work out which files contained those blocks. Luckily, the only file that contained any bad blocks was a swap file, so I recovered absolutely everything. Other bad blocks were in unallocated space. Ext3 is actually a sane enough disc layout that it is possible to work out what is going on manually. I would have been stuffed if it had been a more complicated filesystem.

Also there is no mention of unrecoverable file analysis like error handling of cp operations in the article. And with this many files it would not be feasible without using an error log file.

So going with a simple block copy should suffice IMHO.

I'm the OP, so I can shred a bit of light on that: Dell's support suggested a file-level copy when I asked them what they recommended (but I'm not entirely sure they understood the implications). Also, time was not a big issue.

I did keep a log file with the output from cp, and it clearly identified the filenames for the inodes with bad blocks. Actually, I'm not sure how dd would handle bad blocks.

Thank you for clarification.

I was about to bet on "read fail repeat skip" cycle for dd's behaviour but, looking into coreutil's source code at https://github.com/goj/coreutils/blob/master/src/dd.c , if I'm not mistaken , dd does not try to be intelligent and just uses a zeroed out buffer so It would return 0's for unreadable blocks.


In Windows-land, the default copy is pretty anemic, so probably most people avoid it for serious work.

I'd probably use robocopy from the command line. And if I was being lazy, I'd use the Teracopy GUI.

I think my limit for a single copy command has been around 4TB with robocopy--and that was a bunch of large media files, instead of smaller more numerous files. Maybe there's a limit I haven't hit.

> Teracopy

I've used FastCopy for GUI based larger transfers, it's open source and can handle larger datasets well in my experience. It also doesn't choke on >MAX_PATH paths. Haven't had problems with it. Supposedly it's the fastest tool around...

The only slight issue is that the author is Japanese so the English translations aren't perfect plus the comments in the source are in Japanese.

But, can it handle the real stress test: filenames with colons?

">MAX_PATH paths"

How does this happen?

Technical debt that keeps on giving.

Today there are N applications. "We can't increase MAX_PATH because it will break existing applications!"

Tomorrow there are N+M applications. "We can't increase MAX_PATH because it will break existing applications!"

Repeat forever.

Any time you are faced with a hard technical decision like this, the pain will always be least if you make the change either:

1. During another transition (e.g. 16-bit to 32-bit, or 32-bit to 64-bit). Microsoft could have required all 64-bit Windows apps to adopt a larger MAX_PATH, among other things.

2. Right NOW, because there will never be an easier time to make the change. The overall pain to all parties will only increase over time.

This article has a pretty nice overview: http://msdn.microsoft.com/en-us/library/windows/desktop/aa36...

The MAX_PATH limitation can be avoided by using the \\?\ prefix (which also disables some kinds of path normalization, like converting / to \) or by using paths relative to the current directory.

Unfortunately, most tools (including Windows Explorer and other Microsoft software) still don't work with paths longer than MAX_PATH.

It happens on pretty much every network share, especially when you address it with an UNC path ("\\server\share\...") instead of mapping it as a drive.

If you have non-Windows users using the share it's pretty much guaranteed, since they don't have a problem with it - only Windows chokes on this.

Even Windows users manage to somehow create such paths (files) occasionally. Then you need to break out cmd.exe or somesuch because Explorer can't touch them.

Once you get above about a million files on NTFS, life becomes very slow. If you use a DVCS and check out a bunch of projects with a bunch of history, you can easily get to a point where you can't do too much with the drive. This was true as of 7/2008, perhaps 8/2012 have different/better tunings.

Number of files on the drive shouldn't matter that much, actually. In a single folder is a different matter, of course (for that you definitely want to disable 8.3 compatibility). But the only way I can imagine right now where the number of files on the drive makes a difference would be where you crammed the drive almost full, resulting in a very fragmented MFT (normally the file system tries to keep space for growing the MFT, but not if you try to fill that space with data, too).

When I was using Windows, I personally used Robocopy to make to personal backups to external drives. I was happy when MS started including it in Windows.

Another lesson to be learnt is that it's nice to have the source code for the tools we are using.

The email states that file-based copy operations were used in favor of dd due to suspected block errors. Two questions come to mind:

1. I've not used dd on failing media, so I'm not sure of the behavior. Will it plow through a file with block-read failures or halt?

2. There's the ddrescue utility, which is specifically intended for reading from nonreliable storage. Seems that this could have offered another means for addressing Rasmus's problem. It can also fill in additional data on multiple runs across media, such that more complete restores might be achieved. https://www.gnu.org/software/ddrescue/ddrescue.html

OP said "I went for a file-level copy because then I'd know which files contained the bad blocks". When you copy the block device with ddrescue (dd doesn't have logic to work around the bad sectors and the only sensible action for it is thus to stop, but don't take my word for it), the result will just have zeroes in the places where bad blocks were, and, assuming the filesystem structure is good enough (you should run fsck on it), will give you files with zones of zeroes. But you won't know which files without either comparing them to a backup (which you won't have by definition if you're trying to recover) or with a program that verifies every file's structure (which won't exist for the general case). Whereas cp will issue error messages with the path of the file in question. So the OP's decision makes sense.

I've played with ddrescue very lightly. From the GNU webpage linked above, it appears it creates logfiles which can be examined:

Ddrescuelog is a tool that manipulates ddrescue logfiles, shows logfile contents, converts logfiles to/from other formats, compares logfiles, tests rescue status, and can delete a logfile if the rescue is done. Ddrescuelog operations can be restricted to one or several parts of the logfile if the domain setting options are used.

That might allow for identification of files with bad sectors.

That would need either a hook to the kernel or a file system parser.

Even if you manage to do that, I'm not sure it would be a good idea to continue to use a file system that has lost sectors, even after fsck. Are you sure fsck is fixing any inconsistency? Are there any automatic procedures in place that guarantee that the fsck algorithms are in sync with the actual file system code? (Answer anew for any file system I might be using.) You definitely should do backups by reading the actual files, not the underlying device; perhaps in this case it could be OK (since it was a backup itself already, hence a copy of life data; but then if OP bothered enough to recover the files, maybe he'll bother enough to make sure they stay recovered?)

For that many files I probably would've used rsync between local disks. shrug

And hopefully you would have written up a similar essay on the oddball experiences you had with rsync, which is even more stateful than cp and even more likely to have odd interactions when used outside its comfort zone.

Ditto for tricks like:(cd $src; tar cf - .) | (cd $dst; tar xf -).

Pretty much nothing is going to work in an obvious way in a regime like this. That's sort of the point of the article.

Or maybe not. He mentions rsnapshot in the article, which uses rsync under the hood. This implies rsync would have a very good chance of handling a large number of hardlinks... since it created them in the first place.

That doesn't follow. If backups are for multiple machines to a big file server, the backup machine will have a much larger set of files than those that come from an individual machine. Further, each backup "image" compares the directory for the previous backup to the current live system. Generally it looks something like this:

1. Initial backup or "full backup" - copy the full targeted filesystem to the time indexed directory of the backup machine.

2. Sequential backups:

a. on the backup machine, create a directory for the new time, create a mirror directory structure of the previous time.

b. hard link the files in the new structure to those in the previous backup (which may be links themselves, back to the last full backup.

c. rsync the files to the new backup directory. Anything that needs to be transfered results in rsync transfering the file to a new directory, the moving it into the proper place. This unlinks the filename from the previous version and replaces it with the full version.

So yeah, the result of this system over a few machines and a long-timeframe backup system is way more links on the backup machine than any iteration of the backup will ever actually use.

Yes, it has more links, I realize, but this still doesn't mean it wouldn't work. Give it a shot and report back. (Hah.)

But rsnapshot created that large number of hardlinks incrementally. Without trying it there's no indication that rsync handles that better, especially as rsync with default settings doesn't preserve hardlinks as rsync builds a quite expensive data structure for them.

I'm surprised that cp even worked. I wouldn't have had the balls to just use cp as it could just have failed.

My strategy would probably have been to cp or rsync over the latest backup (so that new incremental backups could be immediately be created again) and then incrementally rsync'ing the previous backups with hardlink detection.

> The number of hard drives flashing red is not the same as the number of hard drives with bad blocks.

This is the real take-away. Monitor your drives. At very least enable SMART, and also regularly run a read on the full underlying drive (SMART won't see and log blocks that are on the way out so need retries for successful reads, unless you actually try to read those blocks).

That won't completely make you safe, but it'll greatly reduce the risk of other drives failing during a rebuild by increasing the chance you get advanced warning that problems are building up.

Glad someone noticed it (I'm the OP). Reading the drives systematically is called "Patrol Read" and is often enabled by default, but you can tweak the parameters.

The later replies regarding the size of the data structures cp is using are also worth reading. This is a case where pushing the command farther can make you think harder about the computations being done.

On Unix, isn't it considered bad practice to use cp in order to copy a large directory tree?

IIRC, the use of tar is recommended.

Something like:

    $ (cd $origin && tar cf - *) | (cd $destination && tar xvf - )

Use && there, not ; - consider the result if either of the cd commands fails.


Would have been even worse in this case due to the requirement to preserve hard links.

tar can preserve hardlinks. In fact, tar is really the best solution in this case, using the streaming between two tar commands.

As discussed in detail elswhere abive: the problem is not with the ability to preserve hardlinks, but the resources, namely RAM, required to do so.

Two tar command would in fact be much worse, because they'd require twice as many resources.

I doubt it, since you're using a pipe. It's not like you're actually storing the whole archive anywhere.

Also, and just to be clear, I did not invent this idea of using tar for this. I remember having read it in a Unix manual when I was learning about this OS.

And even if using tar is not a good idea, it should certainly be considered and as such I don't understand why OP doesn't even mention it in his post.

The pipe is not the issue; you actually are storing the full paths of all files in the archive somewhere (namely in RAM) so that you're able to decide whether a given file you're looking at is an alternative hardlink for a file you have already sent.

Yes, using tar for copying is not new and has its use cases, but this is not one of them.

Certainly not for all files.

The filesystem keeps a track of the number of nodes links for each file. So tar does not have to memorize the inode for each file, but only for those who have duplicates. So unless a large amount of files have more than one hard link, this is probably not a problem.

We will never know though, considering the author did not try it. Neither did he explain why he did not, despite the fact that it is the recommended Unix procedure.

From the original post: "We use rsnapshot (..) so most of the files have a high link count."

> it is the recommended Unix procedure

I don't think it is "recommended procedure", not today. At least on GNU/Linux cp is just fine. It may have been different two decades ago for some reason: I remember a graphical file manager (called "dfm") that used piped tar back in the late 90ies [1]. I know that installing open source utilities was pretty common back then to get better programs on closed source Unixes, and tar was probably high on the priorities to install whereas cp less so, and hence tar might have in fact been better than the preinstalled cp. I haven't seen anyone in the public technical (hence primarily open source) community advocating that even for portability reasons, so I can only think of it as "perhaps historically relevant, if at all".

Your point is valid though that one could have tested it here. See rbh42's answer on this point. Although I'm not sure he's right about both tar instances requiring a hash table: if tar submits the hard linking as from a previously submitted path to the new one, then there's no need for the table. And my quick look at the transmitted data seems to contain the source path for the link. The question then becomes whether tar has a better optimized hash table (not very likely), or uses a better algorithm (see my post about "megacopy" for an attempt at the latter).

[1] It actually had a problem: it didn't put "--" between the options and the path, and promptly failed when I moved an item that had a file name starting with a minus: it was interpreted by tar as option, and in addition the file manager didn't properly check the exit codes and simply unlinked the source even though it wasn't copied. The tools before GNU may really have been wanting: his failure to use "--" may have been caused by some contemporary non-GNU tar not supporting it. Anyway, of course I moved on to another file manager. Those bits of mine were sacrificed to the god of superior recommended Unix procedures or something.

Wouldn't it somehow have to keep about the same data as cp for preserving hard links?

> While rebuilding, the replacement disk failed, and in the meantime another disk had also failed.

I feel the pain. I went thru the same hell a few months ago.

Another lesson: routinely scrub your RAID arrays.

On debian-based systems, /etc/cron.d/mdadm will already do this on the first Sunday of the month.

I wonder how well rsync would have fared here.

Rsync can die just from scanning the whole directory tree of files first.

The incremental option(enabled by default) introduced in rsync 3.0 greatly reduces the need for scanning the whole directory structure.

If it's changing? You'd snapshot (assuming either LVM or BtrFS) then rsync of the snapshot.

Maybe this is naive, but wouldn't it have made more sense to do a bunch of smaller cp commands? Like sweep through the directory structure and do one cp per directory? Or find some other way to limit the number of files copied per command?

No, because then it wouldn't have replicated the hardlink structure of the original tree. That was the goal, and also the bit that causes the high resource consumption.

A problem with cp (and rsync, tar, and linux in general) is there is read-ahead within single files, but no read-ahead for the next file in the directory. So it doesn't make full use of the available IOPS capacity.

This is not, not, not how one should be using RAID.

The math is clear that in sufficiently large disk systems, RAID5, RAID6, and friends, are all insufficient.

Can you elaborate?

I'm guessing he means that RAID is designed as a technology to increase the availability of your data and the person in this story is using it as a backup solution. Google 'RAID backup' to see why it's such a bad idea to use it for that purpose.

He specifically used cp because he was worried about data loss from using something like dd. But if he'd been doing regular backups to tape, he could have just restored the backup from tape with no fuss or problems. It's only because he was using tools that aren't suited to what he was trying to do that he ran into problems.

Thanks for the link! The article says, that due to the read error rate and the size of today's disks, RAID 5 and RAID 6 have (kind of) lost their purpose.

Yep, mathematically no longer safe.

>We use XFS


XFS really shines when you're dealing with large volumes, especially with RAID. There are certain features, like striped allocation, that mean you end up getting faster performance, because the filesystem has a better idea of how blocks are laid out on the disks.

I personally still consider XFS a very mature and reliable filesystem. Both in terms of utility programs and kernel implementation. If I remember correctly, it was ported to linux from SGI/Irix where it was used for decades. It also was the default fs for RedHat/centos for a long time, so it might still have stuck at many shops.

Heres my anecdotal datapoint on which I base my personal believe:

From about 10-6 years ago, when I was doing sysadmin-work at university building storage-systems from commodity parts for experimental bulk data, we first had a load of not-reliably working early raid/sata(?) adapters, and those made ext3 and reiserfs (I think...) oops the kernel when the on-disk structure went bad. Whereas XFS just put a "XFS: remounted FS readonly due to errors" in the kernel logfile. That experience made XFS my default filesystem up to recently when I started to switch to btrfs. (of course, we fixed the hardware-errors, too... :-) )

Also, from that time, I got to use xfsdump/xfsrestore for backups and storage of fs-images which not even once failed on me.

As a blithe, new-Linux user (3.5 years), I was bumfuzzled when I saw RHEL/CentOS 7 switched from ext4 to XFS, figuring it to be some young upstart stealing the crown from the king. Then I did some Googling and figured out that XFS is as old as ext2! I'm looking forward to discovering how tools like xfs* can make my life easier.

Most likely because it would take a looong time to fsck 40TB using ext3/4.

ext2/3/4 have terrible large-directory performance. If you create 100,000 files in a directory in an ext2/3/4 filesystem, enumerating said files (ls, readdir, etc.) becomes unbearably slow -- and the slowness persists even after you delete all but one of the files in the directory.

Tested by time (I'm the OP).

> Tested by time (I'm the OP).

This is so important, and so often overlooked.

The lure of "ohhh shiny!" is prevalent in tech. I love to use bleeding edge stuff, play with new toys etc, but that is purely for home and personal stuff. Never for work.

I want my work life to be utterly boring. Nothing breaking. Everything behaving exactly as expected. I will happily trade a bit of performance and a lack of some shiny features for a system or component that works reliably. The less time I have to spend fighting stuff the more time I can spend tuning, automating or improving what I do have to look after.

XFS performs well. It's reliable, reasonably fast (they trade some speed for durability/reliability), and just works

Rsync seems a better tool for this. Can be run multiple times and it will just copy missing blocks.

it reminds me of crash only software.

I would probably have used tar|tar for this, or rsync.

About rsync: if you just use -a, it does not copy hard links correctly:

  -a, --archive
  [...]  Note that -a does not preserve hardlinks, because 
  finding multiply-linked files is expensive. You must 
  separately specify -H.
If you do specify -H, rsync does keep track of hard links, but presumably at the cost of keeping a data structure similar to cp:

  -H, --hard-links
  This tells rsync to look for hard-linked files in the source and 
  link together the corresponding files on the destination. Without 
  this option, hard-linked files in the source are treated as though 
  they were separate files. [...]
Of course, rsync could be more efficient at keeping track of the hard links than cp, but there's no reason to believe a priori that it would be.

Darn, thanks! I've been using rsync for incremental backups (themselves hard-link-based) for years, but neglecting to do this. That is to say, snapshot N+1 shares structure with backup N to save space (via hard links) via --link-dest. But files may be replicated because -H (missing) is orthogonal of --link-dest.

You can use fdupes (or various others similar to it) to find and automatically hardlink the duplicates.

You're right to recommend a tarpipe. I've had to copy several very large BackupPC storage pools in the past, and a tarpipe is the most reliable way to do it. (The only downside to BackupPC IMO...)

For future reference for other folks, the command would look something like this:

    cd /old-directory && tar czvflpS - . | tar -C /new-directory -xzvf -
Tarpipes are especially neat because they can work well over ssh (make sure you have ssh configured for passwordless login, any prompt at all will bone the tarpipe):

    cd /old-directory && tar czvflpS - . | ssh -i /path/to/private-key user@host "tar -C /new-directory -xzvf -"
...but tarpipe-over-ssh is not very fast. I have a note that says, "36 hours for 245G over a reasonable network" (probably 100Mb).

Disk-to-disk SATA or SAS without ssh in between would be significantly faster.

Be cautious about using tar pipes with compression. I've had experiences where the tar pipe will see the tarball being sent over and truncate the inner tarball. The compression in SSH tends to be good enough. My other suggestion for SSH is to change the algorithm used; rc4, while insecure for most things, is normally good enough, if not, blowfish tends to be a bit faster than AES.

The prompt goes to stderr, the pipe only pipes stdout, so a prompt should not cause excessive bonage, as long as you're there to respond to it.

Also, don't use -z locally, or even over a moderately fast network. The compression is not that fast and almost always makes things slower.

Good to know!

Also, re: bonage, I agree that it "shouldn't", but it definitely did. From my sysadmin notes file:

> The tar operation kicks off before ssh engages; having ssh ask for a password seems to intermittently cause problems with the tar headers on the receiving end. (It _shouldn't_, but it seems to.)

On modereately fast networks, instead of -z which uses a single threaded gzip implementation, parallel gzip may still give big improvements: http://zlib.net/pigz/

What's the benefit of using a tarpipe locally?

No giant bookkeeping datastructures that end up with the process thrashing in swap, because tar is designed to work with so many files that you don't want to keep the metadata in memory.

How does tar keep track of every hardlink without a giant hash table? It doesn't. It's not magic.

This is true if the bookkeeping was irrelevant and unnecessary. I imagine the scenario where your copy gets stopped in the middle is worse when you use the tar method.

It's not as if cp gives you anything more useful in that case. The bookkeeping that caused the problem was a hashtable of files that have been copied, necessary to handle hard links. Would be interesting how tar does that; if it can't do any better then the root comment and mine above are wrong.

tarpipes can handle gnarly piles of hardlinked files without sweating. BackupPC is free software for network backups that builds a massively hardlinked file pool; tarpipes are the only thing that can reliably copy BackupPC pools from one location to another.

Not so. Both the sending and the receiving tar processes will need a data structure keeping track of which inodes they've already processed. They can skip inodes with a link count of 1, but if all inodes have multiple links (as in this case), the overhead will be twice that of a single cp process.

Rsync has similar issues when you are copying hard links. It has to keep a table of all the inodes/filename pairs it sees so it can detect hard links and relink them on the destination. This can be big if you have a ton of hard links (say in a backup situation).

Rsync at least acts reasonably idempotent, so you can just run it again if it gets interrupted, which is usually why I use it for large copies.

I don't remember off the top of my head if tar handles hard links—it may be inappropriate for this usage.

If you are often restarting rsync jobs or they fail over low speed WAN connections

    --partial               keep partially transferred files
Is your friend. Esp if you are trying to sync directories with multi GB files. Also if your media is bad or there is a failing piece of network gear along the path

    --bwlimit=KBPS          limit I/O bandwidth; KBytes per second
Can keep things running smoothly. It even works against disks, I have used it where an array would die if one attempted to read from it too quickly.

Wouldn't it only need to track the inodes with a link count above 1?

I'm unfamiliar with tar|tar, can you provide some incite into this trick?

Not much of a trick, the first tar serializes the entire directory structure including all the metadata into a stream of bytes, from which the second tar reassembles it.

Basically the same as tar's standard usage, except you don't store that stream as a tar file/image anywhere.

Typically when copying from one machine to another, I've had the best results by far with tar | tar, but I'm not sure how well that would fare with tons of hard links like this. I think it's possible, but the default behavior would just gobble up umpteen copies of the same file.

This. I guess anyone who began using UNIX systems back in the SunOS days or earlier would never think of using cp for this job. Well, mostly because the vendor-provided cp usually sucked badly (no recursive copy), but also because tar has always been such a great, reliable tool.

The only disadvantage of tar is its peculiar syntax, but that's something you get used to.

I would certainly have used cpio.

Down voted because you disagree? Or you don't know what cpio is? Either way please discuss instead of or in addition to voting.

Your post has no information as to why you'd use cpio.

OK, fair point. cpio doesn't have global data structures that grow with the size of the operation, which cp does have. rsync prior to 3.0 also had to scan the entire job before it would begin the transfer, which can take forever. cpio operates on a stream, in the form of find | cpio , so cpio only sees one file from the stream at a time. Find, at least GNU find, is also written such that its resources don't vary with the amount of output.

rsync is a bit of a kitchen sink, and cp is too simple. cpio is just the thing, and was designed for this exact use case.

Does your setup preserve hard links? Because this is from where OP's problems originated.

Ok, In this case why use a file to file copy like tar,cp, or rsync instead of a block copy like dd?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact