
My experience with using cp to copy 432 million files (39 TB) - nazri1
http://lists.gnu.org/archive/html/coreutils/2014-08/msg00012.html
======
fintler
I wrote a little copy program at my last job to copy files in a reasonable
time frame on 5PB to 55PB filesystems.

[https://github.com/hpc/dcp](https://github.com/hpc/dcp)

We got an IEEE paper out of it:

[http://conferences.computer.org/sc/2012/papers/1000a015.pdf](http://conferences.computer.org/sc/2012/papers/1000a015.pdf)

A few people are continuing the concept to other tools -- that should be
available at [http://fileutils.io/](http://fileutils.io/) relatively soon.

We also had another tool written on top of
[https://github.com/hpc/libcircle](https://github.com/hpc/libcircle) that
would gather metadata on a few hundred-million files in a few hours (we had to
limit the speed so it wouldn't take down the filesystem). For a slimmed down
version of that tool, take a look at
[https://github.com/hpc/libdftw](https://github.com/hpc/libdftw)

~~~
laymil
And it's interesting and useful for scientific computing where you already
have an MPI environment and distributed/parallel filesystems. However, it's
not really applicable to this workload, as the paper itself says.

 _There is a provision in most file systems to use links (symlinks, hardlinks,
etc.). Links can cause cycles in the file tree, which would result in a
traversal algorithm going into an infinite loop. To prevent this from
happening, we ignore links in the file tree during traversal. We note that the
algorithms we propose in the paper will duplicate effort proportional to the
number of hardlinks. However, in real world production systems, such as in
LANL (and others), for simplicity, the parallel filesystems are generally not
POSIX compliant, that is, they do not use hard links, inodes, and symlinks.
So, our assumption holds._

The reason this cp took such large amounts of time was the desire to preserve
hardlinks and the resize of the hashtable used to track the device and inode
of the source and destination files.

~~~
encoderer
Sure, but if you read that article you walk away with a sense of _thats a lot
of files to copy_. And the GP built a tool for jobs 2-3 orders of magnitude
larger?! Clearly there are tradeoffs forced on you at that size...

------
pedrocr
How about this for a better cp strategy to deal with hardlinks:

1\. Calculate the hash of /sourcedir/some/path/to/file

2\. Copy the file to /tempdir/$hash if it doesn't exist yet

3\. Hard-link /destdir/some/path/to/file to /tempdir/$hash

4\. Repeat until you run out of source files

5\. Recursively delete /tempdir/

This should give you a faithful copy with all the hard-links with constant RAM
at the cost of CPU to run all the hashing. If you're smart about doing steps 1
and 2 together it shouldn't require any additional I/O (ignoring the extra
file metadata).

Edit: actually this won't recreate the same hardlink structure, it will
deduplicate any identical files, which may not be what you want. Replacing the
hashing with looking up the inode with stat() would actually do the right
thing. And that would basically be an on-disk implementation of the hash table
cp is setting up in memory.

~~~
derefr
If you cp your data onto a Plan9 machine, what results is pretty much exactly
the process you've outlined.

Plan9's default filesystem is made up of two parts: Fossil, and Venti.

\- Fossil is a content-addressable on-disk object store. Picture a disk
"formatted as" an S3 bucket, where the keys are strictly the SHAsums of the
values.

\- Venti is a persistent graph database that holds what would today be called
"inode metadata." It presents itself as a regular hierarchical filesystem. The
"content" property of an inode simply holds a symbolic path, usually to an
object in a mounted Fossil "bucket."

When you write to Venti, it writes the object to its configured Fossil bucket,
then creates an inode pointing to that key in that bucket. If the key already
existed in Fossil, though, Fossil just returns the write as successful
immediately, and Venti gets on with creating the inode.

Honestly, I'm terribly confused why all filesystems haven't been broken into
these two easily-separable layers. (Microsoft attempted this with WinFS, but
mysteriously failed.) Is it just inertia? Why are we still creating new
filesystems (e.g. btrfs) that don't follow this design?

~~~
pedrocr
_> Honestly, I'm terribly confused why all filesystems haven't been broken
into these two easily-separable layers. Is it just inertia?_

The penalty for doing content addressed filesystems is of course the CPU
usage. btrfs probably has most of the benefits without the CPU cost with its
copy-on-write semantics.

Note that what you describe (and my initial process) is a different semantic
than hard-links. What you get is shared storage but if you write to one of the
files only that one gets changed. Whereas with hardlinks both files change.

~~~
derefr
In effect, hard links (of mutable files) are a declaration that certain files
have the same "identity." You can't get this with plain Venti-on-Fossil, but
it's a problem with Fossil (objects are immutable), not with Venti.

Venti-on-Venti-on-Fossil would work, though, since Venti just creates
imaginary files that inherit their IO semantics from their underlying store,
and this should apply recursively:

1\. create two nodes A and B in Venti[1] that refer to one node C in Venti[2],
which refers to object[x] with key x in Fossil.

2\. Append to A in Venti[1], causing a write to C in Venti[2], causing a write
to object[x] Fossil, creating object[y] with key y.

3\. Fossil returns y to Venti[2]; Venti[2] updates C to point to object[y] and
returns C to Venti[1]; Venti[1] sees that C is unchanged and does nothing.

Now A and B both effectively point to object[y].

(Note that you don't actually have to have two Venti servers for this! There's
nothing stopping you from having Venti nodes that refer to other Venti nodes
within the same projected filesystem--but since you're exposing these nodes to
the user, your get the "dangers" of symbolic links, where e.g. moving them
breaks the things that point to them. For IO operations they have the
semantics of hard links, though, instead of needing to be special-cased by
filesystem-operating syscalls.)

~~~
ori_b
You seem to be confusing venti and fossil.

~~~
theworst
Can you explain further? I am not a plan9 expert, by any means, but I'm stuck
at where GP made the confusion. Thanks!

~~~
yungchin
He just swapped the names I think - Venti is the block store, Fossil is the
file system layer.

------
rwg
_Disassembling data structures nicely can take much more time than just
tearing them down brutally when the process exits._

A wonderful trend I've noticed in Free/Open Source software lately is proudly
claiming that a program is "Valgrind clean." It's a decent indication that the
program won't doing anything silly with memory during normal use, like leak
it. (There's also a notable upswing in the number of projects using static
analyzers on their code and fixing legitimate problems that turn up, which is
great, too!)

While you can certainly just let the OS reclaim all of your process's
allocated memory at exit time, you're technically (though intentionally)
leaking memory. When it becomes too hard to separate the intentional leaks
from the unintentional leaks, I'd wager most programmers will just stop
looking at the Valgrind reports. (I suppose you could wrap free() calls in
"#ifdef DEBUG ... #endif" blocks and only run Valgrind on debug builds, but
that seems ugly.)

A more elegant solution is to use an arena/region/zone allocator and place
potentially large data structures (like cp's hard link/inode table) entirely
in their own arenas. When the time comes to destroy one of these data
structures, you can destroy its arena with a single function call instead of
walking the data structure and free()ing it piece by piece.

Unfortunately, like a lot of useful plumbing, there isn't a standard API for
arena allocators, so actually doing this in a cross-platform way is painful:

• Windows lets you create multiple heaps and allocate/free memory in them
(HeapCreate(), HeapDestroy(), HeapAlloc(), HeapFree(), etc.).

• OS X and iOS come with a zone allocator (malloc_create_zone(),
malloc_destroy_zone(), malloc_zone_malloc(), malloc_zone_free(), etc.).

• glibc doesn't have a user-facing way to create/destroy arenas (though it
uses arenas internally), so you're stuck using a third-party allocator on
Linux to get arena support.

• IRIX used to come with an arena allocator (acreate(), adelete(), amalloc(),
afree(), etc.), so if you're still developing on an SGI Octane because you
can't get enough of that sexy terminal font, you're good to go.

~~~
_delirium
Adding some kind of arena-allocation library to both the build & runtime
dependencies _solely_ to keep valgrind happy, with no actual improvement in
functionality or performance, doesn't seem like a great tradeoff on the
software engineering front. I'd rather see work on improving the static
analysis. For example if some memory is intended to be freed at program
cleanup, Valgrind could have some way of being told, "this is intended to be
freed at program cleanup". Inserting an explicit (and redundant) deallocation
as the last line of the program just to make the static analyzer happy is a
bit perverse.

(That is, assuming that you don't need portability to odd systems that don't
actually free memory on process exit.)

~~~
andreasvc
I don't see why you assume arenas would be added "solely to keep valgrind
happy". Arenas have better performance when allocating a high number of small
chunks, because an arena can make better performance trade-offs for this use
case than the general malloc allocator.

------
mililani
This may be a little off topic, but I used to think RAID 5 and RAID 6 were the
best RAID configs to use. It seemed to offer the best bang for buck. However,
after seeing how long it took to rebuild an array after a drive failed (over 3
days), I'm much more hesitant to use those RAIDS. I much rather prefer RAID
1+0 even though the overall cost is nearly double that of RAID 5. It's much
faster, and there is no rebuild process if the RAID controller is smart
enough. You just swap failed drives, and the RAID controller automatically
utilizes the back up drive and then mirrors onto the new drive. Just much
faster and much less prone to multiple drive failures killing the entire RAID.

~~~
halfcat
This can not be stressed strongly enough. There is never a case when RAID5 is
the best choice, ever [1]. There are cases where RAID0 is mathematically
proven more reliable than RAID5 [2]. RAID5 should never be used for anything
where you value keeping your data. I am not exaggerating when I say that very
often, your data is safer on a single hard drive than it is on a RAID5 array.
Please let that sink in.

The problem is that once a drive fails, during the rebuild, if any of the
surviving drives experience an unrecoverable read error (URE), the entire
array will fail. On consumer-grade SATA drives that have a URE rate of 1 in
10^14, that means if the data on the surviving drives totals 12TB, the
probability of the array failing rebuild is close to 100%. Enterprise SAS
drives are typically rated 1 URE in 10^15, so you improve your chances ten-
fold. Still an avoidable risk.

RAID6 suffers from the same fundamental flaw as RAID5, but the probability of
complete array failure is pushed back one level, making RAID6 with enterprise
SAS drives possibly acceptable in some cases, for now (until hard drive
capacities get larger).

I no longer use parity RAID. Always RAID10 [3]. If a customer insists on
RAID5, I tell them they can hire someone else, and I am prepared to walk away.

I haven't even touched on the ridiculous cases where it takes RAID5 arrays
weeks or months to rebuild, while an entire company limps inefficiently along.
When productivity suffers company-wide, the decision makers wish they had paid
the tiny price for a few extra disks to do RAID10.

In the article, he has 12x 4TB drives. Once two drives failed, assuming he is
using enterprise drives (Dell calls them "near-line SAS", just an enterprise
SATA), there is a 33% chance the entire array fails if he tries to rebuild. If
the drives are plain SATA, there is almost no chance the array completes a
rebuild.

[1] [http://www.smbitjournal.com/2012/11/choosing-a-raid-level-
by...](http://www.smbitjournal.com/2012/11/choosing-a-raid-level-by-drive-
count/)

[2] [http://www.smbitjournal.com/2012/05/when-no-redundancy-is-
mo...](http://www.smbitjournal.com/2012/05/when-no-redundancy-is-more-
reliable/)

[3] [http://www.smbitjournal.com/2012/11/one-big-raid-10-a-new-
st...](http://www.smbitjournal.com/2012/11/one-big-raid-10-a-new-standard-in-
server-storage/)

~~~
Forlien
I think your calculation on failing an array rebuild is wrong. Can you show
how you got those numbers?

~~~
halfcat
Sure, there were two statements I made.

> _On consumer-grade SATA drives that have a URE rate of 1 in 10^14, that
> means if the data on the surviving drives totals 12TB, the probability of
> the array failing rebuild is close to 100%._

10^14 bits is 12.5 TB, so on average, the chance of 12TB being read without a
single URE is very low, and the probability the array fails to rebuild is
close to 100%. I was estimating 10^14 bits to be about 12TB, so the
probability is actually 12/12.5 = 96% chance of failure.

> _...he has 12x 4TB drives. Once two drives failed, assuming he is using
> enterprise drives...there is a 33% chance the entire array fails if he tries
> to rebuild. If the drives are plain SATA, there is almost no chance the
> array completes a rebuild._

A RAID6 with two failed drives is effectively the same situation as a RAID5
with one failed drive. In order to rebuild one failed drive, the RAID
controller must read all data from every surviving drive to recreate the
failed drive. In this case, there are 10x 4TB surviving drives, meaning 40TB
of data must be read to rebuild. Because these drives are presumably
enterprise quality, I am assuming they are rated to fail reading one sector
for every 10^15 bits read (10^15 bits = 125 TB). So it's actually 40/125 = 32%
chance of failure if you try to rebuild.

------
vhost-
These are the types of stories I love. I just learned a boat load in 5
minutes.

~~~
3rd3
Is there maybe an archive website dedicated to these kind of stories?

~~~
breadbox
At one time there was; it was called the Internet. The archive still exists,
but it's been made harder to browse through due to being jumbled up with
javascript and cat gifs.

~~~
jayvanguard
It's true. We should have never let the public on the Internet. It has been
downhill since then.

~~~
taeric
False choice, isn't it? I mean, the complaint isn't that the public now has
sites with massive javascript and related technologies. The complaint is that
it has muscled out useful sites that did not use those technologies. And it
should be heavily noted that the heavy muscles that have pushed out many of
these sites is not necessarily "the public."

~~~
thinkling
Kind of funny to say that on a text-only JS-free site that seems to be alive
and well, linking to an article on an old-school mailing list archive site. :)

~~~
taeric
Oh, certainly. I just can resonate with the sentiment that these sites aren't
the majority.

Even this site, honestly, is less than easy to deal with on a recurring basis.
(Consider, hard to remember which was the top story three days ago at noon.)
Specifically, sometimes I lose a story because I refresh and something
plummeted off the page. Hard to have any idea how far to "scroll back" to see
it.

------
calvins
I would usually use the tarpipe mentioned already by others for this sort of
thing (although I probably wouldn't do 432 million files in one shot):

    
    
      (cd $SOURCE && tar cf - .) | (mkdir -p $DEST && cd $DEST && tar xf -)
    

Another option which I just learned about through reading some links from this
thread is pax
([http://en.wikipedia.org/wiki/Pax_%28Unix%29](http://en.wikipedia.org/wiki/Pax_%28Unix%29)),
which can do it with just a single process:

    
    
      (mkdir -p $DEST && cd $SOURCE && pax -rw . $DEST)
    

Both will handle hard links fine, but pax may have some advantages in terms of
resource usage when processing huge numbers of files and tons of hard links.

~~~
tedunangst
You know how tar handles hardlinks, right? By creating a giant hash table of
every file.

~~~
dredmorbius
How's that going to scale with memory? In-memory hash tables were the downfall
of cp here.

~~~
tedunangst
It's going to scale just like you'd imagine it would. All the people saying
"oh, tar was built for this" obviously haven't actually tried replicating the
experiment using tar.

~~~
dredmorbius
Pretty much as I'd suspected.

------
pflanze
I've written a program that attempts to deal with the given situation
gracefully: instead of using a hash table, it creates a temporary file with a
list of inode/device/path entries, then sorts this according to inode/device,
then uses the sorted list to perform the copying/hardlinking. The idea is that
sorting should work well with much lower RAM requirements than the size of the
file to be sorted (due to data locality, unless the random accesses with the
hash, it will be able to work with big chunks, at least when done right (a bit
hand-wavy, I know, this is called an "online algorithm" and I remember Knuth
having written about those, haven't had the chance to recheck yet); the
program is using the system sort command, which is hopefully implementing this
well already).

The program stupidly calls "cp" right now for every individual file copy (not
the hard linking), just to get the script done quickly, it's easy to replace
that with something that saves the fork/exec overhead; even so, it might be
faster than the swapping hash table if the swap is on a spinning disk. Also
read the notes in the --help text. I.e. this is a work in progress as a basis
to test the idea, it will be easy to round off the corners if there's
interest.

[https://github.com/pflanze/megacopy](https://github.com/pflanze/megacopy)

PS. the idea of this is to make copying work well with the given situation on
a single machine, unless the approach taken by the dcp program mentioned by
fintler which seems to rely on a cluster of machines.

There may also be some more discussion about this on the mailing list:
[http://lists.gnu.org/archive/html/coreutils/2014-09/msg00013...](http://lists.gnu.org/archive/html/coreutils/2014-09/msg00013.html)

------
jrochkind1
So it was all the files in one go, presumably with `cp -r`?

What about doing something with find/xargs/i-dunno to copy all the files, but
break em into batches so you aren't asking cp to do it's bookkeeping for so
many files in one process? Would that work better? Or worse in other ways?

~~~
xchg_ax_ax
This page may be useful:

[http://unix.stackexchange.com/questions/44247/how-to-copy-
di...](http://unix.stackexchange.com/questions/44247/how-to-copy-directories-
with-preserving-hardlinks)

The main issue is that there's no api to get the list of files hard linked
together: the only way is to check all the existing files and compare inodes.
If you're doing a plain copy over 2 fs, you cannot choose which number the
target inode will be, so you need to keep a map between inode numbers, or
between inodes and file names ("cp" does the later).

~~~
sounds
pedrocr's comment above suggests a good solution:

1\. Copy each file from the source volume to a single directory (e.g. /tmp) on
the target volume, named for the source volume inode number.

(edit: I suggest using a hierarchy of dirs to avoid the "too many dentry's"
slowdown)

2\. If the file has already been copied, it will already exist in /tmp -
looking up the inode is a vanilla directory lookup

3\. Create a hard link from /tmp to the actual path of the file

4\. When all the files have been created on the target volume, delete the
inode numbers in /tmp

------
pedrocr
Unix could really use a way to get all the paths that point to a given inode.
These days that shouldn't really cost all that much and this issue comes up a
lot in copying/sync situations. Here's the git-annex bug report about this:

[https://git-annex.branchable.com/bugs/Hard_links_not_synced_...](https://git-
annex.branchable.com/bugs/Hard_links_not_synced_in_direct_mode/)

~~~
asveikau
Wow, it's not every day I hear about a filesystem feature that Windows has and
Linux doesn't. (On a recent windows system: _fsutil hardlink list <path>_ \--
you can try any random exe or DLL in system32 for an example of a hard link.)

I forget what the api for that looks like if I ever knew. Might be private.

I am surprised, usually Linux is way ahead of Windows on shiny filesystem
stuff.

~~~
peterwwillis
Linux just has more filesystems, and sadly a lot of them have various flaws.
I'm surprised when people are surprised that Linux isn't some completely
superior technical marvel. BSD and Unix systems have been more advanced for
decades..

Everyone on Linux still uses _tar_ for god's sake, even though zip can use the
same compression algorithms people use on tarballs, and zip actually stores an
index of its files rather than 'cat'ing each record on top of the next like an
append-only tape archive. (Obviously there are better formats than 'zip' for
any platform, but it's just strange that nobody has moved away from tar)

~~~
beagle3
tar is good enough for many uses, so people did not move on.

And it doesn't help that tar.gz / tar.bz2 compresses way better than zip in
most cases (thanks to using a single compression context, rather than a new
one for each file; and also compressing the filenames in the same context),
and that it carries ownership and permission information with it - whereas zip
doesn't.

The HVSC project, who try to collect every single piece of music ever created
on a Commodore C64, distribute their archive as a zip-within-a-zip. The common
music file is 1k-4k, goes down to ~500-1000 bytes zipped; The
subdirectory+filename are often 100 bytes with a lot of redundancy that zip
doesn't use, so they re-zip. Had they used .tar.gz or .tar.bz2, the second
stage would not be needed.

------
pixelbeat
I found an issue in cp that caused 350% extra mem usage for the original bug
reporter, which fixing would have kept his working set at least within RAM.

[http://lists.gnu.org/archive/html/coreutils/2014-09/msg00014...](http://lists.gnu.org/archive/html/coreutils/2014-09/msg00014.html)

------
gwern
> Wanting the buffers to be flushed so that I had a complete logfile, I gave
> cp more than a day to finish disassembling its hash table, before giving up
> and killing the process....Disassembling data structures nicely can take
> much more time than just tearing them down brutally when the process exits.

Does anyone know what the 'tear down' part is about? If it's about erasing the
hashtable from memory, what takes so long? I would expect that to be very
fast: you don't have to write zeros to it all, you just tell your GC or memory
manager to mark it as free.

~~~
mjn
Looking at the code, it looks like deallocating a hash table requires
traversing the entire table, because there is malloc()'d memory associated
with each hash entry, so each entry has to be visited and free()'d. From
hash_free() in coreutils hash.c:

    
    
        for (bucket = table->bucket; bucket < table->bucket_limit; bucket++)
          {
            for (cursor = bucket->next; cursor; cursor = next)
              {
                next = cursor->next;
                free (cursor);
              }
          }
    

Whereas if you just don't bother to deallocate the table before the process
exits, the OS will reclaim the whole memory block without having to walk a
giant data structure. That's a fairly common situation in C programs that do
explicit memory management of complex data structures in the traditional
malloc()/free() style. Giant linked lists and graph structures are another
common culprit, where you have to pointer-chase all over the place to free()
them if you allocated them in the traditional way (vs. packing them into an
array or using a userspace custom allocator for the bookkeeping).

~~~
ritchiea
Why exactly is it necessary to to free each hash entry instead of exiting the
process?

~~~
mjn
If it's the last thing you do before you exit the process, it isn't necessary,
because the OS will reclaim your process's memory in one fell swoop. I believe
that's what the linked post is advocating 'cp' should do. (At least on modern
systems that's true; maybe there are some exotic old systems where not freeing
your data structures before exit causes permanent memory leaks?)

It's seen as good C programming practice to free() your malloc()s, though, and
it makes extending programs easier if you have that functionality, since what
was previously the end of program can be wrapped in a higher-level loop
without leaking memory. But if you really are exiting for sure, you don't have
to make the final free-memory call. It can also be faster to not do any
intermediate deallocations either: just leave everything for the one big final
deallocation, as a kind of poor-man's version of one-generation generational
GC. Nonetheless many C programmers see it somehow as a bit unclean not to
deallocate properly. Arguably it does make some kind of errors more likely if
you don't, e.g. if you have cleanup that needs to be done that the OS _doesn
't_ do automatically, you now have different kinds of cleanup routines for the
end-of-process vs. not-end-of-process case.

~~~
epmos
I tend to do this in my C programs because in development usually have
malloc() wrapped so that if any block hasn't been free()'ed it's reported at
exit() time. This kind of check for lost pointers is usually so cheap that you
use it even if you never expect to run on a system without decent memory
management.

As an aside, GNU libc keeps ( or at least used to keep, I haven't checked in
years ) the pointers used by malloc()/free() next to the blocks themselves,
which gives really bad behavior when freeing a large number of blocks that
have been pushed out to swap--you wind up bringing in pages in order to free
them because the memory manager's working set is the size of all allocated
memory. Years ago I wrote a replacement that avoided this just to speed up
Netscape's horrible performance when it re-sized the bdb1.85 databases it used
to track browser history. The browser would just "go away" thrashing the disk
for hours and killing it just returned you to a state where it would decide to
resize again an hour or so after a restart. Using LD_PRELOAD to use a malloc
that kept it's bookkeeping away from the allocated blocks changed hours to
seconds.

------
sitkack
I appreciate that he had the foresight to install more ram and configure more
swap. I would hate to be days into a transfer and have the OOM killer strike.

------
angry_octet
The difficulty is that you are using a filesystem hierarchy to 'copy files'
when you actually want to do a volume dump (block copy). Use XFS and xfsdump,
or ZFS and zfs send, to achieve this.

Copy with hard link preservation is essentially like running dedupe except
that you know ahead of time how many dupes there are. Dedupe is often very
memory intensive, and even well thought out implementations don't support
keeping book keeping structures on disk.

~~~
steveh73
"Normally I'd have copied/moved the files at block-level (eg. using dd or
pvmove), but suspecting bad blocks, I went for a file-level copy because then
I'd know which files contained the bad blocks."

~~~
angry_octet
I was simplifying... dump backs up inodes not blocks. Some inodes point to
file data and some point to directory data. Hard links are references to the
same inode in multiple directory entries, so when you run xfsrestore, the link
count increments as the FS hierarchy is restored.

xfsdump/zfs send are file system aware, unlike dd, and can detect fs
corruption (ZFS especially having extensive checksums). In fact, any info cp
sees about corruption comes from the FS code parsing the FS tree.

However, except on zfs/btrfs, data block corruption will pass unnoticed. And
in my experience, when you have bad blocks, you have millions of them -- too
many to manually fix. As this causes a read hang, it is usually better to dd
copy the fs to a clean disk, set to replace bad blocks with zeros, then
fsck/xfs_repair when you mount, then xfsdump.

dd conv=noerror,sync,notrunc bs=512 if=/dev/disk of=diskimg

See Also: [http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-
US...](http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/xfs-
repair.html)
[http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Me...](http://xfs.org/index.php/Reliable_Detection_and_Repair_of_Metadata_Corruption)

~~~
Rapzid
If the risk of keeping the system running while the array rebuilt was deemed
to high, I would have just gone with a dd/ddrescue of the remaining disks onto
new disks and then moved on from there.

+1 for mentioning ZFS. It's really quite amazing. Almost like futuristic alien
technology compared to the other freely available file systems.

------
minopret
In light of experience would it perhaps be helpful after all to use a block-
level copy (such as Partclone, PartImage, or GNU ddrescue) and analyze later
which files have the bad blocks?

I see that the choice of a file-level copy was deliberate: "I'd have
copied/moved the files at block-level (eg. using dd or pvmove), but suspecting
bad blocks, I went for a file-level copy because then I'd know which files
contained the bad blocks."

~~~
fsniper
Also there is no mention of unrecoverable file analysis like error handling of
cp operations in the article. And with this many files it would not be
feasible without using an error log file.

So going with a simple block copy should suffice IMHO.

~~~
rbh42
I'm the OP, so I can shred a bit of light on that: Dell's support suggested a
file-level copy when I asked them what they recommended (but I'm not entirely
sure they understood the implications). Also, time was not a big issue.

I did keep a log file with the output from cp, and it clearly identified the
filenames for the inodes with bad blocks. Actually, I'm not sure how dd would
handle bad blocks.

~~~
fsniper
Thank you for clarification.

I was about to bet on "read fail repeat skip" cycle for dd's behaviour but,
looking into coreutil's source code at
[https://github.com/goj/coreutils/blob/master/src/dd.c](https://github.com/goj/coreutils/blob/master/src/dd.c)
, if I'm not mistaken , dd does not try to be intelligent and just uses a
zeroed out buffer so It would return 0's for unreadable blocks.

------
IvyMike
Interesting.

In Windows-land, the default copy is pretty anemic, so probably most people
avoid it for serious work.

I'd probably use robocopy from the command line. And if I was being lazy, I'd
use the Teracopy GUI.

I think my limit for a single copy command has been around 4TB with robocopy--
and that was a bunch of large media files, instead of smaller more numerous
files. Maybe there's a limit I haven't hit.

~~~
noinsight
> Teracopy

I've used FastCopy for GUI based larger transfers, it's open source and can
handle larger datasets well in my experience. It also doesn't choke on
>MAX_PATH paths. Haven't had problems with it. Supposedly it's the fastest
tool around...

The only slight issue is that the author is Japanese so the English
translations aren't perfect plus the comments in the source are in Japanese.

~~~
gizmo686
">MAX_PATH paths"

How does this happen?

~~~
xenadu02
Technical debt that keeps on giving.

Today there are N applications. "We can't increase MAX_PATH because it will
break existing applications!"

Tomorrow there are N+M applications. "We can't increase MAX_PATH because it
will break existing applications!"

Repeat forever.

Any time you are faced with a hard technical decision like this, the pain will
always be least if you make the change either:

1\. During another transition (e.g. 16-bit to 32-bit, or 32-bit to 64-bit).
Microsoft could have required all 64-bit Windows apps to adopt a larger
MAX_PATH, among other things.

2\. Right NOW, because there will never be an easier time to make the change.
The overall pain to all parties will only increase over time.

------
pmontra
Another lesson to be learnt is that it's nice to have the source code for the
tools we are using.

------
dredmorbius
The email states that file-based copy operations were used in favor of dd due
to suspected block errors. Two questions come to mind:

1\. I've not used dd on failing media, so I'm not sure of the behavior. Will
it plow through a file with block-read failures or halt?

2\. There's the ddrescue utility, which _is_ specifically intended for reading
from nonreliable storage. Seems that this could have offered another means for
addressing Rasmus's problem. It can also fill in additional data on multiple
runs across media, such that more complete restores might be achieved.
[https://www.gnu.org/software/ddrescue/ddrescue.html](https://www.gnu.org/software/ddrescue/ddrescue.html)

~~~
pflanze
OP said "I went for a file-level copy because then I'd know which files
contained the bad blocks". When you copy the block device with ddrescue (dd
doesn't have logic to work around the bad sectors and the only sensible action
for it is thus to stop, but don't take my word for it), the result will just
have zeroes in the places where bad blocks were, and, assuming the filesystem
structure is good enough (you should run fsck on it), will give you files with
zones of zeroes. But you won't know which files without either comparing them
to a backup (which you won't have by definition if you're trying to recover)
or with a program that verifies every file's structure (which won't exist for
the general case). Whereas cp will issue error messages with the path of the
file in question. So the OP's decision makes sense.

~~~
dredmorbius
I've played with ddrescue very lightly. From the GNU webpage linked above, it
appears it creates logfiles which can be examined:

 _Ddrescuelog is a tool that manipulates ddrescue logfiles, shows logfile
contents, converts logfiles to /from other formats, compares logfiles, tests
rescue status, and can delete a logfile if the rescue is done. Ddrescuelog
operations can be restricted to one or several parts of the logfile if the
domain setting options are used._

That might allow for identification of files with bad sectors.

~~~
pflanze
That would need either a hook to the kernel or a file system parser.

Even if you manage to do that, I'm not sure it would be a good idea to
continue to use a file system that has lost sectors, even after fsck. Are you
sure fsck is fixing any inconsistency? Are there any automatic procedures in
place that guarantee that the fsck algorithms are in sync with the actual file
system code? (Answer anew for any file system I might be using.) You
definitely should do backups by reading the actual files, not the underlying
device; perhaps in this case it could be OK (since it was a backup itself
already, hence a copy of life data; but then if OP bothered enough to recover
the files, maybe he'll bother enough to make sure they stay recovered?)

------
icedchai
For that many files I probably would've used rsync between local disks.
_shrug_

~~~
ajross
And hopefully you would have written up a similar essay on the oddball
experiences you had with rsync, which is even more stateful than cp and even
more likely to have odd interactions when used outside its comfort zone.

Ditto for tricks like:(cd $src; tar cf - .) | (cd $dst; tar xf -).

Pretty much nothing is going to work in an obvious way in a regime like this.
That's sort of the point of the article.

~~~
icedchai
Or maybe not. He mentions rsnapshot in the article, which uses rsync under the
hood. This implies rsync would have a _very_ good chance of handling a large
number of hardlinks... since it created them in the first place.

~~~
sophacles
That doesn't follow. If backups are for multiple machines to a big file
server, the backup machine will have a much larger set of files than those
that come from an individual machine. Further, each backup "image" compares
the directory for the previous backup to the current live system. Generally it
looks something like this:

1\. Initial backup or "full backup" \- copy the full targeted filesystem to
the time indexed directory of the backup machine.

2\. Sequential backups:

a. on the backup machine, create a directory for the new time, create a mirror
directory structure of the previous time.

b. hard link the files in the new structure to those in the previous backup
(which may be links themselves, back to the last full backup.

c. rsync the files to the new backup directory. Anything that needs to be
transfered results in rsync transfering the file to a new directory, the
moving it into the proper place. This unlinks the filename from the previous
version and replaces it with the full version.

So yeah, the result of this system over a few machines and a long-timeframe
backup system is way more links on the backup machine than any iteration of
the backup will ever actually use.

~~~
icedchai
Yes, it has more links, I realize, but this still doesn't mean it wouldn't
work. Give it a shot and report back. (Hah.)

------
dspillett
_> The number of hard drives flashing red is not the same as the number of
hard drives with bad blocks._

This is the real take-away. Monitor your drives. At very least enable SMART,
and also regularly run a read on the full underlying drive (SMART won't see
and log blocks that are on the way out so need retries for successful reads,
unless you actually try to read those blocks).

That won't completely make you safe, but it'll greatly reduce the risk of
other drives failing during a rebuild by increasing the chance you get
advanced warning that problems are building up.

~~~
rbh42
Glad someone noticed it (I'm the OP). Reading the drives systematically is
called "Patrol Read" and is often enabled by default, but you can tweak the
parameters.

------
mturmon
The later replies regarding the size of the data structures cp is using are
also worth reading. This is a case where pushing the command farther can make
you think harder about the computations being done.

------
grondilu
On Unix, isn't it considered bad practice to use cp in order to copy a large
directory tree?

IIRC, the use of tar is recommended.

Something like:

    
    
        $ (cd $origin && tar cf - *) | (cd $destination && tar xvf - )

~~~
dmckeon
Use && there, not ; - consider the result if either of the cd commands fails.

~~~
grondilu
fixed

------
sauere
> While rebuilding, the replacement disk failed, and in the meantime another
> disk had also failed.

I feel the pain. I went thru the same hell a few months ago.

------
maaku
Another lesson: routinely scrub your RAID arrays.

~~~
jewel
On debian-based systems, /etc/cron.d/mdadm will already do this on the first
Sunday of the month.

------
0x0
I wonder how well rsync would have fared here.

~~~
sitkack
Rsync can die just from scanning the whole directory tree of files first.

~~~
chadcatlett
The incremental option(enabled by default) introduced in rsync 3.0 greatly
reduces the need for scanning the whole directory structure.

------
ccleve
Maybe this is naive, but wouldn't it have made more sense to do a bunch of
smaller cp commands? Like sweep through the directory structure and do one cp
per directory? Or find some other way to limit the number of files copied per
command?

~~~
caf
No, because then it wouldn't have replicated the hardlink structure of the
original tree. That was the goal, and also the bit that causes the high
resource consumption.

------
Andys
A problem with cp (and rsync, tar, and linux in general) is there is read-
ahead within single files, but no read-ahead for the next file in the
directory. So it doesn't make full use of the available IOPS capacity.

------
davidu
This is not, not, not how one should be using RAID.

The math is clear that in sufficiently large disk systems, RAID5, RAID6, and
friends, are all insufficient.

~~~
lysium
Can you elaborate?

~~~
davidu
[http://www.zdnet.com/blog/storage/why-raid-5-stops-
working-i...](http://www.zdnet.com/blog/storage/why-raid-5-stops-working-
in-2009/162)

~~~
lysium
Thanks for the link! The article says, that due to the read error rate and the
size of today's disks, RAID 5 and RAID 6 have (kind of) lost their purpose.

~~~
davidu
Yep, mathematically no longer safe.

------
dbbolton
>We use XFS

Why?

~~~
cnvogel
I personally still consider XFS a very mature and reliable filesystem. Both in
terms of utility programs and kernel implementation. If I remember correctly,
it was ported to linux from SGI/Irix where it was used for decades. It also
was the default fs for RedHat/centos for a long time, so it might still have
stuck at many shops.

Heres my anecdotal datapoint on which I base my personal believe:

From about 10-6 years ago, when I was doing sysadmin-work at university
building storage-systems from commodity parts for experimental bulk data, we
first had a load of not-reliably working early raid/sata(?) adapters, and
those made ext3 and reiserfs (I think...) oops the kernel when the on-disk
structure went bad. Whereas XFS just put a "XFS: remounted FS readonly due to
errors" in the kernel logfile. That experience made XFS my default filesystem
up to recently when I started to switch to btrfs. (of course, we fixed the
hardware-errors, too... :-) )

Also, from that time, I got to use xfsdump/xfsrestore for backups and storage
of fs-images which not even once failed on me.

~~~
Eiriksmal
As a blithe, new-Linux user (3.5 years), I was bumfuzzled when I saw
RHEL/CentOS 7 switched from ext4 to XFS, figuring it to be some young upstart
stealing the crown from the king. Then I did some Googling and figured out
that XFS is as old as ext _2_! I'm looking forward to discovering how tools
like xfs* can make my life easier.

------
limaoscarjuliet
Rsync seems a better tool for this. Can be run multiple times and it will just
copy missing blocks.

------
nraynaud
it reminds me of crash only software.

------
gaius
I would probably have used tar|tar for this, or rsync.

~~~
thaumaturgy
You're right to recommend a tarpipe. I've had to copy several very large
BackupPC storage pools in the past, and a tarpipe is the most reliable way to
do it. (The only downside to BackupPC IMO...)

For future reference for other folks, the command would look something like
this:

    
    
        cd /old-directory && tar czvflpS - . | tar -C /new-directory -xzvf -
    

Tarpipes are especially neat because they can work well over ssh (make sure
you have ssh configured for passwordless login, any prompt at all will bone
the tarpipe):

    
    
        cd /old-directory && tar czvflpS - . | ssh -i /path/to/private-key user@host "tar -C /new-directory -xzvf -"
    

...but tarpipe-over-ssh is not very fast. I have a note that says, "36 hours
for 245G over a reasonable network" (probably 100Mb).

Disk-to-disk SATA or SAS without ssh in between would be significantly faster.

~~~
LeoPanthera
The prompt goes to stderr, the pipe only pipes stdout, so a prompt should not
cause excessive bonage, as long as you're there to respond to it.

Also, don't use -z locally, or even over a moderately fast network. The
compression is not that fast and almost always makes things slower.

~~~
thaumaturgy
Good to know!

Also, re: bonage, I agree that it "shouldn't", but it definitely did. From my
sysadmin notes file:

> The tar operation kicks off before ssh engages; having ssh ask for a
> password seems to intermittently cause problems with the tar headers on the
> receiving end. (It _shouldn't_, but it seems to.)

------
RexM
Is this where a new cp fork comes about called libracp?

------
brokentone
Feels like a similar situation to this:
[http://dis.4chan.org/read/prog/1109211978/21](http://dis.4chan.org/read/prog/1109211978/21)

------
lucb1e
> 20 years experience with various Unix variants

> I browsed the net for other peoples' experience with copying many files and
> quickly decided that cp would do the job nicely.

After 20 years you no longer google how to copy files.

Edit: Reading on he talks about strace and even reading cp's source code which
makes it even weirder that he had to google how to do this...

Edit2: Comments! Took only ten downvotes before someone bothered to explain
what I was doing wrong, but now there are three almost simultaneously. I guess
those make a few good points. I'd still think cp ought to handle just about
anything especially given its ubiquitousness and age, but I see the point.

And to clarify: I'm not saying the author is stupid or anything. It's just
_weird_ to me that someone with that much experience would google something
which on the surface sounds so trivial, even at 40TB.

~~~
sitkack
Because the man is wise. He also didn't kill a job that appeared to be hung,
he started reading the code to figure out why and determined that it would in
fact, complete.

