We got an IEEE paper out of it:
A few people are continuing the concept to other tools -- that should be available at http://fileutils.io/ relatively soon.
We also had another tool written on top of https://github.com/hpc/libcircle that would gather metadata on a few hundred-million files in a few hours (we had to limit the speed so it wouldn't take down the filesystem). For a slimmed down version of that tool, take a look at https://github.com/hpc/libdftw
There is a provision in most file systems to use links (symlinks,
hardlinks, etc.). Links can cause cycles in the file tree, which
would result in a traversal algorithm going into an infinite loop.
To prevent this from happening, we ignore links in the file tree
during traversal. We note that the algorithms we propose in
the paper will duplicate effort proportional to the number of
hardlinks. However, in real world production systems, such as
in LANL (and others), for simplicity, the parallel filesystems
are generally not POSIX compliant, that is, they do not use
hard links, inodes, and symlinks. So, our assumption holds.
The reason this cp took such large amounts of time was the desire to preserve hardlinks and the resize of the hashtable used to track the device and inode of the source and destination files.
Check return value of malloc
You don't need \ when breaking function parameters
1. Calculate the hash of /sourcedir/some/path/to/file
2. Copy the file to /tempdir/$hash if it doesn't exist yet
3. Hard-link /destdir/some/path/to/file to /tempdir/$hash
4. Repeat until you run out of source files
5. Recursively delete /tempdir/
This should give you a faithful copy with all the hard-links with constant RAM at the cost of CPU to run all the hashing. If you're smart about doing steps 1 and 2 together it shouldn't require any additional I/O (ignoring the extra file metadata).
Edit: actually this won't recreate the same hardlink structure, it will deduplicate any identical files, which may not be what you want. Replacing the hashing with looking up the inode with stat() would actually do the right thing. And that would basically be an on-disk implementation of the hash table cp is setting up in memory.
> constant RAM
Well, definitely not constant space. The only reason for it to use less RAM than the hash table is if it uses disk instead of RAM, and that's what happened in op's post, too, by way of swap.
You're trading a hash table implemented in user-space for a hash table or b-tree or similar implemented in kernel-space.
The advantage of the user-space solution is that you can optimize it for the job at hand. A trivial implementation would use a binary-encoded inode+device number pair (probably 2*8 bytes) and a pointer to the first path encountered (8 bytes), where the path is only accessed in matching cases. In fact, 17e9/432e6=39.35 B/file, pretty spot on. Seems like there's some room for optimization.
The kernel would need to store the file name string in addition to its hash or b-tree entry. This is unlikely to use less memory. Since it will presumably be randomly accessed, reading back from disk will not be more efficient than swapping either.
Now what I'm wondering is whether the OP really had a filesystem with basically every file having a hard link count >1. Since if most files only have a link count of 1, then cp has no reason to hold them in the hash table, as they can't be encountered again. Unless it does so anyway for the case where hardlinks are being created while cp is running. If so, it should probably have a command line flag to turn this safety feature off.
It does not suggest there is a possible optimization. cp needs to store the pointer and the path itself in memory, so it is more than 8 bytes. ~39 bytes/file means the average path length was 15 bytes (8-byte source device + 8-byte source inode + 8-byte pointer + ~15-byte path = ~39 bytes). I guess one theoretical optimization would be to add to Linux a system call to open a file by device+inode number. So the pointer+path would be replaced by a fixed 8-byte dest device + 8-byte dest inode (so 32 bytes/file). Or since the destination is only one filesystem in this case, you would just need 8-byte source device + 8-byte source inode + 8-byte dest inode (24 bytes/file). In this latter case, the 17 GB hash table would have shrunk to 10 GB, which could have help completely avoid the swap in the OP's case.
Edit: correct, I do not account for unused space in the hash table, for simplicity. So read all my numbers assuming a ±10-30% range. However it seems that the OP who quoted 432e6 files meant 432e6 file names, not inodes. So if the average inode had, say, 4.3 hardlinks (IOW 4.3 names), it means the 17GB hashtable only had 100e6 files in it, so 170 bytes/file, so 170-8-8-8 = 146-byte pathnames on average (±10-30%), with sounds more reasonable.
But your calculation is missing the empty space in the hash table. When using open addressing (with the table being a flat array), there must be something like 10-30% empty slots (and in any case, twice the space during rehashing). That would leave just about 39.35 - 8 * 3 / 0.8 = 9.35 bytes per path. So, likely cp already optimizes by using per-device hash tables at least, then 39.35 - 8 * 2 / 0.8 = 19.35 bytes will remain for the path. The pointer could be shortened to 5 or perhaps even 4 bytes. Paths could be stored as an 'inverted' tree (linked lists of segments while sharing equal tails), which should be very natural since cp gets them piecewise from readdir anyway, so the 19.35 bytes just need to describe a parent path pointer (5 bytes?) plus some shared storage and the space for the filename.
Probably there are better hash table algorithms for low memory use, too. (Using buckets in arrays (instead of linked lists as is often done), and resize those on every append, trading CPU for space?) I wonder if there's a way to avoid temporary doubling of the hash table storage during resizing.
Edit: I've checked the GNU coreutils-8.23 sources, oddly cp is not using any of those optimizations; it uses both dev and ino as a key, concatenates every filename to its parent path and stores that (and a normal pointer in the table), and uses linked lists for the buckets in its hash tables.
Edit 2: a different solution that should really lower RAM requirements a lot would be to separate the walking of the source tree from the copying. When walking the source tree, just create a list of dev/ino/path; then sort this list according to dev/ino (there are optimized sort algorithms that keep RAM usage small, I remember them being in Knuth's TAOCP); then walk the list, while linking all the paths with the same dev/ino which you will now encounter in a group.
See "megacopy" in my top level comment.
Yeah I mentioned it was just an on-disk implementation of cp's in-memory hash table.
>Now what I'm wondering is whether the OP really had a filesystem with basically every file having a hard link count >1
Almost certainly. It was a backup machine with rsnapshot. Most files are just hardlinks to the same file from the previous backup. For his specific case it would probably be easier to code up a domain-specific solution. He probably knows the hard-linked files are all in a regular form like /basepath/YYYY-MM-DD/path/to/file. So he could just iterate /basepath/*/path/to/file and do the correct copy/hardlinks on the destination.
For further reference here's a modified setup that won't use any resources that wouldn't be used anyway:
1. stat() /sourcedir/some/path/to/file and find out the hard-link count and the inode number
2. if the hardlink count is 1 copy to destination and this file is done (go to 7)
3. if the hardlink count is >1 copy /srcdir/path/to/file to /tempdir/$inodenumber if it doesn't exist yet
4. stat() /tempdir/$inodenumber and find out the destination hardlink count
5. if the hardlink count is the same as the source hardlink count move /tempdir/$inodenumber to destination (we're creating the last hard-link for this file)
6. if the hardlink count is less than source count hard-link /tempdir/$inodenumber to destination
7. Repeat until you run out of source files
8. Recursively remove /tempdir/ (if there are any files left there it means there were hardlinks to files in /sourcedir/ from other parts of the filesystem)
By doing this we're using the destination filesystem to keep the map for only for the files with hard-link > 1 and doing it without using any more resources than will be used by the end of the copy as long as there aren't any hardlinks to the same files outside /sourcedir. If there are we'll be using an additional hard-link for each of them by the end of the copy.
The reason I prefer this to hand-rolling a solution is that it reuses well tested code (hard-links in the filesystem) and more importantly doesn't introduce any resource limits that aren't native to cp anyway, such such as having enough memory/swap to hold the map. Basically the contract becomes "if the copy can fit in the destination filesystem, cp will complete".
That's the assumption anyway (cp already works that way); we're talking about this only because the OP had hard links everywhere.
That's what I was talking about in my former post. This does use resources: the folder "tempdir" takes up additional space (it is in addition to the file's content, inode and the other paths). It either needs a b-tree, hash-table or similar in kernel space (on disk, and cached in RAM). Every entry needs to store $inodenumber (which you should mean to be the inode number of the source file, not the target, because you need to map from the source inode, so you can't even argue that it is stored anyway already), plus the storage in the index (b-tree or whatever that makes up the directory listing), just like the user-space solution, but additionally at least also the length of the file name and the actual inode number of the target file.
You're not better off (almost surely worse), and you can't get around the fact that the basic algorithm, which is copying every file while walking through the source tree and checking it immediately, is what is at fault: it needs random access (every iteration needs a look up to a random spot), which is why RAM is required and any kind of disk access (be it swap, or just reloading directory index storage from the file system) is out.
That's why I suggest to use a different algorithm. See my post about "megacopy". That should already be usable by now, BTW. It has a --tmp option to give a non-default location for temporary file storage (used for the linking table), you can point that to a place on the target file system or wherever else you like, which I think should answer your call for resource limitation. How well that algorithm (sorting (and linear append and read) instead of random access) works with regards to RAM usage, I still have to test.
See what I did again. I'm only creating /tempdir/$inodenumber if the hard-link count is >1 and once the destination hard-link count reaches the same count as the source it gets moved. So if there aren't any hard-links outside sourcedir this doesn't use more resources than you're going to use anyway at the end of the copy.
> That's why I suggest to use a different algorithm. See my post about "megacopy". That should already be usable by now
If I'm reading it right it requires doing a sort of the full file list. This means you need to use enough space to list all the files and their inode number even if there are no hardlinks. Your disk usage is O(total files) whereas mine is O(inodes with hardlinks) and thus is always lower.
Which (again) is always, in the case of the OP.
> and once the destination hard-link count reaches the same count as the source it gets moved
Yes I've thought of that optimization; I haven't checked whether cp does it already, might well be.
Thing is the OP said he's got an nlinks of about 20. This means, if they are evenly spread out over the file system, that you see the last link when you're about 95% through. That won't help much.
> If I'm reading it right it requires doing a sort of the full file list.
> This means you need to use enough space to list all the files and their inode number even if there are no hardlinks.
My algorithm works with the nlink==1 optimization just as well: just don't put those files into the file list that needs to be sorted, instead copy them immediately or put them into a list that doesn't need sorting.
And, the whole point of using the sort approach is not to make the list or index smaller (that's simply not possible). The point is that it works with lower RAM requirements. Because sorting works well with serial access and little random access, using a list much bigger than RAM works well.
PS. I wonder how much smaller the table gets when you always have nlinks==2 (the best case for the "done with the inode" optimization). The size of the table will follow some curve over the duration of the program; I'm not sure what it is exactly, perhaps a sine half wave? I expect the max of the curve to be about half way of the max of the unoptimized variant.
Is it a pity to lose the possibility for that optimization by using my algorithm? Disk space is cheaper than RAM. Dropping most of the RAM usage and instead using twice as much in the form of disk still seems worthwhile.
The "done with the inode" optimization will work better if hard linked paths appear close to each other. So perhaps sometimes still useful. It should be possible to come up with a hybrid: go with the hash table until it reaches a certain size; then convert the hash table into the start of the list, and continue by appending to the list. I'm not sure this is worth the more complex programming, though. If it looks nice just to avoid creation of a temp file in the common case, then a less complex alternative may be to store the list in the heap until a certain size (only then dump to file), otherwise still use my algorithm. That way swapping will behave nicer, too, in case part of the cp heap needs to be swapped out because of other system activity.
PPS. note that we're assuming that all the hard links are within the source tree that's being copied. If the files also have a path outside the source path, then the optimization won't work as you never reach zero.
If you do that then your extra storage requirements are now O(paths to inodes with nlinks > 1) which is still strictly higher than O(inodes with nlinks>1). You're also doing two passes, one to enumerate the files, another to do the copying and hardlinking, and doing an extra operation (the sort). I'd be very surprised if that was faster, and unless you have an on-disk sort it will still be RAM+swap limited instead of just disk-space limited.
Yes, that's right, I ignored that part in the calculation above.
By using the filename+parent pointer approach for path compression, that could be improved. (The current version of megacp also stores all necessary stat information (size, uid, gid, perms) in the tempfile to avoid a second stat call; this could be dropped if it makes matters worse.)
Also, quite likely using a fast compression algorithm like lzop on the file would make disk access faster and disk storage requirements lower.
> You're also doing two passes, one to enumerate the files, another to do the copying and hardlinking
Right now I'm doing exactly the same number of system calls as the cp approach needs. The second pass doesn't do readdir or, currently, even stat on the source file (that's why I saved the stat info above; which is more beneficial would need to be tested). It may be beneficial to sort the sorted-by-inode file back into sorted-by-path after the inode cluster recognition step to get original path locality during the copy phase (the included test/genfiles script already does that); I'll have to test that.
> unless you have an on-disk sort it will still be RAM+swap limited
Using an online algorithm is the whole point, I've been saying that from the start. And I remember that Knuth wrote about them so I conclude that they exist for sort, and I quite expect that GNU sort(1) is already sorting big files efficiently (limited testing seems to confirm that, files several times the size of RAM sort just fine in the low single digit minute count).
> I'd be very surprised if that was faster
You're ignoring how much slower disk seek is compared to serial reading/writing. A lot of similar optimizations are done in OSes, databases etc. for that reason. I guess you'll argue next that SSDs don't have such high seek times (and that everybody should be using them), and while those points are right SSDs are still a lot slower for random versus serial access, so whether the overhead of the sorting approach is warranted on SSDs remains to be tested. In any case, I very much expect it to be faster in the OP's scenario which was with spinning disks (note how it took him many days; I expect the sort for the same data to take few hours or quite likely less than an hour given his RAID array).
>Using an online algorithm is the whole point, I've been saying that from the start. And I remember that Knuth wrote about them so I conclude that they exist for sort, and I quite expect that GNU sort(1) is already sorting big files efficiently (limited testing seems to confirm that, files several times the size of RAM sort just fine in the low single digit minute count).
The cp algorithm already solves this just fine and my proposed reimplementation on-disk solves the RAM usage problem. You're trying to make a strictly worse algorithm work. It both requires more storage and is heavier than it needs to be (full sort vs just a map). What do you expect to gain from it?
Plan9's default filesystem is made up of two parts: Fossil, and Venti.
- Fossil is a content-addressable on-disk object store. Picture a disk "formatted as" an S3 bucket, where the keys are strictly the SHAsums of the values.
- Venti is a persistent graph database that holds what would today be called "inode metadata." It presents itself as a regular hierarchical filesystem. The "content" property of an inode simply holds a symbolic path, usually to an object in a mounted Fossil "bucket."
When you write to Venti, it writes the object to its configured Fossil bucket, then creates an inode pointing to that key in that bucket. If the key already existed in Fossil, though, Fossil just returns the write as successful immediately, and Venti gets on with creating the inode.
Honestly, I'm terribly confused why all filesystems haven't been broken into these two easily-separable layers. (Microsoft attempted this with WinFS, but mysteriously failed.) Is it just inertia? Why are we still creating new filesystems (e.g. btrfs) that don't follow this design?
Venti doesn't know about files, it only knows about blocks of data. It's a key/value store where the key is sha-1 and the value is a block(blob) of data.
The filesystem running on top of Venti will ask Venti to store the new block where you changed the byte and the filesystem will update the metadata that assembles all blocks to a file.
The penalty for doing content addressed filesystems is of course the CPU usage. btrfs probably has most of the benefits without the CPU cost with its copy-on-write semantics.
Note that what you describe (and my initial process) is a different semantic than hard-links. What you get is shared storage but if you write to one of the files only that one gets changed. Whereas with hardlinks both files change.
Venti-on-Venti-on-Fossil would work, though, since Venti just creates imaginary files that inherit their IO semantics from their underlying store, and this should apply recursively:
1. create two nodes A and B in Venti that refer to one node C in Venti, which refers to object[x] with key x in Fossil.
2. Append to A in Venti, causing a write to C in Venti, causing a write to object[x] Fossil, creating object[y] with key y.
3. Fossil returns y to Venti; Venti updates C to point to object[y] and returns C to Venti; Venti sees that C is unchanged and does nothing.
Now A and B both effectively point to object[y].
(Note that you don't actually have to have two Venti servers for this! There's nothing stopping you from having Venti nodes that refer to other Venti nodes within the same projected filesystem--but since you're exposing these nodes to the user, your get the "dangers" of symbolic links, where e.g. moving them breaks the things that point to them. For IO operations they have the semantics of hard links, though, instead of needing to be special-cased by filesystem-operating syscalls.)
They are. Pretty much all operating systems provides a block layer, on top of which filesystems are normally layered.
Though the block layer isn't content addressed, it's just indexed (store block number 55, read block number 9. etc.)
Be careful not to create too many files in /tempdir (create a hierarchy of subdirs). Filesystems slow way down when they get "too many" files in one dir.
But take a good look at a tarpipe before applying 'cp' to such a large set of files.
Yeah, you probably want something like /tempdir/srcfilesystem/NN/NN/NN/NN to handle the issue you mention and recursive copies that cross filesystem boundaries. Besides that tempdir will need to be on the same filesystem as destdir so the hard copy will work, and if destdir has been setup as a multiple filesystem target you'll need to keep a tempdir instance for each destination filesystem. Plenty of corner cases to think about.
I don't think that's an acceptable perfomance penalty in most situations.
A wonderful trend I've noticed in Free/Open Source software lately is proudly claiming that a program is "Valgrind clean." It's a decent indication that the program won't doing anything silly with memory during normal use, like leak it. (There's also a notable upswing in the number of projects using static analyzers on their code and fixing legitimate problems that turn up, which is great, too!)
While you can certainly just let the OS reclaim all of your process's allocated memory at exit time, you're technically (though intentionally) leaking memory. When it becomes too hard to separate the intentional leaks from the unintentional leaks, I'd wager most programmers will just stop looking at the Valgrind reports. (I suppose you could wrap free() calls in "#ifdef DEBUG ... #endif" blocks and only run Valgrind on debug builds, but that seems ugly.)
A more elegant solution is to use an arena/region/zone allocator and place potentially large data structures (like cp's hard link/inode table) entirely in their own arenas. When the time comes to destroy one of these data structures, you can destroy its arena with a single function call instead of walking the data structure and free()ing it piece by piece.
Unfortunately, like a lot of useful plumbing, there isn't a standard API for arena allocators, so actually doing this in a cross-platform way is painful:
• Windows lets you create multiple heaps and allocate/free memory in them (HeapCreate(), HeapDestroy(), HeapAlloc(), HeapFree(), etc.).
• OS X and iOS come with a zone allocator (malloc_create_zone(), malloc_destroy_zone(), malloc_zone_malloc(), malloc_zone_free(), etc.).
• glibc doesn't have a user-facing way to create/destroy arenas (though it uses arenas internally), so you're stuck using a third-party allocator on Linux to get arena support.
• IRIX used to come with an arena allocator (acreate(), adelete(), amalloc(), afree(), etc.), so if you're still developing on an SGI Octane because you can't get enough of that sexy terminal font, you're good to go.
(That is, assuming that you don't need portability to odd systems that don't actually free memory on process exit.)
In a multithreaded application in which threads
simultaneously allocate and free memory, there could be contention for these mutexes. To scalably handle memory allocation in multithreaded applications, glibc creates additional memory allocation arenas if mutex contention is detected. Each arena is a large region of memory that is internally allocated by the system
(using brk(2) or mmap(2)), and managed with its own mutexes.
In the limit, you could make one arena for your entire program and then deallocate it on process exit (essentially what the OS gives you by default). Using smaller and smaller arenas lets you see that at a finer and finer granularity. In the other limit you're back to the per-object allocations.
Using an arena to avoid per-object leak checking is good if you're absolutely sure that there are no leaks in that block of code. You're essentially hiding any leaks within the arena from valgrind.
But that seems dangerous, because it makes it harder to use valgrind for what it's good at: finding leaks that you're unaware of, especially in code that you 'know' is good.
One of the points of an arena allocator is to not have to be clean at object level because everything is intended to be freed anyway in the end. For example, one HTTP request or one copy operation fit perfectly for an arena allocator: the system is allowed to "leak" because it's known that the operation won't last that long within the lifetime of the process and all memory related to the operation won't be needed later. (Or cross-operation objects could be allocated from a different pool.)
The problem is that once a drive fails, during the rebuild, if any of the surviving drives experience an unrecoverable read error (URE), the entire array will fail. On consumer-grade SATA drives that have a URE rate of 1 in 10^14, that means if the data on the surviving drives totals 12TB, the probability of the array failing rebuild is close to 100%. Enterprise SAS drives are typically rated 1 URE in 10^15, so you improve your chances ten-fold. Still an avoidable risk.
RAID6 suffers from the same fundamental flaw as RAID5, but the probability of complete array failure is pushed back one level, making RAID6 with enterprise SAS drives possibly acceptable in some cases, for now (until hard drive capacities get larger).
I no longer use parity RAID. Always RAID10 . If a customer insists on RAID5, I tell them they can hire someone else, and I am prepared to walk away.
I haven't even touched on the ridiculous cases where it takes RAID5 arrays weeks or months to rebuild, while an entire company limps inefficiently along. When productivity suffers company-wide, the decision makers wish they had paid the tiny price for a few extra disks to do RAID10.
In the article, he has 12x 4TB drives. Once two drives failed, assuming he is using enterprise drives (Dell calls them "near-line SAS", just an enterprise SATA), there is a 33% chance the entire array fails if he tries to rebuild. If the drives are plain SATA, there is almost no chance the array completes a rebuild.
In reality you'll find that drives perform significantly better than that, arguably orders of magnitude better.
That said, I'm still not a fan of RAID5. Both rebuild speed and probability of batch failures of drives (if one drive fails, the probability of another drive failing is significantly higher if they came from the same batch), make it too risky a prospect for my comfort.
If you have 4x 4TB drives in RAID5, and one fails, then in order to rebuild with a replacement drive, you have to read all data from all surviving drives (3 x 4TB = 12TB).
Here is an example from a manufacturer:
They call it "Non-recoverable Read Errors per Bits Read" and then list "1 sector per 10^15". So for every 10^15 bits read they expect 1 sector to be unreadable.
The same is true for a disk array. Each read operation is an independent event (for the purpose of doing this math). The chance of one URE happening is 1/(10^14) for every bit read. It doesn't matter which disk it happens on. When it happens, the entire array is failed.
Also 12.5 TB is not a hard limit, just an average. The URE could happen on the very first read operation, or you might read 100 TB without a URE.
>On consumer-grade SATA drives that have a URE rate of 1 in 10^14, that means if the data on the surviving drives totals 12TB, the probability of the array failing rebuild is close to 100%.
10^14 bits is 12.5 TB, so on average, the chance of 12TB being read without a single URE is very low, and the probability the array fails to rebuild is close to 100%. I was estimating 10^14 bits to be about 12TB, so the probability is actually 12/12.5 = 96% chance of failure.
>...he has 12x 4TB drives. Once two drives failed, assuming he is using enterprise drives...there is a 33% chance the entire array fails if he tries to rebuild. If the drives are plain SATA, there is almost no chance the array completes a rebuild.
A RAID6 with two failed drives is effectively the same situation as a RAID5 with one failed drive. In order to rebuild one failed drive, the RAID controller must read all data from every surviving drive to recreate the failed drive. In this case, there are 10x 4TB surviving drives, meaning 40TB of data must be read to rebuild. Because these drives are presumably enterprise quality, I am assuming they are rated to fail reading one sector for every 10^15 bits read (10^15 bits = 125 TB). So it's actually 40/125 = 32% chance of failure if you try to rebuild.
The main reason is not only the rebuild time (which indeed is horrible for a loaded system), but also the performance characteristics that can be terrible if you do a lot of writing.
We for example more than doubled the throughput of Cassandra by dropping RAID5. So the saving of 2 disks per server in the end required buying almost twice as many servers.
Then, you need to come up with other creative solutions, like deep syncing inside the FS tree, etc. Fun times.
Add checksumming, and you probably would like to take a holiday whilst it copies the data :-)
rsync --whole-file --ignore-times
that should turn off the metadata checks and the rsync block checksum algorithm entirely and transfer all of the bits at the source to the dest without any rsync CPU penalty.
also for this purpose it looks like -H is also required to preserve hard links which the man page notes:
"Note that -a does not preserve hardlinks, because finding multiply-linked files is expensive. You must separately specify -H."
be mildly interesting to see a speed test between rsync with these options and cp.
there are also utilities out there to "post-process" and re-hardlink everything that is identical, so that a fast copy without preserving hardlinks and then a slow de-duplication step would get you to the same endpoint, but at an obvious expense.
And the critical aspect from a performance perspective was where the hash table became too large to fit into memory. Performance of all sorts goes pear-shaped when that happens.
rsync is pretty incredible, but, well, quantity has a quality all its own, and the scale involved here (plus the possible in-process disk failure) likely wasn't helping much.
I've had good luck with the -H option, though it is slower than without the option. I have never copied a filesystem with nearly as many files as the OP with -H; the most I've done is probably a couple million, consisting of 20 or so hardlinked snapshots with probably 100,000 or 200,000 files each. rsync -H works fine for that kind of workload.
I regularly rsync machines with millions of files. On a slow portable 2.5" 25MB/sec USB2 connection it's never taken me more than 1hr on completely cold caches to verify that no file needs to be copied. With caches being hot, it's a few minutes. And on faster drives it's faster still.
Unless you are doing something weird and force it to checksum, checksumming mostly kicks in on files that have actually changed and can be transferred efficiently in parts (i.e. network; NOT on local disks). In other cases, it's just a straight copy.
Have you actually used rsync in the setting you describe, or at least read its manual?
- In this scenario, the receiving side is empty, so there is no need to check every single file.
- Since 3.0, rsync walks the dirtree while copying (unless you use some special options). So, in some sense, rsync is already "deep syncing inside the FS tree", as you put it.
Some of the other features in rsync would seem to make it a more logical fit for this task anyway, e.g. it's resumable, it supports on-the-fly compression (though the value of that depends on the types of files being transferred), native logging etc etc.
CPU is cheap.
Disks were connected via SATA 3 and was using mdadm
There are a few ways to improve RAID rebuild time (see http://www.cyberciti.biz/tips/linux-raid-increase-resync-reb... for example) but be aware that they tend to negatively affect performance of other use of the array while the build is happening (that is, you are essentially telling the kernel to give more priority to the rebuild process than it normally would when there are other processes accessing the data too).
Edit: To clarify, I was using mdadm, not hardware raid.
Resilvering one drive usually takes < 48 hours. Last time took ~32 hours.
Something else cool:
When I was planning this out I wrote a little script to calculate the chance of failure.
With a drive APF of 10% (which is pretty high), and a rebuild speed of 50MB/sec (very low compared to what I typically get) I have a 1% chance of losing the entire pool over a 46.6 year period. If I add 4 more raidz2 10X3TB VDEVS that would drop to 3.75 years.
I'm always mystified at how stupid our storage systems are. Even very expensive SAN solutions from EMC and the like area just... stupid. We've got loads of metrics on every drive, but figuring out that those things should be aggregated and subjected to statistical analysis just seems to have not been done yet.
What I really want is a "pasture" system - a place I can stick old drives of totally random sizes and performance characteristics and have the system figure out where to put the data in order to maintain a specific level of reliability. Preferrably backed by an online database that tracks the drive failure rate of every drive on every deployed system, noting patterns in 'bad batches' of certain models and the like. If one of my drives would have to beat 3-standard-deviations odds to survive for the next week, move the damn data to somewhere better. And if you've got 2 150GB drives and 1 300GB drive, then each block on those drives has a rating of 2.0 - adjusted for the age and metrics of the drive.
Oh well, maybe when I retire in 30 years storage systems will still be as stupid as they've remained for the past 30 years and I'll have another project to add to the pile I don't have time to work on now.
Treating drives differently based on their expected failure rates seems like it would only matter for very large arrays trying to keep the redundancy as low as possible.
three cost levers: capacity, speed, and downtime.
basically, at some point you will begin to realize the cheapest cost is the capacity itself, everything else is an order of magnitude more difficult to justify. downtime and slowness are simply unacceptable, whereas buying more stuff is perfectly acceptable and quite feasible solution in an enterprise environment.
I work with a number of medium sized businesses of which you can clearly tell who have had and who havent had major downtime due to these types of hardware failures.
Many people who have never experienced issues have no interest in even forming a basic DR plan, but the ones who have just start throwing money at you the moment you mention it.
Buying more stuff is WAY cheaper than slowness and downtime, and penny pinching on this count is going to cause so much additional heartache for what will in the end cost more anyway.
Even this site, honestly, is less than easy to deal with on a recurring basis. (Consider, hard to remember which was the top story three days ago at noon.) Specifically, sometimes I lose a story because I refresh and something plummeted off the page. Hard to have any idea how far to "scroll back" to see it.
I remember when it was "new-fangled" (and DHTML anyone?) and everyone put those annoying cursor trackers that trailed blobs from where your cursor was.
What I mainly remember is two things: suddenly everyone had to do client-side form checking to prevent people from typing dashes in a phone number field, and everyone tried to imitate Java without actually using Java. It seems like any way you could enhance a page by combining mouseover events, gifs and loading content dynamically you would throw into a page, even if it made the user experience a total mess. Which of course never happens today.
My favorite memory is helping to write some of the first "griefers" using JS. If you were unlucky enough to stumble onto our page you had a window literally jumping around your screen with pop-ups til the end of time. There was no way to exit the browser until you'd clicked it all and it'd never end. Just getting your mouse near the 'X' would make the window jump around again. Of course, that was a mild annoyance compared to loading fonts and images so big they'd consume all the available RAM and make the machine fall over from swapping in about 15 seconds.
I remember those myriads of mouseover buttons.
(cd $SOURCE && tar cf - .) | (mkdir -p $DEST && cd $DEST && tar xf -)
(mkdir -p $DEST && cd $SOURCE && pax -rw . $DEST)
I wonder if using something other than a hash table would be worthwhile for cases like the OP though, or if there is a tool that has that option -- the memory overhead of empty buckets and resizing seem like serious downsides when you're trying to handle as many files as the OP.
The program stupidly calls "cp" right now for every individual file copy (not the hard linking), just to get the script done quickly, it's easy to replace that with something that saves the fork/exec overhead; even so, it might be faster than the swapping hash table if the swap is on a spinning disk. Also read the notes in the --help text. I.e. this is a work in progress as a basis to test the idea, it will be easy to round off the corners if there's interest.
PS. the idea of this is to make copying work well with the given situation on a single machine, unless the approach taken by the dcp program mentioned by fintler which seems to rely on a cluster of machines.
There may also be some more discussion about this on the mailing list: http://lists.gnu.org/archive/html/coreutils/2014-09/msg00013...
What about doing something with find/xargs/i-dunno to copy all the files, but break em into batches so you aren't asking cp to do it's bookkeeping for so many files in one process? Would that work better? Or worse in other ways?
The main issue is that there's no api to get the list of files hard linked together: the only way is to check all the existing files and compare inodes. If you're doing a plain copy over 2 fs, you cannot choose which number the target inode will be, so you need to keep a map between inode numbers, or between inodes and file names ("cp" does the later).
1. Copy each file from the source volume to a single directory (e.g. /tmp) on the target volume, named for the source volume inode number.
(edit: I suggest using a hierarchy of dirs to avoid the "too many dentry's" slowdown)
2. If the file has already been copied, it will already exist in /tmp - looking up the inode is a vanilla directory lookup
3. Create a hard link from /tmp to the actual path of the file
4. When all the files have been created on the target volume, delete the inode numbers in /tmp
I wonder if any modern filesystems (xfs/ext4/btrfs/zfs) can actually enable tracking this.
I forget what the api for that looks like if I ever knew. Might be private.
I am surprised, usually Linux is way ahead of Windows on shiny filesystem stuff.
Everyone on Linux still uses tar for god's sake, even though zip can use the same compression algorithms people use on tarballs, and zip actually stores an index of its files rather than 'cat'ing each record on top of the next like an append-only tape archive. (Obviously there are better formats than 'zip' for any platform, but it's just strange that nobody has moved away from tar)
And it doesn't help that tar.gz / tar.bz2 compresses way better than zip in most cases (thanks to using a single compression context, rather than a new one for each file; and also compressing the filenames in the same context), and that it carries ownership and permission information with it - whereas zip doesn't.
The HVSC project, who try to collect every single piece of music ever created on a Commodore C64, distribute their archive as a zip-within-a-zip. The common music file is 1k-4k, goes down to ~500-1000 bytes zipped; The subdirectory+filename are often 100 bytes with a lot of redundancy that zip doesn't use, so they re-zip. Had they used .tar.gz or .tar.bz2, the second stage would not be needed.
As far as I know NTFS does not store what hard links point to a file.
Does anyone know what the 'tear down' part is about? If it's about erasing the hashtable from memory, what takes so long? I would expect that to be very fast: you don't have to write zeros to it all, you just tell your GC or memory manager to mark it as free.
for (bucket = table->bucket; bucket < table->bucket_limit; bucket++)
for (cursor = bucket->next; cursor; cursor = next)
next = cursor->next;
Advances coming from the combination of more advanced operating systems and CPU features like MMUs have made this a non-issue in most cases (can still be an issue on embedded, etc).
It's seen as good C programming practice to free() your malloc()s, though, and it makes extending programs easier if you have that functionality, since what was previously the end of program can be wrapped in a higher-level loop without leaking memory. But if you really are exiting for sure, you don't have to make the final free-memory call. It can also be faster to not do any intermediate deallocations either: just leave everything for the one big final deallocation, as a kind of poor-man's version of one-generation generational GC. Nonetheless many C programmers see it somehow as a bit unclean not to deallocate properly. Arguably it does make some kind of errors more likely if you don't, e.g. if you have cleanup that needs to be done that the OS doesn't do automatically, you now have different kinds of cleanup routines for the end-of-process vs. not-end-of-process case.
As an aside, GNU libc keeps ( or at least used to keep, I haven't checked in years ) the pointers used by malloc()/free() next to the blocks themselves, which gives really bad behavior when freeing a large number of blocks that have been pushed out to swap--you wind up bringing in pages in order to free them because the memory manager's working set is the size of all allocated memory. Years ago I wrote a replacement that avoided this just to speed up Netscape's horrible performance when it re-sized the bdb1.85 databases it used to track browser history. The browser would just "go away" thrashing the disk for hours and killing it just returned you to a state where it would decide to resize again an hour or so after a restart. Using LD_PRELOAD to use a malloc that kept it's bookkeeping away from the allocated blocks changed hours to seconds.
Obviously many programmers do not think to (or cannot easily) test their code with 40 TBs of data.
If you wrote cp in such a way that it's a reusable C library that other programs might use, then I can understand why they do this. It still doesn't make much sense for the command-line version to call this. Of course, it might run on some weird platform where the OS doesn't reclaim the memory of a process on exit(?).
Once your hash table and its linked lists have been swapped to disk, each pointer dereference causes a random disk access, which are orders of magnitude slower than RAM.
His patch just skips that step and ends the program.
> And unless old systems without working memory management must be supported, I don't see any harm in simply removing the call to the forget_all function towards the end of cp.c.
This is exactly what happens when your process exits and its private memory space is freed in its entirety. Hence the discussion. There's no need to clean up before exit, except if you want to (i) use a leak checker; or (ii) run on a machine with no MMU.
So what Linux supports and doesn't support in terms of memory management really has no bearing on what GNU cp can or cannot do, in general.
[...] Note that -a does not preserve hardlinks, because
finding multiply-linked files is expensive. You must
separately specify -H.
This tells rsync to look for hard-linked files in the source and
link together the corresponding files on the destination. Without
this option, hard-linked files in the source are treated as though
they were separate files. [...]
For future reference for other folks, the command would look something like this:
cd /old-directory && tar czvflpS - . | tar -C /new-directory -xzvf -
cd /old-directory && tar czvflpS - . | ssh -i /path/to/private-key user@host "tar -C /new-directory -xzvf -"
Disk-to-disk SATA or SAS without ssh in between would be significantly faster.
Also, don't use -z locally, or even over a moderately fast network. The compression is not that fast and almost always makes things slower.
Also, re: bonage, I agree that it "shouldn't", but it definitely did. From my sysadmin notes file:
> The tar operation kicks off before ssh engages; having ssh ask for a password seems to intermittently cause problems with the tar headers on the receiving end. (It _shouldn't_, but it seems to.)
Rsync at least acts reasonably idempotent, so you can just run it again if it gets interrupted, which is usually why I use it for large copies.
I don't remember off the top of my head if tar handles hard links—it may be inappropriate for this usage.
--partial keep partially transferred files
--bwlimit=KBPS limit I/O bandwidth; KBytes per second
Basically the same as tar's standard usage, except you don't store that stream as a tar file/image anywhere.
The only disadvantage of tar is its peculiar syntax, but that's something you get used to.
rsync is a bit of a kitchen sink, and cp is too simple. cpio is just the thing, and was designed for this exact use case.
Copy with hard link preservation is essentially like running dedupe except that you know ahead of time how many dupes there are. Dedupe is often very memory intensive, and even well thought out implementations don't support keeping book keeping structures on disk.
xfsdump/zfs send are file system aware, unlike dd, and can detect fs corruption (ZFS especially having extensive checksums). In fact, any info cp sees about corruption comes from the FS code parsing the FS tree.
However, except on zfs/btrfs, data block corruption will pass unnoticed. And in my experience, when you have bad blocks, you have millions of them -- too many to manually fix. As this causes a read hang, it is usually better to dd copy the fs to a clean disk, set to replace bad blocks with zeros, then fsck/xfs_repair when you mount, then xfsdump.
dd conv=noerror,sync,notrunc bs=512 if=/dev/disk of=diskimg
+1 for mentioning ZFS. It's really quite amazing. Almost like futuristic alien technology compared to the other freely available file systems.
I see that the choice of a file-level copy was deliberate: "I'd have copied/moved the files at block-level (eg. using dd or pvmove), but suspecting bad blocks, I went for a file-level copy because then I'd know which files contained the bad blocks."
So going with a simple block copy should suffice IMHO.
I did keep a log file with the output from cp, and it clearly identified the filenames for the inodes with bad blocks. Actually, I'm not sure how dd would handle bad blocks.
I was about to bet on "read fail repeat skip" cycle for dd's behaviour but, looking into coreutil's source code at https://github.com/goj/coreutils/blob/master/src/dd.c , if I'm not mistaken , dd does not try to be intelligent and just uses a zeroed out buffer so It would return 0's for unreadable blocks.
In Windows-land, the default copy is pretty anemic, so probably most people avoid it for serious work.
I'd probably use robocopy from the command line. And if I was being lazy, I'd use the Teracopy GUI.
I think my limit for a single copy command has been around 4TB with robocopy--and that was a bunch of large media files, instead of smaller more numerous files. Maybe there's a limit I haven't hit.
I've used FastCopy for GUI based larger transfers, it's open source and can handle larger datasets well in my experience. It also doesn't choke on >MAX_PATH paths. Haven't had problems with it. Supposedly it's the fastest tool around...
The only slight issue is that the author is Japanese so the English translations aren't perfect plus the comments in the source are in Japanese.
How does this happen?
Today there are N applications. "We can't increase MAX_PATH because it will break existing applications!"
Tomorrow there are N+M applications. "We can't increase MAX_PATH because it will break existing applications!"
Any time you are faced with a hard technical decision like this, the pain will always be least if you make the change either:
1. During another transition (e.g. 16-bit to 32-bit, or 32-bit to 64-bit). Microsoft could have required all 64-bit Windows apps to adopt a larger MAX_PATH, among other things.
2. Right NOW, because there will never be an easier time to make the change. The overall pain to all parties will only increase over time.
The MAX_PATH limitation can be avoided by using the \\?\ prefix (which also disables some kinds of path normalization, like converting / to \) or by using paths relative to the current directory.
Unfortunately, most tools (including Windows Explorer and other Microsoft software) still don't work with paths longer than MAX_PATH.
If you have non-Windows users using the share it's pretty much guaranteed, since they don't have a problem with it - only Windows chokes on this.
Even Windows users manage to somehow create such paths (files) occasionally. Then you need to break out cmd.exe or somesuch because Explorer can't touch them.
1. I've not used dd on failing media, so I'm not sure of the behavior. Will it plow through a file with block-read failures or halt?
2. There's the ddrescue utility, which is specifically intended for reading from nonreliable storage. Seems that this could have offered another means for addressing Rasmus's problem. It can also fill in additional data on multiple runs across media, such that more complete restores might be achieved.
Ddrescuelog is a tool that manipulates ddrescue logfiles, shows logfile contents, converts logfiles to/from other formats, compares logfiles, tests rescue status, and can delete a logfile if the rescue is done. Ddrescuelog operations can be restricted to one or several parts of the logfile if the domain setting options are used.
That might allow for identification of files with bad sectors.
Even if you manage to do that, I'm not sure it would be a good idea to continue to use a file system that has lost sectors, even after fsck. Are you sure fsck is fixing any inconsistency? Are there any automatic procedures in place that guarantee that the fsck algorithms are in sync with the actual file system code? (Answer anew for any file system I might be using.) You definitely should do backups by reading the actual files, not the underlying device; perhaps in this case it could be OK (since it was a backup itself already, hence a copy of life data; but then if OP bothered enough to recover the files, maybe he'll bother enough to make sure they stay recovered?)
Ditto for tricks like:(cd $src; tar cf - .) | (cd $dst; tar xf -).
Pretty much nothing is going to work in an obvious way in a regime like this. That's sort of the point of the article.
1. Initial backup or "full backup" - copy the full targeted filesystem to the time indexed directory of the backup machine.
2. Sequential backups:
a. on the backup machine, create a directory for the new time, create a mirror directory structure of the previous time.
b. hard link the files in the new structure to those in the previous backup (which may be links themselves, back to the last full backup.
c. rsync the files to the new backup directory. Anything that needs to be transfered results in rsync transfering the file to a new directory, the moving it into the proper place. This unlinks the filename from the previous version and replaces it with the full version.
So yeah, the result of this system over a few machines and a long-timeframe backup system is way more links on the backup machine than any iteration of the backup will ever actually use.
I'm surprised that cp even worked. I wouldn't have had the balls to just use cp as it could just have failed.
My strategy would probably have been to cp or rsync over the latest backup (so that new incremental backups could be immediately be created again) and then incrementally rsync'ing the previous backups with hardlink detection.
This is the real take-away. Monitor your drives. At very least enable SMART, and also regularly run a read on the full underlying drive (SMART won't see and log blocks that are on the way out so need retries for successful reads, unless you actually try to read those blocks).
That won't completely make you safe, but it'll greatly reduce the risk of other drives failing during a rebuild by increasing the chance you get advanced warning that problems are building up.
IIRC, the use of tar is recommended.
$ (cd $origin && tar cf - *) | (cd $destination && tar xvf - )
Two tar command would in fact be much worse, because they'd require twice as many resources.
Also, and just to be clear, I did not invent this idea of using tar for this. I remember having read it in a Unix manual when I was learning about this OS.
And even if using tar is not a good idea, it should certainly be considered and as such I don't understand why OP doesn't even mention it in his post.
Yes, using tar for copying is not new and has its use cases, but this is not one of them.
The filesystem keeps a track of the number of nodes links for each file. So tar does not have to memorize the inode for each file, but only for those who have duplicates. So unless a large amount of files have more than one hard link, this is probably not a problem.
We will never know though, considering the author did not try it. Neither did he explain why he did not, despite the fact that it is the recommended Unix procedure.
> it is the recommended Unix procedure
I don't think it is "recommended procedure", not today. At least on GNU/Linux cp is just fine. It may have been different two decades ago for some reason: I remember a graphical file manager (called "dfm") that used piped tar back in the late 90ies . I know that installing open source utilities was pretty common back then to get better programs on closed source Unixes, and tar was probably high on the priorities to install whereas cp less so, and hence tar might have in fact been better than the preinstalled cp. I haven't seen anyone in the public technical (hence primarily open source) community advocating that even for portability reasons, so I can only think of it as "perhaps historically relevant, if at all".
Your point is valid though that one could have tested it here. See rbh42's answer on this point. Although I'm not sure he's right about both tar instances requiring a hash table: if tar submits the hard linking as from a previously submitted path to the new one, then there's no need for the table. And my quick look at the transmitted data seems to contain the source path for the link. The question then becomes whether tar has a better optimized hash table (not very likely), or uses a better algorithm (see my post about "megacopy" for an attempt at the latter).
 It actually had a problem: it didn't put "--" between the options and the path, and promptly failed when I moved an item that had a file name starting with a minus: it was interpreted by tar as option, and in addition the file manager didn't properly check the exit codes and simply unlinked the source even though it wasn't copied. The tools before GNU may really have been wanting: his failure to use "--" may have been caused by some contemporary non-GNU tar not supporting it. Anyway, of course I moved on to another file manager. Those bits of mine were sacrificed to the god of superior recommended Unix procedures or something.
I feel the pain. I went thru the same hell a few months ago.
The math is clear that in sufficiently large disk systems, RAID5, RAID6, and friends, are all insufficient.
He specifically used cp because he was worried about data loss from using something like dd. But if he'd been doing regular backups to tape, he could have just restored the backup from tape with no fuss or problems. It's only because he was using tools that aren't suited to what he was trying to do that he ran into problems.
Heres my anecdotal datapoint on which I base my personal believe:
From about 10-6 years ago, when I was doing sysadmin-work at university building storage-systems from commodity parts for experimental bulk data, we first had a load of not-reliably working early raid/sata(?) adapters, and those made ext3 and reiserfs (I think...) oops the kernel when the on-disk structure went bad. Whereas XFS just put a "XFS: remounted FS readonly due to errors" in the kernel logfile. That experience made XFS my default filesystem up to recently when I started to switch to btrfs. (of course, we fixed the hardware-errors, too... :-) )
Also, from that time, I got to use xfsdump/xfsrestore for backups and storage of fs-images which not even once failed on me.
This is so important, and so often overlooked.
The lure of "ohhh shiny!" is prevalent in tech. I love to use bleeding edge stuff, play with new toys etc, but that is purely for home and personal stuff. Never for work.
I want my work life to be utterly boring. Nothing breaking. Everything behaving exactly as expected.
I will happily trade a bit of performance and a lack of some shiny features for a system or component that works reliably. The less time I have to spend fighting stuff the more time I can spend tuning, automating or improving what I do have to look after.
XFS performs well. It's reliable, reasonably fast (they trade some speed for durability/reliability), and just works