
Efficient File Copying on Linux - eklitzke
https://eklitzke.org/efficient-file-copying-on-linux
======
dom0
So after the last blog post by The Author which mainly showed The Author's
lack of understanding, we have another article from The Author highlighting
that he does indeed not understand things he writes blog posts about
(incorrect rationale and assumptions about 128 KiB block size being optimal,
no readahead on virtual device files, and of course not mentioning any of the
FD splicing alternatives in a post titled "Efficient ..." or any of the
approaches involving memory mappings and explicit prefetching on said
mappings).

I don't want to be overly extremely dismissive or arrogant here, but this post
pretty much boils down to "128 KiB is optimal because that number appears
somewhere else, too, and that other spot even has somehow something to do with
I/O".

~~~
akanet
While your criticisms about what the author's explanation lacks or does not
consider are appreciated, I think this is a wonderful opportunity for you to
put forth your own understanding! I know very little about the topic and thus
found the article interesting. I would happily learn more.

~~~
ars
Larger blocksizes are faster, but take more memory and interleave badly with
other tasks since they are all-or-nothing. A large block can take a while and
nothing else can read or write while it's running.

That's about it.

You want a blocksize sized so that it takes around 10 ms to write to disk. So
around 128K to 1MB, depending on the underlying hardware.

------
ars
Explanation concludes readahead is the reason that 128KB buffer is fastest on
the benchmark, while the benchmark uses /dev/zero and /dev/null which don't
have readahead.

You need to redo this article using actual reads and writes. Try it with both
a quiet machine and a semi-busy one.

~~~
Scaevolus
I suspect this has more to do with the balance between L2 cache sizes and
syscall overhead than anything else. Dumping nothing into nothing at 40GB/s is
unacceptable, I _need_ it to be 60GB/s!

With actual I/O devices the difference should be negligible. Sequential reads
and writes are well optimized at every level, from the CPU to the HDD.

------
JoshTriplett
I'd be interested to see how this compares to 1) mmapping both files and using
memcpy, 2) mmaping the source and making a single call to write passing the
whole buffer, and 3) copy_file_range.

~~~
pskocik
I once benchmarked reads with `mmap` vs `read`
([http://stackoverflow.com/a/39196499/1084774](http://stackoverflow.com/a/39196499/1084774))
. mmap starts winning big time once the file is >= 16KiB. Copying should have
similar performance characteristics.

------
jquast
The statvfs system call indicates the preferred block size of the filesystem.
It is a very large value on zfs, for example.

[https://docs.python.org/2/library/statvfs.html#statvfs.F_BSI...](https://docs.python.org/2/library/statvfs.html#statvfs.F_BSIZE)

------
valarauca1
I doubt these benchmarks are relevant anymore as Linux has a system just
dedicated to copying files.

So you never even have to leave the page cache, let alone copy into user
space.

OFC it was implemented post 4.0 so I doubt GLibc supports it therefore the
whole world pretends it doesn't exist.

------
LeoPanthera
The most efficient way to copy a large number of small files is often to use a
tarpipe. What block size does "tar" use? And for that matter, "nc", as a
tarpipe through nc is a super fast way to move data between machines.

~~~
jquast
I don't know the value, but there is a noticeable improvement when using a
program like mbuffer or pv to place a memory barrier between the pipe.
Especially over networks like nc.

~~~
JdeBP
You might want to compare against pax in -r -w mode, too.

------
amelius
For even faster copying on the same device, use a copy-on-write (COW)
filesystem.

(I wonder though what API the "cp" command would use to accomplish that).

~~~
icebraining
GNU cp uses ioctl() with the FICLONE flag when you run "cp --reflink" on a COW
filesystem like Btrfs.

[http://man7.org/linux/man-
page/man2/ioctl_ficlonerange.2.htm...](http://man7.org/linux/man-
page/man2/ioctl_ficlonerange.2.html)

~~~
simcop2387
> GNU cp uses ioctl() with the FICLONE flag when you run "cp --reflink" on a
> COW filesystem like Btrfs.

> [http://man7.org/linux/man-
> page/man2/ioctl_ficlonerange.2.htm...](http://man7.org/linux/man-
> page/man2/ioctl_ficlonerange.2.htm..).

This unfortunately doesn't yet work with zfs on Linux. One of my only gripes
with it.

~~~
rincebrain
It's not clear to me that it will ever work outside of special cases on any
ZFS implementation - see also ryao's two comments here [1], but briefly, ZoL
seems to primarily think this would be feasible to implement in a not
backward-compatible or upgradable-on-old-datasets way, using the dedup
mechanisms, and ZFS's dedup is...well, it's a four-letter word in a number of
ZFS circles.

[1] -
[https://github.com/zfsonlinux/zfs/issues/405#issuecomment-50...](https://github.com/zfsonlinux/zfs/issues/405#issuecomment-50782850)

~~~
simcop2387
I'm curious about dedup being a four letter word in a number of ZFS circles. I
hadn't seen anything bad about it (other than it being heavy on ram and
possibly cpu)

~~~
rincebrain
So, dedup has a nontrivial memory overhead, can't really be turned off short
of recreating the datasets in the pool after it's turned on (ZFS doesn't have
any operations that allow you to change data blocks retroactively, so setting
a new compression type, or checksum...or dedup entry for a block can't be
changed back without you copying the data after changing the setting), and can
have a significant write performance impact depending on your setup.

See [1] for an example of how bad this can be.

A number of people run dedup in production workloads. You just need to
understand that it's a big and weirdly-shaped hammer, and can hit you in the
head if you swing it without being ready for it.

[1] -
[https://forums.freebsd.org/threads/31184/](https://forums.freebsd.org/threads/31184/)

------
jaimex2
Good explanation, thank you. So when using dd or other copy tools setting a
block size of 128 kb should also be the best choice?

------
heinrich5991
What about `copy_file_range(2)`?

~~~
kevinoid
It looks like the FICLONE ioctl was chosen over the copy_file_range syscall[1]
for use in coreutils due to concerns about copy_file_range not preserving
holes.[2] I agree, it would be interesting to know how these compare, both in
filesystem implementation coverage and performance.

1\. [https://lwn.net/Articles/659523/](https://lwn.net/Articles/659523/)

2\.
[https://debbugs.gnu.org/cgi/bugreport.cgi?bug=24399](https://debbugs.gnu.org/cgi/bugreport.cgi?bug=24399)

