

How to transfer large amounts of data via network - phelm
http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html

======
moe
Having transferred petabytes of data in tens of millions of files over the
past months let me assure you there's only one tool that you really need: GNU
parallel.

Whether you copy the individual files with ftp, scp or rsync is largely
irrelevant. The network is always your ultimate bottleneck. Using a slower
copy-tool just means having to set a slightly higher concurrency in order to
max it out.

~~~
batbomb
For bulk transfer of many files, and especially transfer over a local/nearby
network that may be hold true, but as a general practice, and especially for
serial/one off transfer of very large files, GNU parallel won't help. However,
all the tools mentioned multiplex connections. They are better for
transferring individual files, and you can also use them in parallel.

A combination of your network, a CPU threads, frame size, etc... are your
ultimate bottleneck when you are transferring very large files.

We've transferred exabytes using bbcp and GridFTP. bbcp is very easy to use
once it's set up.

At some point, you run into much different issues when trying to routinely
transfer very large files across continents at speeds greater than 10Gbps.

~~~
anon4
For bulk transfer, the absolute fastest I've seen is piping tar through netcat
and doing the reverse on the receiving end - on a 10-gigabit lan that results
in transfer at the hdd speed. That was between my personal machines with
consumer-grade SATA hard disks. The situation probably changes once you add
hops and have multiple disks to read from at once.

------
bwross
The primary advantage GridFTP has over simply using tar+netcat for performance
is that GridFTP can multiplex transfers over multiple TCP connections. This is
helpful as long as the endpoint systems limit the per-connection buffer size
to some value less than the bandwidth-delay product (BDP) between them. If
you've got to bug sysadmins to get GridFTP set up for you on both endpoints,
you might as well just ask them to increase the maximum TCP buffer size to
match the BDP.

EDIT: Sorry, "multiplex" is not the right word to describe that. It's more
like GridFTP "stripes" files across multiple connections; it divides the file
into chunks, sends the chunks over parallel connections, and reassembles the
file at the destination.

------
jefurii
Never underestimate the bandwidth of a station wagon full of tapes hurtling
down the highway.

~~~
batbomb
This only holds true so long as all data is on tape and you don't need a
replica before sending it off, in case someone hits your petamobile.

That's because your bandwidth is limited by your tape library, the drives, the
network to your tape library, the latency of retrieving/writing/copying to
tape, and few other things.

~~~
wmf
That's why the proverbial "tapes" should not be actual tapes but full Hadoop
nodes. Just unrack them and go (or ship an entire rack).
[http://research.microsoft.com/apps/pubs/default.aspx?id=6457...](http://research.microsoft.com/apps/pubs/default.aspx?id=64570)

~~~
swatow
That was 13 years ago. Seems like the economics have changed in favor of the
internet by now.

------
rdtsc
I like the tar+netcat mentioned towards the bottom for LAN transfer. That
usually goes much faster than rsync or scp.

The reason haven't looked at other tools is because I am doing this
intermittently and always reach for the tool already installed on the system.

------
joshAg
If you have to regularly transfer large amounts of data over a network, it
might be worth looking into a wan optimization product like Riverbed's
Steelhead, Silverpeak's VX/NX lines, or Bluecoat Mach 5, or one of the other
vendors' solutions.

Yeah, you could try and roll it yourself, since really it just comes down to
compressing and deduplicating what you send over the wire, but doing that well
and also making it simple to use is not a trivial problem. Why reinvent the
wheel badly?

~~~
epistasis
There's not many situations where these types of products help, in my
experience. Especially for the type of data that's going to be transferred
between UCI and the Broad. Enterprise compute data, cached webpages, etc., may
have a good amount of deduplication capacity.

But for actual "data," being measurements etc, these products will achieve
nothing. The data itself almost never has any duplicated chunks, and if there
are petabytes of data, it's almost certainly stored in some sort of compressed
format already.

~~~
semi-extrinsic
We had to explain this repeatedly to several vendors the last time we were
buying a small-ish (30 TB) file server. They seemed very skeptical of this
concept that we were storing lots of data in compressed binary formats.

~~~
epistasis
I think it's one of those situations where for most vendor's customers, buying
more hardware is far cheaper than hiring smart programmers. But for academic
situations, there's a surplus of clever programmers with low wages, and not
nearly enough money for hardware. So in "enterprise" the solution is to shove
everything into SQL databases and just buy a ton more compute and disk to
manage the extra inefficiencies, whereas academic situations have not had that
luxury.

As data science progresses, the amount of enterprisey large data situations
will also decrease, I think.

------
noedig
This is a good site to visit if you have these kinds of data transfer issues:
[http://fasterdata.es.net](http://fasterdata.es.net)

------
mschuster91
And once you involve Windows, especially with the mentioned "ZOT files", Samba
becomes a massive bottleneck...

