
Ask HN: How do you handle transferring large files over the internet? - lucasch
I&#x27;ve often heard that the best way to send large amounts of data is to actually send it via snail mail. But if that is not an option what is the best way to send, say 500GB, over the internet? Past Experience preferred but creative solutions welcome!
======
Piskvorrr
If, for some reason, you are averse to rsync (there are plenty of Windows
clients), then bittorrent is second-best: it works just as well for
transferring private data, just turn off all the metadata broadcast options
and make sure that encryption is required. (Not so easy for incremental
transfers, though)

~~~
lucasch
We considered rsync but were wondering if there were more specialized tools
available. We figured that those who work in the scientific community would
have a way to transfer their large data sets between institutions.

~~~
ColinWright
We transfer large files containing raw radar data, and moderate sized files
contains databases of target movements and track information.

We use _rsync._

When I worked for the local university we had to transfer data between
machines to run experimental parallel programs on so-called "big data."

We used _rsync._

~~~
lucasch
Ahh ok thats good to hear. Have you considered using multipath transport
protocols with something like rsync? I am curious if it could benefit this
situation. MPTCP sounds like an interesting protocol if you control both
hosts. [https://www.multipath-tcp.org/](https://www.multipath-tcp.org/)

~~~
ColinWright
We were/are always restricted by intermediate limits on throughput, so it's
never been useful or interesting to consider alternatives.

YMMV, but if you want to improve throughput, consider carefully where your
data has to go through. But _rsync_ is rock-solid, well-understood, mature,
and just does exactly what it is intended to do.

------
teddyc
I would use scp with the -C option to compress the data for the first attempt
to transfer the file. Any subsequent attempts to transfer the file should use
rsync (this includes any updates to the file).

Another option is to use a cloud data storage provider (Dropbox, Box, Google
Drive, etc) and install their software that keeps files in sync. Then you can
just put the file in a local folder and let their software sync it to the
cloud. After sync is complete, you email a link to share the file. Of course,
their software is probably just a GUI for rsync.

~~~
Piskvorrr
Why not use rsync for the first transfer as well? There's `-z` for the same
effect...

As for commercial providers - I've seen rate limits (never did it saturate the
link) and size limits (0.5 TB will cost you extra). Moreover, the data will go
through an intermediate hop, which is slow (Dropbox starts downloading only
when upload completes, doubling your transfer time), plus your data is
now...somewhere (which could be a regulatory issue, depending on the data).

------
wmf
[http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html](http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html)

------
niftich
I assume you're transferring between two hosts you both control. Some sane
options are:

\- rsync

\- SFTP-over-SSH, or SFTP over a dedicated VPN

\- private, encrypted torrents

~~~
Piskvorrr
SFTP gets tricky on resume; rsync and BT have builtin data integrity checks.

------
lastofus
Aspera is often used in the film industry to move large video files around. My
understanding is their software takes over some layers of the network stack to
make sure the pipe stays saturated.

[http://asperasoft.com/](http://asperasoft.com/)

~~~
atsaloli
Yep, when I worked in the film industry, we used Aspera or GridFTP (open
source).

------
soulbadguy
Do you need to transmit 500GB every time or just a diff from a previous
dataset ? If it's the later case, using the send/receive functionality of a
file system with snapshot and incremental backup (ZFS,BTRFS etc..) can be
significantly faster than using pure rsync. Rsync would needs to scan the
complete 500GB of data to find the blocks to send, while send/receive can
compute the diff much faster

~~~
Piskvorrr
Negligible W/R/T transfer time. While using native capabilities of ZFS is
awesome, you're now locked into a particular FS at both sides of the transfer
(this may or may not be an issue).

(Also, there's a patch for rsync that allows you to force computing the
checksum in batch, not on each invocation; but that's getting into hairy
territory that's rarely needed - I used it in exactly _one_ case so far)

------
kennell
More details please.

* Are we talking a few giant files or thousand of small-ish files?

* Who is on the receiving end? Technical people from who we can expect that they are able to run some command-line stuff or your average joe?

* What type of scenarios must the solution work? What OSes? Is it acceptable to install extra software or must it work out of the box?

etc. etc. etc.

------
CarolineW
rsync

~~~
Piskvorrr
Indeed. I've been trying all sorts of weird stuff, but this takes the cake.
Ubiquitous, rock-solid, sane. Plus, no worrying "is it done yet? Do I have the
latest version?" Just let it run (again) - this makes it rather foolproof.

~~~
kraemate
Does rsync auto-resume after failed connections?

~~~
CarolineW
I've written a script that every minute tests to see if the appropriate
_rsync_ command is running. If not, it simply runs it again. In that way it
effectively restarts and gracefully resumes.

This Google search:

[https://www.google.co.uk/search?q=rsync+auto-
restart+failed+...](https://www.google.co.uk/search?q=rsync+auto-
restart+failed+connection)

returns this link:

[http://superuser.com/questions/302842/resume-rsync-over-
ssh-...](http://superuser.com/questions/302842/resume-rsync-over-ssh-after-
broken-connection)

which contains this script:

    
    
        #!/bin/bash
    
        while [ 1 ]
        do
            rsync -avz --partial source dest
            if [ "$?" = "0" ] ; then
                echo "rsync completed normally"
                exit
            else
                echo "Rsync failure. Backing off and retrying..."
                sleep 180
            fi
        done
    

The comment says:

    
    
        When the connection dies, rsync will quit
        with a non-zero exit code. This script simply
        keeps re-running rsync, letting it continue
        until the synchronisation completes normally.
    

That's pretty much what I've done.

~~~
kraemate
Wow, thanks for the script. Surprising in its simplicity. I would have thought
this use-case was popular enough to warrant specialized tools etc. Especially
in the scientific community where they transfer large files.

~~~
CarolineW
It's simple enough that it's the sort of thing I type out in 30 seconds and
there it is. No need for specialist tools - finding the tool, remembering how
to use it, working out the right parameters ...

Easier, faster, and more flexible just to write the script. It's what I do.

------
Piskvorrr
btw not relevant in your case, but a friend working for a large video company
regularly drives a truckload of 10 TB tapes around the continent - bandwidth
is still an issue at _that_ scale.

~~~
lucasch
Amazing. If only there was away to get around the limitations at that level.

~~~
Piskvorrr
Well...two decades ago, 500 GB of data would have been moved as freight, too.
Data expands to fill _any_ available capacity - I don't think there is any way
around _that_.

~~~
atsaloli
About 9 years ago, while working as a system administration consultant, I had
a gig to fly a portable hard drive with about 360 GB from LA to St. Louis as
part of a migration of a web application. It was faster than the network
connections available to my client at the time. I remember calculating the
throughput...

I asked why don't you just FedEx it? It's too important, the client said, and
we know and trust you.

It was funny, I had it in a laptop bag, and didn't let go of it except to go
through the security scanner... the only thing missing was the handcuff
connecting it to my wrist. :)

~~~
Piskvorrr
Well...you are probably not going to throw a hard drive at a client's door and
run. A delivery guy might, as it's just another cardboard package, not
priceless data (might be easier now with SSDs).

Indeed, getting a _trustworthy_ courier service is so hard that actually
sending an in-house employee is worthwhile, even though their hourly rates
make this extremely expensive: you are removing tens of abstraction layers,
while preserving high degree of control ("Oh, we might have run it over with a
truck. And accidentally put it on a plane to New Zealand on hop #3. And they
can't seem to find it there.")

~~~
atsaloli
Fair enough. :)

