I'm a biologist by training. Eventually my research hit a data wall (my simulati...

ams6110 · on Dec 1, 2016

Question from ignorance: how do you get "petabytes of data" into the Google Cloud in a reasonable time? I find copying a mere few TB can take days and that's on a local network not over the internet.

MrPowers · on Dec 1, 2016

The AWS Snowball service (https://aws.amazon.com/blogs/aws/aws-importexport-snowball-t...) can transfer 1 petabyte per week. Amazon mails you hard drives, you upload your data to the drives, and then you mail the hard drives back to Amazon to upload.

_coldfire · on Dec 2, 2016

There's also Snowmobile, 100PB of storage in a shipping container, able to be filed within 10 days.

https://aws.amazon.com/snowmobile

fnbr · on Dec 1, 2016

I'd also be interested to hear this.

I'm running a project that's 10gb in size and uploading the data to AWS S3 was absurdly slow.

Is there any way to speed up the upload that you found? 10 GB was painful though, I can't imagine uploading terabytes.

phillc73 · on Dec 1, 2016

I don't work in this specific field, but did previously, during the first decade of this century, in broadcast video distribution.

At the time, UDP based tools such as Aspera[1], Signiant[2] and FileCatalyst[3] were all the rage for punting large amount of data over the public Internet.

[1] http://asperasoft.com/

[2] http://www.signiant.com/

[3] http://filecatalyst.com/

jerven · on Dec 1, 2016

Aspera, is the current winner in Bioinformatics. The European Bioinformatics Institute and US NCBI are both big users of it. Mainly for INSDC (Genbank/ENA/DDBJ) and SRA (Short Read Achive) uploads.

For UniProt a smaller dataset we just use it to clone servers and data from Switzerland to the UK and US at 1GB/s over wide area internet.

Very fast, and quite affordable.

dekhn · on Dec 1, 2016

I used aspera for a while, but plain old HTTP over commodity networks works fine if you balance your transfers over many TCP connections.

collyw · on Dec 1, 2016

Jim Kent wrote a small program parafetch - basically an ftp client that parallelized uploads. It worked reasonably well for speeding things up maybe 10x. You can get it somewhere on the UCSC web site in his software repository, though it involves compiling the C code.

dekhn · on Dec 1, 2016

For GCS, the gsutil program can saturate 1GB NIC using "gsutil -m cp -R"

xapata · on Dec 1, 2016

The fastest way to upload is to ship hard drives in an airplane.

ams6110 · on Dec 1, 2016

OK -- Tannenbaum's "station wagon full of tapes" updated for the 21st century.

dekhn · on Dec 1, 2016

Tannenbaum always forgot to include the time writing and reading tapes. Typical 10TB hard drives (which most people use for data interchange instead of tapes) only have ~100MB/sec bandwidth (~ same as 1Gbit NIC).

collyw · on Dec 1, 2016

I have worked with biologist in the past and tried to show them how to improve their data processing - for example sticking things in a database rather than clog the network file system with millions of short files. The majority don't seem to take any interest.

dekhn · on Dec 1, 2016

you should have told them they'd publish twice as many papers in high priority journals if they could improve their data processing. then show them their competitor's paper where they did just that.

that works well.

op00to · on Dec 1, 2016

Can you say what the breakdown is between happy "institutional" (read: universities, research institutes, etc) and "industry" (read: private companies) customers? This seems great, except that federal grant funding used to make it really hard to use stuff like that.