Hacker News new | past | comments | ask | show | jobs | submit login

I'm a biologist by training. Eventually my research hit a data wall (my simulations produced too much data for my storage and processing system). I had read a paper on GFS and Mapreduce and Bigtable from Google, and decided to go work there. I got hired onto an SRE (production ops) team and spent my 20% time learning how production data processing works at scale.

After a few years I understood stuff better and moved my pipelines to MapReduce. And I built a bigger simulator (Exacycle). It was easy to process 100+T datasets in an hour or so. It wasn't a lot of work, really. We converted external data forms to protobufs and stored them in various container files. Then we ported the code that computed various parameters from the simulation to MapReduce.

I took this knowledge, looked at the market, and heard "storing genomic data is hard". After some research, I found that storing genomic data isn't hard at all. People spend all their time complaining about storage and performance, but when you look, they're using tin can telephones and wind up toy cars. This is because most scientists specialize in their science, not in data processing. So, based on this I built a product called 'Google Cloud Genomics' which stores petabytes of data (some public, some private for customers). Our customers love it- they do all their processing in Google Cloud, with fast access to petabytes of data. We've turned something that required them to hire expensive sysadmins and data scientists into something their regular scientists can just use (for example, from BigQuery or Python).

One of the things that really irked me about genomic data is that for several years people were predicting exponential growth of sequencing and similar rates of storage needs. They made ludicrous projections and complained that not enough hard drives were made to store their forthcoming volumes. oh, and the storage cost too much, too. Well, the reality is that genomic data doesn't ahve enough value to archive it for long times (sorry, folks, for those that believe it: your BAM files don't have value enough for you to pay the incredibly low rates storage providers charge! Also, we can just order more harddrives, Seagate just produces drive to meet demand, so if there is a real demand signal and money behind it, the drives will be made. Actual genomic data is tiny compared to cat videos.

The real issue is that most researchers don't have the tools or incentives to properly collect, store, and use big data. Until that is fixed, the field will continue in a crisis.




Question from ignorance: how do you get "petabytes of data" into the Google Cloud in a reasonable time? I find copying a mere few TB can take days and that's on a local network not over the internet.


The AWS Snowball service (https://aws.amazon.com/blogs/aws/aws-importexport-snowball-t...) can transfer 1 petabyte per week. Amazon mails you hard drives, you upload your data to the drives, and then you mail the hard drives back to Amazon to upload.


There's also Snowmobile, 100PB of storage in a shipping container, able to be filed within 10 days.

https://aws.amazon.com/snowmobile


I'd also be interested to hear this.

I'm running a project that's 10gb in size and uploading the data to AWS S3 was absurdly slow.

Is there any way to speed up the upload that you found? 10 GB was painful though, I can't imagine uploading terabytes.


I don't work in this specific field, but did previously, during the first decade of this century, in broadcast video distribution.

At the time, UDP based tools such as Aspera[1], Signiant[2] and FileCatalyst[3] were all the rage for punting large amount of data over the public Internet.

[1] http://asperasoft.com/

[2] http://www.signiant.com/

[3] http://filecatalyst.com/


Aspera, is the current winner in Bioinformatics. The European Bioinformatics Institute and US NCBI are both big users of it. Mainly for INSDC (Genbank/ENA/DDBJ) and SRA (Short Read Achive) uploads.

For UniProt a smaller dataset we just use it to clone servers and data from Switzerland to the UK and US at 1GB/s over wide area internet.

Very fast, and quite affordable.


I used aspera for a while, but plain old HTTP over commodity networks works fine if you balance your transfers over many TCP connections.


Jim Kent wrote a small program parafetch - basically an ftp client that parallelized uploads. It worked reasonably well for speeding things up maybe 10x. You can get it somewhere on the UCSC web site in his software repository, though it involves compiling the C code.


For GCS, the gsutil program can saturate 1GB NIC using "gsutil -m cp -R"


The fastest way to upload is to ship hard drives in an airplane.


OK -- Tannenbaum's "station wagon full of tapes" updated for the 21st century.


Tannenbaum always forgot to include the time writing and reading tapes. Typical 10TB hard drives (which most people use for data interchange instead of tapes) only have ~100MB/sec bandwidth (~ same as 1Gbit NIC).


I have worked with biologist in the past and tried to show them how to improve their data processing - for example sticking things in a database rather than clog the network file system with millions of short files. The majority don't seem to take any interest.


you should have told them they'd publish twice as many papers in high priority journals if they could improve their data processing. then show them their competitor's paper where they did just that.

that works well.


Can you say what the breakdown is between happy "institutional" (read: universities, research institutes, etc) and "industry" (read: private companies) customers? This seems great, except that federal grant funding used to make it really hard to use stuff like that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: