
Big Data Coming in Faster Than Biomedical Researchers Can Process It - happy-go-lucky
http://www.npr.org/sections/health-shots/2016/11/28/503035862/big-data-coming-in-faster-than-biomedical-researchers-can-process-it
======
Balgair
One point: Most Bio type people have not had multi-variable calculus, and many
have not had calculus at all. So, it's not really that they _can 't_ process
it fast enough, it's that their techniques for processing it are
stereotypical. They figured out how to do that T-Test (or something else)
once, and they stick with it, because they really don't know he math behind
it.

Also, though there is a TON of 'data' coming in, most of it is not useful. For
example, I have a 500Gb file of a stack of .tiff images per fish that I have
imaged in a confocal microscope. I have a GFP filter on the scope and
therefore only get the green part of the .tiff files exposed, the red and blue
are just background noise. Also, most of the image is the dish I have the
fishes in. I tickle the fish, they flick their tails, and I see this all in
120fps. Now, I measure how much of an angle the fish made their tails flick,
all in 3-D, because that's what the scope records in. I have a half TB per
fish to comb through, and I have ~20 fishes, say ~10TB. At the end, I get a
single graph comparing the fish with some gene to those without it, and I have
10TB of 'data' left over. Yeah, someone _could_ comb through it all and find
something else to look at. But i forgot to record the precise temperatures,
the orientation of the fish, the fish that I knew later died, etc. I had that
all in my head. And, hey, what do you know?, the p-value is ~.45 and therefore
there is no 'real' difference in the fish and we can't include this in a
paper. Now all that 'data' is being kept on a drive on some computer somewhere
and is counted towards the budget that the lab has on the shared spaces. It's
not really 'data' anymore, in that it is useful to advancing knowledge for
anyone (it counts as practice I guess), but it still clogs up space.

~~~
shas3
Totally on point, regarding multivariate relationships! The problem is not
that they do T-test (and other Stat-101-jargon blackbox stuff), but that they
stop at it. To many of them, even the existence of multivariate effects is
beyond their imagination.

So, inference to many biomedical folks is just 1-dimensional. Big data has a
long way to go to penetrate fields where people cannot think in more than 1
dimension!

~~~
xapata
The best way to get published is to use a variation on a method that was used
many times before. Since the novelty you're focused on is something
biological, you stick with the same statistical methods that have gotten
published for the last decade.

Unfortunately, there just doesn't seem to be much tenure-juice from innovating
in statistical methods for most life-science fields. Not all, of course.
Science moves slowly.

------
dekhn
I'm a biologist by training. Eventually my research hit a data wall (my
simulations produced too much data for my storage and processing system). I
had read a paper on GFS and Mapreduce and Bigtable from Google, and decided to
go work there. I got hired onto an SRE (production ops) team and spent my 20%
time learning how production data processing works at scale.

After a few years I understood stuff better and moved my pipelines to
MapReduce. And I built a bigger simulator (Exacycle). It was easy to process
100+T datasets in an hour or so. It wasn't a lot of work, really. We converted
external data forms to protobufs and stored them in various container files.
Then we ported the code that computed various parameters from the simulation
to MapReduce.

I took this knowledge, looked at the market, and heard "storing genomic data
is hard". After some research, I found that storing genomic data isn't hard at
all. People spend all their time complaining about storage and performance,
but when you look, they're using tin can telephones and wind up toy cars. This
is because most scientists specialize in their science, not in data
processing. So, based on this I built a product called 'Google Cloud Genomics'
which stores petabytes of data (some public, some private for customers). Our
customers love it- they do all their processing in Google Cloud, with fast
access to petabytes of data. We've turned something that required them to hire
expensive sysadmins and data scientists into something their regular
scientists can just use (for example, from BigQuery or Python).

One of the things that really irked me about genomic data is that for several
years people were predicting exponential growth of sequencing and similar
rates of storage needs. They made ludicrous projections and complained that
not enough hard drives were made to store their forthcoming volumes. oh, and
the storage cost too much, too. Well, the reality is that genomic data doesn't
ahve enough value to archive it for long times (sorry, folks, for those that
believe it: your BAM files don't have value enough for you to pay the
incredibly low rates storage providers charge! Also, we can just order more
harddrives, Seagate just produces drive to meet demand, so if there is a real
demand signal and money behind it, the drives will be made. Actual genomic
data is tiny compared to cat videos.

The real issue is that most researchers don't have the tools or incentives to
properly collect, store, and use big data. Until that is fixed, the field will
continue in a crisis.

~~~
ams6110
Question from ignorance: how do you get "petabytes of data" into the Google
Cloud in a reasonable time? I find copying a mere few TB can take days and
that's on a local network not over the internet.

~~~
fnbr
I'd also be interested to hear this.

I'm running a project that's 10gb in size and uploading the data to AWS S3 was
absurdly slow.

Is there any way to speed up the upload that you found? 10 GB was painful
though, I can't imagine uploading terabytes.

~~~
phillc73
I don't work in this specific field, but did previously, during the first
decade of this century, in broadcast video distribution.

At the time, UDP based tools such as Aspera[1], Signiant[2] and
FileCatalyst[3] were all the rage for punting large amount of data over the
public Internet.

[1] [http://asperasoft.com/](http://asperasoft.com/)

[2] [http://www.signiant.com/](http://www.signiant.com/)

[3] [http://filecatalyst.com/](http://filecatalyst.com/)

~~~
jerven
Aspera, is the current winner in Bioinformatics. The European Bioinformatics
Institute and US NCBI are both big users of it. Mainly for INSDC
(Genbank/ENA/DDBJ) and SRA (Short Read Achive) uploads.

For UniProt a smaller dataset we just use it to clone servers and data from
Switzerland to the UK and US at 1GB/s over wide area internet.

Very fast, and quite affordable.

~~~
dekhn
I used aspera for a while, but plain old HTTP over commodity networks works
fine if you balance your transfers over many TCP connections.

------
danso
It seems like there's good opportunity for skilled data scientists and
engineers to make a real difference here. I do think that laypersons (to both
medicine and engineering) think that practitioners in medicine and biology
have mastered such mundane things like data pipelines, because you have to be
so smart to be in medicine/biology, but my limited experience has been more
along the lines of what Neil Saunders describes as the inspiration for his
coding+bioinformatics blog:

[https://nsaunders.wordpress.com/about-2/about/](https://nsaunders.wordpress.com/about-2/about/)

> You may be wondering about the title of this blog.

> _Early in my bioinformatics career, I gave a talk to my department. It was
> fairly basic stuff – how to identify genes in a genome sequence, standalone
> BLAST, annotation, data munging with Perl and so on. Come question time, a
> member of the audience raised her hand and said:_

> _“It strikes me that what you’re doing is rather desperate. Wouldn’t you be
> better off doing some experiments?”_

> _It was one of the few times in my life when my jaw literally dropped and
> swung uselessly on its hinges. Ultimately though, her question did make a
> great blog title._

edit: To add an anecdote that I believe I read on HN; regarding the topic of
the huge datamine of DNA and other health data provided by the U.S.
government, a commenter said that the reason it was all on FTP was because
professors couldn't download large datasets via their web browser, or some
such technical hiccup.

I won't say that putting data on the Web makes it automatically more
accessible, but data discovery through FTP requires a bit of scripting skill
that I imagine the average biomedical scientist does not have.

~~~
brandonb
I agree--the impact of a great engineer working in healthcare is very high,
particularly if you partner with medical experts.

We're a small startup that has partnered with UCSF Cardiology to detect
abnormal heart rhythms, and other conditions, using deep learning on Apple
Watch heart rate data:

    
    
      https://wsj.com/articles/new-study-seeks-to-use-deep-learning-to-detect-heart-disease-1458240739
    
      https://blog.cardiogr.am/three-challenges-for-artificial-intelligence-in-medicine-dfb9993ae750
    
      https://a16z.com/2016/10/20/cardiogram/
    

We have about 10B sensor data points so far. If you're a machine learning
engineer and interested in working on this type of problem, feel free to email
me: brandon@cardiogr.am.

~~~
daveguy
I'm curious. What is your definition of "machine learning engineer"? Are you
talking mostly feature engineering or something deeper. If so, what?

~~~
brandonb
In our case, we're applying deep learning to sensor data, so much of the day-
to-day work of a machine learning engineer is experimenting with new neural
architectures rather than feature engineering by hand. For example, we're
using or interested in techniques like:

    
    
      * semi-supervised sequence learning (we have a paper in a NIPS workshop next week on applying sequence autoencoders to health data, for example)
    
      * deep generative models
    
      * variational RNNs
    

From a day-to-day perspective, we use tools like Tensorflow and Keras, similar
to most AI research labs. In general, we try to act as a software startup that
happens to work in healthcare, rather than as what you might think of as a
traditional biotech or medical device startup.

Does that help answer your question?

------
searine
Yeah, that happens when you prioritize buying sequencers and building genome
centers while pretending that analysts grow on trees.

The post-docs and graduate students who do the heavy lifting on all of these
projects don't make a living wage.

They can't raise a family, buy a house, or save for the future. The people in
charge made them indentured servants and now those leaders are going to reap
the whirlwind.

~~~
toufka
Yep. Money is strangely spent. Incentives and historical attitudes of the
field regarding money are hard to change.

$500k for a new microscope, no problem! It's a fancy, four channel live
microscope. So it takes four 1Mb images each frame, and you're running a 10min
long experiment taking an image 5 times per second. So that's like 12Gb per
imageset. And you take like 10/15 replicates per experiment under 3-5
different conditions. That data for that experiment which under-girds a 6-7
figure grant is now stored on a $100 3TB USB disk from Best Buy. Oh, and
trying to process that 12Gb image over USB2.0 using MatLab on a student's
personal macbook air is horribly inefficient - but there is no other option
for the student really.

The students collecting the data and storing it on their local HD or laptop
hard drive have no place to archive their data even if they wanted to. There
are no repositories capable of generically storing that kind of huge data that
needs to be frequently accessed at the price the students/labs are willing to
pay (nothing).

And this speaks nothing about the code every student reinvents in MatLab to do
basic scientific analysis. Or worse, does not reinvent and instead reuses
20-year old code written by some long-forgotten student who wanted to try
their hand at a 'new' programming language like IDL.

The 'students' and postdocs are paid nearly minimum wage to do the high-tech
biomedical research. There are no computer-scientists to be seen because they
would be fools to give up making 5-8x money across the street at Twitter.

On the other hand, the scientists know their work really well, and it will
take a truly integrated team to solve these issues. A computer scientist can't
just come over for a day and write up an app to help out. The code will have
to make scientific assumptions and must be custom for many/most projects. But
it's very hard to build a capable team when the market salary for certain
kinds of team members is multiples of another on the same team.

~~~
IndianAstronaut
I am surprised they don't use something like S3 to scale up the storage
withouf buying hardware.

~~~
jerven
From the perspective of the student S3 over university wifi is not "better"
than local USB2 hard drives. It actually is a risk budget wise.

------
zo7
I've heard about this a lot from people I know who work in healthcare. It
seems that one could make a successful business simply by hiring a bunch of
data scientists to offer analytics and data processing services for
healthcare, but what's preventing that? Is there a lack of expertise, funding,
too much regulation, or something else?

~~~
op00to
There's a bunch of issues:

\- Real, bespoke biomedical analysis is not trivial in effort, cost, or time.
There are biomedical analysis systems-in-a-box (look at
[https://galaxyproject.org](https://galaxyproject.org)), but that's just
canned analysis. To make real breakthroughs, you need rigorous analysis that
requires years of experience to be able to perform.

\- It's easier to get the money to collect the data than it is to effectively
steward the data you collect. In a past life, I ran a biomedical research
computing facility, and everyone got plenty of money for new sequencers, mass
specs, and other fancy instruments. They got plenty of money for collecting
all kinds of data. No one would ever add money to their grants to actually
STORE the data. They would literally put the data on USB hard drives bought
from Best Buy, and left them in file cabinets and on desks. There was
absolutely nothing I could do about this, and so I quit.

\- Research is balkanized to hell. Even though I ran the scientific computing
for 20 research labs, each research lab was its own fiefdom. They could decide
to obey or disobey my policies at will, since they controlled their own
funding. You can imagine what happened when I proposed turning on quotas
(~100TB per lab, to start!). Rather than work with my team to determine how to
share resources, people would just jump off my high speed facility, buy a
shitty cheap JBOD from Dell for their analysis, and store their archives on
shitty cheap USB hard drives from Best Buy. The funniest part was that if the
hard drive failed, and the data couldn't be restored, in theory the primary
investigators could get into real legal trouble. No one seemed to worry.

There are a few biomedical research institutes that "get" scientific data
stewardship - Broad, Scripps, but for the most part, biomedical research
computing is a total clusterfuck and I couldn't have gotten out of there fast
enough for the way saner land of tech companies.

~~~
swuecho
I agree. A lot lab do not want to hire good developer or data scientist. Or
they do not have the money to hire, even they spend thousands in data
collecting.

check the job here, most of them are postdoc level.
[https://www.biostars.org/t/Jobs/](https://www.biostars.org/t/Jobs/)

the postdoc level means you get about 50k~80k, even in bay area.

The situation is really bad.

~~~
p10_user
I believe money is definitely a big part of it. If you have the skills needed
to help manage and analyze "big" data (big as in too big to realistically
handle in Excel, which is the limit of most biologists), you can easily earn
much more somewhere else.

~~~
collyw
Partially. I worked with bioinformatics labs until recently. Career
progression is limited as they treat a software engineer as a technician, and
nothing more. They don't appreciate the value you bring unless you are
publishing papers (certainly in the last two institutes I worked in).

------
nonbel
To most here:

Try to make sure you aren't just helping people publish papers for the sake of
it. As soon as you sense that, stop working with them. It is a very, very bad
thing.

------
sien
This isn't just in Biomedical fields.

In Earth Sciences, Astronomy and presumably quite a few other fields there is
a staggering amount of data coming in and the number of people who have the
required domain knowledge, math knowledge and programming knowledge is not
growing as fast as the data is. Teams can and do help, but well, there is lots
of work out there.

~~~
IndianAstronaut
The LSST is a telescope project that is a sky survey and is to generate 15tb
of data a day. Staggering amount of data coming in.

The data processing pipelines need to be very efficient and questions clear to
tackle these. Not to mention the processing algorithms for this data.

------
mavsman
This isn't very surprising to me. Data is relatively cheap and the (effective)
analytics of it is where the value lies. This typically the case, isn't it?

That said, this is a good reminder that data scientists are in high demand and
can make a difference.

------
dcgoss
If you're interested in working on an open source project involving big data,
machine learning and cancer, check out cognoma.org. It is sponsored by UPenn's
greenelab.com

------
yahyaheee
Anyone know of open datasets one could play around with?

~~~
abetusk
The Harvard Personal Genome Project Over 200 whole genomes and I think over
500 genotyping data (23andMe and the like) released under a CC0 license with
sporadic phenotype data [1]. Open Humans [2] has a bunch of data with a
convenient API [3]. OpenSNP has a lot of genotyping data (23andMe etc.)
available for download [4].

For a more comprehensive list check out one of the many "Awesome Public
Datasets" [5] (biology section).

[1] [https://my.pgp-hms.org/public_genetic_data](https://my.pgp-
hms.org/public_genetic_data)

[2] [https://www.openhumans.org/members/](https://www.openhumans.org/members/)

[3] [https://www.openhumans.org/public-data-
api/](https://www.openhumans.org/public-data-api/)

[4] [https://opensnp.org/](https://opensnp.org/)

[5] [https://github.com/caesar0301/awesome-public-
datasets#biolog...](https://github.com/caesar0301/awesome-public-
datasets#biology)

~~~
tetron
Incidentally, Arvados ([http://arvados.org](http://arvados.org)) is the
software used to host the Harvard PGP data, and is a free software platform
for managing large scale storage and analysis aimed at scientific workloads.

------
aisofteng
Isn't Watson Health in exactly this space?

~~~
searine
Waston can only categorize what is already known about genomes given existing
research.

What we need is to have people who can ask questions and think critically.
Good, hypothesis driven science is how we discover entirely new concepts and
mechanisms.

------
mjevans
If that much data is being collected it's time to start asking what we're
looking for within it, and if the rate of collection, retention, and range if
inputs is worth it.

Maybe it only makes sense to store sections which deviate in a significant way
from a range of error (lossy compression).

Maybe some of those inputs just don't make sense for the questions being
asked.

A concrete reasoning for why the data should be kept needs to be presented,
and THAT is what should call for the funding to back that need.

~~~
dnautics
Genomic data is basically stored as "diffs" from the reference genome. For
humans, that's "some guy from Buffalo", as a UCSC professor put it.

~~~
mjevans
My take was that this was mostly 'telemetry' data from all of the vital signs
monitors hooked up to a patient.

