
Open source effort to bring open biomedical data together – anyone interested? - nikolamilosevic
http://inspiratron.org/blog/2015/12/12/open-source-effort-to-bring-all-open-biomedical-data-together/
======
jerven
I am very supportive of the idea in general. I think that the author
underestimates some of the reasons for the diversity in the data sources
schemas etc... They are different often because of the data being different.

The linked data/semantic web approaches are slowly eating away at the unneeded
data diversity. With small shared standards for common data, and unique
schemas for unique data. The single endpoint solution unfortunately does not
scale; it would be the AOL of biomedical data or entrez as you prefer. To be
truly open and free anyone should be able to contribute their data and tools.
This means decentralized infrastructure which means confusion and difficult to
find information. However, as acedemia and research is decentralized their IT
infrastructure must match that reality. This leads to infrastructure that can
integrate on demand, such as made possible by the SPARQL service keyword.

Open source has always been an important part of bio-IT and that is not going
to change. But a single source is not the answer to our problems. We need to
make it easier to find information, but most importantly need to make it
easier to answer questions with the data that is available.

~~~
nikolamilosevic
Linked data and SPARQL are definitely very possible solution and
infrastructure can be decentralized. There is a bit of resistance in one part
of the community from these technologies, because people are not used to them
and compared to some other data storages they tend to be a bit slower, but
thats other discussing. I do not have anything against these technoogies. What
I currently don't like is that there are a lot of resources that are
technically open source and free, but they are burried somewhere on the
internet and sometimes hard to find and it takes quite a lot of time to review
all existing resources. What I wanted to recommend is one central umbrella
organization that will be (1) platform for collaboration in biomedical field,
(2) central endpoint to all major existing project, possibly with some
maturity level of projects and internal review in order to arrange projects
into maturity levels, so it can be relatively easy to review how much you can
"trust" that project of data, (3) central repository for open source NLP, data
curation and semantic web tools, (4) some relevant body that would be able to
propose and work on standards for data curation that would take in account all
field specific needs.

~~~
x1k
You have seriously underestimated 1) efforts needed to develop and to maintain
such resources – your best hope is to work with government-funded institutes;
2) resistance from the convention of a particular research field – you can
rarely bend how people in a field work on things; 3) culture differences
between biologists/doctors and programmers – biologists/doctors think very
differently, which is frequently overlooked by programmers; 4) bureaucracy –
everyone thinks he/she is the best; when you work with top groups to make
things happen, you will find how problematic it is; 5) technical challenges –
as you care about pheonotype data: there are no good ways to integrate various
pheonotypes from multiple sources.

Everyone in biomedical research dreams about integrated resources. I have
heard multiple people advocating SPARQL as well. If it had been that easy,
this would have occurred years ago. In the real world, no one is even close.
If you want to attract collaborators, learn Linus: say you have a working
prototype and demonstrate how wonderful it is. Your ideas are cheap. The
difficult part is a clear roadmap to make it happen.

~~~
dekhn
I agree strongly with this. I started out in biomedicine many years ago with
the same aims as the OP, but after a lot of experience, I think that
announcing the database resource is just the first step, it's an easy one, and
all the hard problems are the ones listed by x1k.

Based on what I see happening in large orgs with lots of machine learning
resources is the development of new techniques to generate large amounts of
homogenous phenotypic data across many measurement modalities. These large
orgs have biologist/doctors: the small number of people cross-trained well
enough to move between the two fields with ease. These orgs have gathered
enough resources to compel the leading researchers to work with them, and
they're starting to publish interesting papers.

~~~
x1k
Yes, I think those large companies may finally have a slim chance to
revolutionize data integration, but it is too early to tell yet. We will see.

------
michaelmachine
Hey, Mike from DrugBank here. Send me an email at mike@drugbank.ca to chat
more. You should take a look at
[https://www.openphacts.org/](https://www.openphacts.org/) which has a similar
goal to this project. I think one problem is this:
[http://xkcd.com/927/](http://xkcd.com/927/)

~~~
pgroth
Mike - drug bank is brilliant.

One of the coarchitects of openphacts here. A pointer for developers is
dev.openphacts.org. All the source and data is open.

While I agree that standards are hard I think the major issue is sustaining
these things. You need some set of people to code and curate even if it's
small and you need good uptime/support to gain community trust.

------
nairboon
Quite an interesting idea, so it'd be like extending the scope of GA4GH beyond
'just' genomics? [https://genomicsandhealth.org/work-products-demonstration-
pr...](https://genomicsandhealth.org/work-products-demonstration-
projects/genomics-api)

~~~
eggie
The global alliance for genomics and health is a similar idea but not designed
to be completely open nor linked. They don't acknowledge the semantic web.
Practically, it is implemented as CORBA using JSON. Everything is an API and
after that an implicit data model (in JSON) is being produced for each type of
concept. AFAIK security is a huge limitation here. The idea of GA4GH is data
silos can communicate some things with each other, but not personally
identifiable information.

I work on the project but find it pretty uninspiring. It presents a dark image
of the future in which a handful of large tech companies control all of our
biomedical data and we have to beg them to allow us to share it. I guess that
sounds pretty similar to the present. Just switch bio and social and here we
are.

~~~
dekhn
I don't think that GA4GH is literally using CORBA. The data model is not
implicit, it's explicit (there is a schema). The "data silos can communicate
some things ... but not personally identifiable information" is a constraint
placed on the alliance by legal system. As for the semantic web, every bio
project I've seen which adopted the semantic web ultimately failed -- the
semantic web seems like a great idea, but attempts to fully implement to the
point where it's useful for research always fail. So I think they're focusing
on areas where they are likely to succeed (collection and processing of large
amounts of raw and derived data using pretty conventional processes, but at a
much larger scale, with a solid authentication and access mechanism).

~~~
eggie
> I don't think that GA4GH is literally using CORBA.

It's not literally CORBA, but people who spent time implementing literal CORBA
in a bioinformatics context (for instance, the original author of
[https://github.com/bioperl/bioperl-corba-
server](https://github.com/bioperl/bioperl-corba-server)) have noted that the
design pattern and discussions followed by the GA4GH are pretty similar to
those had in the EBI when they attempted to unify everything using CORBA.

> The data model is not implicit, it's explicit (there is a schema).

There is a schema, but the semantics of the data model are encoded in the
comments of the schema. Without hooking into some kind of ontological basis it
doesn't seem possible to avoid this.

> As for the semantic web, every bio project I've seen which adopted the
> semantic web ultimately failed -- the semantic web seems like a great idea,
> but attempts to fully implement to the point where it's useful for research
> always fail.

I'm aware of at least one group in the GA4GH that uses RDF internally, then
converts into the custom schemas produced by the group in order to maintain
compatibility with the top-down designs of the project. I believe this is the
phenotype group. These are the people who are most interested what the author
of the linked page is describing, they have decided to use the technology you
believe is doomed to fail. But, they aren't failing. As far as I can tell from
their presentations they are one of two or three groups in the project that
have produced a functioning system.

It's very easy so say that hard things are impossible. This tends to keep them
that way. I doubt we have any other viable option for building large
distributed knowledge systems. The fact that these don't exist does not mean
they are impossible to construct, but simply that no one has managed to do so
yet. People leveled the same kinds of arguments against neural networks up
until a few years ago, saying that they were a nice idea but destined to fail
because they were too hard.

> So I think they're focusing on areas where they are likely to succeed
> (collection and processing of large amounts of raw and derived data using
> pretty conventional processes, but at a much larger scale, with a solid
> authentication and access mechanism).

The scales we're talking about are not even an order of magnitude above that
which existing techniques allow. So I agree that they will succeed insofar as
they simply adopt these existing community-driven standards and slap access
control on top. However, in terms of generating new data models for genomics,
I'm not so convinced that the centralized design and API-based approach which
they are taking will work. I guess we will have to meet back here in a few
years and see what happened.

~~~
x1k
For NN, we have a clear target: for example, beat HMM on speech recognition.
To achieve that, you write a tool on some standard test data sets. You don't
need to interact with many parties. NN is only technically hard. For GA4GH,
things are quite different. Technically, it is hard, but it is much simpler
than NN in my view. What is hard is 1) we lack a target and 2) the
communications between developers and users. People don't know what we need,
don't know what is the right approach and don't know how to evaluate the
success. They have changed course back and forth, and still have clashes
between programmers and those with more biological background.

------
nikolamilosevic
Since this sparked quite an interest, I have created a mailing list and wiki.
For more information about idea, you can read here
[http://inspiratron.org/blog/2015/12/18/starting-an-effort-
to...](http://inspiratron.org/blog/2015/12/18/starting-an-effort-to-bring-
biomedical-data-and-tools-together/) where you can find the links to mailing
list and wiki. I believe that would be the more appropriate place to collect
all efforts that currently exist, index them and maybe try to integrate them
and make them interoperable. Please join the mailing list.

------
eggie
I'm curious if the author knows about DisGeNET:
[https://en.m.wikipedia.org/wiki/DisGeNET](https://en.m.wikipedia.org/wiki/DisGeNET).

The whole idea of the semantic web is exactly what the author is getting at.
I'm curious why this is rarely regarded as a serious basis for such an effort
like that which the author is promoting.

------
IndianAstronaut
Didn't Galaxy want to do this as well?

------
afandian
ContentMine.org seems to have similar aims.

