

Open Biology Data - elkingtowa
https://www.broadinstitute.org//scientific-community/data

======
folli
I consider myself an enthusiastic bioinformatician, but one thing that drives
me up to wall on a regular basis are the thousands of "standardized" formats
to archive and share data.

A small example: given a genome which is commonly saved as FASTA format (a
simple text file containing the nucleotide sequence having a header denoted by
a greater-than symbol; one of the few cases where there is only one format
currently used, maybe just because no one was creative enough to invent
something different), the annotation of the genes (position, direction, name,
function etc. of genes) is most commonly stored as either GFF (where in turn
there are several versions such as GTF, GFF2 and GFF3 still used), BED files
and GenBank format. Those formats only describe the tabular setup regarding
how the information is stored. Whether the actual name of a gene is stored in
the identifier "ID" or "name" or "gene_name" or whatever is still up to
whichever undergrad got the honor to submit the annotation on that day.

~~~
davecap1
Hey folli, I'm the CTO of SolveBio
([https://www.solvebio.com](https://www.solvebio.com)). We're a startup that's
trying to make it super easy to access biological data (specifically for
clinical genetic test pipelines). We're well aware of the "file format
problem", and we're trying to tackle it by building easy to use APIs into our
datasets. This allows for "format-less" retrieval of data (basically JSON)
which can be exported into basically any format through a set of helper tools.

We're looking for frustrated bioinformaticians to work with us and help
develop our platform, so feel free to shoot me an email if you're interested.

