
Timeline: Organisms that have had their genomes sequenced - lelf
http://www.yourgenome.org/facts/timeline-organisms-that-have-had-their-genomes-sequenced
======
jamessb
Note that this a very incomplete list. For more complete overviews, see the
Quick Guide to Sequenced Genomes [1], the NCBI's list of genome sequences [2],
or the wikipedia lists of sequenced bacterial [3] and eukaryotic [4] genomes
(and their related lists).

[1]:
[http://www.genomenewsnetwork.org/resources/sequenced_genomes...](http://www.genomenewsnetwork.org/resources/sequenced_genomes/genome_guide_p1.shtml)

[2]:
[https://www.ncbi.nlm.nih.gov/genome/browse/](https://www.ncbi.nlm.nih.gov/genome/browse/)

[3]:
[https://en.wikipedia.org/wiki/List_of_sequenced_bacterial_ge...](https://en.wikipedia.org/wiki/List_of_sequenced_bacterial_genomes)

[4]:
[https://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_g...](https://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_genomes)

------
BioGeek
This list highlights the most import organisms that have their genome
sequenced, but look at [1] to see the full list of 22244 organisms that are
currently sequenced.

Most human pathogens, like Staphylococcus aureus, Streptococcus pneumoniae,
Escherichia coli, Salmonella enterica, Mycobacterium tuberculosis, ... have
several thousand assemblies each.

[1]
[https://www.ncbi.nlm.nih.gov/genome/browse/](https://www.ncbi.nlm.nih.gov/genome/browse/)

~~~
fatboy93
Many are just deconvolutions of microbiomes.

Not that I'm against it as I'm very much into metagenomics, but many of these
genome assemblies can barely qualify as draft genomes.

Also yet another important point here is the strain level resolution, which
the depth of sequencing had afforded us. For eg, if one were to look at staph
or any other human pathogen, you'll find at the bare minimum more than tens of
those.

------
deathanatos
When we say that we've "sequenced a genome" — what does that mean, exactly?
Just that we've scanned/digitized the DNA sequence of a single member of that
species? (Since of course the DNA of all members of a species isn't the same,
otherwise, we'd all look alike.) Do we ever sample more than one member of a
species? How many members of various species have we sampled? (While I'd
expect them to be similar, of course, I'd also expect the interesting things
to be in the differences.)

~~~
08-15
> "sequenced a genome" — what does that mean, exactly?

It means we got the sequence of As, Cs, Gs and Ts; but in practice, we don't
even achieve that.

The human genome has approximately 3 billion nucleotides in 23 chromosomes. So
you would think a "sequenced genome" is 23 strings over the alphabet {A,C,G,T}
with a length of 130 million or so each. However, it is only possible to
sequence a few hundred nucleotides at a time. So you end up with a jigsaw
puzzle of tiny pieces, some containing errors, some missing, some exact
duplicates of each other. Putting these back together corectly isn't even
possible, doing it approximately is still difficult.

So in practice, a sequenced genome is a sequence of medium size strings,
arranged in the mostly correct order, with many small and a few large gaps, so
they form a good chunk of the long strings that would really be a genome. And
even if all of that was completely correct, we still wouldn't know what most
of these nucleotides mean.

The first "draft" of a genome is usually even worse. If you simply dump DNA
fragments from a mammal onto a modern day sequencer (Illumina HiSeq), and run
software to kind-of solve the jigsaw puzzle, you end up with a million pieces,
each about 2000 nucleotides in length, in no particular order. Since most
genes by far are longer than that, I'd say it's fraudulent to call something
like that a genome, but it's being done all the time.

> Do we ever sample more than one member of a species?

Yes, but not in the same way. Instead of (approximately) solving the
impossible jigsaw puzzle again (that's called de-novo assembly), we tend to
compare the puzzle pieces from a new individual against the big picture we got
from the first genome (that's called mapping). This way, we can see small
differences (single substitutitions, small insertions and deletions), which is
most of the variation present in a species. What we can't see is large scale
variation, like duplicated genes, or rearrangements of large pieces of the
genome, or insertions of retroviruses. We know these exist, it's just more
expensive to look for them. (All of this changing as sequencing is becoming
cheaper.)

Multiple individuals of a species get sequenced when there is interest.
Probably tens of thousands of humans have been sequenced by now. For model
organisms, the number is much lower. For pathogens, where everything is much
easier anyway, the number could easily be higher.

~~~
viewtransform
When they compare the sequenced genomes of two species, say mouse and human,
do they compare the entire genomes or just the known coding regions ? Are
there similar regions in mouse and human that we have no clue what they are
about ?

~~~
08-15
What's compared is usually the "homologous regions". That means literally
"similar because of common descent", but in practice, it means "similar enough
that we think it's homologous". Since parts of genomes can become duplicated,
we sometimes have more than one homologous piece in a different species---and
then it's typically excluded from the comparison, because we don't know which
of the two pieces to compare to.

As a practical example, it has been reported that humans and chimpanzees "are
98.6% identical". That's shorthand for "in the detectably homologous parts of
their genomes, 98.6% of the nucleotides of chimpanzees and humans are
identical". This ignores about 25% of the chimpanzee genome and 20% of the
human genome, for which no homologous sequence is known. It also ignores small
insertions and deletions, because it's unclear how to define similarity in the
presence of those.

There are definitely "similar regions" we have no clue about. Comparisons are
definitely not restricted to protein coding regions; we typically compare at
least whole genes. That includes introns, which sometimes contain regulatory
elements, but most of the time nobody knows what, if anything, they do. The
same goes for the untranslated regions of genes (5'\- and 3'-UTR), which
definitely contain regulatory elements, but for the most part aren't
understood. And there are probably homologous intergenic regions, too, and we
know nothing about their function.

I don't remember exact numbers about human-mouse comparisons, but I don't need
to to know that we end up comparing incomprehensible stuff: We "understand"
1.5% of the genome---that's the protein coding pieces. That's shallow
understanding, but at least something. We could pretend that we also
understand any piece of the genome that contains a few scattered regulatory
elements. Counting very generously, that might be 15% of the genome. The
remainder is completely mysterious. In the human-chimpanzee comparison, ~70%
of the genomes are comparable. Clearly, we have no idea what most of that
stuff does.

------
Aardwolf
Does anyone know how expensive it is to fully sequence your own genome today
(not genotyping but full sequencing)? And how complete is it, would every bit
be correct?

------
supersan
This is super interesting for someone who has no knowledge of the field. Has
my interest piqued as to how they do it, what is number of bases, why do mice
have more bases than us, etc. Very nice presentation wise, even if it is
incomplete (as other comments state).

~~~
alextheparrot
I can address some of these.

Sequencing today is done mostly using computational methods. Think of DNA as a
couple long strings (Number of bases is effectively the character count of
those strings, each string is a "chromosome" in higher organisms), so the
problem is how do we read these long, physical strings. It turns out that
parallel processing is way more effective, so we break the really long strings
into much, much smaller strings that overlap (Millions of characters long to
hundreds often). Because the strings overlap, we can construct a good portion
of the actual sequence computationaly by exploiting this overlapping feature
of our small strings.

The physical way they do this is by using machines (Think GPU vs CPU) that are
effectively a bunch of parallel microscopes specialized to read those short
strings and by "attaching" colors to each of the characters (DNA bases).
Initial DNA sequencing methods lacked both the computational and physical
devices to do this, so they were done by hand. The move from doing sequencing
by hand to doing it computationally is why we see the significant increase in
characters read (Number of bases).

Your last comment I think is the most interesting, as it effectively asks "Why
do mice have a larger string size than us, which means they contain more
information on an absolute level?". The answer is just because. The number of
bases, or even the number of blocks of information that produce proteins
(These blocks are called genes, and a protein is another chemical construct
that mainly focuses on doing actions in the cell), is not strongly correlated
with the complexity of the organism. The key is how those bases interact, not
necessarily in how many there are.

If you have any more questions or need some clarification I'd love to address
them, it is a wonderful time to be alive.

~~~
valine
This is really interesting. For someone who knows nothing about the subject,
how were DNA strands physically read at a low level before computational
methods? I was under the impression DNA is too small to see without an
electron microscope. You mention reading dna by hand, and I'm really
interested in how that is done.

~~~
alextheparrot
Given a string, I can easily discern one characteristic, which is length.
That's because the length of the string is tied to how "massive" it is and
thus when I push on things that are more massive they move more slowly. That's
the general idea behind gel separation.

Now, I just need a way to make all the combinations of substrings starting
from the first position (0 => 1, 0 => 2, etc.). This is a bit more difficult
to explain and chemically intensive, but let's assume for each character (C)
we have another character (C') that is pretty much the same thing. The key
difference, however, is that C' is marked (Radioactivity or with something
that lights up) AND that it doesn't allow any more characters to be added on.
If each distinct C' is a different color, we can now distinguish between our
different substrings, based entirely on the last character. We know that our
strings are ordered by size, so we can construct our original sequence based
on the terminal member of the substrings.

You can imagine this process being done by hand, it works for that. However,
it doesn't scale well to the millions and billions of base pairs we need in
the modern day.

As a fun aside, protein sequences were originally determined in a way pretty
much the inverse of this. For a given protein string, remove the first element
with chemistry. Then, try to figure out what you removed. Now take your string
of size N - 1 and repeat, until you have determined each character. This
method ended up not being tractable for DNA because of chemical differences.
Also, a lot of protein sequencing is done in a similar way to DNA sequencing,
in that we break up, shatter may be a better word, the protein. We then try to
construct the original protein based on how is shatters (Like reconstructing a
window based on knowing where the pieces fall and where the baseball came
from).

------
mrtron
2016 Beaver [1].

[1] [https://www.utoronto.ca/news/happy-150th-canada-u-t-
research...](https://www.utoronto.ca/news/happy-150th-canada-u-t-researchers-
first-world-sequence-beaver-s-genome)

------
vijayr
What determines the complexity of sequencing these? The number of pairs?

~~~
Ovah
Today there exists a multitude of different genome sequencing techniques, and
distinct complexities are associated with each method. However, the number of
pairs is today seldomly the main complexity.

Sanger sequencing was one of the first methods of sequencing, and employs
_linear_ sequencing: the synthesis of strands with increasing length. With the
advent of the Human Genome Project, Celera instead came up with the idea of
_fragmenting_ the genome, amplifying the fragments, sequence the fragments,
and match them together using bioinformatics. The complexity here lies in that
much of the DNA is repeated (such as microsatellites) or no, which makes it
hard to 'phase' the genome. As such, a short 20 nucleotide sequence may be
present in may parts of the genome which makes it hard to generate a 100%
complete connected genome.

Today, Illumina sequencing is the major sequencing platform (~85% of market
share). It relies of the fragmentation of DNA into ~300 bp fragments. By
synthesising the complementary strand of each fragment with fluorescent
nucleotides, we may employ lasers to detect (sequence) the nucleotides of the
fragments. Here we have the same problem as with shotgun sequencing: that we
have many repeats in the DNA sequence.

To remedy this, error-prone sequencing methods with long read lengths such as
IonTorrent/PacBio/etc. may be employed to. These long reads may then act as a
map for stiching together the more precise short reads. This is called
'hybrid' sequencing.

Other sequencing methods, such as Pyrosequencing, has the inherent problem of
not being able to discern too many (5) of the same nucleotide in a row. Other
methods are primer based (i.e. need to know a short subsequence of the DNA
beforehand). This is problematic if we want to perform a de-novo whole genome
sequence. Note: Illumina does not rely on primers, and may be deployed
directly on unknown sequences, unlike Pyrosequencing/Sanger sequencing.

------
roadnottaken
Would be nice to list the approximate number of genes, too.

~~~
liberalsurfer
Agreed, number of bases gives little information about the complexity of an
organism. Genes is a better metric.

~~~
08-15
"Number of characters gives little information about the complexity of a
program. Functions is a better metric."

When the human genome was sequences (it was a ten year project), lots of
pseudo-philosophers predicted the number of genes that would be found, with
numbers ranging from thousands to millions. The genome in hand, we found about
25000 genes, which make up a few percent (1.5-3%) of the genome. We then
looked more closely, and now we can't even agree on the defintion of "gene"
anymore, which is one reason why I can't give an exact percentage.

It turns out that one gene can produce more than one protein, or sometimes
less. There are lots of genes whose only function (assuming "function" is even
well defined) is to regulate other genes, a bit like a program could have
functions of higher order. We're slowly grasping the fact that we have no idea
how complex the genome and the machine interpreting it really are. Number of
genes really is no metric at all.

------
major505
You missed Xylella fastidiosa. It was sequenced in 2000's

