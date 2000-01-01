Hacker News new | comments | show | ask | jobs | submit login
Timeline: Organisms that have had their genomes sequenced (yourgenome.org)
31 points by lelf 1 hour ago | hide | past | web | 8 comments | favorite





Note that this a very incomplete list. For more complete overviews, see the Quick Guide to Sequenced Genomes [1], the NCBI's list of genome sequences [2], or the wikipedia lists of sequenced bacterial [3] and eukaryotic [4] genomes (and their related lists).

[1]: http://www.genomenewsnetwork.org/resources/sequenced_genomes...

[2]: https://www.ncbi.nlm.nih.gov/genome/browse/

[3]: https://en.wikipedia.org/wiki/List_of_sequenced_bacterial_ge...

[4]: https://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_g...

You missed Xylella fastidiosa. It was sequenced in 2000's

This list highlights the most import organisms that have their genome sequenced, but look at [1] to see the full list of 22244 organisms that are currently sequenced.

Most human pathogens, like Staphylococcus aureus, Streptococcus pneumoniae, Escherichia coli, Salmonella enterica, Mycobacterium tuberculosis, ... have several thousand assemblies each.

[1] https://www.ncbi.nlm.nih.gov/genome/browse/

What determines the complexity of sequencing these? The number of pairs?

Today there exists a multitude of different genome sequencing techniques, and distinct complexities are associated with each method. However, the number of pairs is today seldomly the main complexity.

Sanger sequencing was one of the first methods of sequencing, and employs linear sequencing: the synthesis of strands with increasing length. With the advent of the Human Genome Project, Celera instead came up with the idea of fragmenting the genome, amplifying the fragments, sequence the fragments, and match them together using bioinformatics. The complexity here lies in that much of the DNA is not particularly unqique (microsatellites). As such, a short 20 nucleotide sequence may be present in may parts of the genome. As as such, it is oftentimes hard to generate a 100% complete connected genome.

Today, Illumina sequencing is the major sequencing platform. It relies of the fragmentation of DNA into fragments ~300 bp fragments. By synthesising the complementary strand of each fragment with fluorescent nucleotides, we are able to sequence each fragment. Here we have the same complexity as with shotgun sequencing: the fragments occur in multiple parts of the full DNA sequence.

To remedy this, error-prone sequencing methods such as IonTorrent/PacBio/etc. may be employed to generate long reads. These long reads may then act as a map for stiching together the more precise short reads

Other sequencing methods, such as Pyrosequencing, has the inherent problem of not being able to discern too many (5) of the same nucleotide in a row.

The other big complexity is the self-similarity of the genome. To sequence, the genome is duplicated, then sharded physically into many overlapping tiles around an average length, each tile starting at a different position. Each sequence is determined in parallel, and then the shards have to be reassembled computationally (really, this is a post-sequencing process, but is critical to being able to call the result a "sequence" rather than a bag of reads).

If the genome in question contains a lot of regions that are similar to each other, the algorithms that do the assembly will get confused.

There are regions with sparse coverage due to higher concentrations of C-G bases. This can make alignment results less reliable. Plasmids may interfere with alignment or need to be isolated out before or during library prep. And you want to detect and discard any members of the population with any evidence of cancer, genetic disease, etc.

And you need good population coverage. What's a normal variant? Newer methods propose a graph alignment instead of just trying to build a single sequence reference genome.

Yes mostly the number of base pairs. Large repeats can also cause issues finishing a genome.

