Hacker News new | comments | show | ask | jobs | submit login

It remains difficult to observe genome variation in transposon content. The situation is improving as we get longer single-molecule reads, as these let us reach through these sequences into bits of DNA that let us anchor the position of transposons against genomes which we've already sequenced.

I think some people may have the idea that we can observe whole genomes easily, but consider the case of repeats like transposons. Half of the human genome is made up of these, but we still have trouble seeing when and where they are active. A new insertion of a big piece of DNA can be much more phenotypically effective than a little SNP, and yet our observational methods make the latter much easier to see than the first. It seems that structural variation in genomes is a likely place to find at least a partial solution to the missing heritability problem posed by the GWAS community.

https://en.wikipedia.org/wiki/Missing_heritability_problem




> I think some people may have the idea that we can observe whole genomes easily

I think a lot of people do. Certainly I did, before I spent a year working in a genomics institute and had the opportunity for close observation of the massive uncertainty produced by short reads, and the concomitant complexity of the methods required to develop useful information from them.

At that time, long reads were just beginning to look reasonably attainable with new approached from Pacific Biosciences and other challengers to Illumina's market share; I'd be curious to know whether there's been any major movement in the half decade or so since I returned to industry.


The biggest development in single-molecule sequencing (aside from the steady improvement in PacBio's methods) is the arrival of the Oxford Nanopore devices on the market. These are also pretty rough around the edges, but they suggest a future in which labs have direct access to long-read, low-cost sequencing. Also, techniques like that from 10X genomics allow large-scale haplotype resolution, which is another missing component of a true "whole genome" sequence (humans and many other creatures have more than one different genome copy).


Its been a few years since I was in genetics, what device currently gives the longest single reads? If I remember right we were getting the best from the Roche 454 and the Illumina HiSeq, but the iontorrent was better at other things.


Probably the nanopore readers, but sometimes the reads get up to 30% wrong, which means you need to get multiple reads anyways to get consensus.

Iontorrent has very short reads (250bp) but I think they have 1kb working at least in the lab.


Are there any (perhaps prohibitively expensive) techniques that give high fidelity for transposons? This seems like a problem well suited to machine learning, given a dataset of paired high quality and low quality sequence data.



>I think some people may have the idea that we can observe whole genomes easily

By and large, we can. Yes there are minute aspects which lack perfect resolution, but we are well into the sub-1000 dollar whole-genome age.

Particularly in humans, can get a highly accurate reading of your variation across your entire genome, including insertions and deletions which are even easier to spot.

Of course there is cryptic variation, but it is only in really obscure genetic diseases and complex phenotypes that we start to run into trouble. Both of which can be addressed by simply sequencing a deeper population of individuals.


> Particularly in humans, can get a highly accurate reading of your variation across your entire genome, including insertions and deletions which are even easier to spot.

Insertions and deletions are usually harder to spot with the short reads that make up the data that you'll get back from your typical "sub-1000 dollar whole genome". The reason is simple. There are vastly more possible insertions and deletions that SNPs, and these all must be considered by algorithms in order to detect them. Worse, as the length of the insertion or deletion increases to a reasonable fraction of the length of your reads, it becomes impossible to hope to resolve the event without considering an untenable space of possible indels and opening yourself to spurious matching.

Those cheap genomes have a serious blind spot--- they don't easily yield information about the large scale variation in structure (indels, copy number, inversions, translocations) that are apparently very important to evolution. I believe the field has blinded itself to the importance of large variation simply because it is hard to observe. Recent papers based on long read data have started to respond to this assumption in a serious way (https://www.biorxiv.org/content/early/2016/09/24/076562).

> Yes there are minute aspects which lack perfect resolution

In the context of humans there is ample evidence that the things we are missing with short reads are not minute, but are rather an enormous elephant in the room, see https://biosci-batzerlab.biology.lsu.edu/Publications/Sudman... and http://science.sciencemag.org/content/349/6253/aab3761. They report that some genomic regions are expanding by up to 50-fold between individuals. Some whole human populations feature quarter-megabase duplications not present in other groups. The scope of the studies are actually very narrow, with hundreds of individuals being considered. I would be surprised if this is anything less than the tip of the iceberg, and incredibly surprised if this turns out to be a minute detail.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: