Hacker News new | past | comments | ask | show | jobs | submit login

So how do you actually isolate one chromosome to sequence it?



Their github has lots of information about what they do:

https://github.com/nanopore-wgs-consortium/chm13


Step 1: the researchers use a diploid cell line where the entire diploid X chromosome is homozygous: both copies of the X chromosome in this cell line are identical. This is part of why the researchers chose to look at the X chromosome.

> To circumvent the complexity of assembling both haplotypes of a diploid genome, we selected the effectively haploid CHM13hTERT cell line for sequencing (abbr. CHM13)

Incidentally, they do capture the other chromosomes in this process:

> Several chromosomes were captured in two contigs, broken only at the centromere (Fig 1a).

> https://www.biorxiv.org/content/10.1101/735928v3.full.pdf

Step 2: Follow a procedure for DNA prep that results in long stretches of DNA (though not an entire chromosome-length) and amplify (make multiple copies of) the mixture, per this reference:

> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5889714/

Step 3: Run the mixture through a nanopore sequencer (essentially a hole a few nanometres across), reading the change in current in response to the different bases, including methylated bases.

Step 4: Repeat this many many times to get multiple reads of each region of the data:

> In total, we sequenced 98 MinION flow cells for a total of 155 Gb (50× coverage, 1.6 Gb/flow cell, SNote 2). Half of all sequenced bases were contained in reads of 70 kb or longer (78 Gb, 25× genome coverage) and the longest validated read was 1.04 Mb.

Step 5: Overlay the data from the long measurements

> Once we had collected sufficient sequencing coverage for de novo assembly, we combined 39× of the ultra-long reads with 70× coverage of previously generated PacBio data 18 and assembled the CHM13 genome using Canu 19. This initial assembly totaled 2.90 Gbp with half of the genome contained in contiguous sequences (contigs) of length 75 Mbp or greater (NG50), which exceeds the continuity of the reference genome GRCh38 (75 vs. 56 Mbp NG50).

> The read was placed in the location of the assembly having the most unique markers in common with the read. Alignments were further filtered to exclude short and low identity alignments. This process was repeated after each polishing round, with new unique markers and alignments recomputed after each round.

Step 6: Check up the data against the reference genome:

> The corrected contigs were then ordered and oriented relative to one another using the optical map and assigned to chromosomes using the human reference genome.

> The final assembly consists of 2.94 Gbp in 590 contigs with a contig NG50 of 72 Mbp. We estimate the median consensus accuracy of this assembly to be >99.99%.

Essentially, this work closes up difficult-to-read gaps in the reference genome ( https://en.wikipedia.org/wiki/Reference_genome#Human_referen... )


Great write up! Thanks!

Regarding step 1 how can any human have an entire homozygous X chromosome?

Also/rather why not just use a male with one X chromosome?


there are long regions on the Y chromosome that are very similar to the X chromosome, which would make the analysis difficult:

https://en.wikipedia.org/wiki/Pseudoautosomal_region


In the old good (?) days some chromosomal libraries were constructed by using flow sorting. Not sure how often this is being used nowadays for genome/chromosome sequencing projects.


One minor correction - in step 2 the DNA is not amplified as this would reduce the fragment length and also lose the methylation information


Good point. Here they're starting from a cell line, so presumably just starting with as much DNA as they can get from the cells. Amplification is usually needed in other scenarios where the sample is more finite, though from what I've read, nanopore sequencing tech doesn't need much DNA.


I don’t believe you do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: