
Researchers generate complete human X chromosome sequence - mglauco
https://www.genome.gov/news/news-release/NHGRI-researchers-generate-complete-human-x-chromosome-sequence
======
ReaLNero
Dumb question: is there a x_chromosome.txt with the sequence in order? Why do
geneticists not talk about it this way?

~~~
kevinwuhoo
There is! You can find the current "agreed upon" human genome reference
segmented by chromosome here:
[https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/).
(It's not the assembly that's described in the article here.)

People do talk about the genome and its elements using the location by
chromosome number and range like you'd describe an index in a string. There
has even been special notation developed to do so [1]. However, it depends on
_how_ you're looking at biology.

I think an analogy would be: you can describe all code as machine code, but
when there are higher level abstractions you wouldn't choose to do so.

[1]:
[https://en.wikipedia.org/wiki/Locus_(genetics)](https://en.wikipedia.org/wiki/Locus_\(genetics\))

------
skunkworker
The article doesn’t mention but now I’m curious what kind of read length and
error rate they are achieving. This could have huge impacts across all
sequencing.

~~~
rcthompson
It looks like they're using Oxford nanopore and PacBio sequencing technologies
for the long reads. These are two up-and-coming sequencing technologies
focused on extremely long reads. My understanding of both is that their error
rates on individual base pairs are too high to reliably determine the actual
sequence on their own (something like 15% error rates). Typically the long
reads from these technologies are used as a "scaffold" to resolve the large-
scale structure of a DNA sequence, while another sequencing technology,
usually Illumina, is used to resolve the actual sequence. (Illumina produces
short reads, but it produces a _lot_ of them, and the error rate is much
lower, about 1%-5%.) In addition, since PacBio and Oxford Nanopore are very
different technologies, I'm guessing that they probably have different "error
profiles", so they probably partially cover for each others' deficiencies when
you use both of them at the same time.

Note: Don't take any of the specific numbers above as gospel. These
technologies develop extremely quickly, so it's quite likely that my knowledge
of typical error rates is out of date.

In any case, here's the relevant quote from the original link (to phys.org),
before it was changed to the less technical press release, which doesn't
mention any specific technologies used:

"The new project built on that effort, combining nanopore sequencing with
other sequencing technologies from PacBio and Illumina, and optical maps from
BioNano Genomics. Using these technologies, the team produced a whole-genome
assembly that exceeds all prior human genome assemblies in terms of
continuity, completeness, and accuracy, even surpassing the current human
reference genome by some metrics."

~~~
vikramkr
I dont think you can call pacbio up and coming at this point, but nanopore
certainly.

And those error rate examples are way way too high - illumina is closer to
Q30, which is a 1/1000 error rate[0]. 15% would result in an unusable
sequence.

[https://emea.illumina.com/science/technology/next-
generation...](https://emea.illumina.com/science/technology/next-generation-
sequencing/plan-experiments/quality-scores.html)

~~~
rcthompson
The sequencer may report a quality score of 30, but that doesn't guarantee
that the error rate when you align to the genome will actually be 1/1000\.
Still, you're right that good quality Illumina data can do significantly
better than 1% error rate. You can't always get "good quality" data, but I
imagine that the researchers on this project probably could, given the well-
controlled experimental setup.

And yes, a 15% error rate does result in a sequence that is unusable for the
purposes of actually knowing the sequence. But a bunch of really long reads
with 15% error can still be used to resolve the large-scale structure of a
sequence, and then the lower-error-rate Illumina reads can be aligned onto
this large-scale scaffold in order to resolve the actual sequence. At least,
this is my understanding of how these technologies are typically used
together, and given the mention of PacBio, nanopore, _and_ Illumina, that
seems to be what was done in this case.

~~~
vikramkr
Yes the higher error ones are used for alignment- and even then too high an
error rate in a very repetitive region (especially depending on the error type
- misreads vs skipped bases etc) make it too challenging to build a scaffold
to align your illumina reads.

As of 2018 error rate for alignments with nanopore was around 3-6 percent

[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6053456/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6053456/)

------
andreygrehov
In a simple English, could someone explain why is this good and what does it
all mean?

~~~
mr_overalls
From the article: "Repetitive DNA sequences are common throughout the genome
and have always posed a challenge for sequencing because most technologies
produce relatively short "reads" of the sequence, which then have to be pieced
together like a jigsaw puzzle to assemble the genome. Repetitive sequences
yield lots of short reads that look almost identical, like a large expanse of
blue sky in a puzzle, with no clues to how the pieces fit together or how many
repeats there are. . . . Filling in the remaining gaps in the human genome
sequence opens up new regions of the genome where researchers can search for
associations between sequence variations and disease and for other clues to
important questions about human biology and evolution."

Or, to make this more simple: Finding the complete DNA sequence of chromosomes
is difficult. That's because some parts of the sequence are highly repetitive.
Using a new type of lab machine, the scientists were able sequence the
repetitive parts of the X chromosome. This gives a more complete picture of
the X chromosome. And that can help scientists fight diseases and understand
human biology better.

~~~
cco
So 20 years ago when we "sequenced the human genome", we actually didn't? If
you'd asked me whether or not we had a complete sequence of an X chromosome
before I saw this I would have said, "Of course we have one, for over 20
years".

~~~
dekhn
So much about the original announcements was overhyped PR. The original
assembly was super-crappy and super-gappy. The folks running the two projects
were exhausted and declared victory, then moved on.

Of all the fields that I've worked in, genomics has been one of the most
overhyped (virtual drug discovery is the other) and it takes a ton of training
just to understand how messed up the field is.

~~~
cco
Wow, this is news to me. What a farce haha. Thanks for adding clarity here and
breaking a bad assumption I had!

------
staycoolboy
I thought the human genome was mapped in 2003, when the Human Genome Project
wrapped up:
[https://en.wikipedia.org/wiki/Human_Genome_Project](https://en.wikipedia.org/wiki/Human_Genome_Project)

What's different about this?

~~~
arm85
[https://news.ycombinator.com/item?id=23852177](https://news.ycombinator.com/item?id=23852177)
explains it, with a quote from the article. They weren't able to map the whole
thing, because of repeating patterns. That's now starting to change.

------
JadeNB
There is a space missing in the title ("X chromosome").

~~~
notRobot
OP probably ran into the HN character limit.

Edit: Upon testing, that appears to not be the case. Probably a typo.

------
mrfusion
So how do you actually isolate one chromosome to sequence it?

~~~
murphyslab
Step 1: the researchers use a diploid cell line where the entire diploid X
chromosome is homozygous: both copies of the X chromosome in this cell line
are identical. This is part of why the researchers chose to look at the X
chromosome.

> To circumvent the complexity of assembling both haplotypes of a diploid
> genome, we selected the effectively haploid CHM13hTERT cell line for
> sequencing (abbr. CHM13)

Incidentally, they do capture the other chromosomes in this process:

> Several chromosomes were captured in two contigs, broken only at the
> centromere (Fig 1a).

>
> [https://www.biorxiv.org/content/10.1101/735928v3.full.pdf](https://www.biorxiv.org/content/10.1101/735928v3.full.pdf)

Step 2: Follow a procedure for DNA prep that results in long stretches of DNA
(though not an entire chromosome-length) and amplify (make multiple copies of)
the mixture, per this reference:

>
> [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5889714/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5889714/)

Step 3: Run the mixture through a nanopore sequencer (essentially a hole a few
nanometres across), reading the change in current in response to the different
bases, including methylated bases.

Step 4: Repeat this many many times to get multiple reads of each region of
the data:

> In total, we sequenced 98 MinION flow cells for a total of 155 Gb (50×
> coverage, 1.6 Gb/flow cell, SNote 2). Half of all sequenced bases were
> contained in reads of 70 kb or longer (78 Gb, 25× genome coverage) and the
> longest validated read was 1.04 Mb.

Step 5: Overlay the data from the long measurements

> Once we had collected sufficient sequencing coverage for de novo assembly,
> we combined 39× of the ultra-long reads with 70× coverage of previously
> generated PacBio data 18 and assembled the CHM13 genome using Canu 19. This
> initial assembly totaled 2.90 Gbp with half of the genome contained in
> contiguous sequences (contigs) of length 75 Mbp or greater (NG50), which
> exceeds the continuity of the reference genome GRCh38 (75 vs. 56 Mbp NG50).

> The read was placed in the location of the assembly having the most unique
> markers in common with the read. Alignments were further filtered to exclude
> short and low identity alignments. This process was repeated after each
> polishing round, with new unique markers and alignments recomputed after
> each round.

Step 6: Check up the data against the reference genome:

> The corrected contigs were then ordered and oriented relative to one another
> using the optical map and assigned to chromosomes using the human reference
> genome.

> The final assembly consists of 2.94 Gbp in 590 contigs with a contig NG50 of
> 72 Mbp. We estimate the median consensus accuracy of this assembly to be
> >99.99%.

Essentially, this work closes up difficult-to-read gaps in the reference
genome (
[https://en.wikipedia.org/wiki/Reference_genome#Human_referen...](https://en.wikipedia.org/wiki/Reference_genome#Human_reference_genome)
)

~~~
mrfusion
Great write up! Thanks!

Regarding step 1 how can any human have an entire homozygous X chromosome?

Also/rather why not just use a male with one X chromosome?

~~~
bmsran
there are long regions on the Y chromosome that are very similar to the X
chromosome, which would make the analysis difficult:

[https://en.wikipedia.org/wiki/Pseudoautosomal_region](https://en.wikipedia.org/wiki/Pseudoautosomal_region)

------
pulse7
When I read the title my first thought was "Scientists found complete assembly
language of human Xchromosome".

------
pyedpiper
go banana slugs!

------
maxerickson
The press release is better than the link:

[https://www.genome.gov/news/news-release/NHGRI-
researchers-g...](https://www.genome.gov/news/news-release/NHGRI-researchers-
generate-complete-human-x-chromosome-sequence)

~~~
dang
Ok, changed from [https://phys.org/news/2020-07-scientists-human-
chromosome.ht...](https://phys.org/news/2020-07-scientists-human-
chromosome.html). Thanks!

