
No human genome has ever been completely sequenced - shpx
https://www.statnews.com/2017/06/20/human-genome-not-fully-sequenced/
======
jessriedel
As usual, the journalist spends many paragraphs painting a picture of human
conflict before actually getting on to the interesting claim.

> The reason for these gaps is that DNA sequencing machines don’t read genomes
> like humans read books, from the first word to the last. Instead, they first
> randomly chop up copies of the 23 pairs of chromosomes, which total some 3
> billion “letters,” so the machines aren’t overwhelmed. The resulting chunks
> contain from 1,000 letters (during the Human Genome Project) to a few
> hundred (in today’s more advanced sequencing machines). The chunks overlap.
> Computers match up the overlaps, assembling the chunks into the correct
> sequence.

> That’s between difficult and impossible to do if the chunks contain lots of
> repetitive segments, such as TTAATATTAATATTAATA, or TTAATA three times. “The
> problem is, when you have the same exact words, it’s hard to assemble,” said
> Lander, just as if jigsaw puzzle pieces show the same exact blue sky.

> In 2004, the genome project reported that there were 341 gaps in the
> sequence. Most of the gaps — 250 — are in the main part of each chromosome,
> where genes make the proteins that life runs on. These gaps are tiny. Only a
> few gaps — 33 at last count — lie in or near each chromosome’s centromere
> (where the two parts of a chromosome connect) and telomeres (the caps at the
> end of chromosomes), but these 33 are 10 times as long in total as the 250
> gaps.

I recommend reading HN user eggie's comments on this from last week.

[https://news.ycombinator.com/item?id=15482439](https://news.ycombinator.com/item?id=15482439)
[https://news.ycombinator.com/item?id=15483462](https://news.ycombinator.com/item?id=15483462)

~~~
shpx
[https://youtu.be/fCd6B5HRaZ8](https://youtu.be/fCd6B5HRaZ8) is the best
visualization of how the most popular type of DNA sequencer works (that I've
found).

Imagine you have a string of length 3 billion made by randomly choosing from 4
characters. Like this

    
    
        dna = ''.join(random.choices('atgc', weights=[30.9, 29.4, 19.9, 19.8], k=3_234_830_000))
    

you get to randomly sample 1 billion[3, page 7] overlapping substrings of
length 200[3, page 7] with .1% of the characters randomly changed[3, page 8].
Trying to find the original string from this is technically an undecidable
problem. If there's a sequence 400 characters long that repeats multiple
times, how could you know if it repeats 5 times or 50 times? (this would be
unlikely to happen with random.choices() but DNA isn't random). This is called
sequence alignment and it's one of the hard problems in bioinformatics[4].

[0]
[https://docs.python.org/3/library/random.html](https://docs.python.org/3/library/random.html)
random.choices() was added in Python 3.6

[1] [http://www.biology-pages.info/B/BasePairing.html](http://www.biology-
pages.info/B/BasePairing.html) source for `weights`

[2]
[https://en.wikipedia.org/wiki/Human_genome](https://en.wikipedia.org/wiki/Human_genome)
source for 3_234_830_000 (python ignores underscores in numbers)

[3] [http://sci-hub.io/10.1111/j.1755-0998.2011.03024.x](http://sci-
hub.io/10.1111/j.1755-0998.2011.03024.x) illumina is the most popular producer
of genome sequencers

[4]
[https://news.ycombinator.com/item?id=5123377](https://news.ycombinator.com/item?id=5123377)

~~~
nether
Side note: your comment is one of the few times I've seen the code tag used
for actual code on HN, rather than quotations or just indentation.

~~~
Bromskloss
Yeah, why does Hacker News not have a real way to indent things? It wouldn't
make loading the page any less light-weight.

~~~
labster
Seriously. I can't go a week in HN without reading someone's complaint about
not being able to read a quote, probably on mobile. If users keep complaining
regularly, it's a problem with the software interface, not the users.

~~~
nether
I emailed dang about it a while (years) ago. He said he's added it to the fix
list, after initially saying that having a quote tag would mess up the site's
"character." You may want to email him too.

------
mxwsn
> “The problem is, when you have the same exact words, it’s hard to assemble,”
> said Lander, just as if jigsaw puzzle pieces show the same exact blue sky.

This is inaccurate, though the fault is with the author, not Lander. When you
have genomic repeats, the appropriate analogy is if you had multiple puzzle
pieces that had the exact same edges on all sides.

Puzzle pieces that show the same blue sky but with different edges can still
be uniquely assembled. But puzzle pieces with identical edges create non-
unique but equally satisfactory assembly solutions, and this is the case with
the problem of genomic assembly.

------
iskander
It's strange that this article ends as an advertisement for PacBio sequencing
(which can ~50k-60k base reads) but makes no mention of Oxford Nanopore (which
has gotten megabase reads and keeps improving). Single molecule nanopore
sequencing is on track to sequence across the centromeres of human chromosomes
in the next few years.

~~~
cycrutchfield
How many sequencers has Oxford Nanopore sold? It seems like it has perennially
been a “in the next few years” technology.

~~~
coolnow
I'm not sure how many they've actually sold but i do know that they have a
loan/test program where you can pay a relatively small amount (around $2000 or
less) to get set up with a MinION kit. The main recurring cost is the flow
cell. I'm currently trying to figure out how to join together short
multiplexed samples (150bp - 200bp) and get it sequenced on the MinION.
Current short-read sequencers from Illumina or ThermoFisher cost orders of
magnitude more than Oxford Nanopore's offering. Hopefully someone else is
trying to use the platform for short reads.

~~~
car
Have a look at SAGE
([https://en.wikipedia.org/wiki/Serial_analysis_of_gene_expres...](https://en.wikipedia.org/wiki/Serial_analysis_of_gene_expression)),
might give you some ideas for joining short reads. I suppose you'd want to
look for a restriction enzyme that creates average fragment size in your
desired range.

------
AaronFriel
A question and a story for any intrepid biologists and geneticists. I'm seeing
in other comments that this is fundamentally a problem with how genome
sequencing works today.

When I was a movie theatre projectionist, I had a similar problem. Movies have
to be assembled onto platters, from a handful of reels which contain 10-20
minute lengths of film. There's about a mile of film per hour, and to assemble
it quickly you have a motorized platter and a table with its own motor with
dials to control the speed of one or the other. How it should work is you take
a center ring - a circle with a gap to wind film through and metal spokes that
sit in the platter - and you spin the film on. You put the reels one by one
onto this, stopping with each one to cut the footer of one and the header of
the next to splice them together with tape.

Well, that's how it's supposed to work. Sometimes, the film tightens up, when
moving the film from one platter to another. The center ring spokes might not
fit, or might create additional stress.

Well, I went full speed ahead anyway, and when a tiny bit of slack resulted in
a sudden jerk on the platter, the center ring popped off, and about an hour of
film flew over my hand and into the wall behind me. What happened next is
almost impossible to describe. A circular disk of film and metal hitting a
solid wall briefly became a vertical column as the tension-less film was
forced to go any direction except forward, and the result then splayed itself
out on the floor in front of me in a tangled mess a mile long.

Perhaps like an early sequencing machine, my first attempt at recovering this
was to look at a piece of the film and determine what part of the movie it was
from, and to begin throwing away the parts I thought were from trailers.

They weren't. It was the wrong movie.

I despaired, the film was a mile long and tangled into unbelievable knots. I
had to cut it to untangle it. But how?

And it hit me, I knew where a handful of colored markers were, and I began
randomly pulling segments of film out from the mess, placing masking tape
across a frame, and making 3 lines in 3 different colors. I made dozens of
these little loops, and that was how I put it together. The process: cutting,
labeling, and placing the segments into new reels made my shift from that
night run into the next day's matinee.

This process was painstaking, and that's where my question leads: can we label
the ends when we cut up the DNA, like I did with my film? Are we stuck
watching chromosomes splash against the wall and shatter into indecipherable
pieces?

~~~
bsmrea
This is a nice analogy to genome assembly and there are some technologies that
try to label DNA before sequencing to aid reconstruction. The one that comes
to mind when reading your post is the 10X Genomics system, which acts as a
preprocessor before Illumina sequencing. The idea is that you can introduce
nucleotide tags into the DNA molecule(s) before sequencing, then use the tags
to figure out what short reads came from the same DNA fragment later on. There
is a nice video on 10X's website explaining how this works and how it helps:

[https://www.10xgenomics.com/technology/](https://www.10xgenomics.com/technology/)

------
joering2
Slightly off-topic, but this semi-animation of DNA replication machine that is
working right now in your body (in amounts of trillions of devices) blew my
mind few months ago.

from TED:
[https://youtu.be/WFCvkkDSfIU?t=3m39s](https://youtu.be/WFCvkkDSfIU?t=3m39s)

------
Bartweiss
> * A gene called ARHGAP11B, which was created by one such duplication, causes
> the cortex to develop the myriad folds that support complex thought;
> SRGAP2C, also a duplication, triggers brain development.*

A question I've never seen addressed: is one justification for junk DNA to
create space for beneficial mutations?

Obviously some mutations are actively harmful, but the reason most mutations
are destructive is that they break something already-useful into something
inert. That's not a risk with non-coding (or irrelevantly-coding) DNA, which
means that in return for spending energy on copying junk DNA, we get space
where even barely-useful changes will be net positive.

Is this obvious to experts? Completely stupid for reasons I don't know? Even a
possibility worth discussing?

~~~
Obi_Juan_Kenobi
Not stupid at all.

If you're interested in that, I'd look up "Whole Genome Duplications". For
instance, corn underwent a WGD about 7 million years ago (IIRC, the number
might be off), and we can identify an 'a' and 'b' genome. After these events,
most of the duplicated genes will be lost, either through purging selection or
drift. However, some will stick around and take on new subset functions (see
nkrumm reply).

If you really want to get into the weeds on this, there are many metrics that
could, in theory, affect the 'evolvability' of a genome. Higher transposon (or
any sort of repetitive element) content can lead to large-scale shuffling of
genome structure, such as inversions, duplications, etc. The ecological
success of the grass family, for instance, may have it's roots in genome
structure.

Now, there's not a whole lot we can say about this because it's not something
we have a really good grasp of. It's mostly some hypotheses that are quite
difficult to test, but this is some of the most interesting genomic research
going on, in my opinion.

All of this is quite difficult to understand, however, as it really depends on
epigenetic mechanisms like chromatin state, dicers, RISC, and all that. Figure
that out, and you'll realize there's no 'junk' DNA, just various non-coding
sequence that's part of this larger-scale genome evolution.

~~~
quotemstr
Why is it that plants are so much better than animals at tolerating
chromosomal shenanigans? Plants can be highly polyploid, get chromes swapped
out, randomly get duplicates, etc., and not seem to suffer massive problems.
In a human, you duplicate one chromosome, and you get a very buggy phenotype.

~~~
a_bonobo
This has, as usual, several reasons :)

One reason is the absence of sex chromosomes in plants. If humans get a
duplicated sex chromosomes things go completely haywire. Another reason is
that due to so many WGDs, plants have ample gene copies, so the copies can
freely mutate around or even break, there's always a backup.

Not all of these shenanigans work out, in plants this has the wonderful name
'genomic shock'. In wheat there's a candidate gene (Ph-1) which seems to
somehow stabilise these instabilities between chromosome copies. In other
species people have been trying for a few years to make polyploids but have
been failing because there is no stability mechanism - for example, there are
only few papers where people managed to make a Brassica hexaploid, these
hexaploids are usually very unstable and have problems generating offspring
(see for example
[http://www.cropj.com/malek_7_9_2013_1375_1382.pdf](http://www.cropj.com/malek_7_9_2013_1375_1382.pdf)
)

------
pishpash
Haven't paid attention to this area in a while but back in the day I've
wondered about the garbage-in garbage-out problem in genome databases. A lot
of subsequent sequence assemblies were made on the basis of approximate
matching to certain results in the database that have no quality information,
and conclusions made on the basis of further approximate matching across
datasets. Has anyone seriously worked out how reliable some of these
conclusions are? At least physicists spend a lot of time worrying about
p-values.

~~~
et2o
I don't think anyone knows.

------
kusmi
“I’d be the last one to give you a quote saying that we don’t need to bother
with these [unsequenced] regions.”

I wonder if he said this with a straight face.

------
grabcocque
One of the things the HGP revealed is that we have way fewer genes than anyone
expected. Given that the amount of data encoded in genes appears to be far too
small to account for the sophistication and variability in humans, it strongly
implied that large areas of DNA that were previously considered marker/filler
DNA were of great epigenetic importance.

~~~
toufka
There are also a lot more steps between 'genomic dna' -> 'protein' than was
previously appreciated or thought to be of importance. There are a lot of ways
to regulate and produce new protiens and variation that is not itself a
genomic change.

Alternative splicing, readthrough regulaton, multiple kinds of ribosomes,
post-translational regulation, etc. all produce significant variation outside
of the dna itself.

------
hprotagonist
what do you mean, "the" human genome?

~~~
slavik81
The was flag killed. I vouched for this because I had the same question. What
does it mean to have sequenced the human genome? If different individuals have
a different genome, what is the human genome?

~~~
nonbel
Not only do different individuals have different genomes, each cell from the
same person will have different genomes. It is more like an "average" human
genome, one that no actual cell has ever contained.

~~~
viewtransform
"each cell from the same person will have different genomes"

Did you mean each cell will have different gene expression ?

I thought all the cells have the same nucleotide sequences barring maybe
occasional mutations (like in cancer)

~~~
nonbel
Mutation rates are high enough that we should expect at least a few every
division. There are supposedly n~6e9 base pairs[1] and mutation rate is said
to be p~1e-8 per bp per division[2].

Assuming each mutation is independent, etc, etc we get a back of the napkin
via the binomial distribution with mean = n*p ~ 60 mutations per division.

[1]
[https://en.wikipedia.org/wiki/Human_genome](https://en.wikipedia.org/wiki/Human_genome)

[2]
[https://en.wikipedia.org/wiki/Mutation_rate](https://en.wikipedia.org/wiki/Mutation_rate)

~~~
chrisamiller
Most estimates I'm familiar with are substantially lower than that - between
0.1 and 3 errors per cell division.

~~~
nonbel
Thanks, actually the wikipedia link was "per generation" and they assume ~ 100
divisions/mitoses per human generation. So it should be 1e-10
mutations/bp/division.

However, that is for germline, somatic mutation rates are apparently 10-100x
more more common:
[https://www.nature.com/articles/ncomms15183](https://www.nature.com/articles/ncomms15183)

These rates no doubt vary by cell type, environment, etc. I was actually
thinking something like 1e-6 to 1e-10 mutations/bp/division.

------
coldtea
In other words they don't know what they're doing and they're making it up as
they go along -- "yeah, that part there is probably unused", "oh, wait, it's
important after all".

~~~
phyzome
As the article says, we know there's important stuff in those repetitive
regions, it's just really hard to read.

~~~
coldtea
Well: "“A lot of people in the 1980s and 1990s [when the Human Genome Project
was getting started] thought of these regions as nonfunctional,” said Karen
Miga, a molecular biologist at the University of California, Santa Cruz. “But
that’s no longer the case.” "

And: "“I’m between agnostic and a little skeptical that these bits will be
important for disease, but maybe I’m saying that because we can’t read them,”
Lander said.".

~~~
throwanem
The "no such thing as junk DNA" claim comes from a paper which is, to say the
least, controversial in the field. I would hesitate to take it as prima facie
reliable, and in general would recommend looking with extensive skepticism
upon popular reporting in this realm of scientific endeavor. After all, this
very thread originates in a piece of popular reporting that only seems novel
because all of the previous popular reporting has been wrong...

