
Harvard cracks DNA storage, crams 700 terabytes of data into a single gram - evo_9
http://www.extremetech.com/extreme/134672-harvard-cracks-dna-storage-crams-700-terabytes-of-data-into-a-single-gram
======
skosuri
I'm an author of the paper. The title of this article is misleading; first, we
encoded 650kB and made 70 billion copies... second, those 70 billion copies
weigh 1 milligram... third, it's really only meant for archival purposes as
it's immutable and not random access... fourth, it's expensive right now (at
least this might be a solvable problem).

~~~
brfox
Nice work, and thanks for replying to all the questions here. I didn't read
the paper, maybe it is addressed in there, but how much did it cost to
synthesize that much DNA? Also, it could be random access if you PCR amplified
the fragments you need based on the barcodes - you could even make a FAT (file
allocation TUBE) which has all the file names and their barcodes.

~~~
skosuri
approximately a couple of thousand dollars.

while that's true (re: scalability), it's not what we did, and would be hard
to scale.

------
ChuckMcM
I really liked their paper. Its a bit less over the top than the extremetech
guys but hey, that is the difference between pop journalism and science.

Clearly with some form of fountain code or LDPC codes you will be able to get
the data back, but what struck me is that I always thought of DNA as
relatively unstable, in the sense that cells decay/die etc, but the fact that
just sitting there, DNA which isn't expressing various proteins under the
influence of other cellular mechanisms, well it just sits there. That was new
for me.

When I showed it to my wife she pointed out that the sourdough starter she has
been using since we were married was from her grandmother, I joked that the
next megaupload type raid would have to sequence all the DNA the found in a
place to figure out if Shrek3 was encoded in it somewhere. That would be
painfully funny I think.

~~~
vph
do you have a link to their paper?

~~~
skosuri
I posted our paper on dropbox (until i get into too much trouble over it):
paper <http://db.tt/ZDoDJZeD> supplement <http://db.tt/elIqsy72>

~~~
brown9-2
thank you!

------
colanderman
If you store data onto 50 DNA strands, can you always read back all the data
from all 50 strands, or does one need to store multiple copies of each in case
the sequencer can't "find" a particular strand?

If one does need multiple copies, it would seem that this method suffers from
the coupon collector's problem [1] (i.e. to collect all 50 strands requires
collecting 225 random strands on average), and that the retrieval rate could
be improved by using a fountain code [2], which allows each strand to
simultaneously encode data at multiple addresses, which would decrease the
number of strands required to be sampled to only slightly more than the number
of strands worth of data requested.

[1] <http://en.wikipedia.org/wiki/Coupon_collectors_problem> [2]
[http://blog.notdot.net/2012/01/Damn-Cool-Algorithms-
Fountain...](http://blog.notdot.net/2012/01/Damn-Cool-Algorithms-Fountain-
Codes)

~~~
shinratdr
HN formatting breaks the first Wikipedia link by removing the apostrophe. I
ran it through a link shortener as a workaround.

<http://cl.ly/1r153b103k2P>

~~~
i_cannot_hack
One could also replace the apostrophe with %27

<http://en.wikipedia.org/wiki/Coupon_collector%27s_problem>

------
jwr
Does anybody know how to escape their horrible "mobile" version that they
force onto ipad users? It can't even be zoomed :-(

More and more often I find myself not reading articles because someone thought
it would be a great idea to create a non-scrolling, non-obvious, paginated
"iPad format" with additional misleading and unintuitive buttons looking like
native ones but doing something different.

Sorry for the rant.

EDIT: so you might as well access the original at
<http://hms.harvard.edu/content/writing-book-dna> instead of the ad-ridden
regurgitation.

~~~
jessriedel
I'm considering buying an iPad, and I have a question: is there no browser on
the iPad that allows you to choose whether you want to the mobile version or
the regular version? If not, that's almost a deal breaker.

~~~
aerique
Yes those browsers are available. Chrome has a "request desktop site" option
and my personal favourite, iCab Mobile, has an extensive list of agent strings
(custom ones can be added as well) that can be set.

~~~
jessriedel
Phew, thanks.

------
Jun8
And, of course, this brings us to the question: Do we already have messages in
our DNA? Here's a post (from 2007) on this:
[http://blog.sciencefictionbiology.com/2007/07/messages-in-
ou...](http://blog.sciencefictionbiology.com/2007/07/messages-in-our-
dna.html). Actually, if it's from the aliens who seeded life on Earth, it
would probably be in a prokaryotic DNA perhaps?

~~~
mbreese
Probably not... If we were seeded by aliens, then the message would have had
to have been in a very primitive form, so it would necessarily be seen in all
living things, including prokaryotes. Unfortunately bacterial genomes are
much, much, smaller than humans, so there isn't much room to waste for hidden
alien messages. Additionally, presumably, these messages wouldn't be
functional, so they would be under any sort of evolutionary pressure. This
means that they would likely be mutated or lost to natural selection.

That's not to say that we weren't seeded, just that if we were, any message
would likely have been lost. Unless... maybe they seeded mitochondria intact,
which are in all (ok, most) eukaryotes. Maybe that might work... :)

Unfortunately, chrM is pretty small too... so no hidden messages there either
:(

------
wbizzle
This article is incredibly misleading. First of all there is an inconsistency.
The headline says they stored 700 terrabytes (4.4 petabytes). It then later
says that they actually stored 700 kilobytes (Their book) and that they did
made 70 billion copies (44 petabytes?). The main thing is that storing 700
kilobytes and then making 700 billion copies is considerably less useful than
storing 70 billion terabytes outright. Aside from that though, this is
awesome, and a huge step forward into promising and uncharted territory.

~~~
gibybo
It says they stored 44 petabytes total (70 billion * 700 kilobytes), and 700
terabytes in a single gram (which is 5.5 petabits, not 4.4 petabytes).

~~~
skosuri
in the paper, we say 1.5mg per petabyte at large scales. we only encode 650kB
or so. this seems a little sensationalistic. we are far away from being able
to do 1 petabyte of arbitrary information.

~~~
gibybo
Wait, 1.5mg per petabyte at large scales? Wouldn't that mean a gram could hold
(1000/1.5) 667 petabytes and presumably scalable to many grams (eventually)? I
understand it's only 650kB right now, but the density is obviously still
incredible.

~~~
skosuri
yes. density is astounding. that's why it's in science i presume. mostly
because people forget how dense dna information really is.

~~~
apathy
> i presume

you are either the most humble senior author ever or jerking all our chains.
possibly both. congratulations either way.

~~~
skosuri
this is my first senior author paper; i imagine i'll be more haughty after I
start teaching a few classes.

------
nemo1618
First thing that came to mind was Stross' "Memory Diamond" -
[http://www.antipope.org/charlie/blog-
static/2007/05/shaping_...](http://www.antipope.org/charlie/blog-
static/2007/05/shaping_the_future.html)

------
pronoiac
Wow. For scale, the Internet Archive had 5.8 petabytes of data in December
2010 [1] - so, about 9 grams' worth. How much did this cost?

[1] <http://archive.org/web/petabox.php>

~~~
skosuri
i think you have the numbers wrong. at least in the supplement of our paper we
say a petabyte would weigh ~1.5 milligram. It would be far too expensive to do
that though; about 6-8 orders of magnitude increase in scale is necessary from
current technologies. that said, we've seen that kind of drop over the last
decade or so; here's to keeping it going.

------
conanite
They're using T and G for a 1, and A and C for a 0; why not double the density
and get two bits from each letter?

    
    
      T = 00
      G = 01
      A = 10
      C = 11
    

for example.

~~~
skosuri
we didn't because we wanted to avoid particular sequence features that are
difficult to synthesize and sequence. we probably could have gotten away with
something like 1.8 bits per base, but we were already doing fine on density,
so we thought a 2x hit wouldn't be that bad.

~~~
skosuri
i should clarify, because of extra sequence and the address barcode, we are
technically only 0.6 bits/base

------
tarice
I notice that the article fails to mention how long it would take to extract
all 700 terabytes of data...

Assuming 5.5 petabits stored with 1 base pair representing 1 bit, we can
extrapolate the time required to extract the data based off the time taken to
sequence the human genome (3 billion base pairs).

5.5 petabits / 3 billion bits ~= 2 million, so theoretically it should take 2
million times longer to sequence the original.

3 years ago, there was an Ars Technica article about how it now only takes 1
month to sequence a human genome[1]; the article now claims that microfluidic
chips can perform the same task in hours.

Assuming 2 hours (low end) to sequence the human genome:

2 hours * 2 million = 4 million hours = 456 years, give or take a few years.

So, maybe not so great for storing enormous amounts of data. But if you want
to store 1 GB, it would only take ~6 hours. Not too bad.

[1][http://arstechnica.com/science/2009/08/human-genome-
complete...](http://arstechnica.com/science/2009/08/human-genome-completed-
using-one-machine-for-four-weeks/)

~~~
scarmig
Hmm. DNA is fairly easy to duplicate though, right? Wouldn't that allow an
exponential speedup?

~~~
tarice
I assumed that the microfluidic chip speed listed would include parallel
processing.

Even if it doesn't, you'd still need something like 5,000 experiments in
parallel for it to take less than a month...

~~~
JunkDNA
Yes, the microfluidics used today make use of this for reading large numbers
of small segments of DNA in parallel. The current "gold standard" for DNA
sequencing (manufactured by Illumina) uses millions of tiny fragments of DNA
which are read optically as DNA sequence is extended.

------
X4
They're teaching us this in ComputerScience and I wonder if this is total crap
or not. Can you please shed light into this?

"In humans, the deoxyribonucleic acid (DNA, Germ. DNS) is the carrier of
genetic information, and the main constituent of the chromosomes.

DNA is a chain-like polymer of nucleotides, which differ in their nitrogen
bases (Thymin/Cytosin bzw. Adenin/Guanin,) The alphabet of the code is
therefore: {Thymin, Cytosin, Adenin, Guanin,} or also { T, C, A, G } Three
consecutive bases form a word So there are 43 = 64 combinations per word so
the word length is ld (64) bits = 6 bits A gene contains about 200 words A
chromosome contains about 104 to 105 genes The number of chromosomes per cell
nucleus is 46 in humans The stored data per nucleus have, a volume of 6 bit *
200 * 10^5 * 46 = 55200 bit * 10^5 * 5 * 10^9 bit * * 10^9 Byte = 1 GByte"

~~~
apinstein
You aren't far off, but there are some errors. Excuse in advance any errors
below, it's been a while since I did this stuff daily.

Human genome is 23 chromosomes, 2 copies of each (46). They are not mirror
copies like RAID1, but rather alleles (different versions of same gene). So
you might have a different version of a gene on each half of the chromosome.
This is the basis for sexual selection and why sexual organisms can evolve
(esp at the population level) so much faster than asexual ones. The two
alleles can be the same as well. It's called homozygous or heterozygous. This
is basic Mendelian genetics. Most genes work this way, though many are more
complex.

ATGC is correct, so each "bit" is base 4.

Genes encode proteins; every 3 base pairs is a codon which specifies which
Amino Acid to use. While there could be 4^3=64 in practice there are only 20
amino acids used in nature to make functional proteins. Genes vary greatly in
length, not sure where you got ~200 codons/gene but that's terribly wrong.
Maybe close to average, but the range is large. In any case, for data storage,
anything relating to codons and proteins would be irrelevant.

Also in practice not all of a chromosome encodes proteins. There is often lots
of buffer region between genes, not to mention a lot of control flow sequences
that help control expression of genes. Beyond that, the ends of chromosomes
don't have many genes, they just contain pseudo-random information and are
still being explored (junk dna, telomeres, etc).

~~~
X4
>> And how much data did they fit into one DNA pair?

Info: According to Quora the avg. weight of the Human DNA is 60g, that means a
Human could carry 700TB * 60g = 42000 TB Note: 700TB = 7.69658139 × 10^14
bytes

------
schiffern
>To store the same kind of data on hard drives — the densest storage medium in
use today — you’d need 233 3TB drives, weighing a total of 151 kilos.

But hard drives aren't the densest storage medium in use today. A microSD card
can hold up to 64 gigabytes and is 0.5 grams. 700 terabytes would be only 5.6
kilograms.

------
AaronBBrown
What's the latency/throughput on reading the data back?

------
gersh
Can we encode all of human knowledge into the DNA of some organism? How can
organisms access data stored in their DNA? Imagine being born with knowledge
of every Wikipedia article, or even every website. What would that be like?

~~~
milesokeefe
Education would no longer be necessary or valued.

Experience and creativity would be what everyone garners to get a job.

~~~
nnash
I think we'd be past "jobs" if it were possible to store and read memory in
DNA with the human brain.

~~~
nico
Then who would be in charge of creating new knowledge, and who would put it
into our brains or DNA? Knowing everything humans have discovered so far would
only change the type of jobs we would have, we would still need to have jobs
though.

~~~
nnash
Hence the quotations, we'd be past the traditional 9 to 5 and in pursuit of
loftier ambitions.

------
tocomment
I don't understand how this density could be so much better than something
like flash drives. Aren't they also on the same scale of nanometers?

~~~
colanderman
DNA is 3D. Flash drives are 2D.

~~~
brodney
Flash drives also have all of the necessary equipment for it to be read. I
wonder how much data per gram the actual storage part of the flash drive
actually is. I doubt it's better than DNA, but I think it might be
disingenuous for the article to be offering the analogy to hard drives without
taking into account the extra read/write hardware.

~~~
skosuri
agreed, it should be comparied to other arhival media like tape drives; but
it's still pretty similar; we compare in our paper a hard drive platter rather
than a drive itself. even then we are approximately ~million fold more dense

~~~
brodney
Thanks for the response. That is a truly impressive figure. Great
accomplishment!

------
DanBC
The paper is exciting, in the calm measured way that scientists are. I look
forward to seeing huge data storage on DNA in the future.

I'm gently concerned about what'll happen to information if it's not available
to the future people. Is anyone taking the most important documents of our
civilisation and encoding them onto clay tablets, or some such?

------
dsirijus
Why binary if DNA naturally has 4 bits?

~~~
whatgoodisaroad
Bases can only pair with one other base. Adenine can only pair with thymine.
Guanine can only pair with cytosine. Knowing one base implies the other, so
there are only two possible pairs.

~~~
sadga
But there are two strands, and each half-pair is fixed on one side.

------
subrat_rout
The next big hurdle is to how to develop a household DNA sequence reader under
$50 that will read your storage. I mean if I want to store my data onto a DNA
strand, then one day I'd be in need of reading that data at my home with the
help of a sequence reader.Right?

------
tocomment
To read the data out are they basically doing de novo assembly on the
sequenced reads? How are they handling all of the errors in gene sequencing?
How about assembly errors? Long repeats?

~~~
CreRecombinase
If your strand length is less than or equal to the read length of your
sequencer, and you have the address blocks at the start of every sequence, you
don't really need to worry about assembly. Read depth and/or a checksum of
some kind will take care of errors in sequencing, and with short strands or
compression of some kind, long repeats aren't much of a problem either.

~~~
tocomment
Thanks, that makes sense. But how to you get the address block at the start of
each sequence?

~~~
skosuri
it's at the start of the first read.

------
tripzilch
I always have to wonder with these ExtremeTech links:

How much is this news true and how much is it the usual ExtremeTech
editorialism?

For instance does DNA _really_ last forever?

------
samholmes
Is this storage method only good for read-only data, or can it survive
multiple re-writes?

------
kanchax
Great link. Makes one anxious to see what's next.

------
kschua
Why do I get a feeling I am living in a Matrix as a data storage device?

------
mariusz331
This. Is. Awesome.

------
Evbn
Article says they made 70billion copies of 500KB, which is quite different.
Can they encode 700TB of unique data in this system?

------
revelation
It is incredibly stable? We better don't tell evolution.

~~~
ajross
Error correction redundancy to any level of reliability you want takes log(N)
extra storage. My question would be what the access speeds are. If you have to
read it by running it through a trillion PCR test tubes, this isn't exactly
practical.

~~~
drucken
Microfluidics or lab-on-a-chip are the latest advances:
[http://www.extremetech.com/computing/84340-new-silicon-
chip-...](http://www.extremetech.com/computing/84340-new-silicon-chip-
sequences-complete-genome-in-three-hours)

The progress in this field has been far better than Moore's Law [1] so it
could be very practical.

[1] Excellent TED talk on it too by Richard Resnick:
<http://www.youtube.com/watch?v=u8bsCiq6hvM>

------
jorgeleo
Gel Packs! Cool! <http://en.memory-alpha.org/wiki/Bio-neural_gel_pack>

(How nobody has reference this?)

