

500 Exabytes per Raindrop - smanek
http://math.ucr.edu/home/baez/information.html

======
sh1mmer
This is easily the most interesting thing I've read all week.

I was talking to some of our systems engineers about the size of the Hadoops
clusters at Yahoo. I was pretty impressed with having access to thousands of
machines with 10s of petabytes of storage (think kid in a candy shop).

It's nice to be put gently back into my place, by a simple raindrop.

------
biohacker42
Having worked for a bioinformatics company, I can describe my work as handling
huge volumes of data.

Modern instruments are ever more computerized and spit out gigs and gigs of 1s
and 0s with every use.

And then you have to take all that data and try to turn into something humans
can understand.

And every time our instruments get better the amount of data we collect goes
up.

Systems biology is basically a data management problem.

------
vitaminj
_To see a world in a grain of sand,

And a heaven in a wild flower,

Hold infinity in the palm of your hand,

And eternity in an hour_

William Blake, Auguries of Innocence

~~~
glymor
I'm pretty sure you got this from the Tomb Raider movie.

------
bradgessler
This guy needs to read GEB. One could describe a raindrop with even more bits
(subatomic particles anybody?) or less (size, volume, shape of raindrop).

~~~
unalone
Plus, of course, he's talking about that raindrop in isolation. To describe it
in context with the real world, you also need to factor in the light
refracting through it, the position of the observer, and the pace at which it
moves both due to gravity and to the earth's rotation. So to describe it
accurately from a single position requires knowledge of everything around that
raindrop.

In fact, it's possible to say then that you could start with something as
simple as, say, a piece of fairy cake, and deduce accurately the nature of
everything else in the universe.

Reading this, the thing that stuck out to me was the relatively small size of
text (5MB for Shakespeare versus 20GB for Beethoven). It struck me that poetry
- particularly haiku, which deals with nature - is sort of a primitive way of
applying lenses to the world to create a filter for viewing things by
appealing to collective experiences. Just a thought.

~~~
celoyd
Primitive? Poetry is probably the most developed form of lossy compression
there is.

~~~
unalone
Primitive meaning the concept behind it - capture a moment in time with words
- is a very old and simple one.

Don't get me wrong: poetry has become incredibly sophisticated. And it's still
as enjoyable to write as it was long, long ago.

------
yters
What is the theory of useful information? According to algorithmic information
theory, the less compressible something is, the more information it has. But,
the sort of information I'm interested in implies that there is a fairly large
ratio of bits to compression. I.e. this text is much more compressible than a
random string of letters, and conveys much more information to me than a
random string. Is there a precise characterization of this sort of
information? It isn't simple the compression ratio, since
AAAAAAAAAAAAAAAAAAAAAAAAAAA is very compressible, but conveys very little
information.

~~~
thwarted
Go the other way. Assume two strings of length k: string a composed of english
words, and string b generated by reading /dev/random. String a will be more
compressible, length(compress(a)) < k. You can now add MORE "information" to a
and compress it until length(compress(a)) = k. Since you can't compress b any
further, it already contains the maximum amount of information.

I put "information" in quotes above because the actual "information per bit"
of English is pretty low, and that's where the ability to both compress it and
comprehend it, not as single bits but as groups of bits, comes from. Still,
compressed data is still comprehensible once you uncompress it, so it's a
measure of the surface comprehensiblity. It's really a measure of information
density. Compressed data has a high information to space ratio, whereas
random, uncompressible data has low information to space ratio (or perhaps
negative).

You can experiment with this yourself:

    
    
       cat > /tmp/string.a
          (paste in some English text cut and pasted from a web page)
       dd if=/dev/urandom of=/tmp/string.b bs=1 count=`wc -c < /tmp/string.a`
       bzip2 /tmp/string.*
       ls -l /tmp/string.*
    

Experiment with the above for corpora of different lengths. You'll see that as
there is more English text, which is information rich, it can compress better
than shorter English text (as a compression ratio), and will compress better
than random data (which we know contains little information) of the same
length.

A string composed solely of 27 As would compress down to perhaps 2 bytes or
less (not including the size of the decompressor). You are right: there is not
much information in it. Less than 16 bits of information in 27 As.

~~~
yters
I'm looking for a definition of information that matches up with how we use
the word in normal parlance. This definition has to strike some kind of medium
between incompressible and completely compressible. I could make something up,
but I was wondering what the official version is.

~~~
thwarted
I'm not sure I follow. The definition of "information" I use everyday matches
the "official" version -- I'm not sure how it could be different. What is
"completely compressible"? Something can not be compressed beyond the shortest
string that can represent it without losing information content.

[http://en.wikipedia.org/wiki/Information#Measuring_informati...](http://en.wikipedia.org/wiki/Information#Measuring_information_entropy)

~~~
yters
Shannon's information is a measure of the size of a set that an element is
selected from, is that right? So a letter conveys log2(26) bits of
information. That alone doesn't allow me to discriminate between the
information content of a random string and an English sentence, since both
supposedly contain the same amount of information by this bare metric.

If I look at the occurrence of subsets of the string, then that would be a
better discriminator: the random string's subsets should follow a normal
distribution while the English string's subsets will be highly skewed.

However, that doesn't work when I try to discriminate between an English
string and a string generated by a simple algorithm, since the latter's subset
distribution will also be highly skewed. What kind of metric discriminates the
English sentence from either case?

Am I making sense here? I haven't had any formal training in information
theory, and my brain is kind of fried right now.

------
markessien
A raindrop does not have that much information in it. A raindrop only has some
simple function that completely describes it, and this function likely has
only some bytes. You can't take any information out of a raindrop, because the
particles within it are random.

A raindrop is made up of a certain number of atoms, but these atoms are all
the same, and cannot store any information. To store information, you need
items that are dissimilar to each other. So the comparisons he makes are not
correct.

~~~
celoyd
As you say, the particles are random(ish), and this is exactly what makes them
hard to describe completely with a simple function. "Random" means "the
simplest description of it is the literal one", and the literal description of
the position, orientation and motion of that many water molecules is a lot of
information.

If raindrops were perfect crystals at absolute zero (and there were only one
isotope each of hydrogen and oxygen), they would be a lot less information in
them.

~~~
markessien
You are looking at things backwards. If you try to describe a raindrop by
mapping the position of every single atom or molecule, you are creating
information. But the raindrop itself does not contain any information, because
it is completely random. It has no memory, and so cannot store information.

There is a very small amount of information in a raindrop, no matter how
complex it is from the molecular structure.

The entire argument is flawed. It's based on a completely wrong premise -
information is not the same thing as structure.

I can't explain this any better, you will need a leap of intuition to
understand what I mean, but when you get it, it will be obvious.

------
DaniFong
There's even more information if the quantum information is deemed important;
in excess of 2^(5 x 1020).

Ultimately, one should come away with the understanding that obviously not all
that information _is_ important.

~~~
cgranade
Well, that's if you want to record the whole state for simulation by a
classical computer. It you just want to have the state, it seems like it
should be measured in qubits.

------
anewaccountname
If we don't know whether the universe is continuous or discrete, how can we
know how much _digital_ information is in a raindrop? If the universe is
continuous, the distance between any two particles can be measured to an
infinitely more fine-grained level, yielding more information.

------
tel
_90 petabytes: the "Deep Web" in 2002 - includes private databases,
controlled-access sites and so on._

6 years ago, oh my.

