
How much space would it take to store every word ever said? - jonluca
https://blog.jldc.me/posts/word-storage?ref=hn2
======
perl4ever
Assuming 10:1 compression, you have 50 exabytes, and it appears that would be
about 500 of the trucks Amazon uses to load large amounts of data. I can't
find information on how many they actually have, or whether the capacity has
increased from the 100 PB figure mentioned in a lot of places.

Amazon's FAQ is funny:

"Q: Can I export data from AWS with Snowmobile?

Snowmobile does not support data export. It is designed to let you quickly,
easily, and more securely migrate exabytes of data to AWS"

...you can check out any time you like, but you can never leave.

~~~
otabdeveloper2
It's not funny, it's sad.

Also, AWS sucks. It's the 21st century version of IBM mainframe, except
without the reliability.

------
franciscop
> We could also use UTF8, but since we assumed the language is German, we’ll
> stick to ASCII

German cannot be expressed in ASCII[1]. For that fact, neither can Chinese nor
Spanish, the two most spoken languages besides English. Also UTF8 doesn't even
encode all the languages ever spoken. So IMHO this is at least an order of
magnitude wrong.

[1]
[https://news.ycombinator.com/item?id=9222071](https://news.ycombinator.com/item?id=9222071)

~~~
dspig
German _can_ be expressed on ascii just fine - ae oe ue ss are commonly used
and understood when the non-ascii characters are unavailable or just hard to
find. So for this purpose would be fine, except possibly introducing ambiguity
in some words or names that already use the ascii version.

------
DoofusOfDeath
Sometimes I hear someone utter a sentence which I _guess_ has never before
been uttered by anybody. I really wish I had a way to verify that, just for
fun.

~~~
farnsworth
You could check whether it's already been posted to /r/BrandNewSentence/

~~~
oarabbus_
That must be an absurdly small subset of the universe of sentences; for
example, this sentence itself has likely never been spoken or typed until now.

~~~
clSTophEjUdRanu
There are 7.06e57 combinations of playing cards. Odds are that every time
someone shuffles a deck of cards it's a unique shuffling.

------
lilyball
> _10 billion words, times an average word length of 11.66, gets us ~4.8
> billion individual characters spoken per person per lifetime._

Am I missing something or is this math very wrong?

~~~
isomorphic
I noticed this, too.

If you take 16,000 words/day multiplied by 26,280 days (72 years' worth of
days), you get the 420.48 million in the text.

If you take _that_ number and multiply by 24, you get 10,091,520,000--close
enough to 10 billion that I think the author made a days versus hours mistake
somewhere.

The remainder of the article seems to actually use the 420.48 million number
and not 10 billion, as ringshall points out.

------
thansz
This reminds me of a very fun and interesting read called "A Short Stay in
Hell" by Steven Peck, which provides an entertaining perspective on infinity
and very, very large finite time periods. It's about a Mormon who goes to hell
(because Zoroastrianism happens to be the One True Religion). Hell does not
last forever though. For the main character, it's a library that contains
every possible communication that could exist. Once he finds the book that
contains the story of his life, he gets out. Very fun read that addresses
large but finite values, although it focuses more on time rather than space.

------
sudosushi
Interestingly, no one has mentioned the Library of Babel[0]

One could assert that if you were to translate Chinese/Russian/non-UTF
characters into UTF, you'd be covering every word ever possibly said.

[0] [https://libraryofbabel.info/](https://libraryofbabel.info/)

------
jmull
Hm... just the words loses so much -- the tone, the emphasis, the pauses. I
think we'd have to do at least audio. Though of course expressions, hand
movements and bearing count too, so I'm thinking we need a number for video as
well.

~~~
javajosh
That's only a concern if you're want to store some measure of identity and/or
meaning; the author just wants to store words. In any case, you could improve
things by putting words into a screenplay-like format, which implies
relatively small amount of additional text. (I would estimate a typical
screenplay or play script is 90% dialogue).

------
slewis
If we can assume determinism, there’s a much better compression algorithm.

------
QuadrupleA
This would be an interesting dataset to explore! A biographer's dream. Insider
information on every corporate & governmental decision in history. Intimate
daily-life details from early hominids.

------
KiDD
But what if it was encoded in a novel storage format such as DNA?

~~~
willis936
How can DNA be a novel storage format when its use for information storage
predates computers by at least 3 billion years?

~~~
duxup
Well novel for us to use it in this way....

Hummm, better patent "do everyday things...in DNA"

------
purplezooey
You can leave out all of Twitter

------
Avshalom
It's not taking into account UTF-8 though so maybe double or triple.

~~~
willis936
Encoding schemes like UTF-8 don’t affect compressed size much. What matters is
the quantity of information.

