Hacker News new | past | comments | ask | show | jobs | submit login
How much space would it take to store every word ever said? (jldc.me)
49 points by jonluca on April 2, 2020 | hide | past | favorite | 32 comments



Assuming 10:1 compression, you have 50 exabytes, and it appears that would be about 500 of the trucks Amazon uses to load large amounts of data. I can't find information on how many they actually have, or whether the capacity has increased from the 100 PB figure mentioned in a lot of places.

Amazon's FAQ is funny:

"Q: Can I export data from AWS with Snowmobile?

Snowmobile does not support data export. It is designed to let you quickly, easily, and more securely migrate exabytes of data to AWS"

...you can check out any time you like, but you can never leave.


It's not funny, it's sad.

Also, AWS sucks. It's the 21st century version of IBM mainframe, except without the reliability.


Such a lovely place


> We could also use UTF8, but since we assumed the language is German, we’ll stick to ASCII

German cannot be expressed in ASCII[1]. For that fact, neither can Chinese nor Spanish, the two most spoken languages besides English. Also UTF8 doesn't even encode all the languages ever spoken. So IMHO this is at least an order of magnitude wrong.

[1] https://news.ycombinator.com/item?id=9222071


German can be expressed on ascii just fine - ae oe ue ss are commonly used and understood when the non-ascii characters are unavailable or just hard to find. So for this purpose would be fine, except possibly introducing ambiguity in some words or names that already use the ascii version.


For what it’s worth, there are English words which cannot be encoded in ASCII too.

https://en.m.wikipedia.org/wiki/English_terms_with_diacritic...


They probably meant ISO/IEC 8859-15 8-bit encoding

https://en.wikipedia.org/wiki/ISO/IEC_8859-15


Each character in e.g. Chinese represents more information, so there are fewer of them, which sorta cancels out. I thought German was a good conservative choice here.


Sometimes I hear someone utter a sentence which I guess has never before been uttered by anybody. I really wish I had a way to verify that, just for fun.


You could check whether it's already been posted to /r/BrandNewSentence/


That must be an absurdly small subset of the universe of sentences; for example, this sentence itself has likely never been spoken or typed until now.


There are 7.06e57 combinations of playing cards. Odds are that every time someone shuffles a deck of cards it's a unique shuffling.


I found your exact comment in the library of babel, page 82 of a book.

https://libraryofbabel.info/bookmark.cgi?m_hnrvffbgsqubi.82


I came with "Unique first utterer of this performative impredicative statement". I forged it as a profile title[1], and so far no one dared to contradict me. :P

Note that I won't lose my time in a preterition which is giving credits to Russell for the concept impredicativity and to Austin for the concept of performative sentence.

[1] https://www.linkedin.com/in/mathieu-lovato-stumpf-guntz-%E2%...


There is (used to be?) an XKCD IRC channel with a bot that only lets you send messages that have never been uttered in the channel before.

https://blog.xkcd.com/2008/01/14/robot9000-and-xkcd-signal-a...


> 10 billion words, times an average word length of 11.66, gets us ~4.8 billion individual characters spoken per person per lifetime.

Am I missing something or is this math very wrong?


I noticed this, too.

If you take 16,000 words/day multiplied by 26,280 days (72 years' worth of days), you get the 420.48 million in the text.

If you take that number and multiply by 24, you get 10,091,520,000--close enough to 10 billion that I think the author made a days versus hours mistake somewhere.

The remainder of the article seems to actually use the 420.48 million number and not 10 billion, as ringshall points out.


Also,

“ roughly 10 billion words said, per person (16,000 per day * 26,280 days = ~420,480,000)“

Not sure if I completely missed something but it looks like 420m words per person.

Edit: combine these two errors and it makes sense. The 10b number is in the text but not the arithmetic.

16000 words per day * 26000 days * 11 chars per word = 4.5E9 chars per person.


That's my bad, accidentally left in some earlier wording. All the numbers are correct except for references to 10 billion - those should all be 420 million. It should be updated in the text soon.


This reminds me of a very fun and interesting read called "A Short Stay in Hell" by Steven Peck, which provides an entertaining perspective on infinity and very, very large finite time periods. It's about a Mormon who goes to hell (because Zoroastrianism happens to be the One True Religion). Hell does not last forever though. For the main character, it's a library that contains every possible communication that could exist. Once he finds the book that contains the story of his life, he gets out. Very fun read that addresses large but finite values, although it focuses more on time rather than space.


Interestingly, no one has mentioned the Library of Babel[0]

One could assert that if you were to translate Chinese/Russian/non-UTF characters into UTF, you'd be covering every word ever possibly said.

[0] https://libraryofbabel.info/


Hm... just the words loses so much -- the tone, the emphasis, the pauses. I think we'd have to do at least audio. Though of course expressions, hand movements and bearing count too, so I'm thinking we need a number for video as well.


That's only a concern if you're want to store some measure of identity and/or meaning; the author just wants to store words. In any case, you could improve things by putting words into a screenplay-like format, which implies relatively small amount of additional text. (I would estimate a typical screenplay or play script is 90% dialogue).


If we can assume determinism, there’s a much better compression algorithm.


This would be an interesting dataset to explore! A biographer's dream. Insider information on every corporate & governmental decision in history. Intimate daily-life details from early hominids.


But what if it was encoded in a novel storage format such as DNA?


How can DNA be a novel storage format when its use for information storage predates computers by at least 3 billion years?


Well novel for us to use it in this way....

Hummm, better patent "do everyday things...in DNA"


You can leave out all of Twitter


It's not taking into account UTF-8 though so maybe double or triple.


Encoding schemes like UTF-8 don’t affect compressed size much. What matters is the quantity of information.


You could probably do this with minimal overhead by organizing the words by language assuming a codepage for each set.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: