Hacker News new | past | comments | ask | show | jobs | submit login
Eric Schmidt’s "5 Exabytes" Quote is a Load of Crap (rjmetrics.com)
112 points by robertjmoore on Feb 7, 2011 | hide | past | favorite | 43 comments

The figure I've heard is that the data generated doubles every year (here, "data" can mean web pages, logs, transactions, etc.) . Therefore, it follows that every year we create as much data as in all the previous years combined ( sum_i 2^i = 2^(i+1) ).

If we created X amount of data in 2003, then, 7 years later, we're creating 128X as much data; which roughly works out to X every 3 days.

It's still BS. A consumer level analog camera captures around 1GB of data with each picture. The peak of film sales was in 1999 when 800 million rolls of film were sold and 25 billion images were captured and printed which works out to around 25 EB of data just from analog cameras in 1999.

Your example also neatly illustrates that not all data is equally meaningful. You lose a lot of information when you do JPEG or MP3 compression, but it doesn't feel like you're losing much of value.

Until you try to enlarge that JPEG later.

"Doubles every year" is just a simple way of summarizing geometric growth to the layman. I'm sure info generation doubles every t, but the chance of t being very close to one year is quite low.

Other, less dramatic kinds of growth can appear exponential in its early stages. I'm always amazed at Internet growth data that are only fit to an exponential -- "at this rate, our startup is going to take over the universe in five years!"

Has everyone forgotten about logistic growth? There is probably a ceiling to most growth, wouldn't it make more sense to ask where it lies? http://en.wikipedia.org/wiki/Logistic_function

I agree with you in general, but you picked a funny example to make this case - generation of information doesn't seem like something which will level off after a while.

The rate at which it is being generated might.

Edit: if you think of the logistic function as modeling the rate we create information, this is one possible "story": initially there's slow growth due to technology. Then the technology picks up, and there's the exponential phase we're seeing right now. That settles into a nice linear trend as the tech matures. Finally, we hit either natural (e.g., satiation) or technological limitations and that slows the rate of growth. At the very limit, we're still creating information, but at a constant rate. The rate of growth might be near 0, but the rate of production is still bloody high.

Sorry, that is what I (obviously?) meant. It would have been trivial to state that information will continue to be generated.

But the doubling every year claim really only makes sense over a short period of history too. Starting from the article's 27 exabytes every 7 days in 2010, tracing back in time (halving each year), you get down to about 1 byte of information being produced in 1940.

Forecasting these kinds of relationships back to the dawn of time seems a bit tricky to me - I would guess we're really seeing something like an S curve.

But it is also worth pointing out that data != information. The human eye captures more data than a fly eye, but both we and the flies manage to get the same information out (i.e. avoid predators, find food, shelter, mates, etc.) and both survive.

Based on the primary sources I’ve been able to piece together, the more accurate (but far less sensational) quote would be:

"23 Exabytes of information was recorded and replicated in 2002. We now record and transfer that much information every 7 days."

Call me crazy, but that sounds every bit just as sensational to me. Seems like all this article is doing is getting overly picking with some throwaway oft-repeated trivia stat. Who cares what the exact numbers are? The purpose of the statement remains the same.

I understand what you're saying, but I think you're marginalizing what was a pretty solid article. The author heard a questionable figure, sought the source for it, and reconciled the quote against reality. As a lesson in thinking for yourself as opposed to just believing what you hear -- something people are not often accustomed to doing -- I thought it was good.

That said, the title could have been much improved.

I think it's a really great piece. It gets you to think about the reliability of quotes like that, it makes you consider realistic statistics and research, and it points out that the numbers are still pretty sensational.

It's not so much nit picky IMO as light heartedly correcting the facts.

I think that counts as intellectually stimulating :)

Oh really? Schmidt's claim overstates reality by 5000 times and you don't care? Link is to an (awesome) online calculator showing how I arrived at this.


Yeah, that's right.

These are numbers incomprehensibly big. Nobody can even begin to mentally picture these in a reasonable fashion. The only part of the statement that is really important is the meaning, not the details of the numbers. That meaning is "we are creating shittons of data really really fast, faster than ever before". If you're getting hung up on the accuracy of the numbers used to express this, then you're missing the point, and I wonder why you don't have something better to do than get worked up about it...

EDIT: Furthermore, according to the "more accurate" statement form the article, we're creating 23 Exabytes in 7 days, not 5. Read it again: "_23 Exabytes_ of information was recorded and replicated in 2002. We now record and transfer _that much_ information every _7 days_."

To me, the difference between "23 exabytes every 7 days" and "5 exabytes every 3 days" is pretty irrelevant.

I think a better quote would be "By 2003, mankind had generated a shitload of information. Now we generate a shitload every day."

Exactly, the entire point of the quote was to get across the absurd magnitude of growth that we're seeing in a way that anybody can understand.

If you care enough about numbers and can figure out how to say it more accurately... more power to you I guess? Doesn't change the effectiveness of the quote though.

Using shitload as a unit of data: sarcasm or serious?

It's not a unit, it's a magnitude or count. Used in similar situations as "a lot". It's neither 'sarcastic' nor 'serious', but rather 'casual'.

Given a much smaller world-population during most of history, and much less information transfer (other than f2f), I think your estimate is off by at least a factor 100 (rough guess) more than Schmidt's.

Sorry to be nitpicking... :)

For those who don't want to bother with clicking that link: real growth is 50X, Schmidt's claims was 260,000X.

Just now bothered to look at your calculator thing, since I maintain the actual numbers involved are pointless details. However...

The there was a 50x growth between now and the single year of 2002. Schmidt's quote stated 260,000X growth between now and the average of all data created in the past 5000 years.

In order to create an actual comparison of Schmidt Rate vs Real Rate, you need a 'real' number for the amount of data from recorded history until 2002, something that nobody has volunteered. Nobody is debating that Schmidt's numbers were inaccurate, and the number he provides for this is almost certainly wrong, however we don't know by how much. Extrapolating out from data from the single year of 2002 is flawed; it is foolish to think that the growth rates of data in 2002CE and 2002BCE are the same without data to suggest so.

If we accept that there has been a constant rate of growth of data creation, then your conclusion is valid.

Interesting statistic: It has been said that 78% of all statistics are made up.

And people are 86.3% more likely to believe a statistic if it has a decimal point in it.

Richard Feynman was giving a lecture and, as usual, had gone off-script. He mentioned some historical event, but got the last digit of the date wrong (like 1951 instead of 1957 or something). He said, "Hey! 3 significant digits is pretty good for a theoretical physicist!" :-)

Edit: I can't link to the middle of a Silverlight presentation, but if you visit http://research.microsoft.com/apps/tools/tuva/ , click the middle (with the picture of Feynman), click Lecture 5, and skip to 17:00 in, you can see the incident. But you really should watch all of them, especially the one on Symmetry.

Dang Silverlight!

I wonder if anyone would be able to calculate the amount of data created in the last two millennia... and if so, how.

I would urge you to watch this video on how these kind of claims can be made : http://www.youtube.com/watch?v=F-QA2rkpBSY

I guess the first issue would be to define what "data" is. Speech is data, weather is data...

"Data (plural of "datum") are typically the results of measurements"

Weather isn't data, a measurement of temperature would be data...


[Not that I am trying to imply that other things aren't data - but I don't think weather itself is data although you can, of course, have weather data!]

The quote said "information", not data.

Edit: "There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days."

As arethuza said.

Obviously there would be plenty of questions, ranging from the most basic (a CD released in 1990 contains 'x' MB of data on it... but there's also the sleeve design, printed track listing etc...) to more complex (how do you measure the data of a painting in computer terms?), but I'm sure if someone were to set about trying to work this out, they could think up (albeit debatable) definitions.

It's actually not all that difficult to measure the data of a painting in computer terms: get a high resolution scan and store it as a JPEG. Of course, this is open to some reasonable disagreements about what constitutes a "high resolution", but the basic principle applies. I'd suspect that it won't make a huge difference, since audio recordings take up roughly an order of magnitude* more space than still images, and video roughly an order of magnitude more information than audio.

* I have no substantiation for this, but it seems about right: a JPEG is a megabyte or 2, while a single movement of a symphony is 10 or 20, and a short movie at reasonably high quality is easily 100 MB.

That depends on what you call a high res img.

Worst case: http://articles.cnn.com/2007-10-17/us/monalisa.mystery_1_leo...

150,000 dots per inch, 13 light spectrums, including ultra- violet and infrared at say 10 bit per spectrum * largest painting in the world (12 384 000 square inches) http://www.msnbc.msn.com/id/14309591/ns/world_news-weird_new...

150 000 x 150 000 x 13 x 12 x (12 384 000 bits) = 4.71279266 exabytes

information is not all equal. recording from /dev/random is not valuable information even though it fills up disk space. the value of information depends very much on the context.

A lot might have happened since 2002. People with digital cameras take a lot of pictures, for example. YouTube is booming. Lot's of devices generate automatic data feeds, for example location tracking from mobile phones, clickstreams on the internet.

The number might still have been made up, but let's not forget that Schmidt might have some sources of information no available to the public, for example the server stats from Google and YouTube.

How timely! I was actually at a Google recruiting event/tech talk today at my university, where a Google engineer repeated this quote to us. Fittingly, he also misquoted it and said that 5 exabytes of data are created every day, instead of every two days as in the original quote. I looked at him askance for a moment due to the absurdity of the number--thanks for clearing it up!

Perhaps the figures he was given were based entirely on computer data - and he quoted them to sound like all data?

"We now create and replicate as much data in one week, as we did in one year, just a decade ago."

True, not as catchy as the dawn of time, but still mighty impressive. And in fairness to their outgoing CEO, Google didn't cache much data at the dawn of time (or even in the '80s), so it can't have been that important.

My tummy rumbled and I burped at 9:22AM EST this morning. Now that I have posted this: is that a piece of information?

My point is that a lot of this "information" is ephemeral and not really all that important in the long run.

Now it is.

Reply to the edit: If you stretch the timeline long enough, nothing is of any importance.

Yes we are creating more data now than in the history of mankind. However the ratio of (quality stored data / total data ) has gone down with the ease of storage. Most of the "data" is for entertainment.

Lies, damn lies, and clichés.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact