
10,000,000,000,000,000 bytes archived - cleverjake
http://blog.archive.org/2012/10/26/10000000000000000-bytes-archived/
======
ChuckMcM
The internet archive is a wonderful thing. I recovered much of my web site
when my server burned in a fire, it was cheaper than the $2400 to try to pull
it off the melted hard drives. It has also provided fodder for a ton of
lawsuits, of the patent/IP/he-said vs she-said varieties.

Given the latter use, and subsequent 'retro-takedowns' that have occurred on
the archive, I wonder if there is a market for 'a copy of the archive right
now' which would be hard to retro-actively modify? And I wonder what the legal
theory would be around having a tape archive of something that was 'clear' at
the time you took it, but then later 'redacted'. Could you use your copy of
the unredacted information?

~~~
JoshTriplett
I wonder if archive.org actually _deletes_ things when taken down, or just
makes them inaccessible? Likewise for when archive.org takes content down due
to a new robots.txt file that didn't exist on the original site, as often
happens with domain squatters.

~~~
wtallis
They only disable access, not delete. There was once a court case where the
defendant got the court to compel the plaintiff to alter their robots.txt in
order to allow the defendant to gather evidence from the archive. (Apparently,
the Archive managed to convince the court that manually producing the info
hidden in the archive would be too onerous.)

------
Zenst
I do wonder what the best form of compression would be and given there web
pages if some form of customed compression that was optimised for HTML would
be useful.

There again with that volume of data, what would CERN do for storage/access
for a data pool that size and still be useable.

Reason being if you wanted to back that lot of up ship copies for research
purposes then with todays technology the humble memory stick, even the biggest
would fail to even handle the file index on this scale. Scary amount of data.
But certainly a data set many would like to play with and try things out,
being the geeks we are.

~~~
juiceandjuice
CERN has a variety of techniques to access, store, and backup their data. I
know from experience some experiments, or at least some parts of experiments,
use layers and layers of abstraction, like scalla/xrootd which operates on
clusters of servers directly or with nfs.

In addition to this, there's levels of processed data. For example, raw data
is usually level 0, and basic processed data is level 1 Processed and
calibrated data is usually level 2, etc... but experiments often have
different definitions for each level. Reprocessing of any later can happen at
any time, although level 1 reprocessing is usually an extremely intensive
operation because it operates with the largest amount of data.

Level 0 data is usually heavily compressed when it's left on disk, because
it's typically the largest amount of data, but also least touched.

Most scientists will use level 2 or level 1 data. This data will be on low
latency clusters.

So, while CERN has petabytes of data, typically a fraction of that will be
accessible.

In the past, level 0 data was often left to tape. While the raw data is still
backed up to tape (I know this is the case for ATLAS), for many experiments
with large amounts of data they might leave it on lower cost HDD in simple
RAID arrays for redundancy and not worry about performance so much. The BaBar
experiment has done this for their long term data analysis.

In addition to all of this, it's still occasionally easier to transfer large
amounts of data via tape instead of the internet.

------
taf2
google does something interesting when you use google.com to convert that
number to say gigabytes by doing a search such as: 10,000,000,000,000,000
bytes to gigabytes

the result is 9.31323e6

Notice the e... because it's in the same font your eye might miss it like mine
and then you'd say to yourself... 10 gigabytes is so small who cares... but if
you do the same search again but this time to petabytes you'll realize it was
an 'e' in the gigabyte number...

So google says 8.88178 petabytes that's a lot

~~~
rm999
It's exactly 10 petabytes. Giga/peta are SI prefixes and hence are in base 10.

BTW, take google's conversions with a grain of salt. As I understand it,
google's conversion was just a minor project and not something that is
maintained at high quality. For these things Wolfam Alpha is almost always
better because it's the kind of thing Alpha was designed to do.

~~~
cynwoody
Google Calc is a bit behind the curve. The proper prefixes for 2^30 and 2^50
are gibi and pebi, not, as you point out, giga and peta.

<http://en.wikipedia.org/wiki/Timeline_of_binary_prefixes>

~~~
devcpp
Exactly what he was pointing out.

------
paulsutter
I love the Internet archive, but is there a typo? At Quantcast we collect more
than 10PB of logfiles each year, and we process 30PB of data in a day.

Something seems off about the number

------
pjscott
That number has a lot of zeros at the end, but really, it's just barely
getting started. This may be the greatest electronic civilization archival
project in human history, but it is also the smallest and most impoverished.
It has a lot of growing left to do.

------
charonn0
You know it's a great party if the music is performed live by the guy who
wrote the book in your field.

------
tripzilch
> On Thursday, 25 October, hundreds of Internet Archive supporters,
> volunteers, and staff celebrated addition of the 10,000,000,000,000,000th
> byte to the Archive’s massive collections.

So, I bet everyone is dying to know, what was the 10,000,000,000,000,000th
byte then?

~~~
beernutz
If you look at the picture, they SHOW the actual byte in binary!

looks like it is 236 or 11101100

~~~
tankbot
I want to know what this byte was a piece of. Probably Philosoraptor or some
cat video.

------
tarice
Interesting note at the bottom for those who may have missed it (Donald Knuth
on the organ is very distracting):

 _> The only thing missing was electricity; the building lost all power just
as the presentation was to begin. Thanks to the creativity of the Archive’s
engineers and a couple of ridiculously long extension cords that reached a
nearby house, the show went on._

------
3rd3
That is already about five times the identifiable storage capacity of the
human brain!

------
desbest
The Internet Archive uses 10TB more storage every month. I don't know how they
do it.

~~~
cinch
i'm interested in their server setup: DIY or vendor?

------
barbs
Did anyone else read the title as "10 bajillion bytes archived" or was that
just me?

------
arasmussen
"Ten Petabytes (10,000,000,000,000,000 bytes) of cultural material saved!"

Not quite 10 petabytes: (10 * 1024^5) > 10^16

edit: this is wrong, silly me.

~~~
anonymouz
I think by now we have more or less standarized on 10^15 as a Petabyte, as
this is consistent with the SI system. For the powers-of-2 based approach, the
-bi- prefixes are now well established. So:

1 Petabyte = 10^15 Bytes

1 Pebibyte = 1024^5 Bytes = 2^50 Bytes.

See <http://en.wikipedia.org/wiki/Petabyte> .

This way we no longer need to abuse/redefine the SI prefixes!

edit: definition of Petabyte corrected from 10^16

~~~
arasmussen
Oh interesting, thanks for that tidbit of knowledge! I always thought they all
(mega, tera, giga, etc) went off the powers-of-2 system.

Also, you have a typo, 1 Petabyte = 10^15 bytes I believe.

~~~
anonymouz
> Also, you have a typo, 1 Petabyte = 10^15 bytes I believe.

Thanks!

The power-of-2 system was quite popular for a long time when applied to
computers, but always clashed with the much older (and well thought out) SI
system where everything is defined in powers of 10, and that is used
everywhere else (think physics,...). I think harddrive companies where the
first to switch their definitions to the decimal based system (for the obvious
reason that their disks would then seem bigger to people accustomed to the
common power of 2 system...). A decade or so ago the kibi, mebi, ... prefixes
where introduced, and at some point people more or less switched to those.
There seems to be an even more detailed article about this at
<https://en.wikipedia.org/wiki/Binary_prefix> .

------
z92
10,000,000 GB archived, sounds more cool.

