

Full-history English Wikipedia dump produced: 5.6TB uncompressed, 32GB 7z'd - chl
http://infodisiac.com/blog/2010/04/full-history-dump-for-english-wikipedia-is-back/

======
philwelch
Doesn't include deleted articles, so no hope if you want to recover one of
them. This is a pity since Wikipedia deletes too many articles.

~~~
_delirium
That's talked about on and off, but one issue is that they'd have to filter
deleted articles by deletion reason, at least broadly into "deleted for legal
reasons" and "deleted for non-legal reasons" bins. There'd be no problem
distributing a dump of articles deleted due to non-notability, but a dump of
articles axed for copyright violation, libel, or other legal issues would be a
problem.

For specific articles deleted for non-legal reasons (most commonly
notability), you can get a copy from a WP admin. Some have volunteered
themselves as willing to answer requests:
[http://en.wikipedia.org/wiki/Category:Wikipedia_administrato...](http://en.wikipedia.org/wiki/Category:Wikipedia_administrators_who_will_provide_copies_of_deleted_articles)

~~~
pbhjpbhj
Surely articles have posted on them, or their talk pages, the reason for
deletion to allow for responses? Also there should be some way for people to
find that an article has been deleted so that they don't recreate the article
and repeat the error. Indeed rather than delete couldn't a placeholder be
implemented.

Wrong forum for these suggestions but I've never had the time or inclination
to attempt to reach the Wikipedia inner sanctum.

~~~
_delirium
Yeah, the deleted articles could probably be automatically filtered, at least
on a fail-safe basis by only dumping the ones that have a known "not a legal
issue" deletion reason, like "notability" (there are even semi-formalized
deletion codes, itself a mild absurdity). There are probably non-technical /
non-legal reasons people don't want to dump them, but there's also some of
just a "not a priority" aspect. The dump reported in this story is actually
the first successful full-history dump in quite some time, because the dump
scripts were perennially broken / bogging down due to the size of the data /
crashing due to MySQL weirdness. So most of the dump effort has been on just
getting the official stuff out. Next up on the priority list will probably be
some way of doing image dumps.

You do get a bit of a warning if you recreate a deleted page. When you go to
the editing screen at the title of an article that was previously deleted,
it'll show you the summary from the deletion log at the top, and ask you if
you're sure you want to recreate it. There's also a "nothing can go here"
protected placeholder used for articles that are persistently being recreated,
which'll make it impossible to edit at that location.

Yeah, I can sympathize on the Wikipedia-inner-sanctum thing. I was actually
pretty deeply into it (I've been an admin since '04, was formerly on the
Arbitration Committee, formerly active on the mailing lists, etc.), but as the
Policy And Process kept accumulating, I lost interest in navigating it, so am
more on the periphery these days. It's probably inevitable that things would
go that direction, because in the early days there were probably <100
Wikipedians active enough to form the Wikipedia Cabal, all of whom at least
recognized each others' names, so stuff could be pretty informal. But it's
hard to scale that up to a site with 1700 admins and 15k+ editors. A lot of
things are kind of lame about how things are organized these days, but
honestly I have no idea how I'd do it better; despite its flaws it's often
still amazing to me that Wikipedia works at all.

------
MikeCapone
> 5.6 Tb uncompressed, 280 Gb in bz2 compression format, 32 Gb in 7z
> compression format

Wow, I didn't know 7z was this much better than bz2. Is this the expected
result, or is there something special with Wikipedia that plays to the
strengths of 7z?

~~~
_delirium
I'd guess it has to do mainly with 7z being able to use a larger block size,
while bzip2's is 900kb; and possibly being able to do something better with
large runs of repeated text. There are large articles with hundreds of
revisions in a row that leave most of the content unchanged; [[George W.
Bush]], for example, is around 180kb per revision, and is edited a _lot_ ,
mostly with minor changes. So bzip2's block size means it can only squeeze 5
revisions per block: so in the degenerate case where 100 edits in a row
changed only one character each, bz2 would be storing 20 or so basically
identical copies of the article.

IIRC from some tests a year or so ago, Wikipedia hasn't found any significant
improvement from 7z over bz2 on the current-revisions-only dump, which looks
more like just normal English text; that's why it doesn't bother to provide a
separate 7z version of that. It seems to only be this pattern of [200kb
article][almost the same 200kb article][almost the same 200kb article again]
that 7z kills bz2 on.

~~~
JoachimSchipper
Does Wikipedia _really_ store every single revision of every single file? As
in, not deltas? Why is it done that way?

~~~
_delirium
In the dump, I think for robustness and ease of extracting subsets.

Robustness: Having to essentially play back a log to recover any particular
revision increases a chance of something eventually getting corrupted, and so
it's somewhat safer to avoid it in something intended to be archival.

Ease of extracting subsets: For researchers, having the revisions be
independent allows you to filter the XML dump through a SAX parser (or
similar) to grab only revisions meeting particular criteria. If deltas were
stored, you'd have to reconstruct those revisions from the deltas, which would
make it _really_ expensive to do things like, "I want to look at every article
as it appeared at noon on April 1, 2007".

In the live DB, I think just because it's cheaper to get a ton of storage,
esp. for rarely-retrieved old revisions, than to add the overhead of computing
deltas and applying them to reconstruct revisions. In particular, you'd have
to compute a diff for _every_ edit in that situation, whereas currently
MediaWiki only computes diffs when a user requests to view one from the
"history" tab, which is a tiny proportion of all edits.

~~~
JoachimSchipper
I understand it's simpler to store everything, and simplicity _is_ a virtue;
but one could store the current revision plus deltas (and perhaps a few
intermediate revisions for oft-edited articles), and obtain performance at
least as good as in the current case. It would also save _lots_ of space.

------
jonknee
The article says Tb not TB, but in reality it appears to be TB. That's quite a
difference. Still seems heavy for text, but I assume the full text of every
revision is in it, not just diffs.

~~~
_delirium
Yeah, every revision is standalone, which is why it compresses so well
(obviously there are a lot of edits that make relatively small changes). One
reason is to make it easier for researchers to grab specific revisions, e.g.
run the dump through a filter returning only revisions as of June 1, 2006---
without having to apply a ton of diffs to reconstruct those revisions.

The dump schema is something like:

    
    
      <mediawiki blah blah ...>
        <siteinfo>
          some metadata
        </siteinfo>
        <page>
          <title>Article Title</title>
          <id>15580374</id>
          <revision>
            <id>139992</id>
            <timestamp>2002-01-26T15:28:12Z</timestamp>
            <contributor>
              <username>_delirium</username>
              <id>82</id>
            </contributor>
            <comment>vandalized this page</comment>
            <text xml:space="preserve">Complete text of this revision of the article goes here.
            </text>
          </revision>
          <revision>
            ...next revision of this page...
          </revision>
        </page>
        <page>
          ...revisions of the next page...
        </page>
      </mediawiki>

------
anigbrowl
Impressive...I wonder how big a content snapshot is, ie no article histories
and no meta-material like talk pages or WP:xxx pages, just the user-facing
content.

I was also sort of hoping to see from the stats what proportion of content was
public-facing vs devoted to arguments between wikipedians...if you look at the
stats for 'most edited articles' (accessible from the top link) it's
interesting that of the top 50 most edited articles, only one, 'George W.
Bush' is user-facing - and I suspect that only made it in because of
persistent vandalism.

Still, with history and all included, there is some fabulous data-mining
potential here, with which there's the potential to do some really innovative
work. I'd hazard a guess that the size of Wikipedia already exceeds that of
existing language corpuses like the US code...

 _/retreats into corner muttering about semantic engines and link free
concepts of total hypertext as necessary AI boot conditions_

~~~
_delirium
> I wonder how big a content snapshot is, ie no article histories and no meta-
> material like talk pages or WP:xxx pages, just the user-facing content

I don't know how big it is uncompressed, but they do have a dump of just that
part:

    
    
      2010-03-16 08:44:40 done Articles, templates, image descriptions, and primary meta-pages.
      2010-03-16 08:44:40: enwiki 9654328 pages (255.402/sec), 9654328 revs (255.402/sec), 82.9% prefetched, ETA 2010-03-17 03:08:26 [max 26568677]
      This contains current versions of article content, and is the archive most mirror sites will probably want.
      pages-articles.xml.bz2 5.7 GB

~~~
anigbrowl
Well spotted. This has great possibilities for education in the 3rd world.

~~~
Groxx
Perhaps this is a good time to point to this?

<http://thewikireader.com/index.html>

------
bprater
One wonders if this will be the file first fed into something approximating
machine consciousness. I'm not sure where else you can easily get such a high-
quantity of fairly consistent human interest data.

Quick question: what does "bot-edited" entries refer to?

~~~
ajb
bots are used for hugely common editing operations, such as various kinds on
cleanup.

------
baddox
I like the quick fix the site designer used to switch from a static layout to
a fluid one.

------
kez
Interesting, but somehow doubt that many people have the set-up to handle this
number of data.

~~~
blhack
That's because, despite what the 14 year olds on digg and reddit think, this
isn't for you to download on your computer at your house. This is for
_archival_ or data-mining purposes.

I apologize for the minor insult at digg/reddit, I just remember a few years
ago a link to the archive was posted on digg and everyone started downloading
it...unnecessarily wasting wikipedia's limited and donated resources.

~~~
eru
If that's a problem, they could have put it up with bittorrent and throttled
the bandwidth.

~~~
rmc
Yeah, this is _exactly_ the sort of problem bittorrent was designed to handle

------
helwr
40 + 15 days to compress? how long it would take to decompress this thing

~~~
unwind
7zip, like many other compression schemes, is optimized and designed so that
decompression is typically (much) faster than compression.

The web page (<http://www.7-zip.org/7z.html>) states that the default "native"
LZMA format decompresses at between 10 and 20 times the speed that it
compresses.

So, 15 days / 15 is about one day to decompress, then.

~~~
leif
LZMA is well-known for its decompression speed. This is one of the reasons
it's a popular choice for filesystem compression. It's quite easy for LZMA to
keep a pretty fair pace with a disk, so you get a pretty noticeable
performance boost by adding LZMA at the filesystem layer, especially for read-
heavy workloads.

Gzip usually gets a slightly better compression ratio, but at the expense of
decompression speed, particularly on less compressible data (LZMA somehow
seems to know better when to give up trying to compress). Bzip2 has the best
compression ratio of the three, but is far too slow to compress and
decompress, so you end up losing more time decompressing than you gained by
doing less actual I/O.

EDIT: source, for those curious folk out there:
<http://portal.acm.org/citation.cfm?id=1534536> (caveat: the experiments were
run by taking a large file on disk and compressing it to another large file on
disk, so seek thrashing may have been an issue and I'm not quick to take the
numbers for all they should be worth)

~~~
ars
I'm pretty sure LZMA (7z) compresses better than bz2.

At least it does when I test it. But it's slower - sometimes much slower
(depends on settings).

~~~
leif
That might be true, though I don't know whether that depends on the entropy in
the file. It is likely the case that one of them compresses text better but
takes a larger performance hit with binary data or somesuch.

Turns out the paper I was recalling and referencing dealt with LZO, not LZMA,
so maybe I have less to say about LZMA than I thought. Shows how much you
jerks read before upvoting. ;-)

