

How to Mirror Wikipedia - jvoorhis
http://www.igeek.co.za/2009/10/16/how-to-mirror-wikipedia/

======
GavinAnderegg
I was going to make a joke about just using

    
    
      wget --mirror http://en.wikipedia.org/wiki/Main_Page
    

so I decided to try it first. After downloading the robots.txt, it stopped. I
looked and saw the following:

    
    
      #
      # Sorry, wget in its recursive mode is a frequent problem.
      # Please read the man page and use it properly; there is a
      # --wait option you can use to set the delay between hits,
      # for instance.
      #
      User-agent: wget
      Disallow: /

------
mukyu
The english Wikipedia uses a massive number[1] of MediaWiki extensions. At the
very least you are going to need parserfunctions (used in basically every
template) and cite to properly display basically every article. Math,
wikihiero, syntaxhilighting, poem and who knows how many others are needed for
non-general pages. You'll also be missing images and getting them would
require significant effort since there has not been a batch download in years
and have to get images from en and commons.

[1] <http://en.wikipedia.org/wiki/Special:Version>

------
charliesome
Don't download the archive directly from Wikipedia - that wastes their
bandwidth. Grab the torrent instead.

------
redthrowaway
Does anyone know if this includes media like pictures and audio files? 20 gigs
uncompressed seems way too small for all of the media on Wikipedia. Granted,
it's >2 years old, but that still seems a bit shy.

~~~
Hrundi
It doesn't, due to licensing issues:
[http://en.wikipedia.org/wiki/Wikipedia:Database_download#Whe...](http://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_are_images_and_uploaded_files)

~~~
redthrowaway
That makes sense, thanks.

------
wladimir
Nice initiative. Even though I trust the Wikimedia foundation, it _is_ a
single point of failure. It'd be a waste for all the information to disappear
if somehow they stop, or become unreachable, either voluntarily or
involuntarily.

So, mirror what you can :-)

BTW: Is it possible to mirror incrementally? Downloading the 5.2GB file every
time is a big waste of bandwidth.

