
Open-source app releases first complete copy of English Wikipedia with images - gnosygnu
http://xowa.sourceforge.net/
======
nowarninglabel
Some people wonder what this is good for. For me, it would have been great,
when I worked on a training ship as the IT support. Students were not allowed
access to the internet (nor could our meager satellite uplink have supported
it), but for my second stint on the ship, I wanted to see if I could provide
them with Wikipedia. So, I grabbed the dump, set up MediaWiki, and imported
it...and waited...for days and days and eventually the thing loaded, and it
was great. But, it would have been super nice to have an easy installer to
handle all that. So, yes there are use cases out there.

~~~
gnosygnu
That's pretty impressive. I never had the patience to sit through a full
MediaWiki import for en.wikipedia.org.

Just to be clear, XOWA isn't an installer for MediaWiki, but it's own app.
This allows it to avoid the dependency on the entire MediaWiki tool-chain
(apache, php, mysql, MediaWiki). Unfortunately, this means that XOWA has to
reproduce the same logic, which is quite a challenge...

~~~
fauigerzigerk
It is indeed a challenge. The mediawiki syntax is the weirdest mess I have
ever had to parse. There is no spec, real world usage deviates significantly
from the help docs, and it's a Turing complete language with heaps of
backwards compatibility hacks. So if you have something reasonably complete
and correct than kudos to you!

~~~
tommorris
The visual editor uses a new parser, Parsoid, which has been implemented
separately in node.js (iirc). That may be the answer...

~~~
gnosygnu
Yup. It also has its own DOM, rather than continuously adding to one string
and repeatedly running regex's on it (which is what MediaWiki does today).

I was already pretty far along with my own parser before Parsoid was usable
though. (and my parser has its own DOM / hooks)

~~~
tommorris
MediaWiki is such an astoundingly fugly piece of software.

------
voltagex_
> Space required during initial import: multiply the dump file by 8. For
> example, for English Wikipedia, the dump size is 10 GB. You should at least
> have 80 GB space free for the import process

> Space required when completed: multiply the dump file by 2.5. For example,
> for English Wikipedia, the dump size is 10 GB. When done, your English
> Wikipedia will be 25 GB.

Ouch, looks like I won't be trying this here any time soon.

~~~
jmillikin
It's only ~3 GiB once you leave out the articles about individual Pokemon.

~~~
runn1ng
Wikipedia doesn't have that. For most Pokémon, they have summary pages like
this

[http://en.wikipedia.org/wiki/List_of_Pok%C3%A9mon_(252%E2%80...](http://en.wikipedia.org/wiki/List_of_Pok%C3%A9mon_\(252%E2%80%93319\))

------
sirsar
See also: Kiwix

For one thing, it has an Android app, and it's easy to put the whole thing on
your external SD card. It also provides an index for full-text search.

[http://www.kiwix.org/wiki/Main_Page](http://www.kiwix.org/wiki/Main_Page)

~~~
gnosygnu
Kiwix's Android app and the full-text search are both great features.

However, I'll point out that Kiwix has not updated English Wikipedia since
January 2012. Also, XOWA works directly with the Wikimedia dumps
([http://dumps.wikimedia.org/backup-
index.html](http://dumps.wikimedia.org/backup-index.html)) so it's (a) always
up to date and (b) can work on any wiki (Kiwix needs to release the zim file
first)

Also, XOWA can run from an external SD card (including FAT32 formatted ones)

~~~
ddeck
Another Android option is Fastwiki [1]. No images, but provides a conversion
tool to convert native Wikimedia dumps. Also works with older Android
versions.

[1]
[http://fastwiki.qwrite.info/en/index.html](http://fastwiki.qwrite.info/en/index.html)

------
laxatives
Since Wikipedia doesn't have ads (and I'm guessing gets no real benefit from
number of clicks), maybe this could be a nice way of lightening their server
load and reducing some of their costs.

~~~
_nb
Probably not. In your ordinary browsing, you're not likely to download
anywhere near the entirety of wikipedia. Plus, spreading out your requests
over time rather than downloading it all at once is more gentle.

~~~
gnosygnu
Good observation, but I just want to point out that the full dumps are
downloaded from different servers. They are even mirrored by other
institutions. See
[https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_...](https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps)

~~~
cdash
They could set up a torrent as well and almost assuredly people will seed it
full time on servers.

~~~
rakoo
There is a torrent distribution channel: see
[https://meta.wikimedia.org/wiki/Data_dump_torrents](https://meta.wikimedia.org/wiki/Data_dump_torrents)

However it suffers the number one torrent issue: they do not tolerate change.
This means that

\- When an article changes, you need to generate a new torrent \- When a new
torrent describes the archive, it needs to be downloaded _from scratch_ by
_all_ peers, so that the maximum number of peers are available for a newcomer.

I hope you'll understand that this is not the official way to distribute
archives...

~~~
gnosygnu
Yeah, but they're working on incremental updates, so this should make the
"from scratch" part much easier. You can look at
[https://www.mediawiki.org/wiki/Incremental_dumps](https://www.mediawiki.org/wiki/Incremental_dumps)
and
[http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-Aug...](http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-August/000915.html)
if you're interested.

------
EvanDotPro
I wouldn't mind having something like this for long flights without wifi.

Edit: Kiwix looks nice too, thanks!

------
espresso77
As a thought experiment, assume you were downloading and printing on low
acidic paper (or otherwise storing for the long term) the entire contents of
Wikipedia for insurance against some near extinction event. Is there a way to
allow future generations to correctly interpret the arbitrary text on a page
into meaningful information with only using "the page" as a medium?

(I tried googling the topic, but I'm very lost as to even where I should
start. The main question that I'm interested in could be summarized as, "How
do we ensure our knowledge as a species/society is not lost in an
unintelligible format?")

~~~
Houshalter
Well practically speaking you could just use images and then have the first
book as a sort of pictionary with words and pictures that demonstrate them.
Once they figure out enough words they can start to figure out other words
that can't easily be explained by a single picture, and grammar can be worked
out from there. Sort of like the Rosetta stone.

But this is useless to the post-apocalyptic hunter gatherer, civilization
would have to be reestablished by then. And hopefully they at least understand
the idea of writing words on paper and don't just worship it as a religious
thing.

There has actually been work on doing that, attempting to mark radioactive
waste dumps in a way even someone in a completely different culture from a
distant future could understand. It's really interesting and there is pdf on
it here: [http://prod.sandia.gov/techlib/access-
control.cgi/1992/92138...](http://prod.sandia.gov/techlib/access-
control.cgi/1992/921382.pdf)

If you are going to communicate with a _completely_ alien civilization, that
is one that can't even understand images, expressing information is even more
difficult. My best guess is that you send a message with a really obvious
pattern to it, then use that pattern as a basis for sending more information.

For example, send a ton of examples of simple code in a simple programming
language, and their output. They can figure out what it means. Then encode
your messages in the programming language somehow. Send a simulation of 3
dimensional space and little objects in it interacting, for example.

Somewhat related:
[http://lesswrong.com/lw/qk/that_alien_message/](http://lesswrong.com/lw/qk/that_alien_message/)

~~~
sejje
A movie might work.

It could possibly be made future-proof by using a flipbook format, assuming we
can find a suitably long-lived material to print on.

~~~
Houshalter
An animation of what? It would be much more difficult and take way more space
for a flip book, and it doesn't contain much more information than a few
pictures, let alone a book full of pictures.

~~~
sejje
I was thinking an actual film, which wouldn't be difficult.

I kept thinking about the "how to tell future people about radioactivity"
example and I think it's hard to convey the actual effects, while it may be
easy enough to convey that it's dangerous.

A movie would be much more apt at explaining what was there and what the
consequences of irradiation are that a few stills.

------
eksith
It would be interesting if it keeps a version history between each dump file.
WP often deletes articles that it finds not notable (these are excluded in
dumps) and it would be nice to still retain the last version.

In an odd way, we're going back to days where encyclopedias came in CDs. There
was Encarta on a single disc, now we can have a lot more for around 80Gb (
[http://xowa.sourceforge.net/requirements.html](http://xowa.sourceforge.net/requirements.html)
)

~~~
gnosygnu
This is an interesting idea. When Wikimedia finalizes an incremental backup
solution, it may be possible. They'll release a dump with incremental
additions / updates / deletions. You would then have XOWA accept the additions
/ updates, but ignore all the deletions.

It would place more responsibility on the user to maintain their copy of the
dump though.

~~~
eksith
That's true. The burden then shifts to the user. But in a way, that's also
good because then you can choose which snapshot to follow.

It's a bit like maintaining your copy of an OS. You can stick to the "stable"
branch or, if you're feeling adventurous, you can switch to "release". If
you're really into the bleeding edge, you can go with the "nightly" build.

All-in-all, I really like this.

One concern I have is the possible increased bandwidth load for WP. Maybe you
can include a small icon or notification to support it by donations. Couldn't
hurt to have one there for yourself as well.

~~~
gnosygnu
Source control is an interesting analogy. In the same vein, when a user syncs
their version with the main branch, there will be hundreds of thousands of
changes to review. It'll be pretty harrowing for anyone to figure out what to
keep / reject. Just something to consider.

Anyway, thanks for the food for thought as well as your suggestion. I added
donation links for archive.org and wikipedia tonight.

------
splatzone
What's the main target market for this kind of thing?

~~~
gnosygnu
I personally use it for traveling. There are a few other applications as well:

\- low-bandwidth availability, particularly in less-developed regions of the
world

\- censorship evasion

\- security concerns. some users want to access Wikipedia without exposing
their machine to the internet

There are probably a few others I'm missing....

~~~
greenyoda
" _censorship evasion..._ "

Unfortunately, having your own copy of Wikipedia could also be used to
_enable_ censorship. For example, a fundamentalist school could have their own
version of Wikipedia from which they've purged all articles about evolution,
etc. Then they could configure their firewall to block the real Wikipedia.

~~~
gnosygnu
Agreed. However, I think it would be less work for them to block access
through firewall policy, than to remove them from XOWA.

By and large, for most private individuals, an offline app would allow them to
evade censorship. I'd hope that this benefit outweighs the risk of the other's
abuse.

~~~
Someone
A firewall would not hide the fact that censoring takes place. You would have
to rewrite content to do that. That might be easier in batch, especially if
you are going to do NLP to make cut up sentences grammatical.

~~~
gnosygnu
Ahh.... That's pretty devious. I was thinking of blocking the entire article,
not rewriting content. Still not worth the work IMHO, but who knows what
censorship servants would do.

------
csmuk
This makes me wonder why we couldn't have a DVCS with encyclopaedia content in
it. That would be easy to "pull" offline and update regularly and "push" back
up with changes. It'd be easier to distribute content and version it as well.
Oh and patch queues could be used to review and edit content.

A local HTTP service or desktop app, DVCS and indexer would do a fine job of
this.

~~~
arthulia
I think it's just not feasible/useful for most people. The latest copy of
Wikipedia is 42GB, and that's not including images or earlier revisions.

~~~
csmuk
To be honest, a lot of wikipedia is crap.

An abridged version would be a couple of gigabytes perhaps which isn't beyond
the realm of possibility. That'd fit nicely on a smart phone/tablet and could
be taken somewhere with less than adequate data connections (read most places
on this planet).

~~~
adrianN
Good luck with getting the editors to agree which parts are crap.

~~~
qznc
You can get the page view logs. You can filter out most of the garbage that
way.

[http://dumps.wikimedia.org/other/pagecounts-
raw/](http://dumps.wikimedia.org/other/pagecounts-raw/)

~~~
tommorris
Justin Bieber: 594,757 views in the last month.

Origin of birds: 8,506 in the last month.

Ogden L. Mills (secretary of the US Treasury under Herbert Hoover): 399 views
in the last month.

What is in the public interest is not the same as what the public show an
interest in. Page views won't necessarily help you filter Wikipedia...

~~~
qznc
With filtering I was thinking about discarding everything with less then 10
views. I consider all of your examples relevant. Getting rid of pages like
"Wikipeida_suxxxs" is the first step.

~~~
tommorris
If those pages exist, it would be awesome if you could notify admins and we'll
delete them.

[https://en.wikipedia.org/wiki/Wikipedia:CSD](https://en.wikipedia.org/wiki/Wikipedia:CSD)

Just edit the page and {{db-nonsense}} {{db-test}} or {{db-vandalism}} as
appropriate. :)

------
wrongc0ntinent
Text-only versions, even straight from dumps, have had some bad formatting and
truncation problems over the years, can't wait to try this. And here's a use
case:
[https://news.ycombinator.com/item?id=6676661](https://news.ycombinator.com/item?id=6676661)

------
enterx
We demand e-ink version + "DON'T PANIC" in large, friendly letters on the
cover.

Thanks. :)

~~~
keenerd
[http://kmkeen.com/tmp/wikireader1.jpg](http://kmkeen.com/tmp/wikireader1.jpg)

Though it is a reflective LCD and text-only.

