
Download Entire Wikipedia for Offline Use With an HTML5 App - antimatter15
http://offline-wiki.googlecode.com/git/app.html
======
aw3c2
"nearly all of the textual content of the English Wikipedia" = "1GB"

I find that hard to believe. Other wiki readers' dumps are a multiple of that.
Eg aarddict for en is ~8GB.

~~~
riffraff
from the blog: >>> First of all, it compresses not the entireity, but rather
the most popular subset of the English Wikipedia. Two dumps are distributed at
time of writing, the top 1000 articles and the top 300,000 requiring
approximately 10MB and 1GB, respectively.

~~~
kristianp
Actually the top 1337 and 314159 articles, respectively :).

------
erichocean
Seriously mis-titled, since it's nowhere even close to the "Entire Wikipedia"
– it's a tiny subset of the English-language Wikipedia from what I can tell.

~~~
antimatter15
You can switch to a larger subset in the settings but it is a still somewhat
mistitled

------
obsessive1
Nice job, this looks really useful - would certainly help for the times when
I'm stuck with no internet access and need to look something up.

One minor niggle - when I changed the file I wanted to use in settings, there
was no confirmation or notification to let me know it was downloading the new
file. I ended up stopping the download, erasing the data and starting again,
to be sure. It might be worth adding in a confirmation to let users know it
was changed OK, and is being re-downloaded.

------
darylteo
Can we use this technology for APIs for all various languages/frameworks?

I could definitely use a bit of a productivity boost (by turning off web
access)

~~~
icebraining
You can just use a website mirroring tool like wget, they've been around for
ages. I've done just that with plenty of reference websites.

Don't forget to set an acceptable delay to ensure you don't overload the
servers, though. Mine usually run all night.

~~~
devs1010
Selenium's webdriver is great for this, they have different implementations so
you can use it like wget (but more sophisticated) where it doesn't run an
actual browser or you can have it run an implementation where it actually runs
a browser, like Chrome or Firefox (good for debugging)

------
kanzure
Also, there was a Wikipedia/git project a while back (offline editing). All of
the revision history was dumped into git.

<http://scytale.name/blog/2009/11/announcing-levitation>

[http://www.readwriteweb.com/hack/2011/08/gitjs-a-git-
impleme...](http://www.readwriteweb.com/hack/2011/08/gitjs-a-git-
implementation-in.php)

Why does mediawiki have its own version control system, anyway?

~~~
lsb
Because Wikipedia started in January 2001, well before even Subversion was
around.

Levitation looks really cool, I wonder if using gitwiki will be easier and
more sustainable than mediawiki.

And hey, Dan Lucraft who wrote git.js is still at Songkick, YCS07!

~~~
jmilloy
For the record, (Wikipedia says) the initial release of subversion was Oct 20,
2000. It's still understandable that Wikipedia rolled it's own.

~~~
lsb
Yup, but it became stable enough to host itself Aug 2001, whereas I think it
was January or February 2001 that the few articles started trickling in.

------
bajsejohannes
This is cool, but the one thing I miss from all wikipedia dumps so far is
images. It's essential for a lot of articles. Last time I checked, images were
excluded from dumps because of license issues. "Fair use" in particular. How
about a dump of just the images with fitting licenses? Does anyone here know
why this is not available?

~~~
DanBC
People don't understand licences. There are many images with incorrect
licences. (There are bots that trawl the images to ask people to correct the
licences; there have been megabyte long flamewars about the operators of those
bots and how unpopular image tagging is.)

There are just too many infringing images, even those supposedly with the
correct licence, for Wiki* to distribute and stay safe.

------
blinkingled
Random article results in 404 once in may be 4 times. Here is a suggestion for
an improvement - a link for making 404 pages available offline. So if I go
looking for a specific page that isn't offline I can make it available and
read it later.

------
kal00ma
An offline wikitravel would be incredibly useful for travelers. I haven't
found this yet so I built an offline wikitravel for android:
[https://market.android.com/details?id=com.heliod.eutravelgui...](https://market.android.com/details?id=com.heliod.eutravelguide&hl=en)

~~~
Wicher
I use the Wikireader (from OpenMoko) when traveling:
<http://en.wikipedia.org/wiki/Wikireader> I find it very useful, especially on
the longer wall-socketless cycling trips.

You can stick both wikipedia and wiktionary on it. Quite possibly also
Wikitravel, if they provide dumps.

~~~
timdoug
Wikitravel is available -- wget <http://wrmlbeta.s3.amazonaws.com> and you'll
see all the dumps they have available, e.g.:

<http://wrmlbeta.s3.amazonaws.com/entrav-20111105.7z.001>

In the age of nearly unlimited connectivity, I still find my Wikireader an
invaluable device when traveling.

------
vasco
Unfortunately I can't see any formulas correctly and the tables are quirky.
Example: [http://offline-
wiki.googlecode.com/git/app.html?Permeability...](http://offline-
wiki.googlecode.com/git/app.html?Permeability_\(electromagnetism\))

~~~
antimatter15
I thought of using jsmath or mathjax but they're too big.

------
khuey
It says it was tested in Firefox 10, which is a little surprising since it
doesn't work at all in Firefox 10. The IndexedDB spec changed and Firefox
changed to align with the spec between 9 and 10, but the page uses the old
API.

~~~
antimatter15
I tested it on an infrequently updated installation of firefox nightly, and
the about page said Firefox 10. But I didn't know the API changed, I'll look
into it. But how so did it change?

~~~
khuey
Instead of doing

var request = mozIndexedDB.open("databasename"); request.onsuccess =
function(event) { request = event.target.result.setVersion(N);
request.onsuccess = function(event) { // set up your database } }

it looks like

var request = mozIndexedDB.open("databasename", N); request.onupgradeneeded =
function(event) { // set up your database } request.onsuccess =
function(event) { // do stuff with your database }

Feel free to email me at <my hacker news username>@mozilla.com if you need a
more detailed explanation.

------
naner
Does this app grab the files from Wikipedia directly? It doesn't seem very
nice to create an app that pulls down gigabytes of data from a web service you
do not own nor have permission from.

EDIT: It appears my concern was unwarranted.

~~~
blantonl
can someone explain the reason why this concern is unwarranted?

~~~
teraflop
The files are hosted by Google Code: <http://code.google.com/p/offline-
wiki/downloads/list>

------
samstave
Sorry for being obtuse, but I DLd the 1GB repository - where is it stored and
how do I access it?

I see I can go to the index, from this page - is this index served up from the
1GB DL I just did?

How can I transfer this to [device]?

------
huetsch
Thanks so much for this. This will be incredibly useful for me (behind the
GFW, which gets moody about Wikipedia pretty often). Could this easily
periodically update itself to grab fresh versions of articles? I think that
would be a great feature, especially if you could do it without having to pull
down the whole database each time you wanted to update, instead just updating
on an article-by-article basis.

------
js4all
Absolutely amazing. This technology can be used for many other offline
databases. He provides the tools for indexing, compressing and everything
needed for the reader. Make sure to read his corresponding blog post:
<http://antimatter15.com/wp/2011/12/offline-wiki-redux/>

------
Kakitus
Amazing project, just what I was searching for. A few recommendations:

Could you expand the available download options to include an option to
download all of wikipedia, not just a subset of the most popular articles?

Right now, mathematic and other kinds of formulae aren't rendered correctly.
Is there any way you could fix that?

An option to include pictures (maybe compressed or low-res versions) would be
neat.

Thanks!

------
SquareWheel
I remember when you announced some months back, wasn't it a paid application?
Great work either way.

------
pax
Can this also be synced, or does one need to delete/re-download the whole
dump?

------
chrisatlee
doesn't work for me with Firefox 10 or 11a2. It would be awesome if it could
be made to work there!

------
heifetz
doesn't seem to work on the ipad

~~~
antimatter15
It only sort of works on iOS 5, the downloads stop whenever a "Increase
Storage" prompt pops up and you have to reload whenever that happens. But it
does work with the small dump, albeit slowly.

~~~
mmahemoff
Cool, I initially didn't think this much storage was possible on mobile yet.
Are you saying you can get the whole thing down if you keep agreeing to the
prompts?

It's a pity mobile browsers haven't got better support for this kind of thing
yet.

~~~
antimatter15
No I think it stops issuing prompts after 50GB. Also, on iOS 5, it only
supports WebSQL which dosn't store (AFAIK) objects like typed arrays, so I
have to convert it to a string and back base64 encoded, which makes it use
even more space.

------
leeight
very cool.

