Hacker News new | past | comments | ask | show | jobs | submit login
How to download all of Wikipedia onto a USB flash drive (planetofthepaul.com)
448 points by bubblehack3r on Oct 6, 2022 | hide | past | favorite | 195 comments



Circa 2009 or so, my absolute favorite app for the iPod Touch was Patrick Collison's Offline Wikipedia (yes, that Patrick Collison: https://web.archive.org/web/20100419194443/http://collison.i...). You could download various wikis that had been pre-processed to fit in a very small space - as I recall, the entire English Wikipedia was a mere 2 GB in size. It was simply magical that I could have access to all of Wikipedia anytime, anywhere offline - especially since the iPod Touch could only connect to the Internet via WiFi. It was particularly useful while travelling, since I could load up articles and just read them on the plane.

As I recall, there were several clever things that the app did to reduce the size of the dump; many stub/redirect articles were removed, the formatting was pared down to the bare minimum, and it was all compressed quite efficiently to fit in such a small space. Patrick gives more technical detail on an earlier version of the app's homepage: https://web.archive.org/web/20080523222440/http://collison.i...


Yeah that was awesome, so useful/fun being able to go deep on a wikipedia reading session when there wasn't good mobile coverage and data was expensive.

In retrospect I do kinda miss _not_ having cell reception on vacations, as it was easier to disconnect from stuff.


> In retrospect I do kinda miss _not_ having cell reception on vacations, as it was easier to disconnect from stuff.

In retrospect I do miss _not_ having the internet or cell service at all. It's part of why I like to watch shows and movies from the 80s or 90s. It's funny because I love technology as much as anyone else on HN, but at the same time my idea of the perfect retired life is one that is almost entirely offline.


I don't know if its still the case, but about ten years ago I took ViaRail Canada from Vancouver to Toronto. It was a four-day train ride (with a brief stopover on Bamff) and the vast majority of it was remote Canada crossing the Rockies, with no mobile network coverage.

I asked the train staff about WiFi onboard, and they said that they didn't have it and the preferred it that way. People take their train, not to get from A to B, but to disconnect and meet other people, read books, or watch the beautiful landscape going by. If people had Internet access, they'd be glued to their devices and wouldn't meet fellow travelers, and that's the magic of this line.

(They even said that their corporate management wanted WiFi in the trains so that the staff could digitize a lot of their paperwork, but the crew was resisting it because they believed it was exactly the lack of connectivity that keep people taking the train for vacation.)


The tea ceremony of: sit at computer desk, boot up computer, dial up the internet connection etc was pretty fun. Also the relative novelty.

(As I check HN from the bathroom, which would’ve been a childhood dream)


Yea ceremony, nice way of terming it. Playing a vinyl LP was a more intricate tea ceremony.


I usually ‘lie’ to everyone that where I am going I will have no connection. Even if I go to London City or some urban hub :)


> In retrospect I do kinda miss _not_ having cell reception on vacations, as it was easier to disconnect from stuff.

What's preventing you from not taking a phone with you on vacation?


The expectations of other people, society, and, more and more, infrastructure (taking public transport, ...)


I was in Park City, Utah, not during ski season, just looking around, and we decided to stop and walk around. Parking meters theoretically took coins or cards, but all were nonfunctional. The only way to pay for parking was by installing their app. Promptly uninstalled it when I left, but still… with no smartphone, you literally could not legally park on the street.


I have an app called Kiwix that does this now. I also like Organic Maps for offline maps.


I worked on a slide deck we were proposing SD Cards & Thumb drives preloaded with Kiwix and an almost full almost localized version of Khanh Academy (KA Lite at the time) which could be mass distributed in active conflict zones where the schools and education were some of the first casualties. I don't think it ever went too far which was a shame compared to how other monies were spent at the time these could have really made a difference.


Yeah, Kiwix is great! I have that on my phone to browse when I'm on airplanes.


HERE WeGo also has offline maps (although you have to download them prior to being offline).

Useful for when I have been traveling and need GPS nav and there isn't cell service.


I got a bunch of super cheap WikiReader devices and those were great for offline Wikipedia. I think I paid less than $5 per device.


Check those prices now. And you know about the test utilities included with them? Incredible devices.


Bunch of neat Forth stuff in the firmware? Sadly I gave them away, hopefully not sitting in a drawer somewhere


Not directly related but way back in the day I had a Handspring Visor and would use something (i can't remember now what it was called) to download websites to it when I would sync it and it would be available on it.

At the time i was super into fan stories about Ultima Online like PK Ghost and things and would download that and we would all pass the Handspring around to read them.

Dang this actually brought up some good memories for me thinking about.



Was it AvantGo?


pocketpc with avantgo i remember that era.

in fact i invested in avantgo right before the crash. i like to think it was the $500 or so i invested that truly precipitated the original year 2000 dot com bubble crash.


I fondly remember AvantGo! Surprisingly many sites supported it, it was really handy.


that's awesome! I remember totally wanting a Handspring Visor (and soo badly wanting to play UO)! Jealous, hahah :)


Had a similar thing on my iRiver H340, running Rockbox, device had no network at all. Those were the days.


FYI, the internet archive hosts a ZIM archive that has dumps of wikipedia and many other works. https://archive.org/details/zimarchive

I wish it was a little more obvious how to search it, or what all the variations mean, but it looks like a valuable resource.

It is worth noting that Kiwix works on multiple OSes and on phones and has a wifi hostspot version (that you might run on an raspberry pi, for example). Internet-in-a-box similarly works as a wifi hostspot for ZOM archives.

Lastly, it is worth mentioning that there are tools for creating your own ZIM files; it looks like the most straightforward way is to take a static website and use a utility to convert it into one self-contained file.


Thanks for sharing, Can you explain a bit more about creating our own ZIM files (or) for archived websites on Internet Archive?

I'm looking for a way to archive all the websites from my browser bookmarks and then download them for offline use.


Not related to the OP topic or zim but I was looking into archiving my bookmarks and other content like documentation sites and wikis. I'll list some of the things I ended up using.

ArchiveBox[1]: Pretty much a self-hosted wayback machine. It can save websites as plain html, screenshot, text, and some other formats. I have my bookmarks archived in it and have a bookmarklet to easily add new websites to it. If you use the docker-compose you can enable a full-text search backend for an easy search setup.

WebRecorder[2]: A browser extension that creates WACZ archives directly in the browser capturing exactly what content you load. I use it on sites with annoying dynamic content that sites like wayback and ArchiveBox wouldn't be able to copy.

ReplayWeb[3]: An interface to browse archive types like WARC, WACZ, and HAR. The interface is just like browsing through your browser. It can be self-hosted as well for the full offline experience.

browsertrix-crawler[4]: A CLI tool to scrape websites and output to WACZ. Its super easy to run with Docker and I use it to scrape entire blogs and docs for offline use. It uses Chrome to load webpages and has some extra features like custom browser profiles, interactive login, and autoscroll/autoplay. I use the `--generateWACZ` parameter so I can use ReplayWeb to easily browse through the final output.

For bookmark and misc webpage archiving then ArchiveBox should be more than enough. Check out this repo for an amazing list of tools and resources https://github.com/iipc/awesome-web-archiving

[1] https://github.com/ArchiveBox/ArchiveBox [2] https://webrecorder.net [3] https://replayweb.page [4] https://github.com/webrecorder/browsertrix-crawler


Excellent! Thank you for the detailed answer.

I'm going to explore all the solutions and start building my setup soon.


Kiwix is great - I have a collection of various things from their library https://library.kiwix.org/?lang=eng downloaded for when I'm on a plane or the internet is otherwise unavailable.

That and the TeXlive PDF manuals can get me through anything.


I second Kiwix. I found out about it not too long ago on the topic of portable Wikipedia readers. It really stands out as the best software part of such a solution.


I used kiwix Wikipedia for a Polish friend in the Uk who couldn’t afford reliable internet access and was using public library computers. I found the English edition with images was too large for him, but the Polish edition was fine. Ideally I’d have liked a simple update system (Git like?) which he could have run at the library occasionally.


I third Kiwix. Immensely useful when I was deployed without internet.


That sounds interesting, what was the context?


Hi sorry for the delayed reply, yes I was military and in the early days of Iraq/Afghanistan we didn't have much access to the internet so I brought it with me.


Library or school in a remote village. There are computers (usually old computers), there might even be a LAN of some sort, but no internet (or very slow internet).

In those cases having local access to Wikipedia (and not necessarily just en; Kiwix has archives for all the languages) can be a great learning resource and reference.


Do you know 23B1? If not, probably you posted a reply to the wrong comment.


My guess is military.


Yep, you can download StackOverflow for offline use too


Does it actually work? I installed the app and tried to download wikipedia two or three times, each time it just failed. Eventually I gave up.


Personally I downloaded the larger files (>2GB) from a torrent file using my torrent manager. Much more reliable than over HTTP. You have checksums, it's resumeable, etc.


Given the sheer size of Wikipedia dumps - it's around 90 Gb with images! - I would recommend downloading them outside of the app.

https://download.kiwix.org/zim/


Yes it does.

I've downloaded the entirety of wikivoyage for example.


I downloaded the files directly from the library if I recall correctly.


I wish one could create new articles in Kiwix’s zim files. Right now, Kiwix is basically a Wikipedia reader. Editing features would be very nice for local wikis to develop, and later on — maybe — to have such local article editions merged into the main Wikipedia, perhaps similar to how git works.


The .zim file format is heavily optimized for compactness and ease of serving. For starters, it doesn't even store the original MediaWiki markup, but rather pre-rendered basic HTML. Images only have the thumbnail version (the one that's shown inline when reading the article), there's no full-size to zoom in. And, of course, no edit history. Multiple articles then get bundled into clusters of ~1Mb each, and each cluster compressed using ZSTD.

https://wiki.openzim.org/wiki/ZIM_file_format

This all lets you squeeze English Wikipedia into 90 Gb. But it also makes it much more difficult to edit in-place, and, of course, no MediaWiki means that it cannot possibly work like git pull requests.


I totally understand the reason why it is made for read-only consumption. However, we live in a moment where storage is significant cheaper, and so is processing. There could have a compromise, though I do not see any indication of such. SQLite could very well be used here.


Some time ago I dreamt that I was in an alien space ship for some reason. Still carrying my phone and laptop bag. They were a friendly lot and asked whether or not I would like to charge my laptop. Do you have 220V sockets I asked. They didn't know what that was. So I needed measurements and definitions. An approximate meter, an approximate second. Coulomb was difficult. I woke up and downloaded Wikipedia the next day. Deleted it again later for lack of harddisk space...

But next time this happens I will have an USB stick with all the necessary knowledge. The definitions for voltage, current and frequency should however be printed out in case my laptop battery charge is insufficient for accessing the USB stick.


Usually alien abductions thoughts revolve around an intrusive test to see if you have a chess cheat device inserted, or should that be upserted?


Somewhere around the original ipad era, I believe there was a curated subset of wikipedia articles that may have been called something like Educator’s Edition.

It worked offline and had images and I traveled to Peru with it and learned so much. Does anyone remember this sort of thing?

I’ve tried wix formatted copies and they do work but the experience on an offline ipad was simply better. Thanks in advance.


Yes, I remember - I had a copy on an SD card on my OLPC.

I believed it morphed into "Wikipedia for Schools" ^0 - possibly this ^1 is a comment about it?

0: https://en.m.wikipedia.org/wiki/Wikipedia:Wikipedia_for_Scho...

1: https://www.speedofcreativity.org/2008/11/11/wikipedia-to-go...


Tangent - I’ve noticed a lot more comments like this using the “^0” syntax for citations vs the traditional “[0]” one I’ve become accustomed to seeing on HN. Is there a real shift happening here and, if so, why?


Sorry about that, thanks for pointing it out. I'll learn.


It's a bit non-standard, and if it's trying to follow the wikipedia citation style then it's the wrong way round.



Checking to see if supported on HN [^1]

Edit: nope :)

[^1]: https://github.blog/changelog/2021-09-30-footnotes-now-suppo...


others use HN "viewers"

all the links always appear plaintext for me


thank you very much. that page brings me back. it even has technorati tags.

by the way, do you still have an olpc? i never got to use one but remember seeing them. my one weird piece of similar era tech is a cr48, the early chromebook google gave away. I remember on the form for requesting them it asked what you would do with it. i responded “install linux on it” and they gave me one.


Yes, I still have mine from the Give-One-Get-One program. It's still my favorite screen for sunny day use. It still works, I've been using a power supply from an X30 ThinkPad as I have no idea where the original went.

My neighbor years ago used to always chuckle at me using it with an Happy Hacking Pro keyboard because of the price difference between the two.


I said “develop/add olpc support to various bootloaders to help spur development, adoption, and utility” and they didn’t give me one.


Kiwix is an amazing project.

I used a similar approach for https://wikiscroll.blankenship.io

1. kiwix dump

2. unpack to HTML

3. process with cheerio to create json files

4. Create git repo and push to github pages

Works well for infinitely scrolling content, it's just Math.random on top of static files.

https://github.com/retrohacker/wikiscroll


What a cool project, thanks! Sounds like something I'd love to waste time on lol.


This is really cool. Thanks for posting that


Oh wow, I thought this was gonna be a REALLY large file, but only 95GB not bad, some worthless videogames are larger haha


Circa 2003 I carried around a pared down copy on a Pocket PC. Dropping a few chosen categories (who needs Sports?) allowed it to barely fit on a 1-GB SD card.


People going back in time need sports. An almonac of some kind.


While handy, it would be a bit too conspicuous. At least one could claim that an almanac is a novelty print.


I was curious how they achieve this. It looks like the underlying file format uses LZMA, or optionally Zstd, compression. Both achieve pretty high compression ratios against plain text and markup.

> Its file compression uses LZMA2, as implemented by the xz-utils library, and, more recently, Zstandard. The openZIM project is sponsored by Wikimedia CH, and supported by the Wikimedia Foundation.

https://en.wikipedia.org/wiki/ZIM_(file_format)


The more important thing is that they aggressively downsize the images and omit the history and talk pages. Even if they were using LZW it would probably only triple the filesize.


BTW: what's the difference between 95.2 GB file and 45 GB one? There is no info on download page.


95.2 is the "maxi" file. 49.48 is the "nopic" file. 13.39 is the "mini".

From https://www.kiwix.org/en/documentation/

File size is always an issue when downloading such big content, so we always produce each Wikipedia file in three flavours:

Mini: only the introduction of each article, plus the infobox. Saves about 95% of space vs. the full version. nopic: full articles, but no images. About 75% smaller than the full version Maxi: the default full version.


I remember the era of stupidly large games.


you mean today!


How can someone use so many words to say "use kiwix".


I recall doing such an offline dump with Wikitaxi (https://www.yunqa.de/delphi/apps/wikitaxi/index) back when WP was getting banned in China a decade or so ago.

IIRC the articles were rather easy to download and convert even on my early 2000s netbook. The media (pictures, video, audio) though was painful to deal with, and it didn't take long to find out that Wikipedia without diagram s and figures was not a great experience.


Kiwix's maxi-all Wikipedia zimfiles have pretty much all the pictures that are used in articles, but not the video and audio. And the pictures are too small; often you can't read the text in them.


So can it remove things like movies and tv shows and other noise?

I remember there was some work done to categorize articles like with the Dewey system, but so far, you can't really reduce the size of those exports.

Of course it would require a lot of work. Maybe it's already possible to categorize articles of they belong to a "portal".

But yeah, it doesn't seem the Wikipedia foundation really care about those kind of problems. To be fair they lack money.


Uno card: Reverse!

Is TV Tropes available as a single file ZIM download?


Article mentions to format to exFat as NTFS has a 4GB limit - I don't think that is true.


It's not -- FAT32 is the one with the 4GB limit. NTFS has much less native support on Macs than exFAT, though.


FAT32 can even be used with larger sizes if you just format with a larger cluster size. Since each bundle/shard is 1MB minimum that is not a problem here.


File size is still limited to 0xffffffff (the dir entry only has 32 bits to store it). Some broken implementations even treat it as signed, and files over 2GB become problematic


Ah sorry, you're right. That allows FAT32 to be used for larger partition sizes, but the file size limit remains in place.


protip: you need to download wikipedia in other languages as well

they are not translations, they are completely different articles under the name brand and platform of Wikipedia

an entry that may be just a blurb in English may be one of the most comprehensive and fully fleshed out and researched entries on the site in German, for example


Can anyone recommend a hardy device for viewing the content? As nutty as it sounds, in some post-apocalyptic world it would sure be nice to have. I'd keep it under the bed just in case..


If you follow the logic that anything is at about half its life that would probably be an older thinkpad laptop, like an x61 or x200. If you are willing to spend the money on something newer perhaps a thoughbook. I have a modded kobo ebook reader (I upgraded mine to 256GB storage and have project gutenberg, wikipedia and a few other things on it) with a good solar powerbank.


> If you follow the logic that anything is at about half its life

I don't think that makes any sense. By that logic any currently working device should be assumed to last another $currentlifetime. My 20 year old car is not gonna last another 20 years. My 10 year old laptop won't last another 10. If my car somehow did last another 20 years, it would not then make sense to assume it would still be running in another 40.

Makes more sense to look at all objects of the same class. If 75% of laptops are dead in 10 years and 95% are dead in 15, and your laptop is 10 years old, you can infer that 5 out of 25 surviving laptops will make it another 5 years, or 20%. (These numbers completely made up, just an example.)


I think the idea of "everything is about half its life" is to account for survivorship bias in longevity. The only units that make it to the 95th percentile lifetimes clearly got luckier with parts and can reasonably be expected to last longer.


Reliability of most complicated devices (cars, electronics) is usually thought to follow a “bathtub curve.” Some early mortality due to defective parts or manufacturing defects, a long trough of reliability from say, 1-10 years, then a rapid rise in failures due to aging. “Everything at half life” is a pretty bad approximation of this.


Not just electronics, go read the print quality of some of your paper receipts from three years ago and see if you can make heads or tails or where you purchased the item. Ever see photos from photo albums long ago?


Right, correcting for survivorship bias is very important. If an object lasts one year, its expected life isn't now $average_use_life - 1; that's too low an estimate.

The problem with the "half life" rule is that it corrects for this in the dumbest possible way, not only providing an inaccurate estimate for most of the object's life, but even getting the first derivative wrong for most objects. Usually, lasting longer does not make the expected remaining years of service go up, but the rule implies it does!

Take people for example. At birth, a woman in the United States has a life expectancy of 81. If she makes it to 60, she can now expect to make it to ... 85. Not a big change! Every year she lived (even her first), her remaining life expectancy went down, not up. See this chart I made comparing the life expectancy of people versus a theoretical "half-lifer": https://0x0.st/otZ_.png


What kind of mods did you make, aside from either inserting a 256 GB card, or swapping out the built-in storage? Which model?


Swapping the built in storage, changing the init scripts to start the kiwix webserver, and installing some homebrew apps. It's a Kobo Clara HD.


It doesn't sound all that nutty given the world politics today. And pretty much any ruggedized Android device will do, so long as it has enough storage - best get something with an SD slot.

You might want a device like that to have offline maps as well, especially as those are more likely to be immediately useful. The easiest way to get there is the OsmAnd app - like Kiwix, it does a number of tricks to compress things, so it's quite feasible to have a complete offline road and topographic map of US in your pocket.

(Note that Google Play Store availability on the device is immaterial, since Kiwix and OsmAnd are also available as as downloadable .apk, and are also listed in F-Droid store.)


Honestly a generic PC would probably be best, because it may be a bit harder to find power, etc, but you will have infinite amounts of replacement parts.


Have you looked at e-Ink readers?


There used to be one, maybe you can find one somewhere.

https://en.wikipedia.org/wiki/WikiReader



Print it out on paper, small but legible font.


Someone did actually print out and bind Wikipedia in 2015:

https://en.wikipedia.org/wiki/Print_Wikipedia


*a small portion


Is there a portable version of Kiwix? Would be cool if you could plug the USB into any computer and start reading Wikipedia without having to install anything.


Yes. You download a zip archive. Unpack from 121MB to 263MB, and start the exe. (assuming you're using Windows)


I recently discovered https://github.com/tatuylonen/wiktextract wiktextract for wiktionary and https://kaikki.org/ kaikki has the extracts available as links, but it's only en wiktionary for now.


>"After reading this article, you’ll be able to save all ~6 million pages of Wikipedia so you can access the sum of human knowledge regardless of internet connection!"

[...]

>"The current Wikipedia file dump in English is around 95 GB in size. This means you’ll need something like a 128 GB flash drive to accommodate the large file size."

Great article!

Also, on a related note, there's an interesting philosophical question related to this:

Given the task of preserving the most important human knowledge from the Internet and given a certain limited amount of computer storage -- what specific content (which could include text, pictures, web pages, PDFs, videos, technical drawings, etc.) from what sources do you select, and why?

?

So first with 100GB (All of Wikipedia is a great choice, btw!) -- but then with only 10GB, then 1GB, then 100MB, then 10MB, then 1MB, etc. -- all the way down to 64K! (about what an early microcomputer could hold on a floppy disk...)

What information do you select for each storage amount, and why?

?

(Perhaps I should make this a future interview question at my future company!)

Anyway, great article!


Wow, this is so cool! 95 GB and I can browse the entire Wikipedia offline!? Thanks so much!

https://library.kiwix.org/?lang=eng

I was looking at what other sites are available, and seems there are quite a few. Are there any specific ones apart from Wikipedia that HN readers would recommend?


If you end up programming offline, I remember they had a dump of stackoverflow


There is a ZIM file that contains all of stack overflow. Super useful if you have to program without access to the internet.


But but if all of Wikipedia fits on a USB drive, what do they need the millions and millions of Dollars for? /s


How does this scale with the need to update data with time, corrections etc? Having to download everything again doesn't seem that elegant. I think this wold benefit a lot from some form of incremental backup support, that is, download only what was changed since last time. A possible implementation of that could be a bittorrent distributed git-like mirror so that everyone could maintain their local synced one and be able to create its snapshot on removable media on the fly.


Given that the ZIM format is highly compressed, I'd assume that any "diff" approach would be computationally quite intensive [1] – on both sides, unless you require clients to apply all patches, which would allow hosting static patch files on the server side.

Bandwidth is getting cheaper and cheaper, and arguably if you can afford to get that initial 100 GB Wikipedia dump, you can afford downloading it more than once (and vice versa, if you can download multi-gigabyte differential updates periodically, you can afford the occasional full re-download).

One application where I could see it making sense is a related project [2] which streams the Wikipedia over satellite: Initial downloads at this point probably take several days of uninterrupted reception.

[1] Google has once implemented a custom binary diff optimized for Chrome updates, but I'm not sure if it still exists. [2] https://en.wikipedia.org/wiki/Othernet


Do you think Apple would approve an app that just offlines Wikipedia?


Yes, being a "content library app" (dictionary, for example) seems perfectly fine. You just need to be more than a frame for a website... but accessing device-local reference material is fine.


I have Minipedia installed on my iPhone and it does just that.


I think there are better ways to open ZIM files. I've had massive trouble with Kiwix. The old version seems broke beyond repair and the new version is too heavy.

ZIMply on branch `version2` has worked pretty well for me [1]. The search works a lot better and it's really nicely formatted.

[1] https://github.com/kimbauters/ZIMply/tree/version2


There is also CDPedia, a project from Python Argentina originally intended for making Wikipedia available in rural schools without Internet connection. https://github.com/PyAr/CDPedia http://cdpedia.python.org.ar/index.en.es.html


Apropos of nothing stumbled upon encyclopedia Britannica the other day, anyone know what’s up with that and if there are any pros to it vs Wikipedia ?


I used Britannica while in prison due to the obvious "No Internet". It works well enough: the articles are OK and from authoritative authors unlike many wikipedia pages but I found them a bit lacking in full detail; the main problem is that the range of topics is much much smaller, to the point where it was far less useful for detailed research. For prison, use as a basic reference, it was probably perfectly OK but for more demanding research it's not adequate.


The Library page has three identical looking entries, 100gb, 50gb, and 15gb without any explanation about what is or isn't included in each


Can anyone explain to me how the kiwix library site works? There’s 3 Wikipedia listings that all have the same name, description, language, and author, but seem to have different content. This pattern repeats for the “Wikipedia 0.8” and “Wikipedia 100” sets. One of the latter says that the top 100 pages on Wikipedia require 889 MB? What’s going on here?


Note that it's possible to make wikipedia substantially smaller if you're happy to use more aggressive compression algorithms.

Kiwix divides the data into chunks and adds various indexes and stuff to allow searching data and fast access, even on slow CPU devices. But if you can live with slow loading, you can probably halve the storage space required, or maybe more.


What compression algorithms would help? It's already using lzma for the text (in the form of .xz).


The hutter prize is a competition for compressing Wikipedia:

http://prize.hutter1.net/

So the best algorithm to use from there is starlit, with a compression factor of 8.67, compared to lzma in 2MB chunks which can only achieve about 4:1 compression.


Oh, and if you are happy to wait days or weeks for your compressed data, Fabrice bellards nncp manages even higher ratios (but isn't eligible for the prize because it's too slow)


Submissions for the Hutter price also include the size of the compressor in the "total size". So I assume that's hard to beat if you use huge neural networks on the compression size, even if decompression is fast enough.


nncp uses neural networks, but 'trains itself' as it goes, so there is no big binary blob involved in the compressor.

The only reason it isn't eligible are compute constraints (and I don't think the hutter prize allows a GPU, which nncp needs for any reasonable performance).


Ah, OK, fair enough.


They embed full-text indices into the .zim file these days, but they used to be separate originally. IIRC at that time the index for English wiki took up around 12 Gb, with the actual data in the ballpark of 65 Gb


And if you're only interested in preserving just some Wiki pages, this browser extension with some automation on top will do the perfect job: https://github.com/gildas-lormeau/SingleFile No affiliation, just a happy user :)


I wonder if there is an offline backup of Wikipedia on ISS? There should be. And on every manned space mission.


Already been done [0]. Unfortunately, the first attempt, the probe crashed, but given the physical durability of the media, it is expected to still be readable.

[0]: https://www.archmission.org/spaceil


Why should there be?


The next Apollo 13 will probably be a software problem , doesn't hurt if they can read up about it


You're proposing that if something goes wrong on the ISS, the crew will need wikipedia to solve it? Not... talking to Houston or just taking the Soyuz back?


I wrote a crap SDR basically just from wikipedia, maybe their radio broke.

Presumably you have seen a science fiction film before, use your imagination.


> The next Apollo 13 will probably be a software problem , doesn't hurt if they can read up about it

What good would an "offline backup of Wikipedia" do in that situation?

Wikipedia is good for one thing, and one thing only: getting some cursory knowledge on a topic you're unfamiliar with. It's the tourist map to the "sum of all human knowledge." If you expect to use it for anything else, you're asking too much of it.


I have found a lot of the math articles to be quite good.


So, stackoverflow, not wikipedia, then?


So its contents aren't lost if Earth's surface gets depopulated.


Putting it on ISS wouldn't help with that, although I'm sure this comes as no surprise to you, given that its orbit is a decaying one.

I like the idea of periodic Wikipedia moonshots, although the storage format is kind of an open question, I've wondered for awhile if a DVD made from e.g. quartz, platinum, and titanium might be up to the job.

A full backup would fit on 12 double-layer, single-sided disks; I'm being conservative and not using Blu-Ray numbers, since density and longevity are always somewhat in tension. Probably more expensive to put them safely on the moon than to manufacture in the first place.


Agreed. I think even bare nickel or iron would probably be fine. Holographic glass laser damage can in theory handle higher recording densities and, like your DVD, isn't vulnerable to surface damage.

In space you probably don't have to worry as much about minor surface scratches and oxidation, though. You just have to worry about outgassing and meteoroid impacts. Some of them you can stop, and some you can't. On the bright side, they're very rare.

I think common media formats like DVDs are designed with a lot of emphasis on speed, both of reading and of duplication. This compromises both density and longevity. If you, instead, allow yourself the luxury of FIB milling to write and an electron microscope to read, you can manufacture in any vacuum-stable material at all, and you can engrave your archival message with, say, 50-nanometer resolution. At one bit per 50 nanometers square, you get five gigabytes per square centimeter.

I think that with e-beam photoresist cross-linking followed by etching you get about 500 kilobits per second, and I think FIB milling is a little slower, so it might take a few weeks to make the copy — obviously unacceptable for a consumer CD burner but fine for periodic Wikipedia moonshots.


Relevant material about CD-R and DVD-R archival: https://news.ycombinator.com/item?id=33117813


And what's the point of it in space? Knowledge doesn't disappear when it's not on wikipedia. If humans are still around they will continue contributing to knowledge. Just because it's not printed or recorded doesn't mean that information or knowledge doesn't exist.


> if Earth's surface gets depopulated

There are 14 people on the ISS. If they were the only ones left, they would certainly not have the breadth knowledge of a Wikipedia dump.


And how would these 14 survive if they are the only ones left? Do you know that there's a whole support team to support them in space? It's not just 14 people. There are hundreds on the ground supporting them.


Yeah, ISS itself is not that realistic.


Why not just every space mission, period?


How much would the science capabilities of a telescope like JWST be reduced if 1/3 of its SSD was repurposed for storing the latest wikipedia dump (that 1/3 number is assuming it's only English, compressed, and without images)? To me that seems like an easy cost/benefit analysis.


How much would the science capabilities of a telescope like JWST be reduced if we left its SSD alone and just taped a USB drive to the side of it somewhere that contained wikipieda?


Would duct tape pass the pre-launch vibe check? You'd have to do some engineering work to make sure it's sturdy, doesn't have any impact wrt oscillations, won't create FOD (debris) etc.

Once you've done all that work, I'm not sure what you've actually accomplished. By the time any sentient being gets around to visiting JWST, I wouldn't be surprised if an unshielded commercial drive would be rendered totally unusable by radiation.


Well the robots don't read too well..


I wonder if they snapshot Wikipedia for this, or if they stagger it per article to avoid very recent unreviewed edits getting in to such a download (that would say disappear off the site if those were bad edits or vandalism)


They have snapshots ,there is also official Wikipedia torrent links of dumps


Do not store 96gb of anything on exfat, use ext4 or APFS or zfs or some journaled file system. Does NTFS really have a 4GB file size limit? Structures should match exfat so that part seems suspect to me.


>> Does NTFS really have a 4GB file size limit?

No, but FAT32 does. Exfat, on the other hand has a file size limit of 16 exibibytes. That, combined with exfat's cross-platform mounting (NTFS has a lot of limitations in this regard) makes it a superior formatting system for flash based offline file transfer.

On a network? Use zfs+ or something.


This is the kind of thing that you download once and then never write anything on that media until you decide to refresh the content. In fact, you might as well mount it read-only. A journaling FS wouldn't do anything useful here.


Afaik NTFS max filesize is 256TB


In the old days :tm: I remember doing this as well with a 1GB drive ( and room to spare for some mobile apps).

Would be interesting to see a graph of usb size easily available vs. Wikipedia dump size.


That would be very interesting. Thanks, I now have another entry on my To Do list.


Love it! Imagine if USB Flash drive manufacturers just loaded up new drives with content like this. I mean, why not right? I think the physics means it would even be lighter ;)


I for one would not be happy.

When I buy a storage device I usually have an intended purpose for that storage and would not like to have to delete all of the files that some manufacturer thought would be useful information but I would have to delete to make room for what I want.


Sorry I wasn't clear—I 100% agree with you I definitely would generally want a blank one, but would be fun to have some options.


Especially if you didn't know which one you were going to get. Plug it in for a big surprise! (From a verifiable manufacturer who has their customers' happiness and enjoyment at heart)


Now I'm curious: if, hypothetically, wikipedia was just backed by a single git repo and every edit was a commit, how big would it be and how long would it take to clone?


Can someone explain what the role of kiwix in all this, please?


It provides access to the content of the zimfile and an interface for downloading zimfiles.


Thanks. I had not understood that the download is the actual "raw" wiki files, not the pages as they would be delivered to a browser.


It's actually a compressed archive, but I think the contents are in fact HTML and other browser-accessible media types.


But why? If civilization collapses I'm not going to think "oh, let me consult Wikipedia" I'm going to think "man, this sucks".


This has to be one of the most poorly structured pieces of writing I've seen in a while. It's way too verbose, and on the one hand there are separate sections like:

* Getting a flash drive

* Formatting a flash drive (which includes a subsection on not formatting it but buying one that's already formatted instead, while there was a separate section just before this one on buying a drive)

* Waiting for a file to download

At the same time downloading both Wikipedia and Kiwix are in the same section. Then, installing Kiwix is in a section called "You're done" which isn't next to the section on downloading Kiwix.


I want to like Kiwix -- I downloaded Wikipedia AND StackOverflow -- but it keeps crashing every time I try to search for anything on this M1 macbook.


Is there a way to keep a mirror that stays in sync?


It looks like Kiwix uses the ZIM file format, which appears to have diffing support [0] (see zimdiff and zimpatch). That said, it doesn't look like Kiwix actually publishes those diffs.

[0] https://github.com/openzim/zim-tools/tree/master/src


Does it include the images or it’s just the text?


Yes, with images but only english

All possible dumps: https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/


This would be useful to drop into North Korea?


That's exactly what I was thinking of as well. I remember listening to an episode on Darknet Diaries with this theme where dvds and usb drives are a common way to smuggle things into North Korea.


I think I'd rather have stack overflow offline, before I'd want wikipedia offline, though.


Anyone doing ZIM of news.ycombinator.com ? Once in a week package would be fine. How to make one?


Could I use something like this to train my own GPT that's obsessed with Wikipedia? :)


Is there something like this that downloads the full edit history as well?


What about the images?


would be cool if kiwix came with an auto-update feature, but given the database size, I believe it's difficult to implement.


95 GB? I remember when it was like 2 GB haha


Is there something similar for Stack Overflow?


Kiwix can do that also. You needs to specify the ZIM file and it works:

https://wiki.kiwix.org/wiki/Content_in_all_languages

Why I know that? I wanted to travel as system administrator in some antartica base with a whole copy of stackoverflow with me.


Don't forget the Arch and Gentoo wikis!


Nice, how was Antarctica?



Kiwix has Stack Overflow (and various StackExchange subsites), Project Gutenberg, TED talks, and plenty more. You can also request something.


Yes, there are Kiwix .zim files of Stack Overflow. I think Zeal also may have an SO docset.


Still using a WikiReader?


and now donate to Wikipedia, because you just caused them to pay for 95Gb of (useless) traffic


That really isn't how it works.


Cool. I don't have a USB Flash Drive though.


$15 can help




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: