Circa 2009 or so, my absolute favorite app for the iPod Touch was Patrick Collison's Offline Wikipedia (yes, that Patrick Collison: https://web.archive.org/web/20100419194443/http://collison.i...). You could download various wikis that had been pre-processed to fit in a very small space - as I recall, the entire English Wikipedia was a mere 2 GB in size. It was simply magical that I could have access to all of Wikipedia anytime, anywhere offline - especially since the iPod Touch could only connect to the Internet via WiFi. It was particularly useful while travelling, since I could load up articles and just read them on the plane.
As I recall, there were several clever things that the app did to reduce the size of the dump; many stub/redirect articles were removed, the formatting was pared down to the bare minimum, and it was all compressed quite efficiently to fit in such a small space. Patrick gives more technical detail on an earlier version of the app's homepage: https://web.archive.org/web/20080523222440/http://collison.i...
Yeah that was awesome, so useful/fun being able to go deep on a wikipedia reading session when there wasn't good mobile coverage and data was expensive.
In retrospect I do kinda miss _not_ having cell reception on vacations, as it was easier to disconnect from stuff.
> In retrospect I do kinda miss _not_ having cell reception on vacations, as it was easier to disconnect from stuff.
In retrospect I do miss _not_ having the internet or cell service at all. It's part of why I like to watch shows and movies from the 80s or 90s. It's funny because I love technology as much as anyone else on HN, but at the same time my idea of the perfect retired life is one that is almost entirely offline.
I don't know if its still the case, but about ten years ago I took ViaRail Canada from Vancouver to Toronto. It was a four-day train ride (with a brief stopover on Bamff) and the vast majority of it was remote Canada crossing the Rockies, with no mobile network coverage.
I asked the train staff about WiFi onboard, and they said that they didn't have it and the preferred it that way. People take their train, not to get from A to B, but to disconnect and meet other people, read books, or watch the beautiful landscape going by. If people had Internet access, they'd be glued to their devices and wouldn't meet fellow travelers, and that's the magic of this line.
(They even said that their corporate management wanted WiFi in the trains so that the staff could digitize a lot of their paperwork, but the crew was resisting it because they believed it was exactly the lack of connectivity that keep people taking the train for vacation.)
I was in Park City, Utah, not during ski season, just looking around, and we decided to stop and walk around. Parking meters theoretically took coins or cards, but all were nonfunctional. The only way to pay for parking was by installing their app. Promptly uninstalled it when I left, but still… with no smartphone, you literally could not legally park on the street.
I worked on a slide deck we were proposing SD Cards & Thumb drives preloaded with Kiwix and an almost full almost localized version of Khanh Academy (KA Lite at the time) which could be mass distributed in active conflict zones where the schools and education were some of the first casualties. I don't think it ever went too far which was a shame compared to how other monies were spent at the time these could have really made a difference.
Not directly related but way back in the day I had a Handspring Visor and would use something (i can't remember now what it was called) to download websites to it when I would sync it and it would be available on it.
At the time i was super into fan stories about Ultima Online like PK Ghost and things and would download that and we would all pass the Handspring around to read them.
Dang this actually brought up some good memories for me thinking about.
in fact i invested in avantgo right before the crash. i like to think it was the $500 or so i invested that truly precipitated the original year 2000 dot com bubble crash.
I wish it was a little more obvious how to search it, or what all the variations mean, but it looks like a valuable resource.
It is worth noting that Kiwix works on multiple OSes and on phones and has a wifi hostspot version (that you might run on an raspberry pi, for example). Internet-in-a-box similarly works as a wifi hostspot for ZOM archives.
Lastly, it is worth mentioning that there are tools for creating your own ZIM files; it looks like the most straightforward way is to take a static website and use a utility to convert it into one self-contained file.
Not related to the OP topic or zim but I was looking into archiving my bookmarks and other content like documentation sites and wikis. I'll list some of the things I ended up using.
ArchiveBox[1]: Pretty much a self-hosted wayback machine. It can save websites as plain html, screenshot, text, and some other formats. I have my bookmarks archived in it and have a bookmarklet to easily add new websites to it. If you use the docker-compose you can enable a full-text search backend for an easy search setup.
WebRecorder[2]: A browser extension that creates WACZ archives directly in the browser capturing exactly what content you load. I use it on sites with annoying dynamic content that sites like wayback and ArchiveBox wouldn't be able to copy.
ReplayWeb[3]: An interface to browse archive types like WARC, WACZ, and HAR. The interface is just like browsing through your browser. It can be self-hosted as well for the full offline experience.
browsertrix-crawler[4]: A CLI tool to scrape websites and output to WACZ. Its super easy to run with Docker and I use it to scrape entire blogs and docs for offline use. It uses Chrome to load webpages and has some extra features like custom browser profiles, interactive login, and autoscroll/autoplay. I use the `--generateWACZ` parameter so I can use ReplayWeb to easily browse through the final output.
For bookmark and misc webpage archiving then ArchiveBox should be more than enough. Check out this repo for an amazing list of tools and resources https://github.com/iipc/awesome-web-archiving
Kiwix is great - I have a collection of various things from their library https://library.kiwix.org/?lang=eng downloaded for when I'm on a plane or the internet is otherwise unavailable.
That and the TeXlive PDF manuals can get me through anything.
I second Kiwix. I found out about it not too long ago on the topic of portable Wikipedia readers. It really stands out as the best software part of such a solution.
I used kiwix Wikipedia for a Polish friend in the Uk who couldn’t afford reliable internet access and was using public library computers. I found the English edition with images was too large for him, but the Polish edition was fine. Ideally I’d have liked a simple update system (Git like?) which he could have run at the library occasionally.
Hi sorry for the delayed reply, yes I was military and in the early days of Iraq/Afghanistan we didn't have much access to the internet so I brought it with me.
Library or school in a remote village. There are computers (usually old computers), there might even be a LAN of some sort, but no internet (or very slow internet).
In those cases having local access to Wikipedia (and not necessarily just en; Kiwix has archives for all the languages) can be a great learning resource and reference.
Personally I downloaded the larger files (>2GB) from a torrent file using my torrent manager. Much more reliable than over HTTP. You have checksums, it's resumeable, etc.
I wish one could create new articles in Kiwix’s zim files. Right now, Kiwix is basically a Wikipedia reader. Editing features would be very nice for local wikis to develop, and later on — maybe — to have such local article editions merged into the main Wikipedia, perhaps similar to how git works.
The .zim file format is heavily optimized for compactness and ease of serving. For starters, it doesn't even store the original MediaWiki markup, but rather pre-rendered basic HTML. Images only have the thumbnail version (the one that's shown inline when reading the article), there's no full-size to zoom in. And, of course, no edit history. Multiple articles then get bundled into clusters of ~1Mb each, and each cluster compressed using ZSTD.
This all lets you squeeze English Wikipedia into 90 Gb. But it also makes it much more difficult to edit in-place, and, of course, no MediaWiki means that it cannot possibly work like git pull requests.
I totally understand the reason why it is made for read-only consumption. However, we live in a moment where storage is significant cheaper, and so is processing. There could have a compromise, though I do not see any indication of such. SQLite could very well be used here.
Some time ago I dreamt that I was in an alien space ship for some reason. Still carrying my phone and laptop bag. They were a friendly lot and asked whether or not I would like to charge my laptop. Do you have 220V sockets I asked. They didn't know what that was. So I needed measurements and definitions. An approximate meter, an approximate second. Coulomb was difficult. I woke up and downloaded Wikipedia the next day. Deleted it again later for lack of harddisk space...
But next time this happens I will have an USB stick with all the necessary knowledge. The definitions for voltage, current and frequency should however be printed out in case my laptop battery charge is insufficient for accessing the USB stick.
Somewhere around the original ipad era, I believe there was a curated subset of wikipedia articles that may have been called something like Educator’s Edition.
It worked offline and had images and I traveled to Peru with it and learned so much. Does anyone remember this sort of thing?
I’ve tried wix formatted copies and they do work but the experience on an offline ipad was simply better. Thanks in advance.
Tangent - I’ve noticed a lot more comments like this using the “^0” syntax for citations vs the traditional “[0]” one I’ve become accustomed to seeing on HN. Is there a real shift happening here and, if so, why?
thank you very much. that page brings me back. it even has technorati tags.
by the way, do you still have an olpc? i never got to use one but remember seeing them. my one weird piece of similar era tech is a cr48, the early chromebook google gave away. I remember on the form for requesting them it asked what you would do with it. i responded “install linux on it” and they gave me one.
Yes, I still have mine from the Give-One-Get-One program. It's still my favorite screen for sunny day use. It still works, I've been using a power supply from an X30 ThinkPad as I have no idea where the original went.
My neighbor years ago used to always chuckle at me using it with an Happy Hacking Pro keyboard because of the price difference between the two.
Circa 2003 I carried around a pared down copy on a Pocket PC. Dropping a few chosen categories (who needs Sports?) allowed it to barely fit on a 1-GB SD card.
I was curious how they achieve this. It looks like the underlying file format uses LZMA, or optionally Zstd, compression. Both achieve pretty high compression ratios against plain text and markup.
> Its file compression uses LZMA2, as implemented by the xz-utils library, and, more recently, Zstandard. The openZIM project is sponsored by Wikimedia CH, and supported by the Wikimedia Foundation.
The more important thing is that they aggressively downsize the images and omit the history and talk pages. Even if they were using LZW it would probably only triple the filesize.
File size is always an issue when downloading such big content, so we always produce each Wikipedia file in three flavours:
Mini: only the introduction of each article, plus the infobox. Saves about 95% of space vs. the full version.
nopic: full articles, but no images. About 75% smaller than the full version
Maxi: the default full version.
IIRC the articles were rather easy to download and convert even on my early 2000s netbook. The media (pictures, video, audio) though was painful to deal with, and it didn't take long to find out that Wikipedia without diagram s and figures was not a great experience.
Kiwix's maxi-all Wikipedia zimfiles have pretty much all the pictures that are used in articles, but not the video and audio. And the pictures are too small; often you can't read the text in them.
FAT32 can even be used with larger sizes if you just format with a larger cluster size. Since each bundle/shard is 1MB minimum that is not a problem here.
File size is still limited to 0xffffffff (the dir entry only has 32 bits to store it). Some broken implementations even treat it as signed, and files over 2GB become problematic
protip: you need to download wikipedia in other languages as well
they are not translations, they are completely different articles under the name brand and platform of Wikipedia
an entry that may be just a blurb in English may be one of the most comprehensive and fully fleshed out and researched entries on the site in German, for example
Can anyone recommend a hardy device for viewing the content? As nutty as it sounds, in some post-apocalyptic world it would sure be nice to have. I'd keep it under the bed just in case..
If you follow the logic that anything is at about half its life that would probably be an older thinkpad laptop, like an x61 or x200. If you are willing to spend the money on something newer perhaps a thoughbook. I have a modded kobo ebook reader (I upgraded mine to 256GB storage and have project gutenberg, wikipedia and a few other things on it) with a good solar powerbank.
> If you follow the logic that anything is at about half its life
I don't think that makes any sense. By that logic any currently working device should be assumed to last another $currentlifetime. My 20 year old car is not gonna last another 20 years. My 10 year old laptop won't last another 10. If my car somehow did last another 20 years, it would not then make sense to assume it would still be running in another 40.
Makes more sense to look at all objects of the same class. If 75% of laptops are dead in 10 years and 95% are dead in 15, and your laptop is 10 years old, you can infer that 5 out of 25 surviving laptops will make it another 5 years, or 20%. (These numbers completely made up, just an example.)
I think the idea of "everything is about half its life" is to account for survivorship bias in longevity. The only units that make it to the 95th percentile lifetimes clearly got luckier with parts and can reasonably be expected to last longer.
Reliability of most complicated devices (cars, electronics) is usually thought to follow a “bathtub curve.” Some early mortality due to defective parts or manufacturing defects, a long trough of reliability from say, 1-10 years, then a rapid rise in failures due to aging. “Everything at half life” is a pretty bad approximation of this.
Not just electronics, go read the print quality of some of your paper receipts from three years ago and see if you can make heads or tails or where you purchased the item. Ever see photos from photo albums long ago?
Right, correcting for survivorship bias is very important. If an object lasts one year, its expected life isn't now $average_use_life - 1; that's too low an estimate.
The problem with the "half life" rule is that it corrects for this in the dumbest possible way, not only providing an inaccurate estimate for most of the object's life, but even getting the first derivative wrong for most objects. Usually, lasting longer does not make the expected remaining years of service go up, but the rule implies it does!
Take people for example. At birth, a woman in the United States has a life expectancy of 81. If she makes it to 60, she can now expect to make it to ... 85. Not a big change! Every year she lived (even her first), her remaining life expectancy went down, not up. See this chart I made comparing the life expectancy of people versus a theoretical "half-lifer": https://0x0.st/otZ_.png
It doesn't sound all that nutty given the world politics today. And pretty much any ruggedized Android device will do, so long as it has enough storage - best get something with an SD slot.
You might want a device like that to have offline maps as well, especially as those are more likely to be immediately useful. The easiest way to get there is the OsmAnd app - like Kiwix, it does a number of tricks to compress things, so it's quite feasible to have a complete offline road and topographic map of US in your pocket.
(Note that Google Play Store availability on the device is immaterial, since Kiwix and OsmAnd are also available as as downloadable .apk, and are also listed in F-Droid store.)
Honestly a generic PC would probably be best, because it may be a bit harder to find power, etc, but you will have infinite amounts of replacement parts.
Is there a portable version of Kiwix? Would be cool if you could plug the USB into any computer and start reading Wikipedia without having to install anything.
>"After reading this article, you’ll be able to save all ~6 million pages of Wikipedia so you can access the sum of human knowledge regardless of internet connection!"
[...]
>"The current Wikipedia file dump in English is around 95 GB in size. This means you’ll need something like a 128 GB flash drive to accommodate the large file
size."
Great article!
Also, on a related note, there's an interesting philosophical question related to this:
Given the task of preserving the most important human knowledge from the Internet and given a certain limited amount of computer storage -- what specific content (which could include text, pictures, web pages, PDFs, videos, technical drawings, etc.) from what sources do you select, and why?
?
So first with 100GB (All of Wikipedia is a great choice, btw!) -- but then with only 10GB, then 1GB, then 100MB, then 10MB, then 1MB, etc. -- all the way down to 64K! (about what an early microcomputer could hold on a floppy disk...)
What information do you select for each storage amount, and why?
?
(Perhaps I should make this a future interview question at my future company!)
I was looking at what other sites are available, and seems there are quite a few. Are there any specific ones apart from Wikipedia that HN readers would recommend?
How does this scale with the need to update data with time, corrections etc? Having to download everything again doesn't seem that elegant. I think this wold benefit a lot from some form of incremental backup support, that is, download only what was changed since last time. A possible implementation of that could be a bittorrent distributed git-like mirror so that everyone could maintain their local synced one and be able to create its snapshot on removable media on the fly.
Given that the ZIM format is highly compressed, I'd assume that any "diff" approach would be computationally quite intensive [1] – on both sides, unless you require clients to apply all patches, which would allow hosting static patch files on the server side.
Bandwidth is getting cheaper and cheaper, and arguably if you can afford to get that initial 100 GB Wikipedia dump, you can afford downloading it more than once (and vice versa, if you can download multi-gigabyte differential updates periodically, you can afford the occasional full re-download).
One application where I could see it making sense is a related project [2] which streams the Wikipedia over satellite: Initial downloads at this point probably take several days of uninterrupted reception.
[1]
Google has once implemented a custom binary diff optimized for Chrome updates, but I'm not sure if it still exists.
[2] https://en.wikipedia.org/wiki/Othernet
Yes, being a "content library app" (dictionary, for example) seems perfectly fine. You just need to be more than a frame for a website... but accessing device-local reference material is fine.
I think there are better ways to open ZIM files. I've had massive trouble with Kiwix. The old version seems broke beyond repair and the new version is too heavy.
ZIMply on branch `version2` has worked pretty well for me [1]. The search works a lot better and it's really nicely formatted.
I used Britannica while in prison due to the obvious "No Internet". It works well enough: the articles are OK and from authoritative authors unlike many wikipedia pages but I found them a bit lacking in full detail; the main problem is that the range of topics is much much smaller, to the point where it was far less useful for detailed research. For prison, use as a basic reference, it was probably perfectly OK but for more demanding research it's not adequate.
Can anyone explain to me how the kiwix library site works? There’s 3 Wikipedia listings that all have the same name, description, language, and author, but seem to have different content. This pattern repeats for the “Wikipedia 0.8” and “Wikipedia 100” sets. One of the latter says that the top 100 pages on Wikipedia require 889 MB? What’s going on here?
Note that it's possible to make wikipedia substantially smaller if you're happy to use more aggressive compression algorithms.
Kiwix divides the data into chunks and adds various indexes and stuff to allow searching data and fast access, even on slow CPU devices. But if you can live with slow loading, you can probably halve the storage space required, or maybe more.
So the best algorithm to use from there is starlit, with a compression factor of 8.67, compared to lzma in 2MB chunks which can only achieve about 4:1 compression.
Oh, and if you are happy to wait days or weeks for your compressed data, Fabrice bellards nncp manages even higher ratios (but isn't eligible for the prize because it's too slow)
Submissions for the Hutter price also include the size of the compressor in the "total size". So I assume that's hard to beat if you use huge neural networks on the compression size, even if decompression is fast enough.
nncp uses neural networks, but 'trains itself' as it goes, so there is no big binary blob involved in the compressor.
The only reason it isn't eligible are compute constraints (and I don't think the hutter prize allows a GPU, which nncp needs for any reasonable performance).
They embed full-text indices into the .zim file these days, but they used to be separate originally. IIRC at that time the index for English wiki took up around 12 Gb, with the actual data in the ballpark of 65 Gb
And if you're only interested in preserving just some Wiki pages, this browser extension with some automation on top will do the perfect job: https://github.com/gildas-lormeau/SingleFile
No affiliation, just a happy user :)
Already been done [0]. Unfortunately, the first attempt, the probe crashed, but given the physical durability of the media, it is expected to still be readable.
You're proposing that if something goes wrong on the ISS, the crew will need wikipedia to solve it? Not... talking to Houston or just taking the Soyuz back?
> The next Apollo 13 will probably be a software problem , doesn't hurt if they can read up about it
What good would an "offline backup of Wikipedia" do in that situation?
Wikipedia is good for one thing, and one thing only: getting some cursory knowledge on a topic you're unfamiliar with. It's the tourist map to the "sum of all human knowledge." If you expect to use it for anything else, you're asking too much of it.
Putting it on ISS wouldn't help with that, although I'm sure this comes as no surprise to you, given that its orbit is a decaying one.
I like the idea of periodic Wikipedia moonshots, although the storage format is kind of an open question, I've wondered for awhile if a DVD made from e.g. quartz, platinum, and titanium might be up to the job.
A full backup would fit on 12 double-layer, single-sided disks; I'm being conservative and not using Blu-Ray numbers, since density and longevity are always somewhat in tension. Probably more expensive to put them safely on the moon than to manufacture in the first place.
Agreed. I think even bare nickel or iron would probably be fine. Holographic glass laser damage can in theory handle higher recording densities and, like your DVD, isn't vulnerable to surface damage.
In space you probably don't have to worry as much about minor surface scratches and oxidation, though. You just have to worry about outgassing and meteoroid impacts. Some of them you can stop, and some you can't. On the bright side, they're very rare.
I think common media formats like DVDs are designed with a lot of emphasis on speed, both of reading and of duplication. This compromises both density and longevity. If you, instead, allow yourself the luxury of FIB milling to write and an electron microscope to read, you can manufacture in any vacuum-stable material at all, and you can engrave your archival message with, say, 50-nanometer resolution. At one bit per 50 nanometers square, you get five gigabytes per square centimeter.
I think that with e-beam photoresist cross-linking followed by etching you get about 500 kilobits per second, and I think FIB milling is a little slower, so it might take a few weeks to make the copy — obviously unacceptable for a consumer CD burner but fine for periodic Wikipedia moonshots.
And what's the point of it in space? Knowledge doesn't disappear when it's not on wikipedia. If humans are still around they will continue contributing to knowledge. Just because it's not printed or recorded doesn't mean that information or knowledge doesn't exist.
And how would these 14 survive if they are the only ones left? Do you know that there's a whole support team to support them in space? It's not just 14 people. There are hundreds on the ground supporting them.
How much would the science capabilities of a telescope like JWST be reduced if 1/3 of its SSD was repurposed for storing the latest wikipedia dump (that 1/3 number is assuming it's only English, compressed, and without images)? To me that seems like an easy cost/benefit analysis.
How much would the science capabilities of a telescope like JWST be reduced if we left its SSD alone and just taped a USB drive to the side of it somewhere that contained wikipieda?
Would duct tape pass the pre-launch vibe check? You'd have to do some engineering work to make sure it's sturdy, doesn't have any impact wrt oscillations, won't create FOD (debris) etc.
Once you've done all that work, I'm not sure what you've actually accomplished. By the time any sentient being gets around to visiting JWST, I wouldn't be surprised if an unshielded commercial drive would be rendered totally unusable by radiation.
I wonder if they snapshot Wikipedia for this, or if they stagger it per article to avoid very recent unreviewed edits getting in to such a download (that would say disappear off the site if those were bad edits or vandalism)
Do not store 96gb of anything on exfat, use ext4 or APFS or zfs or some journaled file system. Does NTFS really have a 4GB file size limit? Structures should match exfat so that part seems suspect to me.
No, but FAT32 does. Exfat, on the other hand has a file size limit of 16 exibibytes. That, combined with exfat's cross-platform mounting (NTFS has a lot of limitations in this regard) makes it a superior formatting system for flash based offline file transfer.
This is the kind of thing that you download once and then never write anything on that media until you decide to refresh the content. In fact, you might as well mount it read-only. A journaling FS wouldn't do anything useful here.
Love it! Imagine if USB Flash drive manufacturers just loaded up new drives with content like this. I mean, why not right? I think the physics means it would even be lighter ;)
When I buy a storage device I usually have an intended purpose for that storage and would not like to have to delete all of the files that some manufacturer thought would be useful information but I would have to delete to make room for what I want.
Especially if you didn't know which one you were going to get. Plug it in for a big surprise! (From a verifiable manufacturer who has their customers' happiness and enjoyment at heart)
Now I'm curious: if, hypothetically, wikipedia was just backed by a single git repo and every edit was a commit, how big would it be and how long would it take to clone?
This has to be one of the most poorly structured pieces of writing I've seen in a while. It's way too verbose, and on the one hand there are separate sections like:
* Getting a flash drive
* Formatting a flash drive (which includes a subsection on not formatting it but buying one that's already formatted instead, while there was a separate section just before this one on buying a drive)
* Waiting for a file to download
At the same time downloading both Wikipedia and Kiwix are in the same section. Then, installing Kiwix is in a section called "You're done" which isn't next to the section on downloading Kiwix.
It looks like Kiwix uses the ZIM file format, which appears to have diffing support [0] (see zimdiff and zimpatch). That said, it doesn't look like Kiwix actually publishes those diffs.
That's exactly what I was thinking of as well. I remember listening to an episode on Darknet Diaries with this theme where dvds and usb drives are a common way to smuggle things into North Korea.
As I recall, there were several clever things that the app did to reduce the size of the dump; many stub/redirect articles were removed, the formatting was pared down to the bare minimum, and it was all compressed quite efficiently to fit in such a small space. Patrick gives more technical detail on an earlier version of the app's homepage: https://web.archive.org/web/20080523222440/http://collison.i...