Show HN: Static.wiki – read-only Wikipedia using a 43GB SQLite file

phiresky · on July 30, 2021

Author of the referenced blog (and library) here. This is great!

The full text search engine in SQLite is sadly not really good for this - one reason is that it uses standard B-Trees, another is that it forces storing all token positions if you want BM25 sorting, which is a huge overhead for articles as long as Wikipedia's.

But that doesn't mean full text search isn't possible in a very efficient manner with statically hosted data! I wrote a proof of concept of making the Rust-based tantivy library work in the same way, which has a lot of internal things that can make the index much smaller and more efficient than SQLite's. It's also >10x faster in creating the search index.

Here's the demo also for Wikipedia: https://demo.phiresky.xyz/tmp-ytccrzsovkcjoylr/dist/index.ht... I'm not sure if it's more efficient than the SQlite version in this form, but it definitely has more upward potential and is more fun to work with.

And the corresponding draft PR: https://github.com/tantivy-search/tantivy/pull/1067

I sadly haven't gotten around to working on it more and writing an article about it.

Other people are also working on using this stuff to make Sci-hub and LibGen more available by using this in combination with IPFS for distributed and "uncensorable" hosting which is pretty awesome.

Edit: Just realized that the OP demo only searches in article titles, while mine searches in full article contents by default. You can search in only the titles in my demo by querying `title:harry title:potter`

segfall · on July 30, 2021

Thank you, Phiresky. My little side project only exists because of your work.

ignoramous · on July 31, 2021

Congrats on your side project! This wonderful discussion exists because of your work.

walrus01 · on July 31, 2021

How does it do for performance if you throw the whole 43GB into RAM? Plenty of very affordable workstation systems out there today gently used with 128GB in them.

shp0ngle · on July 31, 2021

what? the point of this is that it seeks just parts of the files through http ranges… not all 43 GB… just try the page? or am I missing something?

unixhero · on July 31, 2021

You're missing his pragmatic and creative solution to leapfrog the problem altogether.

bawolff · on July 31, 2021

Which problem?

Yes putting it in ram will be faster than downloading it over the network. I don't think anyone ever doubted that.

I imagine people who wanted to download it would just use kiwix. The potential to download isn't the part that makes this project interesting.

IshKebab · on July 31, 2021

A pragmatic solution that requires users to have 43GB of RAM? Right.

jggonz · on July 31, 2021

I'm going to upvote this comment because not everyone in the world can afford $300 for 64GB+ of RAM. The solution to not dump everything into RAM is a great one.

Fortunately, I've had a fancy Silicon Valley job for 15+ years which now lets me afford many GBs of RAM if I wanted to but it was a different story in my 20s. I relied on free or almost free hardware to learn. And that free hardware usually lagged in specs.

walrus01 · on July 31, 2021

Yeah I agree with you not everyone can afford it, but lots of young people build $1800 gaming PCs. If you want a powerful Linux workstation desktop PC for fun projects like this you can get a Dell t5610 with two 8 core xeons and 128GB of ram on eBay for $650, then add your own $100 SSD.

I would not recommend running out and buying a new current gen Ryzen board+cpu or core i7-whatever and 128GB of new ram at ddr4-4200 prices.

AussieWog93 · on July 31, 2021

To put it in perspective, 43GB of RAM to browse the entirety of Wikipedia doesn't sound that ridiculous when popular Electron-based IM programs use 1GB+.

exdsq · on July 31, 2021

It's around £300 for 64GB of Corsair RAM - you need a motherboard that supports it but you can go cheap and you don't need a great processor. You could probably put this together for £600? It's not exactly crazy if this was for an office or something.

azeirah · on July 31, 2021

I mean sure it's for enthusiasts, but my 3 year old pc already has 32GB ram. It's definitely not out of the question

unixhero · on July 31, 2021

2021 called, memory is cheap.

andi999 · on July 31, 2021

Notebook vendors didn't get this memo.

Eremotherium · on July 31, 2021

Don't buy a garbage laptop with soldered RAM and just upgrade it yourself?

andi999 · on July 31, 2021

Good solution. Which one that allows for 128GB do you recommend?

unixhero · on July 31, 2021

A workstation. Any professional one.

jbverschoor · on July 31, 2021

20 applications x 50GB = 1tb of memory.

Quekid5 · on July 31, 2021

o... kay?

You're not wrong, but who's talking about 20 applications?

jbverschoor · on July 31, 2021

Look at the amount of applications and web-application you have open at a given point.

If this is such a good idea, let's just all use iscsi and mount the internet.

linkpuff · on July 31, 2021

Is it though? We are in the middle of a chip crisis where prices are raising to record levels...

unixhero · on July 31, 2021

Yes it is. In the grand scheme of things, looking at how far we've come DDR4 modules are cheap dollar per gigabyte.

IshKebab · on July 31, 2021

And how many users have more than 16GB of RAM?

IgorPartola · on July 31, 2021

On phones too?

gsich · on July 31, 2021

Phones are not first class devices.

IgorPartola · on July 31, 2021

Says who?

gsich · on Aug 1, 2021

I do. Many others too. Or do you see books written on phones, programing being done on phones? Being able to install any program you like? No? Not first class then.

There are many things that can't be done on a phone that are possible on a PC. The reverse case is much harder.

IgorPartola · on Aug 2, 2021

I see Wikipedia being read on phones, which is the problem at hand.

gsich · on Aug 2, 2021

The fact that you can't even add memory (be it RAM or storage, the later one is of course doable on some devices) to your phone makes them not first class devices.

IgorPartola · on Aug 2, 2021

Same with MacBooks? Sorry this sounds like utter nonsense.

gsich · on Aug 2, 2021

shp0ngle · on July 31, 2021

ahhh I misread his comments.

Sure, RAM is always faster.

sonograph · on July 31, 2021

> Sure, RAM is always faster.

Well, why not shove it all into L2 cache and leapfrog RAM?

/first joke I ever made on HN

trollied · on July 31, 2021

Jokes are funny...

londons_explore · on July 31, 2021

You still have to download it all... Which is a barrier for most...

unknownOrigin · on July 31, 2021

That's only a size of regular modern video game... the downloads of which are pretty mainstream these days. Saying that a 50 gig download is a barrier "for most" is definitely not true.

tjoff · on July 31, 2021

For English it states 90 GB uncompressed, doesn't say compressed size but that doesn't sound much larger than a large game. In the context I don't see it as a barrier.

tw04 · on July 31, 2021

As someone who was very recently on about as bad of a dsl connection you could get... What? Outside of metered cellular connections this is available to basically everyone. Even the remotest parts of Africa have at least a handful of unmetered connections within a days drive and a USB drive.

exdsq · on July 31, 2021

Gears of War 5 on the Xbox is 133GB with all its updates. I think we're at the point where 100GB is a high but reasonable request for people nowadays, at least in cities in the West.

peterhunt · on July 31, 2021

You can also get efficient FTS with this method if you implement indexing in user space and avoid BM25. The Lucene practical scoring function works well with this method in my experience: https://www.elastic.co/guide/en/elasticsearch/guide/current/...

piyh · on July 31, 2021

Is there a dumbed down version of this indexing conversation for someone who understands b-trees, but BM25 or user space indexing?

peterhunt · on July 31, 2021

Well, this post is in the context of an e2e encrypted DB, but it's subject to the same constraints: https://medium.com/@ZeroDB_/scalable-full-text-search-over-e...

If you understand btrees you understand the hardest part already :)

Basically, you need to design a search index that examines the fewest DB pages in order to find the result. The Lucene scoring method stores a mapping of term -> document[] sorted in relevance order. The main idea is that you can examine only the first n documents for each term in the search query in order to find the most relevant search results. Picking n is sort of tricky, but if your index is stored in this way it's possible to fulfill a large % of queries efficiently without downloading the whole index.

Here's a little Python implementation of what I mean by a "user space implementation". Note that it's a toy but it performs pretty well on some demo sklearn data sets: https://gist.github.com/petehunt/724eeb77189332db101ad7b0db8...

dheera · on July 31, 2021

Yeah I was just thinking that. Why not just a static filesystem? Just mount a ext4 image full of .html files and browse away. It would be remarkably efficient, likely much more efficient than SQLite.

tored · on July 31, 2021

I have been thinking of using FTS5 with SQLite to search emails. Not close of 43 GB ofc, so I would probably not have any major performance problems, but still is FTS5 any good? Or should I look into other solutions?

psanford · on July 31, 2021

The nice thing about FTS5 is you can have full text search up and running on a dataset in a couple of minutes. So give it try and see if it is sufficient for your use case. If it is, great! If not, you've not wasted much time.

OJFord · on July 31, 2021

Have you seen 'notmuch'? I was honestly shocked how good/fast it is. It uses a C (++?) lib called 'xapian' to provide the actual index/search capability.

phiresky · on July 31, 2021

If you're talking about a normal local SQlite DB and your dataset is less than maybe 100GB of plaintext then SQLite FTS will work fine regarding performance.

neocodesoftware · on July 31, 2021

how are the xml dumps generated?

arianon · on July 30, 2021

Shocked to find out that this is very much a static website, it's "merely" downloading a small portion of the 43GB SQLite file in question with HTTP Range requests, and then it uses a WASM-compiled copy of SQLite to query the requested data. Very impressive.

simonw · on July 30, 2021

It's the same trick that was described here - it's absolutely brilliant: https://phiresky.github.io/blog/2021/hosting-sqlite-database...

scrame · on July 30, 2021

Wow! The end where they use sql to manipulate the DOM I think just blew a fuse in my brain. "Brilliant" feels like an understatement.

MuffinFlavored · on July 30, 2021

On insert from Netlify:

> [error: RuntimeError: abort(Error: server uses gzip or doesn't have length). Build with -s ASSERTIONS=1 for more info.]

On insert from Github Pages:

> [error: Error: SQLite: ReferenceError: SharedArrayBuffer is not defined] Your browser might either be too old to support SharedArrayBuffer, or too new and have some Spectre protections enabled that don't work on GitHub Pages since they don't allow setting the necessary isolation headers. Try going to the Netlify mirror of this blog for the DOM demos.

JoeyBananas · on July 30, 2021

wouldn't this be vulnerable to DOS attacks? I can make the database run arbitrarily long and complicated queries

simonw · on July 30, 2021

Since the work all happens in your browser, the only victim of a long complicated query would be you own browse and the S3 bandwidth bill of the person hosting the database (since you'd end up sucking down a lot of data).

But if you want to jack up someone's S3 bandwidth bill there are already easier ways to do it!

floatingatoll · on July 30, 2021

It's a static file, CDN it.

simlevesque · on July 30, 2021

Many CDNs cannot cache 43GB files.

Cloudflare's limit is 10GB, 20GB for Cloudfront, 150GB for Akamai and Azure.

phiresky · on July 30, 2021

You can chunk the file into e.g. 2MB chunks. The CDN can then cache all or the most commonly used ones. That's what I did in the original blog post to be able to host it on GitHub Pages.

dmos62 · on July 31, 2021

I would look into caching range requests. Simpler than pre-chunking or caching the whole database.

sltkr · on July 30, 2021

It should be trivial to split it up into 1GB chunks or whatever. In fact, if you only request single pages, you could split the database up into pagesize-sized files. This is a lot of files, but at least it avoids the need to do range requests at all. You probably want to increase the pagesize a bit though (e.g. maybe 256 KiB instead of the default 4 KiB?)

floatingatoll · on July 30, 2021

Range requests are the star, so I would vote keep them? I mean, technically, the CDN could represent each 4kb block of the file as an individual URL, so that you do range requests by filename instead of by .. range request .. but at some point I definitely think RRs are more sane.

rzzzt · on July 31, 2021

Probably both, compute file name and relative offset from the original offset. (A minor complication is requests crossing file boundaries, in which case you have to do two requests instead of one.)

milankragujevic · on July 30, 2021

Isn't Cloudflare's limit 512 MB? (except for enterprise plan which has no price defined where it's 5GB)

https://support.cloudflare.com/hc/en-us/articles/200172516-U...

floatingatoll · on July 30, 2021

TIL, for whatever it's worth :) Thanks!

jlg23 · on July 30, 2021

If I get the essence of what you are saying... "the only victims would be both ends of the communication"? ;)

quickthrower2 · on July 30, 2021

What is said is this site isn’t any more susceptible to DOS than any other site. Of all the ways of architecting a site this is probably the least vulnerable.

chrisseaton · on July 30, 2021

> I can make the database run arbitrarily long and complicated queries

Do you understand that the database runs on your computer? You can only DOS yourself.

lillecarl · on July 31, 2021

In a smart webserver this could mean not a whole lot more than a memcpy, and between you and the CDN it probably won't be too high latency.

One would have to version the SQLite DB by version to avoid corruption by cache invalidation.

rkeene2 · on July 31, 2021

My static file HTTP server called "filed" [0] will satisfy a request in as few as 1 system call (no memcpy involved -- the kernel reads the file and sends the buffer to the NIC), by using sendfile(2). Most other webservers do a bit more work, like open the file for every request.

[0] http://filed.rkeene.org/

CodesInChaos · on July 30, 2021

The database runs on the client. It can't do anything you couldn't do through any other http client.

k__ · on July 30, 2021

Didn't know you could pump 43GB of data into a browser.

wmf · on July 31, 2021

The trick is that the browser is not downloading all 43 GB, just the parts it needs.

dnamlin · on July 31, 2021

I plowed the 40 GiB database file into sqlite_zstd_vfs [1] which reduced it 75% to 10 GiB. This plug-in also supports HTTP access [2] in the spirit of phiresky's, however, I don't have a WebAssembly build of mine yet, so it's a library for desktop/command-line apps for now. You can try it out on Linux or macOS x86-64:

  pip3 install genomicsqlite
  genomicsqlite https://f000.backblazeb2.com/file/mlin-public/static.wiki/en.zstd.db "select text from wiki_articles where title = 'SQLite'"

("genomicsqlite" is the CLI for my Genomics Extension [3], which is built around these Zstandard compression & web layers.)

[1] https://github.com/mlin/sqlite_zstd_vfs

[2] https://github.com/mlin/sqlite_web_vfs

[3] https://mlin.github.io/GenomicSQLite

EDITS: I expanded on this comment in this gist https://gist.github.com/mlin/ee20d7c5156baf9b12518961f36590c...

If you want to download the whole en.zstd.db, then please kindly get it from zenodo (which doesn't support HTTP range requests, but is free): https://zenodo.org/record/5149677

jart · on July 31, 2021

Great work but why not compress with deflate if you are serving http requests since then you could directly copy the database content to the wire as gzip encoded responses.

dnamlin · on July 31, 2021

With sqlite_zstd_vfs the data are compressed beforehand and stored that way "at rest", so web responses are directly copied to the wire controlled by HTTP range headers, similarly to the OP. They need to be decompressed by a client library sitting between SQLite and the wire.

segfall · on July 30, 2021

OP here. Happy to take questions, with the caveat that my work here is mostly glue.

Some background: https://github.com/segfall/static-wiki

The datasets I generated: https://www.kaggle.com/segfall/markdownlike-wikipedia-dumps-...

The original author of this approach: https://news.ycombinator.com/item?id=28013514

segfall · on July 30, 2021

For added fun, you can try switching between different databases:

http://static.wiki/en/United_States

http://static.wiki/zh/美国

http://static.wiki/fr/United_States

http://static.wiki/enwikiquote/United_States

http://static.wiki/simple/United_States

ddtaylor · on July 31, 2021

Is there a way to get this simple theme / stylesheet to work for the regular Wikipedia? I really like it.

segfall · on Aug 1, 2021

Thank you. Unfortunately, you'd have to write that stylesheet from scratch. For this demo, I just wrote them in the Svelte components.

johnwheeler · on July 31, 2021

Is the db in memory?

segfall · on July 31, 2021

sql.js is in memory, but the database file is only partially loaded from the static file host.

exikyut · on July 31, 2021

No, as other comments have noted it's stored in S3.

simonw · on July 30, 2021

I'm nervous on your behalf for your S3 bill, have you done napkin maths for how much this will cost to host, especially given the relative inefficiency of the autocomplete queries?

A trick that might help: create a separate SQLite table containing just the article titles to autocomplete against. That might give you queries that use smaller ranges of the file.

ignoramous · on July 30, 2021

> A trick that might help: create a separate SQLite table containing just the article titles to autocomplete against.

Another one: build a directed-acyclic-word-graph [0][1] of just the page-titles and download it to browser's localStorage for prefix-based searches.

In our usage, ~5M entries take ~18MB. Wikipedia has ~6.3M articles [2].

[0] https://github.com/smhanov/dawg

[1] http://stevehanov.ca/blog/?id=115

[2] https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

mkl · on July 31, 2021

A version of [1] with the pictures working: https://web.archive.org/web/20120211141906/http://stevehanov...

Bayart · on July 31, 2021

Thanks for making me aware of a new data structure !

sitkack · on July 30, 2021

I built a wikipedia reader on the ipad years ago, I used two sqlite databases, one for the article titles using sqlite FTS then I compressed the articles themselves and merged the compressed chunks into a set of files < 2G, the title database had the offset into the compressed blobs. Only including the most popular pages, it was just under 10G.

simonw · on July 30, 2021

That's really clever. Were you storing the article text as compressed data in BLOB columns in a SQLite table?

sitkack · on July 31, 2021

I realize I wasn't clear. I used 2 sqlite databases just to store the article titles and offsets into binary files (<2GB) that had chunks of compressed data. Each chunk contained 10s-100s of stories, so that the decompressed around was around 20megs I believe. I had planned to store everything in sqlite, but there were too many issues. The FTS extension to sqlite is excellent, it made whole classes of problems literally melt away.

Recalling now, the article text itself, I stored in groups of articles concatenated together so that that data compression could take advantage of similar texts. This data was just Lua data structures, with the article title and article body in an array. So I knew which chunk it was in, but I had to decompress a whole chunk at a time and then iterate through the list of decompressed articles to find the specific one. I was trying to tune search time, decompression time and compression ratio. I had a handful of parameters in the job that would take the xml dumps and create the sqlite and compressed blobs. Memory consumption would spike as a chunk was decompressed, but after a call to the Lua garbage collector after pulling the article out of the decompression buffer, everything was fine again.

This was late 2010 or so. The job that processed the XML into database+bin data took over an hour to run on my laptop.

If I did it again, I'd use a compression library that was tuned for decompression and used an external dictionary. For the index, I'd use fst [1]

The project stalled out in parsing the wikitext, there are now libraries that can handle the parsing. I have considered resurrecting it, with jpeg2000 or other image codecs, one could include the images as well. The system did not have FTS over the article text itself. I had planned to use some sort tree of bloom filters, something similar to [2].

I was really trying to keep the total footprint of all the data under 10G or so it could fit on the majority of devices. Now with larger flash sizes, better codecs and better compression libraries, the functionality and performance could be greatly enhanced.

I learned a ton from this project, I really recommend getting deep into something, but also working within your limits and having fallback plans when something looks insurmountable. There were plenty of times where I thought I was a dead end, only to say, "what if ..." the next day, run a quick experiment and be on to the next insurmountable roadblock.

[1] https://github.com/BurntSushi/fst

[2] https://arxiv.org/abs/2002.03057

segfall · on July 31, 2021

Great catch, simonw (and everyone else in this thread).

I've reviewed some other static hosts and settled on wasabi for now. When my S3 bill rolls in, I'll reply to this comment with the damage. A scary number will help others avoid my oversight.

segfall · on Aug 2, 2021

AWS Billing reported that I was charged $28.43 over the course of 7 hours, for a total transfer of 306.916 GB. I believe this capture the ~25k page views the site received before I switched over to another static file host.

Lesson learned for me. Thanks to everyone in this thread for saving my wallet.

apitman · on Aug 3, 2021

Thanks for reporting back!

jinseokim · on July 31, 2021

It seems to be 403 happens.

Quick test: https://s3.us-east-2.wasabisys.com/static.wiki/en.db

capableweb · on July 30, 2021

Seems to be hosted at Surge,sh which is hosted at Digital Ocean, so seems this is not OPs problem.

    $ nslookup static.wiki
    
    Server:  8.8.8.8
    Address: 8.8.8.8#53
    
    Non-authoritative answer:
    static.wiki canonical name = na-west1.surge.sh.
    Name: na-west1.surge.sh
    Address: 188.166.132.94
    
    $ whois 188.166.132.94
    
    % This is the RIPE Database query service.
    % The objects are in RPSL format.
    %
    % The RIPE Database is subject to Terms and Conditions.
    % See http://www.ripe.net/db/support/db-terms-conditions.pdf
    
    % Note: this output has been filtered.
    %       To receive output for a database update, use the "-B" flag.
    
    % Information related to '188.166.132.0 - 188.166.135.255'
    
    % Abuse contact for '188.166.132.0 - 188.166.135.255' is 'abuse@digitalocean.com'
    
    inetnum:        188.166.132.0 - 188.166.135.255
    netname:        DIGITALOCEAN
    country:        NL
    admin-c:        PT7353-RIPE
    tech-c:         PT7353-RIPE
    status:         ASSIGNED PA
    mnt-by:         digitalocean
    created:        2019-04-17T14:02:52Z
    last-modified:  2019-04-17T14:02:52Z
    source:         RIPE
    
    person:         Network Operations
    address:        101 Ave of the Americas, 10th Floor
    address:        New York, NY, 10013
    address:        United States of America
    phone:          +13478756044
    nic-hdl:        PT7353-RIPE
    mnt-by:         digitalocean
    created:        2015-03-11T16:37:07Z
    last-modified:  2019-04-17T14:37:51Z
    source:         RIPE # Filtered
    org:            ORG-DOI2-RIPE
    
    % This query was served by the RIPE Database Query Service version 1.101 (HEREFORD)

But yes, it is wise to run storage/transfer heavy websites on something that doesn't charge for traffic like AWS and most other infrastructure providers do.

simonw · on July 30, 2021

That's just the site itself (the HTML and JavaScript). The 43GB SQLite database is hosted on S3.

dom96 · on July 30, 2021

Why do people even use S3 if that’s a risk? I really don’t get this, do most just not worry about accidentally wasting money or do they just not understand this risk?

slunk · on July 30, 2021

It's one thing if you have a source of revenue so justify your S3 costs. My interpretation of the parent commentator's concern is that this person has opened up a multi gigabyte S3 file publicly and sent lots of traffic its way (via hacker news) for what appears to be a passion project.

simonw · on July 30, 2021

This is a risk any time you host anything on the internet on any system that charges you for storage, bandwidth or CPU usage. It's fine to take that risk on as long as you're aware of it and keep an eye on your spend.

In this case I'm nervous because 43GB is a lot larger than most projects!

remram · on July 31, 2021

I think the point is that not every host charges for bandwidth or CPU usage, making S3 particularly risky.

RhodesianHunter · on July 31, 2021

There's no CPU charge, only per-request charges.

Bandwidth is free into S3 or from S3 to another AWS service so it's particularly handy for that use case.

remram · on July 31, 2021

For S3, yes.

RhodesianHunter · on July 31, 2021

Every system has risks. You just need to be aware of them going in.

S3 is excellent at many things, serving a ton of public traffic isn't one of them. Usually people put a CDN in front of it for that use case.

mayli · on July 30, 2021

Correct, downloading this 40GB file could easily leads to huge s3 bills due to traffic. That's 1$ for a full download.

banana_giraffe · on July 31, 2021

$2, if my math is right.

As pointed out elsewhere, there's some basic work being done to prevent everyone from grabbing the .db file, but it's easy enough to bypass.

Also, they're probably seeing non-zero API call charges. S3 only charges $0.0004 per 1000 calls, but when you start to make lots of calls at scale, it can really add up. Still, for this site, it's probably not that bad. If my simple tests are a fair judge, it works out to something like $20 a month for 100k lookups.

I've mentioned elsewhere on HN I used a similar technique with Python to query SQLite databases in S3. It works well, but there was an edge case on that project that resulted in millions of requests to S3 per invocation. It didn't break the bank, but it did push the project into the unprofitable range till we fixed that case.

Asdrubalini · on July 30, 2021

I tried downloading the file directly from curl [0] but it seems like it blocks non-partial GET requests, so downloading it is not really straightaway.

[0] https://s3.amazonaws.com/static.wiki/db/en.db

llacb47 · on July 30, 2021

     curl -v "https://s3.amazonaws.com/static.wiki/db/en.db" -H "Referer: http://static.wiki/" -H "Range: bytes=0-"

Above works but for the sake of OP's wallet I would suggest that you do not download the entire database from his S3 bucket.

infogulch · on July 30, 2021

Instead download it from kaggle which is offering to host it for free: https://www.kaggle.com/segfall/markdownlike-wikipedia-dumps-...

Weryj · on July 30, 2021

Get a FusionIO drive off ebay and host the DB yourself, if you need, stick the computer behind a static ip VPN.

max_ · on July 30, 2021

He should seed a torrent file

wmf · on July 30, 2021

Torrents aren't great for random access which this app is doing.

magila · on July 30, 2021

Bittorrent can do random access just fine, especially with v2 which enables content verification in 16KB chunks. Clients tend to be throughput oriented which leads to sub-optimal latency, but that's not a limitation of the protocol.

wmf · on July 31, 2021

I guess just fine is a matter of opinion. Connection setup and piece bitmap tracking seems like more overhead than HTTP range requests.

magila · on July 31, 2021

Connection establishment in bittorrent might be marginally higher overhead than HTTP, but not by much and the cost is highly amortized if you're doing many range requests in the same torrent. There is indeed more overhead from HAVE messages and such, but that's just because bittorrent is doing a lot more than HTTP to enable P2P. It's still not that bad as a percentage of the payload traffic.

dTal · on July 30, 2021

But at least we could get a copy without incurring a massive s3 bill.

rudian · on July 31, 2021

Can we really? For torrent to work you still need someone to host it in full at all times. Who will host a 40GB file that will be out of date in a week? The website provider will, so we’re back to square one.

dTal · on July 31, 2021

No, you don't. You just need N people to host 1/N of the file some of the time. The only person who needs to upload it in full is the original creator.

thrower123 · on July 31, 2021

I would like a copy of that sqlite database to play with.

Having that all preprocessed would save a bunch of work compared to getting it from Wikipedia itself

Weryj · on July 30, 2021

IPFS?

warpech · on July 31, 2021

Second that. I have no hands down experience with it, but I've seen seemingly more complex apps deployed on IPFS https://docs.ipfs.io/concepts/usage-ideas-examples/

NelsonMinar · on July 31, 2021

I've been using static copies of Wikipedia for years; they're great to have when traveling. I mostly use the Kiwix app. Their ZIM format database for all English is 83 GB; 43 GB for one without pictures. Compares nicely to the 43 GB here.

I once spent an hour in the Louvre learning about Rubens astonishing paintings of Marie de' Medici thanks to Kiwix. Without it I was just like "huh, nice paintings, I wonder what they are?" They're incredible and I owe that moment of kismet to Wikipedia, Kiwix, and my mobile phone. https://en.wikipedia.org/wiki/Marie_de%27_Medici_cycle

elcritch · on July 31, 2021

That’s brilliant! Art museums are fun but there’s so much context missing for most paintings.

mohiit · on July 31, 2021

I remember having it on my ipod touch way back in 2012 when it was only around 7gb.

voldemort1968 · on July 30, 2021

There's also https://www.kiwix.org/en/ which is great to take on long offline flights, browsing through wikipedia.

rycomb · on July 31, 2021

And someone keeps a relatively up-to-date .slob file with the entire wikipedia (that can be read with Aard2), which I find more useful than kiwix!

(A link to it can be found in the GitHub page with Aard dictionaries)

ce4 · on July 31, 2021

Aard2 Android: https://github.com/itkach/aard2-android

Aard2 web: https://github.com/itkach/aard2-web

Dictionaries: https://github.com/itkach/slob/wiki/Dictionaries

exikyut · on July 31, 2021

  files:  enwiki-20210616*
  size:   18.9 GB

How'd they manage that...

zelienople · on July 31, 2021

Seriously? They couldn't make the content something standard like a .zip file? Or a zip file with a SQL database in it? Bloody monkeys always have to go and create a new standard for everything. They will argue it's for performance or something, but it's really an exercise in ego. "Oh, mine's better!" No, it's not. What is better is what works with everything else in the world.

And that index! We couldn't capitalize the proper name of Wikipedia? And do you not understand there are visually impaired people in the world who can't read light red text on a white background? What is wrong with people?

And the terminology! What in the fuckety is Aard2? What is a .slob file? Why is it that everyone uses obscure acronyms or whatever for everything and never explains anything? Ego again, probably. Let's create a language that no one else understands to make ourselves feel superior! No, you are not superior, you are just turning language into thick vegetable soup.

Didn't you people go to university, or did you not pay attention when they taught you that the first reference is always completely spelled out and explained and you can use the acronyms (or whatever they are) after that. It's only common sense that you don't spout gibberish and nonsense words without explaining then in the first instance, unless your goal is to confuse everyone for the purpose of making yourself feel superior.

And the Google sheet and the flat index, what a hodgepodge of nonsense. What the heck does it mean? What is the difference between the different versions? Who knows?

I have to go now. My eyes are bleeding from trying to read light red text on a bright white background. I'll check back again once I get a Braille laptop screen.

lmm · on July 31, 2021

Does it actually work? I tried and it would always give up partway through the downloads (and most of the times when I ask in these threads it turns out the person recommending it doesn't actually use it, they just think it's a cool idea...)

41b696ef1113 · on Aug 1, 2021

I too have had rather disappointing results with Kiwix. Love the idea, but the implementation needs work. At this point, given the complexities of parsing Wikitext, I wish Wikipedia themselves would offer SQLite dumps of the latest text only content.

punnerud · on July 30, 2021

Related post 89 days ago: https://news.ycombinator.com/item?id=27016630 (Hosting SQLite databases on GitHub Pages or any static file hoster)

Scaevolus · on July 31, 2021

You should compress the individual article texts with zlib. Right now, looking up an article text for a decent size article ("Virus") takes 10 round-trips-- 5 appear to be btree traversals, but another 5 (!) are to read in the page content, as chunks of 8K 8K 16K 32K 64K. Compressing the article might cut one round trip off that

Increasing the sqlite page size to a much larger size like 64KB or 128KB may improve average latency too, since fewer btree pages would need to be fetched to find the offset of the desired row, and most page contents could be read with one fetch. This matters more when fetches over the network take ~100ms.

mongol · on July 30, 2021

Perfect if stuck on a remote deserted island! Must be the closest existing thing to the Hitchhiker's guide?

yreg · on July 30, 2021

I installed a reader with a Wikipedia dump (without multimedia) on the original iPhone. At the time (at least in my region) WiFi was quite rare and it really did feel like the Hitchhiker's guide and incredibly powerful.

I only kept it for a few weeks though since it consumed almost all of the space. (And I imagine it was probably my language mutation, the english Wikipedia must have been too big already.)

pronoiac · on July 30, 2021

You might look at Kiwix, and the data sources they've indexed: https://www.kiwix.org/en/

(Oddly, it's crashing at launch on my iPad right now)

phil294 · on July 30, 2021

I'm not trying to kill the fun here, but since this thing relies on a working internet connection, it would probably be pretty useless on a remote island

phkahler · on July 31, 2021

When SpaceX sends people to Mars they will want a local Wikipedia among many other things.

anigbrowl · on July 31, 2021

You can put a 43gb database on a $30 SD card and still have room for a bunch of movies.

dTal · on July 31, 2021

You could host it locally easily enough!

jtsiskin · on July 31, 2021

Yes, but you could also host a normal Postgres db easily enough

dTal · on July 31, 2021

This is entirely static, so I think it's easier.

ric2b · on Aug 8, 2021

Not entirely, the images are still loading from wikipedia.

wyldfire · on July 30, 2021

Towel not included.

exikyut · on July 31, 2021

Sigh obligatory XKCD https://xkcd.com/548/

r00fus · on July 30, 2021

Hitchhikers Guide to Wikipedia? HHGTWP?

nynx · on July 30, 2021

This is fantastic. I knew that we’d see things like this once that hosting sqlite databases project was posted.

mwidell · on July 31, 2021

I remember having the iPhone 2G back in 2007, and that someone created a way to store a 5GB wikipedia dump (the complete wikipedia at that time, but without photos) on it, so you could have the whole Wikipedia in your pocket. At the time it felt pretty amazing.

spiritplumber · on July 31, 2021

Don't panic!

harlanji · on July 31, 2021

Bonito. I want and yet have problems downloading even a 350mb file on my carrier, tethered via 2 different devices. Being able to DL something like this should be a requirement to be called an Internet carrier. Non-mobile is not an option for some, eg. low income unemployed and homeless. Maybe they’ve been a software engineer for 20 years and can reach unreachable markets. (No doubt it’s more profitable to spy and steal from the vulnerable, if God allows).

What if a customer needs their records and can not afford to get them and they lose their life or suffer significantly? Who pays the balance?

They didn’t need use the Internet? Okay, well they did use it and they do during pandemic times.

yukinon · on July 31, 2021

There's a bug when the target article is titled the same as the disambiguation page. Search for "Earthbound", and try clicking on the article for the video game. It'll bring you right back to the disambiguation page.

twotwotwo · on July 30, 2021

That's super cool.

The autocomplete trickiness is fun. Looks (from links in GitHub) like it's using full-text search and currently needs to scan a lot to rank results. Doing just prefix search might be a cheap half-solution to autocomplete, and maybe layering on hacks around stopwords etc. can get you a bit further.

For another angle, enwiki's full compressed list of titles (enwiki-20210720-all-titles-in-ns0.gz) is 86MB. Grabbing the whole dang thing (in the background?) and doing something with it client-side is a different sort of approach from what this page is going for but not unthinkable.

suprbeast · on July 30, 2021

We put this on an e-reader once. Onboard wikipedia, it was cool. It was 2011.

spiritplumber · on July 31, 2021

cool! was it able to update itself or no?

jart · on July 31, 2021

[Hit #1 on HN hosting 43gb Wikipedia SQLite DB on AWS] https://justine.lol/hugemistake.gif

Each time I post a link to Hacker News that becomes popular it ends up being at least a $100 in bandwidth bills from Google Cloud, and I've got a simple tiny unbloated website. This poor guy is going to go bankrupt. The Internet dark age is like Ma Bell all over again when we'd get those $500 long distance bills for dialing into BBSs.

dankwizard · on Aug 1, 2021

Hi, I am just commenting to mention a bug. When I look at page for The Smith's self titled album ( http://static.wiki/en/The_Smiths_(album) ) , the track listing section doesn't populate. Other than that, I love the simplicity and cleanness of the design

cafxx · on July 31, 2021

[nit] you should really run analyze before vacuum, not the other way around (https://github.com/segfall/static-wiki/blob/6d455de149c24e2a...)

Soullessmate · on Aug 6, 2021

This is great stuff, looks like I can serve a db straight off my hard drive to https://keybase.pub!

https://book.keybase.io/docs/files#storage

graderjs · on July 31, 2021

That's awesome man. I didn't know you could host Wikipedia on your own client / own back-end. Seems like this means if someone could built a better interface and performance for Wikipedia they could begin to steal some of their traffic. Is it not?

max1cc · on July 30, 2021

The search seems to be broken right now, but it looks like you can access whatever page you want via URL (e.g. http://static.wiki/en/Computer_Science)

dleslie · on July 30, 2021

Two immediate issues:

1. Doesn't work if Privacy Badger is active.

2. Nim (Programming Language) is not indexed.

JadeNB · on July 30, 2021

> 2. Nim (Programming Language) is not indexed.

You and dom96 (https://news.ycombinator.com/item?id=28013376) both noticed this issue, just two minutes apart. I guess there's some sort of inclination among HN readers testing out a static copy of Wikipedia to search for Nim? :-)

johnisgood · on July 31, 2021

dom96 is a core developer of the language, and author of Nim in Action, so it makes sense for him to look it up. I don't know the OP.

Retr0id · on July 30, 2021

More generally, it appears that any link containing brackets is broken/missing.

qwertox · on July 31, 2021

This appears to be the issue.

qwertox · on July 31, 2021

http://static.wiki/en/Gardena does not contain a link to http://static.wiki/en/Gardena_(company) yet https://en.wikipedia.org/wiki/Gardena does.

chadcatlett · on July 30, 2021

I noticed if you go directly to the page for Nim, it works.

http://static.wiki/en/Nim_(programming_language)

lanevorockz · on July 31, 2021

Maybe that would be a nice way to start Wikipedia alternatives, sijce it became a form of political activism the trustworthiness of Wikipedia fell down significantly.

Camillo · on July 30, 2021

Read-only Wikipedia is just "Wikipedia" for 99.99% of users. I can't remember the last time I've even considered clicking the edit button.

quenix · on July 31, 2021

Give it a try! It's strangely fulfilling.

foxpurple · on July 31, 2021

Editing an existing page is usually pretty easy and conflict free. Creating a new page is close to impossible and strictly guarded.

It’s almost not worth trying for a beginner because you will almost certainly be rejected. The moderators require multiple approved sources of information for a page to be created. These pages must be reputable sources (IE not blog posts, forum posts, or other wikis), and must not have any kind of commercial link to the subject. So if you create a page about a product and you state that it has feature X, you can not use the companies own page that talks about having X as a reference.

So basically you can’t make a page about something until a major news website writes about it or you can reference it in some formal study.

I tried to create a few pages. One of them being about a file format and found that I could not use the companies website with the PDF describing the spec as a reference for the format because there is a commercial link between the file format and the source linked.

quenix · on July 31, 2021

This is so painfully true. I've tried to submit two articles so far to AfC (articles for creation) and both were rejected. It's puzzling to me because I often see extremely low quality articles (with just one sentence, for example), published onto mainspace.

foxpurple · on Aug 1, 2021

I have just come to accept this is how it is and unless you are a wiki power user who can just directly create a page, don’t bother and let someone else create it because you are wasting your time. I find it funny that articles like the one for GitLab had to fight off the wiki mods who wanted to remove it for not being notable.

The worst part is how Wikipedia does not have a logical user introduction flow. They have an insane amount of rules and process but they will very quickly drop a brand new user in to a text box and make it look like you simply type in some stuff and the page will be created.

They need some kind of reputation system to ramp up users to greater and greater feature sets like stack overflow has rather than pretending a brand new or unauthenticated user can just create a page.

quenix · on Aug 4, 2021

I just discovered that we both may have been wrong.

Looks like you can skip the AfC review process entirely and this is given the go-ahead by Wikipedia (it's not frowned upon).

Basically, you have to create a Draft-namespace article (for example, Draft:Rust).

Then, once you have auto confirmed privileges (10 edits and 4 days account age), you can move the page namespace from the Draft namespace to the (Article) namespace...

... and that should create the article. Hmm.

thealig · on July 31, 2021

Oh interesting. I've started editing a bit on Wikipedia recently, and seems like there are a lot of potential articles to improve upon. The deletionism issue reminds me of this article https://www.gwern.net/In-Defense-Of-Inclusionism

andymoe · on July 30, 2021

Reminds me that at one point each page of the Apple Wiki Server that shipped with OS X Server back in the day used one SQLite file per page. Worked ok-ish.

andrepd · on July 31, 2021

How does this compare to https://www.kiwix.org/en/

dom96 · on July 30, 2021

Really cool! I think I see a bug: searching “Nim” shows “Nim (programming language)” but it’s not clickable.

segfall · on July 30, 2021

Yep, you're right. The wikitext -> markdown conversion strips out some links, noticeably on the disambiguation pages, for an unclear reason.

ctenb · on July 31, 2021

If wikipedia used this approach they could stop asking for money for server upkeep.

brokensegue · on July 31, 2021

wikipedia keeps getting updated. this approach is static. it's really easy to serve a static wikipedia lots of different ways.

hutrdvnj · on July 31, 2021

Since it doesn't require any backend, a IPFS mirror would be great.

8note · on July 31, 2021

I find that most wikipedia pages I'd want to look at are utterly unintelligible with communication conventions youd need to look up separately to understand the article.

I can't use an offline Wikipedia without that coursework

xchip · on Aug 1, 2021

Could article pages be stored as a torrent?

sgt · on July 30, 2021

thgttg.sqlite3

schwartzworld · on July 31, 2021

hhgttg

andyxor · on July 31, 2021

this is awesome, seems to be a good use case for using IPFS decentralized storage, either for copies of the big sqllite db dump file, or (as an alternative) for the individual wiki articles "snapshots"

i think it would be easier to keep it in sync with original wiki using the second option, where each article in static.wiki is synced with original wikipedia pages and updated individually, vs. updating the huge sqllite dump on every change in the originals

then the sqllite will be only holding metadata and links to static (and may be version controlled) versions of individual static pages/article snapshots, so the size will be much smaller and easier to distribute on ipfs

Edit: looks like there is similar effort in the works, to put wiki on ipfs https://blog.ipfs.io/24-uncensorable-wikipedia/

chovybizzass · on July 31, 2021

Searched for "porn" -- got this: http://static.wiki/en/Progressive_outer_retinal_necrosis

geenat · on July 31, 2021

Cool example of minimalism.

But hate to be that guy- if you're querying a database, it is not static.

dmd · on July 31, 2021

Nope, it is - unless you think "loading html pages from a remote web server" is also not static.

The point is there's no computation going on on the server, behind the scenes; your browser is doing all the work of knowing what to fetch, NOT the remote end, with the magic of range requests.

peterkelly · on July 31, 2021

That's what's so brilliant about this. The database query happens on the client, not on the server. The server side is entirely static hosting.