Hacker News new | past | comments | ask | show | jobs | submit login

I built a wikipedia reader on the ipad years ago, I used two sqlite databases, one for the article titles using sqlite FTS then I compressed the articles themselves and merged the compressed chunks into a set of files < 2G, the title database had the offset into the compressed blobs. Only including the most popular pages, it was just under 10G.



That's really clever. Were you storing the article text as compressed data in BLOB columns in a SQLite table?


I realize I wasn't clear. I used 2 sqlite databases just to store the article titles and offsets into binary files (<2GB) that had chunks of compressed data. Each chunk contained 10s-100s of stories, so that the decompressed around was around 20megs I believe. I had planned to store everything in sqlite, but there were too many issues. The FTS extension to sqlite is excellent, it made whole classes of problems literally melt away.

Recalling now, the article text itself, I stored in groups of articles concatenated together so that that data compression could take advantage of similar texts. This data was just Lua data structures, with the article title and article body in an array. So I knew which chunk it was in, but I had to decompress a whole chunk at a time and then iterate through the list of decompressed articles to find the specific one. I was trying to tune search time, decompression time and compression ratio. I had a handful of parameters in the job that would take the xml dumps and create the sqlite and compressed blobs. Memory consumption would spike as a chunk was decompressed, but after a call to the Lua garbage collector after pulling the article out of the decompression buffer, everything was fine again.

This was late 2010 or so. The job that processed the XML into database+bin data took over an hour to run on my laptop.

If I did it again, I'd use a compression library that was tuned for decompression and used an external dictionary. For the index, I'd use fst [1]

The project stalled out in parsing the wikitext, there are now libraries that can handle the parsing. I have considered resurrecting it, with jpeg2000 or other image codecs, one could include the images as well. The system did not have FTS over the article text itself. I had planned to use some sort tree of bloom filters, something similar to [2].

I was really trying to keep the total footprint of all the data under 10G or so it could fit on the majority of devices. Now with larger flash sizes, better codecs and better compression libraries, the functionality and performance could be greatly enhanced.

I learned a ton from this project, I really recommend getting deep into something, but also working within your limits and having fallback plans when something looks insurmountable. There were plenty of times where I thought I was a dead end, only to say, "what if ..." the next day, run a quick experiment and be on to the next insurmountable roadblock.

[1] https://github.com/BurntSushi/fst

[2] https://arxiv.org/abs/2002.03057




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: