For actually ingesting the archives, dignifiedquire expanded a Rust utility aptly named Zim, which you can find here https://github.com/dignifiedquire/zim
Both repos contain information (and code of course) on how to extract information from the Zim archives
import xml.dom.pulldom as pulldom
from lxml import etree
from xml.etree import ElementTree as ET
sInputFileName = "/my/input/wiki_file.xml"
context = etree.iterparse(sInputFileName, events=('end',), tag='doc')
for event, elem in context:
iThisArticleCharLength = len(elem.text)
sPageURL = elem.get("url")[0:4000]
sPageTitle = elem.get("title")[0:4000]
SPageContents = elem.text
<do what you want with these vars...>
It took a bit to get accustomed to the format, but after looking at the files and doing a bit of research on the documentation, using Python with lxml made it relatively straightforward to do what I was interested in.
I'd recommend doing the same, only because it worked for me: get the XML dump, manually check out some files to understand what is going on, search for documentation on the file format and maybe read a few blog posts, and then convert the XML files to data structures suited for what you're interested in.
And I've changed it a little bit to extract only the first n characters, this might be of some use since wikipedia dump are supposed to be pretty large: https://github.com/mooss/ruskea/blob/master/make_wiki_corpus... .