
Ask HN: Has anyone had luck using the Wikipedia data dumps? - perfect_loop
Many docs and tutorials are from 10+ years ago. Have you had any luck loading the data dumps (not the API) locally in order to play around with them? if yes, I&#x27;d very much appreciate it if you could point me in the right direction.
======
thomas536
I also didn't find much information about how long it would take to import
into a db, so I used the xml dumps directly [1]. I only needed the wiki
content (not the history), so the article xml files worked well for me. And
then I used mwparserfromhell [2] to parse and extract from the wiki markup.

[1]
[https://dumps.wikimedia.org/enwiki/20190301/](https://dumps.wikimedia.org/enwiki/20190301/)

[2]
[https://mwparserfromhell.readthedocs.io/en/latest/](https://mwparserfromhell.readthedocs.io/en/latest/)

~~~
paulmolloy
I've been working on some research for a recommender system using xml wiki
article dumps the last few months. I've been using mwparserfromhell as well to
get plain text and some other metadata I needed from articles to create a
dataset. It seems to work pretty well for that use case anyway.

------
diggan
While building the Wikipedia mirror on IPFS (with search), we tried using the
dumps from Wikipedia themselves but ended up using Zim archives from kiwix.org
instead. The end result is here: [https://github.com/ipfs/distributed-
wikipedia-mirror](https://github.com/ipfs/distributed-wikipedia-mirror)

For actually ingesting the archives, dignifiedquire expanded a Rust utility
aptly named Zim, which you can find here
[https://github.com/dignifiedquire/zim](https://github.com/dignifiedquire/zim)

Both repos contain information (and code of course) on how to extract
information from the Zim archives

------
yk66
I use Kiwix to do that. Much simpler. Plus they provide other dumps too. So
allows you to play with say wikipedia and stackoverflow simultaneously.

~~~
pklee
That is really cool.. had not seen this before.. I do think the intention is
api access to this data not browser.

------
arsenide
I have toyed around with the Wikipedia dump -- in XML, downloaded through the
provided torrent file on Wikipedia.

It took a bit to get accustomed to the format, but after looking at the files
and doing a bit of research on the documentation, using Python with lxml made
it relatively straightforward to do what I was interested in.

I'd recommend doing the same, only because it worked for me: get the XML dump,
manually check out some files to understand what is going on, search for
documentation on the file format and maybe read a few blog posts, and then
convert the XML files to data structures suited for what you're interested in.

------
aboutruby
You could also use Special:Export depending on your use case:
[https://en.wikipedia.org/wiki/Special:Export](https://en.wikipedia.org/wiki/Special:Export)

------
thetermsheet
This may not be the most helpful reply but I remember having to use some
"importing tool". Wikipedia provides you with standard SQL dumps yet simply
importing them into the DB is not going to cut it. The community has created
import scripts which simplify the process to a degree.

------
zepearl
I used Python to load the contents of the articles into a DB (potentially
wrong extract of veeery old code - I have something like 20 different versions
lying around therefore I'm not 100% sure that this did work well):

===

    
    
      import xml.dom.pulldom as pulldom
      from lxml import etree
      from xml.etree import ElementTree as ET
      sInputFileName = "/my/input/wiki_file.xml"
      context = etree.iterparse(sInputFileName, events=('end',), tag='doc')
    
      for event, elem in context:
        iThisArticleCharLength = len(elem.text)
        sPageURL = elem.get("url")[0:4000]
        sPageTitle = elem.get("title")[0:4000]
        SPageContents = elem.text
    
        <do what you want with these vars...>
    

===

------
StrangeDoctor
I built tools to parse the compressed XML dumps. My computer was pretty
underpowered at the time (MacBook air) so I had to very careful to make
everything a streaming algorithm. Looking back I basically recreated a shitty
map reduce in Python.

------
mooss
I've had some success using this tutorial:
[https://www.kdnuggets.com/2017/11/building-wikipedia-text-
co...](https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-
nlp.html) .

And I've changed it a little bit to extract only the first n characters, this
might be of some use since wikipedia dump are supposed to be pretty large:
[https://github.com/mooss/ruskea/blob/master/make_wiki_corpus...](https://github.com/mooss/ruskea/blob/master/make_wiki_corpus.py)
.

------
kldavis4
I wrote a simple parser in node to import the article dump into an
Elasticsearch instance as a part of a hands on tutorial:
[https://github.com/kldavis4/kuali-
days-2017-elasticsearch/bl...](https://github.com/kldavis4/kuali-
days-2017-elasticsearch/blob/master/wikipedia/index.js). At the time, on the
full dump, it took quite a while to ingest (days as I recall).

------
usgroup
Dependent on what you’re doing, consider using wikidata instead. Has a SPARQL
interface that’s easy to query.

