

Wiki2text - chx
https://github.com/rspeer/wiki2text

======
jug
Too bad about the complex template formatting getting in the way of parsing.
It sounds like a hard problem to solve from the raw XML too, forcing one to
take steps like rendering the documents and extracting the rendered output,
solving the template issue since it's now all rendered, but then instead
losing the logical structure and instead getting a display-oriented structure.

I also think this is counterintuitive from a project that is about opening the
world's information to the masses. You'd think a problem with heavy use of
templates -- possibly and ironically making especially the most popular and
requested articles harder to parse -- would have been solved by a cleaner
version in the dump besides this XML-with-a-Turing-complete-template-language.
It sounds like it makes Wikipedia data interchange really tricky, raising the
question "who are they dumping for?"

