

Ask HN: Interested in a Wikipedia parsing tool for data mining? - ajeet

I recently developed a C library that parses Wikipedia, with the goal of being _fast_. For example, I was able to parse and extract text from the entire wikipedia dump (~35 GB uncompressed) in under an hour (on a 5yr old iMac).<p>It is a work in progress. If there is sufficient interest, then I will clean up the code and put up some documentation.
======
tarr11
Have you looked at dbpedia.org? Perhaps your tools would fit in with their
project.

~~~
ajeet
Yes I have, and that project would be a great fit. They already have a scala
regex based lib though. My library is likely a faster and lighter alternative.

------
sareiodata
Isn't there a complete database with the information in Wikipedia available?

~~~
nowarninglabel
Yes, but it does take quite some time to get that setup. I remember it took
around two days on my server to get the data imported into MySQL. That said,
thereafter, searching is a relatively solved problem, so I'd question the
value of a C library, though I suppose it'd be useful in a case where you
didn't/couldn't have put the data into a database or where you needed to parse
new dumps all the time and didn't want to wait.

~~~
ajeet
I am curious about your use case. Was it full text search? Did you get the
database in wiki format, which you transformed to text?

The goal of my library is to enable quick data mining on wikipedia. Search is
just one use case. As an example, you might want to build a content classifier
to automatically categorize web pages into wikipedia categories (like
politics, sports, etc). To go about doing this, you would need to parse wiki
pages and extract features (like n-grams) for a particular category. The C
library transforms plain wiki text to a parsed object, that you can use to
extract what information you want. The only advantage is that it does this
incredibly fast.

~~~
nowarninglabel
I was just setting up a Wikipedia clone (using the dumps and MediaWiki) and
adding some features to it for my 4 months at sea with 250 students. No
internet in the middle of the ocean, so it was helpful for them to have a
server setup for them to still use wikipedia. Wiki has category dumps and such
already, so just used them. But yes, for your use case if you are extracting
something that they are not doing already then I could see the potential
gains.

