Ask HN: Interested in a Wikipedia parsing tool for data mining?

tarr11 · on Sept 27, 2012

Have you looked at dbpedia.org? Perhaps your tools would fit in with their project.

ajeet · on Sept 27, 2012

Yes I have, and that project would be a great fit. They already have a scala regex based lib though. My library is likely a faster and lighter alternative.

sareiodata · on Sept 27, 2012

Isn't there a complete database with the information in Wikipedia available?

coderintherye · on Sept 27, 2012

Yes, but it does take quite some time to get that setup. I remember it took around two days on my server to get the data imported into MySQL. That said, thereafter, searching is a relatively solved problem, so I'd question the value of a C library, though I suppose it'd be useful in a case where you didn't/couldn't have put the data into a database or where you needed to parse new dumps all the time and didn't want to wait.

ajeet · on Sept 27, 2012

I am curious about your use case. Was it full text search? Did you get the database in wiki format, which you transformed to text?

The goal of my library is to enable quick data mining on wikipedia. Search is just one use case. As an example, you might want to build a content classifier to automatically categorize web pages into wikipedia categories (like politics, sports, etc). To go about doing this, you would need to parse wiki pages and extract features (like n-grams) for a particular category. The C library transforms plain wiki text to a parsed object, that you can use to extract what information you want. The only advantage is that it does this incredibly fast.

coderintherye · on Sept 27, 2012

I was just setting up a Wikipedia clone (using the dumps and MediaWiki) and adding some features to it for my 4 months at sea with 250 students. No internet in the middle of the ocean, so it was helpful for them to have a server setup for them to still use wikipedia. Wiki has category dumps and such already, so just used them. But yes, for your use case if you are extracting something that they are not doing already then I could see the potential gains.