Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Interested in a Wikipedia parsing tool for data mining?
5 points by ajeet on Sept 27, 2012 | hide | past | favorite | 6 comments
I recently developed a C library that parses Wikipedia, with the goal of being _fast_. For example, I was able to parse and extract text from the entire wikipedia dump (~35 GB uncompressed) in under an hour (on a 5yr old iMac).

It is a work in progress. If there is sufficient interest, then I will clean up the code and put up some documentation.




Have you looked at dbpedia.org? Perhaps your tools would fit in with their project.


Yes I have, and that project would be a great fit. They already have a scala regex based lib though. My library is likely a faster and lighter alternative.


Isn't there a complete database with the information in Wikipedia available?


Yes, but it does take quite some time to get that setup. I remember it took around two days on my server to get the data imported into MySQL. That said, thereafter, searching is a relatively solved problem, so I'd question the value of a C library, though I suppose it'd be useful in a case where you didn't/couldn't have put the data into a database or where you needed to parse new dumps all the time and didn't want to wait.


I am curious about your use case. Was it full text search? Did you get the database in wiki format, which you transformed to text?

The goal of my library is to enable quick data mining on wikipedia. Search is just one use case. As an example, you might want to build a content classifier to automatically categorize web pages into wikipedia categories (like politics, sports, etc). To go about doing this, you would need to parse wiki pages and extract features (like n-grams) for a particular category. The C library transforms plain wiki text to a parsed object, that you can use to extract what information you want. The only advantage is that it does this incredibly fast.


I was just setting up a Wikipedia clone (using the dumps and MediaWiki) and adding some features to it for my 4 months at sea with 250 students. No internet in the middle of the ocean, so it was helpful for them to have a server setup for them to still use wikipedia. Wiki has category dumps and such already, so just used them. But yes, for your use case if you are extracting something that they are not doing already then I could see the potential gains.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: