My current project is a search engine for the RC hobby (http://comparerc.com).
I have over 150,000 parts indexed and searchable...unfortunately, none of them are categorized. Search is nice but a lot of people like to browse. I need to find a way to roughly categorize these parts. Any ideas?
I have a few options, none which look particularly compelling. I can outsource it to MTurk or a few data entry VAs. This will be prohibitively expensive however. 150,000 MTurk HITs at 3 cents each will cost $4,500. I don't have that kind of capital to spend since this is a side project. It also ignores the accuracy problem - a lot of these components are highly technical and the average person might categorize them incorrectly.
Another option is some kind of hierarchical clustering algorithm, which attempts to sort into categories for me. This is appealing because it is automated. The downside is that many items don't have sparse descriptions, which may make it difficult for the algorithm to cluster well. The same issues of quality control is also present.
Am I missing something obvious?
1) You should do an algorithmic prescreen via hierarchical clustering to lower the number of items that need human assessment, with a confidence rating on each entity. For any elements with a sufficiently low confidence, you should push it to us. This will cut down the number of tasks you need to judge further.
2) If you can produce a ruleset for a human to follow in making categorization decisions, you can do this with MobileWorks for much less than the cost of MTurk -- probably a penny each. Since workers are incentivized properly, the cost per item is less. You can contact me directly (anand@mobileworks).