Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: I need to categorize 150,000 products
8 points by polyfractal on Aug 18, 2012 | hide | past | favorite | 19 comments
My current project is a search engine for the RC hobby (http://comparerc.com).

I have over 150,000 parts indexed and searchable...unfortunately, none of them are categorized. Search is nice but a lot of people like to browse. I need to find a way to roughly categorize these parts. Any ideas?

I have a few options, none which look particularly compelling. I can outsource it to MTurk or a few data entry VAs. This will be prohibitively expensive however. 150,000 MTurk HITs at 3 cents each will cost $4,500. I don't have that kind of capital to spend since this is a side project. It also ignores the accuracy problem - a lot of these components are highly technical and the average person might categorize them incorrectly.

Another option is some kind of hierarchical clustering algorithm, which attempts to sort into categories for me. This is appealing because it is automated. The downside is that many items don't have sparse descriptions, which may make it difficult for the algorithm to cluster well. The same issues of quality control is also present.

Am I missing something obvious?



You should follow a two-pronged approach, using the algorithm to do a first pass and humans for cleanup.

1) You should do an algorithmic prescreen via hierarchical clustering to lower the number of items that need human assessment, with a confidence rating on each entity. For any elements with a sufficiently low confidence, you should push it to us. This will cut down the number of tasks you need to judge further.

2) If you can produce a ruleset for a human to follow in making categorization decisions, you can do this with MobileWorks for much less than the cost of MTurk -- probably a penny each. Since workers are incentivized properly, the cost per item is less. You can contact me directly (anand@mobileworks).


Most likely 'search' as you already mentioned, would prove to be the most useful option here.

But to make sure, you can think about this approach. Get answers to these questions - (1) understand who the end users are (2) what problem are you trying to solve for them? (3) What kind of knowledge would they have before coming to this website? (4) What kind of information would they be searching on the site?

Each of these questions will have multiple answers. You will have to talk to the stakeholders (business owner, marketing dept etc.) to get this information. At the end of this exercise, you should be able to make groups of your end users. Could be 3 to 7 depending upon various demographic factors (1) age (2) location (3) domain knowledge (4) expectation etc.

Once you have these user groups decided, make profiles for each of the group. Profile basically describes that particular section of your user audience based on various characteristics like age, gender, education, language, computer experience, domain expertise and their expectation from the website.

Then the most challenging part - talk to at least 10-15 'real' users from each of your user groups and validate the 'expectations' part that you assumed above. In addition to that you can also try 'card sorting' and 'reverse card sorting' approaches to see how your users categorize the information you have.

People who are from Usability Analysis field, will understand what I'm talking about here, but @polyfractal, you can trust this as a proven scientific method for Information Architecture of your website content.


> Search is nice but a lot of people like to browse.

Could you categorise an item under the search term that people used to look for it? For example, let's say they enter a search term which brings up a range of items and they click on one. You can then add that item to a category whose name is the search term. Clicking on the category would bring up all the items people searched for successfully in the past.


This was my first thought as well. If you are not confident in everyone's ability to categorize the items, you could develop some user profiles as described previously, and allow your more knowledgeable users to categorize items. This would be similar to the way HN allows users with a certain karma to downvote, and how StackOverflow has tiers of privileges.


Clever...I think I may have that data logged right now. I'll see about retroactively using that data to model categories and see if they are relevant. Thanks!


You seem to use a simple search algorithm. I searched for "battery", and only got results that had "battery" in the title. So basically, you could just skip the whole process and make each word in the title a tag...

That's unless you're using something more sophisticated and I just didn't realize.


I did a site:comparerc.com in Google, and most of the pages that show up are blank such as: http://comparerc.com/items/xk2845-b-3700kv-brushless-inrunne... . I'm not sure if your site is down, or if Google is indexing expired pages? Might want to look into that...

As for tagging, if I were you, I'd start out with a very simple categorization: by first letter.

Also, how are you going to make these pages filled with more high quality content? Google doesn't like thin pages and might penalize you in the future for having 100000+ thin content pages.

And how do you plan to get links? Without links, you have little hope of ranking even for long-tail queries.


Not a bad idea tagging by first letter, although I'm not sure how useful that would be. If I visit the site and want to browse for batteries, I may not know all the brand names. Might be a good first start though.

Regarding blank pages: those pages are blank for two reasons. First, I just launched an updated version of the site. I haven't added any redirects from old to new yet.

The second is that I accidentally let Google index those pages in the first place. As you mentioned, Google doesn't like thin content. I received good long-tail traffic until Google (rightly) booted my pages down to the third or fourth page. I now have a robots.txt in place to stop it, and don't particularly want Google to index my product pages at the moment.

Moving forward, my plans to make product pages better involve various parametrics (which I'm working on). Being able to play with interactive graphs of various parametrics, such as mAH/weight or discharge/cost ratio. Then generic sorting based on price, availability, etc

I'll selectively unblock certain product pages and allow Google to crawl them when they are more useful to the user.

Basically, Octopart is my role model.

Backlinking strategy right now is contacting bloggers and podcasters to get my site into the hands of a few hobbyists, so I can start to generate feedback. I also got a mention on Hackaday a few days ago which is helping. I have a few features planned which will incorporate more sharing.


Often the last word is a descriptive noun. Take the last word of each product name and make it a category. Then, go through this list (much smaller than 150k) and mark the ones that are and aren't actually categories. Now, change your categorization script to choose the closest word to the end that hasn't been marked as not a category. Go through your list again, this time filtering out anything you've positively marked as a category. After a few iterations with these scripts you should have decent categories.

Bonus: have your script set multiple categories if the title has multiple words you have marked positively as a category.


This is also clever. I'm a big fan of the 80/20 rule and I think this will work great as a first pass. There are definitely cases where it won't pick up anything, like this product:

OUTRAGE 5C NRG35 3S1P 11.1V 800mAH 35C NRG355C-8003

That's a LiPo battery, and should be classified under "Batteries", but nothing in the title explicitly states that fact. You can tell it's a battery because of the parametrics (11.1 volts, 35C discharge, 3S, etc) but none of that would show up as a categorical classification.

Then again, I probably want sub-categories in batteries for those parametrics, so perhaps I just need to allow categories to be nested under other categories and manually assign tiers later.

Thanks for the suggestion, I'm going to play around with this!


You're more familiar with your data, but the last word doesn't need to be a hard and fast rule. The last word there looks like a product code. However, the first word looks like a brand.

If this is more of an exception, I'm sure you can generate a pretty small list of what falls into that. If you regex for some number of volts with some mAH, it's probably a battery or a motor. That at least reduces the number of things you have to manually go through later to clean up the data.


Glad it's helpful. In cases like that one I suppose you look for clues and have another script to assign categories. For example, if most batteries have voltage in the title you can regex for \d+V\w (or whatever) and assign battery to matches. Could become complex quickly, though.


I'd start making your list of categories and then have associated linked words that appear in the title, where linked words are matched it's auto assigned a category. So Batteries (category) would be auto added to any product with battery, batteries, volt & discharge.

Just setting up some rules of how you would manually understand what a product is & automating it. Plus once you have good linking words to fit each category it should be perfect for new products that need categorizing. With MTurk, you'd have to keep paying every time new products were added.


Could you ask your users to tag it on the fly? While they browse your data they have the ability to categorize it with predefined tags and possibility to add new tags that you can review?


I like this idea. I already have a "Flag this" button in place: sometimes my algorithms have grouped products together (thinking they are the same) when they are really two separate products.

Enlisting the users to help categorize would be similar. Good suggestion :)


Could you possibly automate the items that do have enough information and collect the ones which don't for later manual review?


I think this will be the solution I go with, but I'm a bit unsure how it would work in practice. I don't know how I would evaluate the accuracy of automated clustering?

Another solution might be a sort of automated-manual hybrid: e.g. identify common words/phrases in a particular category manually, write a script to find all items that have those, add to category.


Well to write the automatic bit you'll have to manually figure out the rules :)

But yeah, something like that would be a good start. I don't know anything about this domain so I'm of limited help here. It might be too hard to categorize based solely on words if they're not distinct enough.


You could use an algorithm to identify keywords in your dataset and the manually classify the most common ones.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: