

Ask HN: I need to categorize 150,000 products - polyfractal

My current project is a search engine for the RC hobby (http://comparerc.com).<p>I have over 150,000 parts indexed and searchable...unfortunately, none of them are categorized.  Search is nice but a lot of people like to browse.  I need to find a way to roughly categorize these parts.  Any ideas?<p>I have a few options, none which look particularly compelling.  I can outsource it to MTurk or a few data entry VAs.  This will be prohibitively expensive however.  150,000 MTurk HITs at 3 cents each will cost $4,500.  I don't have that kind of capital to spend since this is a side project.  It also ignores the accuracy problem - a lot of these components are highly technical and the average person might categorize them incorrectly.<p>Another option is some kind of hierarchical clustering algorithm, which attempts to sort into categories for me.  This is appealing because it is automated.  The downside is that many items don't have sparse descriptions, which may make it difficult for the algorithm to cluster well.  The same issues of quality control is also present.<p>Am I missing something obvious?
======
anandkulkarni
You should follow a two-pronged approach, using the algorithm to do a first
pass and humans for cleanup.

1) You should do an algorithmic prescreen via hierarchical clustering to lower
the number of items that need human assessment, with a confidence rating on
each entity. For any elements with a sufficiently low confidence, you should
push it to us. This will cut down the number of tasks you need to judge
further.

2) If you can produce a ruleset for a human to follow in making categorization
decisions, you can do this with MobileWorks for much less than the cost of
MTurk -- probably a penny each. Since workers are incentivized properly, the
cost per item is less. You can contact me directly (anand@mobileworks).

------
usablebytes
Most likely 'search' as you already mentioned, would prove to be the most
useful option here.

But to make sure, you can think about this approach. Get answers to these
questions - (1) understand who the end users are (2) what problem are you
trying to solve for them? (3) What kind of knowledge would they have before
coming to this website? (4) What kind of information would they be searching
on the site?

Each of these questions will have multiple answers. You will have to talk to
the stakeholders (business owner, marketing dept etc.) to get this
information. At the end of this exercise, you should be able to make groups of
your end users. Could be 3 to 7 depending upon various demographic factors (1)
age (2) location (3) domain knowledge (4) expectation etc.

Once you have these user groups decided, make profiles for each of the group.
Profile basically describes that particular section of your user audience
based on various characteristics like age, gender, education, language,
computer experience, domain expertise and their expectation from the website.

Then the most challenging part - talk to at least 10-15 'real' users from each
of your user groups and validate the 'expectations' part that you assumed
above. In addition to that you can also try 'card sorting' and 'reverse card
sorting' approaches to see how your users categorize the information you have.

People who are from Usability Analysis field, will understand what I'm talking
about here, but @polyfractal, you can trust this as a proven scientific method
for Information Architecture of your website content.

------
mrlyc
> Search is nice but a lot of people like to browse.

Could you categorise an item under the search term that people used to look
for it? For example, let's say they enter a search term which brings up a
range of items and they click on one. You can then add that item to a category
whose name is the search term. Clicking on the category would bring up all the
items people searched for successfully in the past.

~~~
polyfractal
Clever...I think I may have that data logged right now. I'll see about
retroactively using that data to model categories and see if they are
relevant. Thanks!

~~~
yossilac
You seem to use a simple search algorithm. I searched for "battery", and only
got results that had "battery" in the title. So basically, you could just skip
the whole process and make each word in the title a tag...

That's unless you're using something more sophisticated and I just didn't
realize.

------
AznHisoka
I did a site:comparerc.com in Google, and most of the pages that show up are
blank such as: [http://comparerc.com/items/xk2845-b-3700kv-brushless-
inrunne...](http://comparerc.com/items/xk2845-b-3700kv-brushless-
inrunner-31fcc.html) . I'm not sure if your site is down, or if Google is
indexing expired pages? Might want to look into that...

As for tagging, if I were you, I'd start out with a very simple
categorization: by first letter.

Also, how are you going to make these pages filled with more high quality
content? Google doesn't like thin pages and might penalize you in the future
for having 100000+ thin content pages.

And how do you plan to get links? Without links, you have little hope of
ranking even for long-tail queries.

~~~
polyfractal
Not a bad idea tagging by first letter, although I'm not sure how useful that
would be. If I visit the site and want to browse for batteries, I may not know
all the brand names. Might be a good first start though.

Regarding blank pages: those pages are blank for two reasons. First, I just
launched an updated version of the site. I haven't added any redirects from
old to new yet.

The second is that I accidentally let Google index those pages in the first
place. As you mentioned, Google doesn't like thin content. I received good
long-tail traffic until Google (rightly) booted my pages down to the third or
fourth page. I now have a robots.txt in place to stop it, and don't
particularly want Google to index my product pages at the moment.

Moving forward, my plans to make product pages better involve various
parametrics (which I'm working on). Being able to play with interactive graphs
of various parametrics, such as mAH/weight or discharge/cost ratio. Then
generic sorting based on price, availability, etc

I'll selectively unblock certain product pages and allow Google to crawl them
when they are more useful to the user.

Basically, Octopart is my role model.

Backlinking strategy right now is contacting bloggers and podcasters to get my
site into the hands of a few hobbyists, so I can start to generate feedback. I
also got a mention on Hackaday a few days ago which is helping. I have a few
features planned which will incorporate more sharing.

------
1123581321
Often the last word is a descriptive noun. Take the last word of each product
name and make it a category. Then, go through this list (much smaller than
150k) and mark the ones that are and aren't actually categories. Now, change
your categorization script to choose the closest word to the end that hasn't
been marked as not a category. Go through your list again, this time filtering
out anything you've positively marked as a category. After a few iterations
with these scripts you should have decent categories.

Bonus: have your script set multiple categories if the title has multiple
words you have marked positively as a category.

~~~
polyfractal
This is also clever. I'm a big fan of the 80/20 rule and I think this will
work great as a first pass. There are definitely cases where it won't pick up
anything, like this product:

 _OUTRAGE 5C NRG35 3S1P 11.1V 800mAH 35C NRG355C-8003_

That's a LiPo battery, and should be classified under "Batteries", but nothing
in the title explicitly states that fact. You can tell it's a battery because
of the parametrics (11.1 volts, 35C discharge, 3S, etc) but none of that would
show up as a categorical classification.

Then again, I probably want sub-categories in batteries for those parametrics,
so perhaps I just need to allow categories to be nested under other categories
and manually assign tiers later.

Thanks for the suggestion, I'm going to play around with this!

~~~
caw
You're more familiar with your data, but the last word doesn't need to be a
hard and fast rule. The last word there looks like a product code. However,
the first word looks like a brand.

If this is more of an exception, I'm sure you can generate a pretty small list
of what falls into that. If you regex for some number of volts with some mAH,
it's probably a battery or a motor. That at least reduces the number of things
you have to manually go through later to clean up the data.

------
helen842000
I'd start making your list of categories and then have associated linked words
that appear in the title, where linked words are matched it's auto assigned a
category. So Batteries (category) would be auto added to any product with
battery, batteries, volt & discharge.

Just setting up some rules of how you would manually understand what a product
is & automating it. Plus once you have good linking words to fit each category
it should be perfect for new products that need categorizing. With MTurk,
you'd have to keep paying every time new products were added.

------
koopajah
Could you ask your users to tag it on the fly? While they browse your data
they have the ability to categorize it with predefined tags and possibility to
add new tags that you can review?

~~~
polyfractal
I like this idea. I already have a "Flag this" button in place: sometimes my
algorithms have grouped products together (thinking they are the same) when
they are really two separate products.

Enlisting the users to help categorize would be similar. Good suggestion :)

------
DevAccount
Could you possibly automate the items that do have enough information and
collect the ones which don't for later manual review?

~~~
polyfractal
I think this will be the solution I go with, but I'm a bit unsure how it would
work in practice. I don't know how I would evaluate the accuracy of automated
clustering?

Another solution might be a sort of automated-manual hybrid: e.g. identify
common words/phrases in a particular category manually, write a script to find
all items that have those, add to category.

~~~
DevAccount
Well to write the automatic bit you'll have to manually figure out the rules
:)

But yeah, something like that would be a good start. I don't know anything
about this domain so I'm of limited help here. It might be too hard to
categorize based solely on words if they're not distinct enough.

