Hacker News new | past | comments | ask | show | jobs | submit login

What's a reasonable number of categories? 10?



That's defined by the phenomenon you're investigating.

In the case of six-sided dice, there are precisely six categories, ideally with even odds of occurrence. With the lottery jackpot given, there are eight categories, with highly asymmetric probabilities and values.

In real-world cases, you might be trying to distinguish two cases (treatment and control in a medical experiment), between multiple particles or isotopes (say, with physics or chemistry), amongst different political divisions (countries, states or provinces, counties, cities, or other), between political parties or candidates (which raises interesting questions over which and/or how many to include in consideration, in turn dependent on voting procedures, overall popularity, and impacts of non-winning candidates or parties on others), on multiple products, or on different behavioural characteristics in some domain (e.g., highly-active, occasionally-active, and lurking participants in online fora).

There are times when categories are well and unambiguously defined. Others in which where you choose to draw divisions (say, in generational groups, or wealth or income brackets) is highly arbitrary. Even where there are a large number of potential categories, choosing some limited number for specific analysis (2, 3, 5, 10, etc.) and lumping the remaining into "other" may provide clearer insights and fewer distractions than choosing a large number of divisions.[1] In other cases, a very small number of individuals may account for an overwhelming majority of activity or outcome. I'd strongly argue that in this case, the analysis might be somewhat poorly focused, and that activities and outcomes rather than individuals are of greater interest.[2]

What's key is to match your sampling and sample sizes to the phenomenon being studied.

________________________________

Notes:

1. Power law distribution / Zipf functions often mean that a very small number of participants has highly disproportionate impact or significance.

2. This is often the flip side of power law distributions. If we look at all book titles, there are a huge number of individual items to consider; there are roughly 300k annual English-language "traditional" publications, and over 1 million "nontraditional" (self-published, or publish-on-demand) titles. But if your focus is instead titles by percentage of revenue or number of sales, a top-n analysis (5, 10, 20, etc.) often captures much of the activity, frequently well over half. This is typical of any informational good: music, cinema, blogs, social media posts, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: