I don't understand the author's reasoning for preferring mean shift over k-means due to k-means needing the number of clusters as an input. It seems that in mean shift, choosing the kernel bandwidth parameter is just as arbitrary in that it requires domain knowledge. In k-means there are tests[0] for choosing an appropriate k. Is there a similar strategy in mean shift for choosing an appropriate kernel bandwidth parameter?
While choosing a kernel bandwidth may be arbitrary for certain situations, it does have some nice properties. The bandwidth informally defines how "close" points need to be to be considered similar, which can work well for certain problem domains where this can be easily determined.
Like most clustering problems, if you can't choose a reasonable set of parameter values based on some domain specific information, it is largely trial and error. Of course, there are several metrics that can be used to score certain clusterings (e.g., parameter values) over others, as described in your wikipedia link.
The other nice advantage that mean shift has over k-means is that it does not make any assumptions about the cluster shape. K-means assumes spherical clusters. Mean shift allows for clusters of any shape, since it is driven by density.
You can tune any free parameters by defining a loss function. For example, you loss function can panelize two close points to be separated in two clusters and vice versa. Once you define loss function, it's just matter of finding optimal value - although at extra cost.
I don't understand the author's reasoning for preferring mean shift over k-means due to k-means needing the number of clusters as an input. It seems that in mean shift, choosing the kernel bandwidth parameter is just as arbitrary in that it requires domain knowledge. In k-means there are tests[0] for choosing an appropriate k. Is there a similar strategy in mean shift for choosing an appropriate kernel bandwidth parameter?
[0] http://en.wikipedia.org/wiki/Determining_the_number_of_clust...