-- A naive linear scan for a lookup will not scale as the size of your database grows larger. You should be looking into space-partitioning trees, or approximate methods like locality-sensitive hashing.
-- Euclidean distance is a terrible metric for kNN on non-metric spaces, which is what your movie example is. It will also be beaten to a pulp by the Curse of Dimensionality: http://en.wikipedia.org/wiki/Curse_of_dimensionality#Distanc...
You're right, it won't scale as well as could be for large datasets -- or work on large dimensionalities. It's meant to be a light-weight solution for the more common use-cases..
For larger cases, it would be good indeed to resort to better methods.. I wanted it to be stateless and functional, whereas with space-partitioning trees, don't you need to maintain the tree (and not generate it on the fly for it to be scalable)?
As for your second point, could you elaborate on why the movie example is non-metric?
Yes, but it's essential to maintain some sort of summary or index data structure to make the method scalable. A linear scan may be stateless, but that's hardly an issue when the method won't scale beyond a few hundred examples.
> As for your second point, could you elaborate on why the movie example is non-metric?
No triangle inequality.