Once you use scikit-learn to figure out what you want to do on a reasonable data set, you can use Mahout (http://mahout.apache.org/) to translate the algorithm to hadoop pretty easily.

You should directly use Mahout: the recsys part is quite complete and high level and application oriented contrary to scikit-learn which does not provide high level recsys concepts.

The best documentation I found is the Mahout in Action book (http://manning.com/owen/) while reading the source code in parallel.

Also you probably don't need to run this on a Hadoop cluster unless your data is too big to fit on one single machine.

