In particular, I think it'd be interesting to track students over time. Do some clusters have more difficulty picking up later concepts? Is their submission, while correct, showing some systematic error in their mental model of the language or topic?
It'd be very cool to give qualitative feedback in addition to the quantitative unit tests based on these clusters. E.g., "Your code, while correct, is demonstrating characteristics that may be less maintainable than other submissions. In addition, we recommend a review of [some topic]; using those concepts would simplify your code."
While you could probably catch a lot of cheaters this way, there is a possibility for a large false positive rate. If this is true then I would especially advise against deploying this type of software in a traditional university since the academic dishonesty policies can often cause significant and undue harm on an innocent student.
As an instructor of programming on a university level, I like to think that I have enough sense to know that particularly for "trivial" assignments, some similarity is expected. However, as I've encountered, a great deal of similarity over multiple assignments (and exams) between two students of the same nationality who sit together in class provides additional evidence of plagiarism.
So, yes, I agree a single data point of similarity is insufficient, but a history of similarity, particularly in complex projects, becomes more damning.
So basically what I'm saying here is that I think "for "trivial" assignments, some similarity is expected" isn't always widely understood, to the detriment of students.
I think these sort of systems become most valuable when used to check work against work submitted from previous years to bust frat-house collections of answers, but varying questions year from year probably helps even more in that regard. Similarity between complex projects in the class sizes that were typical at my university (in classes advanced enough to have complex answers) was pretty easy to spot manually. Maybe edit-distance software is useful there to put some weight behind accusations?
Has anyone managed to compute the value of the Hausdorff-Besicovitch dimension?
I'm thinking of real phenomena exhibiting a 'partial' fractal nature. I think you are thinking in the pure maths realm.
BTW I still don't like his paintings generally. Manet and Holbein are more my thing, or Morandi on certain occasions.
Hadn't heard of Morandi, not sure his stuff means much to me, however I like this photographic interpretation of his work - http://static.dezeen.com/uploads/2009/06/dc03_ins.jpg (to sell a dinner service). Thanks for the pointer.