

Code Webs – Visualizing 40,000 student code submissions - ohjeez
http://www.stanford.edu/~jhuang11/research/pubs/moocshop13/codeweb.html#

======
aidos
This was posted a couple of weeks ago.

[https://news.ycombinator.com/item?id=6513062](https://news.ycombinator.com/item?id=6513062)

~~~
privong
Yep. Seems like the software should know that a lonely "#" appended to the url
(and similar things) is a duplicate page?

------
3JPLW
Very fascinating. I'm excited to see where they go with these data. The final
paragraph is where the money is.

In particular, I think it'd be interesting to track students over time. Do
some clusters have more difficulty picking up later concepts? Is their
submission, while correct, showing some systematic error in their mental model
of the language or topic?

It'd be very cool to give qualitative feedback in addition to the quantitative
unit tests based on these clusters. E.g., "Your code, while correct, is
demonstrating characteristics that may be less maintainable than other
submissions. In addition, we recommend a review of [some topic]; using those
concepts would simplify your code."

------
jmount
A fun paper on this sort of topic:
[http://www.cs.tufts.edu/~nr/cs257/archive/don-
knuth/empirica...](http://www.cs.tufts.edu/~nr/cs257/archive/don-
knuth/empirical-fortran.pdf)

------
howeman
Is this also what is used to enforce academic standards (i.e. each student
doing individual work)? I've heard CS lodges the most official complaints of
any department.

~~~
bun-neh
From what I read, I don't believe it is. In my opinion it should also never be
used for this purpose.

While you could probably catch a lot of cheaters this way, there is a
possibility for a large false positive rate. If this is true then I would
especially advise against deploying this type of software in a traditional
university since the academic dishonesty policies can often cause significant
and undue harm on an innocent student.

~~~
alttag
Good comment, but I wouldn't say never.

As an instructor of programming on a university level, I like to think that I
have enough sense to know that particularly for "trivial" assignments, some
similarity is expected. However, as I've encountered, a great deal of
similarity over multiple assignments (and exams) between two students of the
same nationality who sit together in class provides additional evidence of
plagiarism.

So, yes, I agree a single data point of similarity is insufficient, but a
history of similarity, particularly in complex projects, becomes more damning.

~~~
jlgreco
I got flagged as a freshman for "55% similarity" (whatever that meant) to
another students submission in a "learn how to write shit in C++" type
assignment. As far as I could tell, the only thing that triggered the software
was the fact that both I and the other kid used do-while loops, while nobody
else in the course did. The rest of the programs were semi-similar, just a few
lines of cout/cin/<</>>/... to ask your name and echo it back.

So basically what I'm saying here is that I think _" for "trivial"
assignments, some similarity is expected"_ isn't always widely understood, to
the detriment of students.

I think these sort of systems become most valuable when used to check work
against work submitted from previous years to bust frat-house collections of
answers, but varying questions year from year probably helps even more in that
regard. Similarity between complex projects in the class sizes that were
typical at my university (in classes advanced enough to have complex answers)
was pretty easy to spot manually. Maybe edit-distance software is useful there
to put some weight behind accusations?

------
NAFV_P
I'd prefer that to a Jackson Pollock.

Has anyone managed to compute the value of the Hausdorff-Besicovitch
dimension?

~~~
pbhjpbhj
Now it's a very long time since I did fractal geometry but as a simple
countable set isn't the Hausdorf dimension 0 [zero]?

~~~
NAFV_P
"Now it's a very long time since I did fractal geometry..." Yep, same here I'm
afraid. But A Pollock at the fundamental level is just atoms, which I would
think is just another countable set. Still, research has been done into the
fractal nature of his paintings. Apparently as he matured the HD dimension
increased.

I'm thinking of real phenomena exhibiting a 'partial' fractal nature. I think
you are thinking in the pure maths realm.

BTW I still don't like his paintings generally. Manet and Holbein are more my
thing, or Morandi on certain occasions.

~~~
pbhjpbhj
I've been working in a public facing creative arts role for a while now and my
appreciation of the likes of Pollock has become somewhat more favourable over
that time.

Hadn't heard of Morandi, not sure his stuff means much to me, however I like
this photographic interpretation of his work -
[http://static.dezeen.com/uploads/2009/06/dc03_ins.jpg](http://static.dezeen.com/uploads/2009/06/dc03_ins.jpg)
(to sell a dinner service). Thanks for the pointer.

------
yread
Isn't comparing code on edit distance a bit too simplistic? If I pull some
functionality into a separate function the code tree becomes already quite
different

~~~
bglusman
Read the article, it's not code edit distance, it's AST edit distance. If the
extracted function was the same as the code inline, I think this would have
little effect, though that might depend on the AST parsing.

------
tonyplee
Love to see something like this to visualize the evolution of a complex
software project such as Linux git.

------
ChuckMcM
That is pretty amazing.

