Hacker News new | comments | show | ask | jobs | submit login
Automatic Grading of Code Submissions Isn't Perfect Either (gregorulm.com)
39 points by gu 1607 days ago | hide | past | web | 27 comments | favorite



I really think the concerns about bad code are overblown. I had a friend in college who, for a CS class's final project, wrote an entire game in Java within a single, enormous function body. I still don't know how he even managed to do it, but it basically worked and he passed the class. He sort of understood how functions worked, but he found them confusing, so he didn't use them.

This wasn't at some community college or anything. This was at Georgia Tech.

That's an extreme case, and I certainly am not saying that just because it happened in a respected engineering school, that makes it acceptable. But my point is that in entry level courses (like the ones where you'd be implementing a clip function), even professors grade on getting the job done. Code quality just doesn't enter the picture at that level.

The thing is, trying to teach good code directly is pointless. Your less bright students will accept the dogma and never actually understand how to apply it usefully. Your brightest students will see it as a bunch of useless bullshit that's holding them back.

If you want to teach good code, here's how you do it: Make a student write and maintain a large project. Make them keep it running for two years, while you make them add more and more features. Keep checking it against an automated test suite which they do not have access to, and grade them on its correctness. Give them the resources to learn about best practices, but never tell them they have to use them.

Then, at the end of two years, let them rewrite it from scratch. Then you will see a student who has learned the value of good coding practices.


The author completely misses here - probably due to limited exposure to real-life third-party code in real life production systems.

Code auto-grading, at least at Coursera, is usually done by running comprehensive unit tests, which extensively test border cases as well. These test suites are often 5-10 times larger than the actual submitted code, and it is difficult to imagine anybody outside of this type of environment spending so much extra time designing (and testing!) test suites with 100% coverage.

Moreover, code submissions have to comply with (or implement, in case of Java) predefined interfaces. And some courses (e.g. Scala) have style checker output taken into account (20% of grade is decided by the style checker in the Scala course).

In summary, well-thought-out test suites and interface specifications demand well-designed code submissions; in real life, poor comments or sloppy expressions are a very minor nuisance compared to poorly designed interfaces and forgotten border cases.


I wasn't talking about insufficient test cases. Your remark about extensive unit tests is therefore quite irrelevant in this context as I don't question at all that the unit tests take border cases into account.

I am mostly concerned with "soft" aspects. Just consider the case where a student has to define variables, but picks variable names in a language other than English, or where control flow in a submission is more convoluted than it would have to be. Those are the cases I discuss in the article.

Moments ago, someone left a very fitting comment on my blog:

"I am taking the edX CS169.1 course and I find that I will consistently have a "less than elegant" solution that the auto grader accepts but that I feel is sub-par. The irony is this class has a large BDD/TDD aspect and is teaching RED-GREEN-REFACTOR, but with an auto grader once its green there is little reason to go back and refactor."


Let me try to make my point by a "soft" analogy, in your terminology.

If a fiction book has a great, gripping plot and interesting, relatable, wonderfully done characters, then weird spelling and heavy sentences are not a big deal and can be easily fixed by a competent editor. But nothing can save a very grammatical and clean-written text that is just flat, boring, or makes no sense at all. Ask any publisher - which kind of books they prefer?

Similarly in software, getting the big picture right is much much more important than "elegance" in each individual line.


You are setting up a false dichotomy here. Speaking in your analogy, the question is not between grammatical and boring and ungrammatical and exciting, but between a novel that is gripping and grammatical versus one that may be as gripping but wasn't properly edited.


That comment on your blog is exactly what professional programmers do in the real world: pass the test suite and move on. After all, the goal of software engineering isn't to write elegant code; it's to deliver software that solves the customer's needs. And the customer's needs are tracked via the spec, not the style guide.


Apart from the issue that we're talking about an introductory CS course, the greater problem remains that you'll only load up "technical debt", which may well lead to problems later on.

Michael O.Church briefly talks about this in his article on startup culture: http://michaelochurch.wordpress.com/2012/07/08/dont-waste-yo...

Also, see the recent HN post, "Ask HN: I just inherited 700K+ lines of bad PHP. Advice?": http://news.ycombinator.com/item?id=4557919

Lastly, in my article I highlight Open Office, which still has comments in German in its source code.


This is a short term attitude that is incompatible with building a system that grows mor ppwerful or a decade. That may be OK or maybe not, depending on your horizon and sunset plans.


IME it's not that binary.

In our corporate environment we do enough to pass the tests, with one extra 'test' being a peer review which should take into account a list of criteria that aren't easy to check for automatically; house code style, test code coverage, future maintainability, g11n/i18n-ness, etc.

We often only go as far as 'just good enough' but the standard to which that is assessed is pretty high.


"These test suites are often 5-10 times larger than the actual submitted code, and it is difficult to imagine anybody outside of this type of environment spending so much extra time designing (and testing!) test suites with 100% coverage."

Try the (extensive!) programming challenges at http://uva.onlinejudge.org/

In almost all of the challenges the example input data is sized such that even the most naive algorithm will run within a second or so. When submitted the code is judged against much larger/more-complex sets of input that will catch out inappropriate algorithm choice, unhandled edge cases, etc.


That's pretty neat. I wish that it had a little more language support though. As a recreational activity, it'd be nice to be able use something a little more modern than Java.


I have been pondering the possibility to build a challenge problem site that accepts solutions composed in languages other than c/c++/java/c# . Interview street has done some for languages from Haskell to scala to php. But probably I will include javascript because javascript is going to be influential on html5 apps.

If anyone has similar idea and like to team up. Please contact me.


That would be really cool. Another thing about doing it in JS: you could then potentially use something like emscripten to automatically support other languages.


spoj.pl is very similar to acm.uva.es, but also has a much wider array of supported languages.


The automatic grading has a huge advantage: It is nearly real-time, and improving the solution and re-submitting improves your score.

Having been a teaching assistant who corrected programming assignments (and also a student), I always wondered how many of the students would read my comments, would go back to their solution and actually improve it. Probably none. If I (as a student) received a comment about a solution I submitted two weeks ago, I often didn't instantly know what the corrector talked about. I had to go back to look at my code. I'm not sure I always did that when I was busy. Additionally, I think even if I acknowledged the comment, I wouldn't actually go ahead and fix my solution.

I'm taking the scala course right now, and when I submit a solution and something is flagged, my thoughts are right in the code, I still have all the files open in vim, sbt running... so I can instantly go and fix it. And there is a real incentive to do that, because my score will improve.


I agree wholeheartedly with this. I find that I learn more and engage more with my online courses than in a traditional lecture because I am able to watch a few minutes of instruction, then work an example and have it validated. It breaks the lecture up into small incremental building blocks, whereas in a classroom you normally get the entire lecture, then work the homework several days later. By having real time feedback, you learn better and more quickly. Also, since most programmers learn by example, if the instructor uses well structured code, the students likely will, too.


I agree that the real-time advantage is much more important than the "comprehensive" feedback of a human grader.

For style problems that are really a detriment, students will drift toward better style to save themselves time when correcting and resubmitting.


This is nothing, compared to the "peer review" of the humanities lectures.

I know that there is no easy answer for doing MOOC (massive open online course) in humanities, but, according to the web, Coursera's solution is not working very well and, what is more striking to me, Coursera doesn't seem to respond.

But again, I have no easy solution for grading essays in MOOC.

More information here:

http://courserafantasy.blogspot.cz/2012/09/done-more-or-less...

http://www.insidehighered.com/blogs/hack-higher-education/pr...

http://gregorulm.com/a-critical-view-on-courseras-peer-revie...


I'm making http://codehs.com to teach beginners how to code. We're focusing on high schoolers and promoting good style and good practices.

We have a mixture of an autograder for functionality and human grading for style.

It's really important to get both. Our class uses a mastery model rather than grades, so you shouldn't move on until you've mastered an exercise, and mastery does not just stop at functionality. Style is included.

Making your code readable to other people is really important, and it can and should be taught and stressed even on small exercises.

At Stanford, code quality is half your grade in the first two intro classes because it's just as important that someone else understand your code as it is to just make it work.


I disagree with the article in general because I think the secret sauce for these online classes is involving students with non-graded questions during the lectures, graded tests, and homework.

I think the comprehensive grading of programs submitted for homework is good, but even if it is not perfect, in the 5 classes I have taken, the assignments helped me dig into the material.

I also like the model of letting students take graded quizzes more than once. I find that the time spent between the first and second time taking a quiz is very productive for improving my understanding of the material.

These classes are fundamentally superior to just reading through a good text book.


But aren't those entirely different issue altogether?

Don't get me wrong, I do agree with you that MOOCs are a boon. A well-structured course may be able to provide a better experience than working through a textbook on your own. Still, this doesn't mean that those courses live up to the hype.


What the article saying isn't specified to MOOC, i.e. think about continuous integration vs code review - they are not contradictory.

MOOC is not going to replace formal education and I think the "limitations" mentioned are perfectly acceptable due to the issues of costs and incentives involved, e.g. In the Coursera's Scala course, there are more than 10K+ weekly assignment submissions, you must need a scalable assessment method. (The grader is not bad in fact, i.e. knows cyclomatic complexity, warn if you use mutable collections etc)


I'm taking 6.00x and Udacity CS101 currently, and I'd have to disagree with the OP.

The code checkers give you immediate feedback with test suites that are more comprehensive than what students would (or could, in most cases) design themselves.

sure there's no professorial feedback on your code, but 90% of the time those comments you receive back on your printed out code will go unread. Not to mention the lead time, often as long as two weeks, from the time you submit to the time you receive back comments, often makes the comments worthless.

as for style, my Uni Intro to CS courses didn't check my style either. I find 6.00x and CS101 to be vastly superior in almost every respect.

finally, 6.00x and CS101 actually provide you with the "correct" answers after you've passed their tests with an adequate solution. I've a few times found myself hitting my head and thinking, "Why didn't I think of that! That's more elegant than my solution.", and going back and attempting to implement their solution. Try finding that in anything other than an online course.


the scala coursera course does a style check which will catch some style issues. i think it uses this: http://www.scalastyle.org/rules-0.1.0.html but it wouldn't have caught the clip problem discussed on the blog.


I found the style checker to be pretty good and helpful.

Because for some problems they give a hint, e.g. "this is solvable with a one-liner", the student could figure out himself whether he is on the right track or not.

Line-length could also be checked by implementing something similar to a style checker. It could also check if some methods that are supposed to be used (e.g. min/max) are being used.

I guess the author is correct in that automatic grading is not perfect and it's never going to be as good as talking with someone more experienced... but it can go pretty far. Having corrected programming assignments as a teacher assistant myself, I have to say that it can be a really tough job and an automatic grading system may give you more help. When I corrected an assignment, and there were a lot of issues with it, I would point out the most important ones. But an exhaustive list is really tough because there is time pressure. Also, I think it could be too demotivating for the student if he gets a list of 30+ issues from the corrector, I'd rather have him acknowledge the 3 most important ones.


As a part of the build system at my work, we run various passes over the code aside from just "does this compile". I'm sure these MOOCs could find software to:

1. Check style of a language

2. Run a comprehensive suite of unit tests

3. Static analysis of the code

These tools together can catch most problems of bad formatting, fragile code (cannot handle edge cases, errors, etc.), and structural errors. Additionally you could take into account some kind of performance of the code - does this solve this problem in a reasonable amount of time?

By using standard industry tools, one could do a good grading system that is entirely automated.


Most of the CS courses I've taken from Coursera already do all those things. The Algorithms course from Prof. Sedgewick in particular had an excellent test suite for its programming assignments.

You were graded on complying to an interface, code style, correctness, performance and memory usage, with strict requirements on the latter two. You couldn't get away with, say, implementing a brute force solution and calling it a day, you had to solve the problem optimally.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: