Hacker News new | past | comments | ask | show | jobs | submit login
37M Compilations: Investigating Novice Programming Mistakes [pdf] (kent.ac.uk)
100 points by gwern on July 4, 2016 | hide | past | web | favorite | 30 comments

Most of these errors are not really interesting. Almost all of them are simple typos or syntax errors.

I get that this is the errors that are simple to analyze. Reference/value confusion would for instance be more interesting, but I guess that's harder to autodetect.

Isn't that very interesting? It means the IDE / environment they are using allows them to make typos and syntax errors. Why weren't those corrected and/or highlighted as they typed?

The first programming language I used had magic keys which produced whole keywords (eg. pressing "P" produced "PRINT"). Typos were impossible! If you still managed to type a line with a syntax error, the cursor changed to a flashing $ over the error and you literally couldn't submit the line of code until you'd fixed it. And that was on a machine with 1 kilobyte of RAM and a < 4 MHz processor!

Yeah, I am pretty excited about the idea of IDEs just presenting ASTs instead of text (of course can always save as text). There's a haskell structured mode that tries to do a similar thing.

It might seem silly, but the biggest impediment I can think of is copy-pasting code from elsewhere. If your variable names are different, the IDE should reject the code (since you shouldn't be able to commit variables that don't exist), but then what do you do? Open it in Notepad and "fix" it?

EDIT: structured haskell mode, complete with GIFs showing how it works https://github.com/chrisdone/structured-haskell-mode#feature...

Brings to mind Smalltalk's concept of images (which comes from lisp, I believe) and Forth.

Of course in Smalltalk, you do work with code-as-text, but then it becomes code-as-program (technically byte-code, with among other things a text representation). Does seem more reasonable than the anachronistic insistence on text->parse-to-AST->(whatever magic, and however many transforms the actual compile-part is).

Then again, we all know what a rich editor is. It's MS Word. And MS Word eats documents and depricates file formats.

On the other hand, I think Emacs Org-mode, Gimp images and the literate editor in Python, Leo[l] are examples of more pleasant "rich format" editors. It is a bit odd that what we take for granted for image files (edit history, binary format++) we fear in our IDEs.

[l] http://leoeditor.com/

Core images in Lisp have been used since the early 60s.

See for example the Lisp 1.5 manual. Appendix E describes the handling of images:


> Overlord is the monitor of the LISP System. It controls the handling of tapes, the reading and writing of entire core images, the historical memory of the system, and the taking of dumps.

Are you familiar with Lamdu? http://www.lamdu.org/

> Why weren't those corrected and/or highlighted as they typed?

That's such a horrible expectation. Do we expect children to spell correctly right off the bat? Or do we teach them, slowly & patiently, instead? Or are you saying we should just let autocorrect be their guide?

No? Then why would programming languages be different. It's part of the learning process. And it's by far one of the easier parts.

Aren't they? The fact that confusion between = and == is 4th most frequent is quite interesting for example. It's possible to disallow assignments in places where comparisons are most likely (expressions that are known to coerce to a boolean), Python does it, with presumed significant productivity benefits (lots of time isn't spent dealing with that common mistake).

D can arguably be mitigated by differentiating bitwise operators more them from their logical counterparts (for example by having the logical operators in plain English).

Many more are better handled at the editor/IDE level, but this seems like a really interesting read for anyone involved in PL design.

The Alto thread posted a ref to BCPL, and it was interesting that the original symbols for the comparison operators were text: eq, ne, gt, lt.

C broke BCPL syntax that was clear, memorable, and consistent and replaced it with a math-like syntax that must have wasted millions of person hours of debugging time in the decades since - for no good reason.

Similarly &(pointer address) and &&(logical and) are too close and too easy to typo.

Language design really should have considered human factors much more than it did.

> Similarly &(pointer address) and &&(logical and) are too close and too easy to typo.

Not to mention &(bitwise and).

Interestingly, Powershell uses eq, ne, gt, lt, etc for its comparison operators.

The historical reason behind that is the > and | symbols were used as redirection operators to follow shell conventions so they used -gt and -bor. All of the other operators followed for consistency. There was also a big kerfuffle about statements like "$a = 1" or "$a++" not being able to return values like in C because if they were used stand-alone the return value would be printed out by the host.

Yes, this always gets me. I don't like perl-style at all. Compare the cognitive load of distinguishing -lt and -le vs. < and <=.

I loved doing this in Ruby:

    if (x = y)
       x += 10
(assign y to x and operate on x if x is truthy )

The optional parens reminds me that this is a deliberate assignment. But as someone who now teaches Python, I thank the Python gods who disallowed such syntactic sugar. I've found it impossible to overestimate how difficult it is for beginners to grok "a = b"...and can you blame them, after years for math instruction telling them that that equals sign means something else? (Never mind the clusterfuck that occurs when trying to teach SQL -- which also uses singal equals sign for equality -- in tandem with a scripting language. The cognitive difficulty is so high that I've considered switching to teaching R, which at least has the optional arrow operator, "a <- b"

That would not pass review with me. Code should be as obvious and simple to read as possible. It doesn't matter whether or not the reader is a beginner or not, what matters is that it reads like it could be a typo. So there is more cognitive overhead than strictly necessary, and quite possible there are bugs hidden in those 'clever' bit that you love so much.

Haskell has a neat solution. There is no assignment operator. (Let bindings are a different thing)

I agree the errors are uninteresting to experienced engineers. I think give this data to newcomers can help them debug faster, build confidence, etc.

I did courses at two universities (I transferred) and in either case one would need to compile the assignment source code in order to actually do the homework. I think that a compiler, especially javac, would catch almost all of these and warn about some of the others.

Moreover, good syntax highlighting or an IDE with some static analysis in it would help a lot too. I think that might be a useful thing to put in intro programming classes. Eclipse is free right? I use IDEA-based editors for most of my work, but even the syntax highlighting in, say, emacs without installing packages (at least on the OS/distros I'm familiar with) would go some fair distance to this goal. I assume the same would be true of vi.

After having a short look on the compilation, it shows one thing very accurately IMHO: C syntax is very error prone.

To sad, that so many new programming languages chose to use exactly this syntax that is so error prone.

C was a very good programming language and the short syntax might have some appeal -- but for learning programming, this syntax is not the best option, as long as you don't use it as type of intellectual test to find the best computer-people ...

English is also error-prone, as you have demonstrated—perhaps unwittingly—but we still use it around the world. I would argue that this fact makes C syntax a particularly good choice for one learning to program: Computers don't think like humans, unless we program them to think like humans.

Though it is in this case worth noting that there is enough redundancy in English that you can both a) tell that "to bad" is wrong and b) know they meant "too bad", which isn't always the case for =/==

Of course the desired phrase was "So sad". Or was it "Too sad"?

I was thinking much the same. 90% of these errors don't even exist in some other languages.

Error D would be particularly confusing, since it would still produce the result the programmer intended (at least with my setup where implicit int -> bool).

"The right thing for the wrong reasons".

This shouldn't work. I spot sometimes those errors in my code and I find them fun to fix

A different perspective: The paper shows: (1) Humans quickly learn to avoid simple syntax mistakes after they compile code and get an error message. These messages often pinpoint the error location and suggest the fix, so this result is hardly surprising (e.g., Invalid token '}', did you forget ';').

(2) The authors assume every type error is unintentional. This may not be true: Consider transitioning from using a String to represent a number (eg., a command line argument), to a numeric type. This transition may be to check for errors upfront and to avoid parsing the number in multiple locations. All these locations will be pointed to by type errors, after the programmer changes the type.

Is it even interesting to worry about the mistakes novice programmers make? Novice programmers are learning programmers.

What matters more is the mistakes that people continue to make even after they are not novices.

From the paper:

1. Introduction

Knowledge about students’ mistakes and the time taken to fix errors is useful for many reasons. For example, Sadler et al [10] suggest that understanding student misconceptions is important to educator efficacy. Knowing which mistakes novices are likely to make or finding challenging informs the writing of instructional materials, such as textbooks, and can help improve the design and impact of beginner’s IDEs or other educatoinal programming tools.

In other words, yes, understanding what types of errors novice programmers make can be very interesting and useful.


I had my language-designer goggles on, but you and the paper are right that educator goggles matter too.

A lot of these would be solved by the following language:


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact