
Size is the best predictor of code quality - gandalfgeek
http://blog.vivekhaldar.com/post/10669678292/size-is-the-best-predictor-of-code-quality
======
jacques_chester
Size is the best predictor of many things about software projects.

Note that 'size' is a dimensionless quality; we can only approximate it with
certain proxy metrics (KSLOC, Function Points, Budget allocation).

Edit: and gzipped size, and token counts, and logical lines, and Halstead
metrics, and cyclomatic complexity, and object points, and ... and ... and ...

For example, project size is the best predictor of whether a project will meet
its initial budget/time/feature/quality goals (Boehm, Standish). It totally
swamps staff quality, programming language, programming process, tools,
libraries, _everything_ in this respect (Boehm).

Per (Standish), a project with a budget > US$10 million at launch has a 98%
probability of not meeting its goals and from memory < 50% probability of
avoiding cancellation.

In fact I have a totally untested hypothesis that agile "works" because it's
mostly applied by small teams to small projects.

(Boehm): Barry Boehm, Software Cost Estimation with COCOMO II

(Standish): The Standish Group CHAOS Report.

~~~
fleitz
The issue is that budget size requires more code, (eg. if you have a budget
you need to spend it), and spending it requires hiring programmers. You can't
take on a $10 million dollar project and tell people you're hiring 4
programmers and that the project will be done in a few months. You need to
hire 100 programmers and tell people it will take a year.

Basically, it's a symptom of the idea that work expands to fill time, agile
works IMHO because it avoids spending time that doesn't need to be spent.

~~~
jacques_chester
This is one of those chicken-and-egg, correlation-is-not-causation problems.

Does big 'size' cause a big budget, or do big budgets cause blooming 'size'? A
bit of both I'd wager.

------
cpeterso
I just pulled Steve McConnell's (must read!) _CODE COMPELTE: A Practical
Handbook of Software Construction_ off my bookshelf. The section _How Long Can
a Rountine Be?_ references some surprising (but perhaps dated) studies that
suggest the evidence in favor of short routines is "very thin" and the
evidence in favor of longer routines is "compelling". These studies are
probably biased to desktop and corporate software written in C during the
1980s.

The consensus is that routines should have fewer than 200 LOC, but that
routines shorter than ~30 LOC are not correlated with lower cost, fault rate,
or programmer comprehension. btw, the longest function I've seen in commercial
software I've worked on was 12,000 LOC! I will not name names. :)

* A study by Basili and Perricone found that routine size was _inversely_ correlated with errors; as the size of routines increased (up to 200 LOC), the number of errors per LOC _decreased_ (1984).

* Another study found that routine size was not correlated with errors, even though structural complexity and amount of data were correlated with errors (Shen et al. 1985)

* A 1986 study found that small routines (32 LOC or fewer) were not correlated with lower cost or fault rate (Card, Church, and Agresti 1986; Card and Glass 1990). The evidence suggested that larger routines (65 LOC or more) were cheaper to develop per LOC.

* An empirical study of 450 routines found that small routines (those with fewer than 143 source statements, including comments) had 23% more errors per LOC than larger routines (Selby and Basili 1991).

* A study of upper-level computer-science students found that students' comprehension of a program that was super-modularized into routines about 10 lines long was no better than their comprehension of a program that had no routines at all (Conte, Dunsmore, and Shen 1986). When the program was broken into routines of moderate length (about 25 lines), however, students scored 65% better on a test of comprehension.

* A recent [sic!] study found that code needed to be changed least when routines averaged 100 to 150 LOC (Lind and Vairavan 1989).

* In a study of the code for IBM's OS/360 operating system and other systems, the most error-prone routines were those that were larger than 500 LOC. Beyond 500 lines, the error rate tended to be proportional to the size of the routine (Jones 1986a).

* An empirical study of a 148 KLOC program found that routines with fewer than 143 source statements were 2.4 times less expensive to fix than larger routines (Selby and Basili 1991).

~~~
fhars
Judging from my experiences, several of these studies may be missing a major
confounding factor: the complexity of the problem the code solves. All my
longest methods are rather stupid output formatting stuff or if/else cascades
that handle tedious by mostly trivial distinctions. But for code that solves
hard problems I often write many small functions for independent steps of the
solution.

So the studies that measure method length/bug count correlation within a
single code base or in code written within a single organization might only
measure the fact that code that requires no thinking contains fewer bugs than
code that does. Paging Captain Obvious. Some of the other studies address that
(e.g. Shen 1985 and the code comprehension studies), but as it is so often the
case in quantitative studies of things related to programmer productivity, we
lack repeated measurments where the only variable is the independent factor
whose influence is studied.

~~~
gruseom
That's an interesting point. There's certainly a cap on the complexity of code
that can be put into a single long function. (Unless, I suppose, it has inner
functions that call one another, like how people do OO in Javascript; such
things can be just as complex as whole programs.) Usually it's implementing
some conceptually unified thing. Even if that thing is a complex algorithm,
it's still cohesive enough to be able to say what it is. And implementing even
a very complex algorithm is not particularly complex at a system level.

------
icefox
For anyone interested in more discussion of this I suggest grabbing a copy of
"Making Software", in particular chapter eleven on Conways Corollary which the
chapter centered around this paper from 2008
<http://research.microsoft.com/pubs/70535/tr-2008-11.pdf>

The meat:

    
    
      Table 4: Overall model accuracy using different software measures
      Precision  Recall Model
      86.2% 84.0% Organizational Structure 
      78.6% 79.9% Code Churn
      79.3% 66.0% Code Complexity
      74.4% 69.9% Dependencies
      83.8% 54.4% Code Coverage
      73.8% 62.9% Pre-Release Bugs
    

Or in plain terms if people mess with code they don't normally mess with, you
can bet real money (with a higher probability than other metrics) it
introduces bugs.

Edit: I have been meaning to make a git tool that would analyze the history of
a project to create predictions on what bit of code is the most buggy using
this model, but just haven't done it yet. It would be cool to integrate it
with GitHub's bugs api to see how correct it might be. If someone does make it
let me know!

------
dustingetz
> _However, I still haven’t found any studies which show what this
> relationship is like. Does the number of bugs grow linearly with code size?
> Sub-linearly? Super-linearly? My gut feeling still says “sub-linear”._

interesting, my gut says exponential, which is why cost and likelihood of
project cancellation shoot up in the largest projects.

edit: i have been corrected below, i concur with quadratic.

~~~
baddox
Are you sure you mean exponential? I think quadratic is a more reasonable
guess.

~~~
dustingetz
without straining my brain, to me its reasonable to measure complexity by
number of interacting components, which is a combination, which is geometric,
which is a discrete exponential, right?

~~~
sage_joch
The number of edges in a complete graph is n*(n - 1)/2, so by that metric it
would be quadratic.

~~~
lurker19
Yes but the number of nodes is n, and an "interaction" among m nodes might not
be decomposable to a chain of pairwise interactions, which raises the ceiling
back to exponetial (actually factorial, which is worse).

~~~
scarmig
What corresponds to interactions that aren't decomposable as pairwise
interactions? Race conditions? Resource utilization? Real bugs (and the
nastier ones), but probably a minority. So factorial with a relatively small
constant in front of it.

But obviously the real answer is to write a program that will correctly verify
all other programs.

------
codex
Isn't it the case that small code bases contain fewer features? So, really,
isn't this result merely that the bug rate per feature is a constant?

~~~
chc
That is not necessarily the case, no. For example, implementing the
functionality of printf yourself is rather involved, but calling printf from
the standard library is literally a one-liner. In both cases you get the
functionality of printf.

Similarly, there are brief ways to write things and verbose ways. Sometimes
what is commonly written as large factory classes and interfaces in one
language works out to a simple higher-order function or macro in another
language. See the old "evolution of a programmer" joke for some extreme
examples.

------
gruseom
I have two questions for y'all on this.

I've heard it said for years that studies show the number of bugs grows
roughly linearly with code size and that this holds true across any
programming language. (It's repeated, for example, at
<http://c2.com/cgi/wiki?LinesOfCode>) I think I've even seen references to
such studies, but I don't remember where. So, HN: what are these studies?
Anybody know? (I mean besides the one referenced by the OP on class size. I
believe this meme goes further back than that.)

My second question is about how to measure code size. PG said a few years ago:
why not just count tokens? I've thought about this ever since and I don't see
what's wrong with it. Raw character count is obviously a lame metric, and LOC
isn't much better. But token count seems like an apples-to-apples comparison
that is easy to measure objectively and leaves out noise such as name length
and whitespace. So: what's wrong with token count as a measurement of code
size?

~~~
anon_d
I actually think raw byte-count is a pretty good metric. Documentation size
and variable name length are also symptoms of complexity. Specifically I think
_wc -c $(find -type f)_ (how many total bytes) and _find -type f | wc -c_ (how
many files and directories) make good metrics. Obviously you need to filter
out data files and such.

This has some problems (e.g. spaces vs. tabs, utf8, etc.), but all of these
size metrics will be pretty loose.

~~~
gruseom
But this penalizes programmers who like to use long readable names. I'm not
one of them (though I used to be), but they have a strong case here.

Take any program. Replace all the names with the smallest possible character
sequences. Have you made the program simpler? Or smaller in any meaningful
way? Surely not. I'd say what you've done is left its logical structure
precisely intact (another way of saying that token count is a good metric)
while reducing its readability.

~~~
anon_d
This metric relies on the assumption that people are trying to produce
readable code. IMHO long variable names are much more helpful in complex codes
than simple ones.

~~~
gruseom
Ok, but now I'm wondering if we have opposite views of code size. In my view,
code size is bad bad bad. More code means more complexity. Any time you add
code, you're subtracting value; it's just that (if it's good code) you're
adding more value than you're subtracting. So a higher score in a code size
metric is a bad thing to aspire to, and we should greatly favor approaches to
writing software that -- all other things being equal -- lead to smaller
programs. I don't think that programmers who use long names for readability
should have their programs discounted as longer (and thus more complex). Just
because their names are longer doesn't mean their programs are.

~~~
anon_d
No no no. My logic is this: Take tight, readable code with short names a
replace them with long names, and you'll have worse code. The converse isn't
true because complex (bad) codes are more readable with long variable names.

Complexity -> Code Size Code Size -> Long Variable names (win for big codes)
Complexity is bad

Therefore long variable names are a symptom of a problem, but not the problem
themselves. Long variable names aren't bad, but they are still a good
predictor of badness. Since size metrics are meant to predict badness, long
identifiers should increase size metrics.

~~~
gruseom
Oh, I see. You sound like an APLer. We have similar tastes, but many good
programmers disagree, so I doubt that long variable names are a predictor of
program badness. Not every long name is FactoryManagerFactoryManagerFactory.

Consider a language like K, in which variables usually have one-letter names.
The real code-size win for K is not that. It's that the language is so
powerful that complex things can be expressed in remarkably compact strings of
operators and operands. (Short variable names, I'd argue, are an
epiphenomenon. It's because the programs are so small that you don't need
anything longer, and longer names would drown out the logical structure of the
program and make it harder to read.) Token count is a good metric here. Both
line count and byte count come out artificially low, but token count can't.

------
GotToStartup
Writing as little code as possible to fully accomplish a goal has recently
become a fundamental principle that I live by.

If I can write a piece of code in fewer lines, I'll do it. That pretty
obvious, we all would. But I try to take it a step further and consciously
seek solutions that lead to fewer lines of code. Chopping down a large block
of code is an incredibly gratifying feeling for me.

I find that writing less code while maintaining expressiveness _usually_ leads
to simpler solutions and, IMO, it is simplicity that reduces the bug count.

~~~
iandanforth
I find that in a team environment that I would prefer my co-workers write more
lines of code, and longer lines at that.

While, given a moment or two, I can unpack a dense list comprehension (a one
liner) I would rather read several statements that add up the same thing.

Of course there are many times when you could do the same thing with fewer
lines, in a more elegant, straight-forward way. However I have a hard time
believing that just having fewer lines is a sufficient goal for flexible,
maintainable code.

Then again I'm pretty new at this :)

------
codeslush
In my experience, it's the size of individual methods/functions that determine
the number of bugs. >50 or 75 lines of code per routine greatly reduces its
maintainability and increases the number of bugs (often difficult bugs to
track down).

~~~
mrb
At my school (Epita, France) our C coding style standard mandated that _all_
functions be <= 25 lines.

Even though some lines didn't count (like those containing a single curly
brace), it was very tough, but always possible. This applied to all projects,
small and large. They made us write mostly Unix apps, like an FTP server, a
command-line NNTP client, a POSIX shell (I still remember how meticulous you
had to be when reading all the man pages to implement process control and
terminal control correctly!). Plus the code had to be portable across all 3
Unix OS running at the school: NetBSD, Solaris, Digital Unix. This was in
2000-2001.

For example I just checked the FTP server I wrote for one of the assignments
(I still have a copy): 3123 lines and all the functions are <= 25 lines of
code. Such rigorousness definitely shaped the quality of the code I now write
professionally, 10 years later...

~~~
codeslush
That's awesome! I'm guessing it fits within what I defined - you look at the
starting line number and the ending line number for code points in a function
and the number should be <50 to 75 LOC. That includes inline comments
(noticeably not function definition comments). Code clarity should also be
prevalent - meaning, nothing fancy! ;-) Don't cheat the system with single
line if statements (for example). It's a really, really simple rule that
works! People have argued with me, saying they needed more LOC for a routine,
but not once has that proven to be true - at least not in the code I reviewed.
And I'm, by far, not the sharpest tool in the shed. If I can do it, anyone
can!

~~~
mrb
Our coding standard was very strict. It was not possible to cheat and save
lines by writing, eg:

    
    
      if (func()) a = 1;
    

You had to write:

    
    
      if (func())
        a = 1;
    

Writing very complex C programs with functions <= 25 lines is definitely
possible. All Epita students were routinely doing it!

~~~
codeslush
I was thinking more along the lines of:

if (func()) { a=1; }

/* curly braces didn't count in your allocation of LOC, but they would in
mine. */

------
boyakasha
If you read the abstract of the paper that the post refers to, it actually
says that the size of a _class_ affects the number of bugs _in that class_.
That's something very different to the size/bugs correlation for a whole
application.

~~~
lurker19
This is a hugely important overlooked point. When the measurements have a
systematic bias, a product optimized to those measurements will have systemic
problems. In this case: large, easy-to-understand classes that are excessive
in number and completely fail to interoperate correctly.

------
mynegation
Size by itself is very elusive metric and varies hugely depending on the
language used. At least for a single method/procedure I found Cyclomatic
Complexity (<http://en.wikipedia.org/wiki/Cyclomatic_complexity>) to be best
predictor of maintainability if not quality

~~~
anthonyb
The metrics involved here are specifically OO metrics - inheritance depth,
number of children and so on, but there's at least one, Weighted Method Count,
which uses cyclomatic complexity as a weighting factor. It's described on page
7 of the paper.

It seems counter-intuitive to me too (that the complexity of a method doesn't
matter), but perhaps if you _have_ to have that complexity in your
application, it's best to have it in one method rather than trying to spread
it around with inheritance or some sort of pattern?

------
sleight42
This theory emphasizes the imortances of service oriented architectures. Note:
I specifically am not endorsing WS-*.

Decoupling a larger application into smaller applications with well-defined
interfaces should reduce cognitive load and, per this theory, perhaps defects
as well.

------
blackhole
The fact that this is apparently not painfully obvious is almost as pathetic
as the fact that people actually try to use it to justify programming
languages that reduce the lines of code instead of learning how to write code
that isn't terrible. A giant, bloated codebase written in erlang is still a
giant, bloated codebase.

That isn't to say erlang or any other language isn't worth learning (they
are), its just that no language can save you from bad programming.

~~~
tikhonj
I don't think the argument is that some language will completely "save you
from bad programming" but rather that it will encourage bad programming less
some other language.

Some languages like Java require more lines of code and provide less
mechanisms for dealing with complexity than other language (say Erlang), which
means that you're more likely to write bad code in Java than Erlang.

All languages are equal, but some are more equal than others.

------
TelmoMenezes
I always had a bit of a problem with this sort of study because you can never
know for sure that you uncovered all the bugs in a program (short of doing a
formal proof of correctness). This in turn introduces biases. Case in point,
it could be that a program with a higher number of LOC is actually more likely
to have a real user base which then leads to a larger number of bugs being
discovered.

------
ff0066mote
"Bigger is just something you have to live with in Java. Growth is a fact of
life. Java is like a variant of the game of Tetris in which none of the pieces
can fill gaps created by the other pieces, so all you can do is pile them up
endlessly."

I am reminded of <http://qntm.org/files/hatetris/hatetris.html>

------
majmun
I had theory that code mainainability is ultimately dependend on nuber of
things that programmer can keep in head simultainously. Number ussually beeing
beetween 4 and 12 in later text N. so that means if function has more than N
parts it will be divided to 2 functions. and if class has more than N methods
it will be divided. and so on.. Anything that has more than parts than N will
be divided. and we must consider that N is not same for everybody but varies.
what is maintainable for one programmer could not be for other.

How to measure what is your N number. It could be done like in that movie
Rainman you throw toothpics and must quicly count them , start with large
number and in next round remove some until you consistenly correctly count the
number of toothpics.

in extension to this there is limit to number of parts that one man can
control . this number is N^N

------
jonmc12
The unfortunately named CRAP method of measuring code quality uses a metric
based on Cyclomatic complexity
(<http://en.wikipedia.org/wiki/Cyclomatic_complexity>) and code coverage to
estimate change and maintenance risk of code (this page has an equation
<http://www.artima.com/weblogs/viewpost.jsp?thread=215899>). The paper cited
in Vivek's article emphasises that code length decreases cognitive complexity.
I would bet that Cyclomatic complexity also correlates to bugs and
maintainability on the same basis.

------
tholschuh
More insights about defect prediction in Thomas Zimmermann's publications at:

<http://thomas-zimmermann.com/publications/list/Short>

A lot of his papers are freely available as pdf.

------
pnathan
After rummaging around code metrics, I've come to the conclusion that kLOC is
the best estimator for 'hardness' of a codebase. There are a few ways to slice
it, of course, (no-comment source only? statements only? semi-colons only?),
but, fundamentally, I do not see any _pragmatic_ use for code quality metrics
besides "How many pages is this". Everything boils down to the amount of
moving components in the system.

~~~
technomancy
If you want to simplify it even further there is this:
<https://github.com/technomancy/bludgeon>

    
    
        Bludgeon is a tool which will tell you if a given
        library is so large that you could bludgeon
        someone to death with a printout of it.

------
sunir
No code has no bugs.

(A koan.)

~~~
daemin
Conversly: No code has no features.

~~~
techiferous
Inversely: This feature has no code: <http://instantzendo.com> (view source to
see the code)

~~~
daemin
Yeah, but there's code there to return the no code page.

~~~
techiferous
NOP!

( <http://homepage.ntlworld.com/richard.leedham-green/> )

------
colomon
So, does this mean that adding tests actually adds bugs to your overall
program (main code + tests)? One might hope that well-designed tests at least
push the bugs from the main code to the test... wonder if any research has
been done on this particular question.

------
DannoHung
Does this just mean character count or statements/expressions? I've worked
with APL-like languages which produce enormous ratios of character to
statement/expression but they never seemed particularly easier to debug than
say, Python.

------
bluekeybox
I wonder if real-life organizations/bureaucracies obey the same law.

------
oacgnol
It'd be interesting to look at relative code quality vs. size on a per-
language basis. Some languages take a whole lot less boilerplate to accomplish
the same thing.

------
regularfry
I'd _love_ to know how this is affected when you include whitespace, and if
code quality is measurably affected by how much can actually fit on the
screen.

------
johnrob
There is probably a proof out there for why fewer lines of code is better; but
everyone who believes it is too busy pumping out features to bother
articulating it.

~~~
Birejji
I would say fewer AND human readable lines are better. The more both can be
achieved, the better.

------
shithead
_[...] my hypothesis that the number of bugs can primarily be predicted only
by the total lines of code [...] I still haven’t found any studies which show
what this relationship is like. Does the number of bugs grow linearly with
code size? Sub-linearly? Super-linearly? My gut feeling still says “sub-
linear”._

That's an interesting question. I'd bet the other way - for example, a 500
kLOC program having more total bugs than the sum of ten 50 kLOC programs.

But even if the sub-linear hypothesis were true, there's the yield problem
that semiconductor manufacturers know well. Suppose you have, statistically,
one fatal defect per 500 kLOC. That means it's hard to get a functional 500
kLOC program done. But you could get right eight or nine out of ten 50 kLOC
programs ...

~~~
fragsworth
The 500k loc program will have more bugs, because suppose at best it can be
divided into ten 50K components each with the same average number of bugs as
the 50k program. The fact that the components must interact correctly will
introduce more bugs. How much more is pretty unclear but I am certain it is
more.

------
dramaticus3
I wonder if they correlated it with Programming Language. Because this would
suggest that the more terse a language is, the less prone to errors it is.

------
xpda
I just cut my Visual Studio font size down to 6, but my code runs the same.

~~~
majmun
effects are not instant. you must keep your code on font 6 for ever.

