Hacker News new | comments | show | ask | jobs | submit login
What we actually know about software development, and why we believe it’s true (slideshare.net)
133 points by ZeroGravitas 2975 days ago | hide | past | web | favorite | 54 comments

It looks like it was a nice talk, but unfortunately it failed to address the reason why there is so little reliable science when it comes to software development.

The reason is, of course, that we have yet to come up with a reliable, meaningful metric for programmer productivity, one that would apply to the lone ruby on rails coder as well as to the enterprise java cog-in-the-great-machine or the ThoughtWorks elite agile programmer in a 6-people iterative team.

Until there is a meaningful way to compare the productivity of different software methodologies, it is pointless to berate the lack of scientific studies comparing the productivity of these methodologies.

As a final nail in that coffin, I don't think anyone can claim that Google is lacking in the "let's measure everything" department, with their reputation for intensive A/B testing of every single pixel change. And yet Google has yet to resolve this problem... which would tend to indicate that this is a real, non-trivial problem, not just a case of people not trying hard enough.

Good point.

I attended this talk, and it was easily one of the best talks at Stackoverflow (Devdays) Toronto, and got a very rousing response from the crowd.

To answer your concern, the question of "why" was actually one of the overall points of his discussion. Wilson would enthusiastically agree with you that there aren't very good metrics out there not only for programmer productivity, but a number of other things that you'd want to measure in software. At the beginning of the talk he contrasted the software sciences against other, older disciplines just to show how incredibly nascent it is.

His most damning point of the current state of affairs was that there is a lot of metrics/methodology pseudoscience being spouted by people and businesses in our profession, and that we often eat it up unchallenged. His optimistic counter-point was that things are turning around and measurement, experimentation and skepticism are spreading, and as you point out, for some companies like Google, it's yielding a lot of gains, even if they haven't cracked any universal code of productivity (does anyone actually believe we ever will? This will always be a moving target).

The most frequent phrase used by Wilson in his talk was "wouldn't that be a useful thing to know?" (with respect to the result of some experiment that would show a verifiable result). Essentially, his talk was a rallying cry to attempt to stir the audience into becoming more critical about their own methods and to boost their skepticism of the things they're told about methodology.

I also attended this conference in Toronto. I didn't like the last two thirds of the talk one bit. There is plenty of evidence that explains why less females are in software development (and in general, science) there is also plenty of science that disputes anthropogenic global warming (and also plenty of science that supports it). The speaker made highly emotionally charged statements that were in the opposite spirit of his whole thesis: DATA and ANALYSIS should drive opinions and decisions, not emotions. All of this is besides the point. What does global warming or Rush Limbaugh have to do with the science behind software development organization?

While he is absolutely correct that software development lacks rigorous testing of effective methods (agile vs waterfall), he allows, openly, volunteers (!) to be part of his test group ("I'll post my email") after already telling them what he is testing. This makes his inquisition of the most effective method even worse than nothing at all. By taking in biased volunteers he makes his numbers look like they achieve statistically validly, while failing to take a true cross-section of software developers.

My favorite talk of the night was the Ruby guy. Clean, unemotional, understandable, interesting, informative. I almost expected him to end with "live long and prosper."

Not only is there no metric, but when we force artificial metrics onto programming teams, we are forcing them to pursue a goal other than the quality of the software itself. Any benefits derived from measurability will be eclipsed by the bad practices brought about by the metrics.

Take the two of the most common pseudo-metrics: lines-of-code, bugs-closed-per-month.

Lines-of-code encourages verbosity and complexity while discouraging code reuse. It also ignores essential code maintenance tasks that usually remove lines-of-code rather than adding them.

Bugs-closed-per-month encourages the creation of bugs in the first place. Not that anyone intentionally makes mistakes, but it lets the programmer get away with getting it less-right the first time. Even worse, it assures continued dividends from an error-prone implementation.

It would be useful to seek macroscopic measures of programmer productivity in our discipline as a whole. Let's just keep in mind the potential adverse effects of applying these metrics on the ground. Realistically, any effective macro metric will be abused my PHBs all over the world in managing their 10 man teams.

Somehow even your use of the word "force" here still manages to understate the severity of the problem.

The "Performance Management" craze is sweeping corporate management, along with the pernicious use of the "Balanced Scorecard" it advocates. Unable to identify meaningful metrics to build these scorecards with, management delegates to programmers the task of inventing metrics. Programmers, who know very well how any metric can be gamed, know that they're obligated to hand management a loaded gun, and hope they can fill the scorecards with a gun of the least-damaging calibre.

This will continue unabated until a better science appears, and there's a very real possibility that one never will (if programming is more akin to an art and/or craft).

I would love to have seen this presentation.

> This will continue unabated until a better science appears

. . . or until a better theory of management appears and gains widespread acceptance.

I would consider "bugs opened per month" to be a good metric, assuming someone uses the software.

I'm of the opinion that the reason there is no quantitative metric for programmers is the same reason there is no quantitative metric for writers, artists or musicians.

Yet there are quantitative metrics for engineers. Is your Ruby on Rails application really more complex than the software flying an airliner?

There are metrics for problems which are studied in-depth before engineers are let loose to solve it. That's not true of most software projects.

Who do you think does the studying? Engineers...

You're agreeing with the parent. There are metrics to measure solving well-known problems, but no metrics for researching problems to make them well-known.

But the first person or team who realized that you could put a computer in an aircraft weren't solving a known problem - however they were using engineering methods, which is why it's very rare for avionics to (literally!) crash.

Here's my theory: we do know how to build very high quality software in a repeatable way. However very few organizations are willing to pay what it costs to do that, and that is why most software is so bad. Admittedly a lot of people writing code these days are "cowboys" (they don't care and neither do their clients, they just want speed) but that doesn't mean that the body of knowledge doesn't exist.

I don't know why this comment was voted down, but I think he could be right -- given the correct resources (time mainly), and assuming the right talent level, I think we could consistently build rock-solid software. The constraint seems to be a (logical and necessary) trade off between total cost and total utility.

As we all know, something that works 95% of the time at 5% the price is very often better than a "perfect" solution. With critical systems like plane navigation, the tradeoff doesn't make sense, so you pay full price for near-perfection... your niche e-commerce site? 95% is sufficient.

Plane navigation is in someways easier to e-commerce sites as well. The environment that the plane is in, isn't actively malicious. You can collect accurate statistics on weather, where as the attacks and usage (with bots/scraping etc) consistently change.

Not at all man. The physical environment can be much more malicious than you seem to think.

I've done some software for buses, it's toy in comparison (with planes) but you get issues with noise, power supply, vibration, temperature, light (UV killing a touch screen) and probably lots more i've never encountered. The physical world can be every bit as unpredictable as the virtual one.

The major difference is that environmental agents are generally deterministic or stochastic, i.e. obey the law of physics. On the other hand, realms with intelligent agents, such as humans, are inherently unpredictable due to their free will (which is unpredictable by definition).

You also have the free will of humans messing with the system (in my case, fraud prevention), also consider animals interacting the machinary while someone is not watching. I remember hearing a story about a huntsman spider regularily tripping a sensor for a program that measured the weight of lorries.

The only consistent thing you can state is that all these "real world" issues are to do with hardware whereas yours are more softwarey. However the consideration of such issues is necessary across the board.

I don't know about you but if I was writing a program to fly a plane I'd want to make damn sure I took into account an intelligent malicious agent trying to crash my plane.

E-commerce has been around for >15 years now, and there how many tens or hundreds of thousands of e-commerce sites? It's a very well known problem by now.

regarding problems and engineering, lets see how brooks puts it:

"Essential complexity is caused by the problem to be solved, and nothing can remove it"

so the kind of complexity with have to tackle comes from the area of application we are in, not the programming task itself. its not about writing pretty, bugfree java code. let's assume your java is 100% perfect. there still remains the essential complexity of the problem.

now if the problem is in any way related to humans (a lot of software has to be 'a solution' for humans) then there is no way to proof a solution to be The Right One. hence B/B testing. this is at some stages the way to got.

show something & get feedback from the user, who knows more about the problem (himself) than you. this of course doesn't mean it's all blind trial&error - well in a way it is, but humans get better at that over time.

disclaimer: if you program the fuel-control for a rocket that is 30 years old, you can of course calculate the correct solution and resolve all essential complexity to mathematical formulas.

interestingly this was not the case for the first rocket-programmers, since the rockets kept changing every few month :)

Look at slide 20 again: every 25% increased problem complexity results in 100% increased solution complexity.

It's more open-ended because you can mostly make your own rules, unlike in the physical world where you have lots of predefined constraints. The resulting solution space being much bigger, you have to use your taste to narrow down to the better solutions, so of course it's more subjective.

What's your point? Does airline software have a good metric?

The main difference is that other engineering disciplines don't have to deal with the halting problem.

I don't think it's a matter of complexity so much as a matter of difficulty defining the problem.

I agree, But what if we looked at a job where productivity should be much easier to define, like a roofer.

An obvious performance metric would be "Time it takes to install a roof that will pass a quality inspection". That metric is pretty useless alone though. Can he fix a nail gun that breaks? How safe is he? Can he work as fast in cold weather? are just a few factors.

Say you could get metrics for ALL productivity factors, you still haven't taken into account personality issues that cannot be quantified.

And this is for a single person working on a measurable task with only a few tools. What chance is there for metrics to be accurate about a task that could require any number of tools, multiple correct answers, and collaboration?

"Does it work" isn't a quantitative metric, it's a qualitative one.

For sure. Before 'hit parades' music could sit around in a desk drawer for decades (Schubert) before being heard. Because innovation drove away audiences wanting to hear the familiar. The people who made sure that Schubert's stuff lived was other composers ... not audiences.

Software's the same ... the people who cheer great code (art, science) are people who understand it well enough to recognize great code. The opinions of peers is an important part of achievement metrics. And that artful kind of code pretty much counts for nothing by any standard 'industry' measures. But it's influential.

I agree this is a key point. Software bug rates are not comparable to scurvy incidence rates, and developer productivity is not similar to the number of bricks a laborer can move in an hour.

Still, there's a lot of room between "no evidence" and "double-blind clinical trials", and reason to believe that statements about software development can profitably be moved towards the latter.

I would take Code Complete as a good example. There are sections on methodology that run along the lines of "Company A tried this method and got a B% reduction in defects compared to the first version of the product. Company X also used it and only got a Y% improvement compared to the previous year, but that's still worth using." Studies don't have to be totally conclusive in order to provide meaningful back-up to anecdotal reports and subjective experience.

In other words, the perfect is the enemy of the good, and good evidentiary metrics may be achievable even if perfect ones aren't.

One solution to this is that you can measure similar projects. "The Mythical Man-Month" is based on this idea. If you compare large projects to large projects (or a large project to itself with different methods, different number of people, etc), you get a decently reliable, if imperfect, metric which can give good results.

Similarly, compare elite six-man teams to other elite six man teams (not very reliable, but interesting if results are extreme) or to themselves on other projects, possibly with slightly different methodologies.

Yes, it's difficult. It would be nice if we had some simple means of measuring productivity, but much of programming is very artlike -- we're not solving identical problems, and we may solve them in very different ways, and a lot of the resulting quality depends on the taste of the viewer. We're not going to get a really excellent, predictable way to measure quality.

Waiting until we do means no research, which is awfully fatalistic.

I freely admit that I have no idea how to measure productivity. And until I do, I really can't tell you that what I choose to do is anything other than consistent with my superstitions about software development.

In other words, I choose some superstitions and I can build up a consistent kind of world based on those superstitions and I can test various ideas to see if the ideas are consistent with my superstitions or the logical consequences of those superstitions.

But really, I have no idea whether I'm directing development traffic or whether I have a pair of coconuts clamped to my ears and I'm waving palm fronds under an empty sky.


Worse than that. We can't even measure success.

In the extremes of ultimate financial success or failure in the long term we can determine success, but that gives us something like 1 bit of data per 3-5 years per major software project. Is twitter successful? MySpace? Facebook? Was Netscape successful? Geocities? (The stock used to buy Geocities is currently worth about $2 billion dollars.) For each of those there are even more examples that are more difficult to determine. Modern software comes in a series of releases, how much can you attribute the success of a particular release to the existing code base, and how much can you attribute it to just the particular diffs for that release? It's a tricky problem with no easy answers.

What I find odd is that there used to be quite a bit of research in this area. Read older software development books they reference papers research more often. i.e. Code Complete vs. agile programming books.

In principle it should be practical to compare the productivity of different software methodologies by counting Function Points (or something similar) as well as defects. http://www.ifpug.org/ But in practice that doesn't work very well. It's awkward to count Function Points for maintenance and refactoring work in a meaningful way. And for most organizations, the cost of just gathering the data and doing the counting is so high as to outweigh any possible benefit.

Most professions choose their best practices through a form of natural selection and evolution. Rookies try all the different ways and 30 years later they are Veterans. The Veteran then passes on the best ways he learned to the Rookies, who copy it, and then over the course of their career slowly improve on it as well.

Even the most "simple" task like scaffolding (for instance) has heaps of "craft" in it. What research team decided that scaffolders should thread scaffold clamps between their fingers when they carry them? There is a subtle technique to this that has been passed down through generations.

This is how all professions learn the tricks of their trade. People try stuff, figure out what works best for them personally, and then pass that knowledge down to the rookies. Certainly with programming this happens at university - I've had profs who give practical advice like, "Make sure your program is always capable of compiling throughout the entire dev process." This is folk knowledge yes, but it's practical folk knowledge that HAS been tested over the course of careers by many people.

It is not economical to have many dedicated research scientists trying to figure out these things. They are figured out by people as they do their job.

The most important thing is mentoring. Paul Graham widely shares the things he has learned from experience and those things HAVE been legitimately tested because he has tried ways that failed and tried ways that succeeded. Just because it is not double blind does not mean that it hasn't been tested.

It is not economical to have many dedicated research scientists trying to figure out these things. They are figured out by people as they do their job.

... it is not economical to have many scientists trying to figure out the cures for illnesses. They are figured out by people as they treat other people.

Is that what you really want to say? Basic science does not have any immediate economic pay-off, no. But it often pays in spades later on.

because he has tried ways that failed and tried ways that succeeded

These are ways that worked for PG - and we have no idea if luck is involved. See his latest article - luck plays a large role in all startups.

I'm not saying mentoring is bad. In fact, it's pretty much the only thing that produces any results right now. But creating a solid foundation for the engineering side of computer science doesn't strike me as a bad idea...

he makes some excellent points, or should I say hypothoses?

drop 25% of the features to reduce complexity by half. I suppose this is why templates and re-usable code appeal to managers moreso than software developers. one sees the clock and the other sees something that would be much more beautiful after a redo.

my son learned in mechanical engineering school that if you want the best design and the best device, don't ask the builder to design the device. the builder will make decisions based on the complications of the build process and the desire to build. the designer would focus on the utility of the device and defer thinking to the builder on how to get the thing built.

People with different Myers Briggs score do better in pair programming. Raise your hand if that makes your brain hurt.

Mythical Man Month is the classic book on this, based on the OS360 project at IBM.

Complexity drives time to complete toward infinity. Says the moth to the flame, "you're so beautiful". Says the teacher to the class, "let's not get ahead of ourselves".

A fellow named Francis Frank used to teach project management as a position that sits between the hydrant and the dog. The scrum master takes that role now, looking for a favorable outcome from myers briggs score diversity.

I especially like the point that defects are not a factor of geographic distance, but of distance on the org chart.

If you look at research papers in software engineering, you will notice a significant thing has happened in the last several years. With the rise of, first the WWW, and then open-source repositories, there is now a large body of code available that was simply absent before. It is now usual/common for research to investigate, test, experiment against that. This must be a positive change, that we should be hopeful about yielding more substantial results.

Must have been a great talk, because even the slides are good (and normally the slides are almost useless alone).

He definitely got the most thunderous applause of the day at DevDays simply because so much of the day's talks thus far had been horrible. I was following the Twitter feed of the conference and after his talk many people remarked that he had basically saved DevDays Toronto from being a failure.

I was hoping Joel would send around a URL for feedback. I didn't stick around to see if that ended up happening. The problem wasn't that the talks were horrible in and of themselves. But it appears that he totally underestimated some of the audience's abilities. I couldn't have been the only person there that's used ASP.NET MVC, Python, jQuery, and Ruby. For us, the talks weren't insightful because they were all introductory.

That said, I chatted with some attendees who thought the talks were great because they only program in ASP.NET (old style) and have never been exposed to these sort of languages. I guess they're the developers that ask the questions on StackOverflow and people like us are the ones answering the questions? :-) Or maybe I was just unlucky enough to know about the languages they decided to focus on. Had there been an introductory talk on Haskell or Clojure, I'd have learned something new.

It was a brilliant talk, easily worth the cost of driving out from Michigan for DevDays.

Selling Software != Selling Consulting

Hence it depends on type of the company.

Programming is still a craft. If you want to know how it should be done, just look at the masters. Plenty of successful products out there that can serve as an example.

Until we develop actual computer science (rather than calling publicly funded engineering science) that's all we have.

Ummm... you do realize that the slides point out scientific studies of programming with interesting, practical results, yes? And that the author of the slides is a professor of computer science doing more such studies? One of his points in the talk was to call for the professional programmers in the audience to participate in these sorts of studies.

A couple of points buried in the talk: defects increase as the code size increases. The increase is greater than linear. So my question for folks is why would you program in Java rather than Python, where you are guaranteed to have more bugs, even with static type checking?

The issue of programmer productivity has been around at least since The Mythical Man Month, but the generic responses have been better estimates and better methodologies. There is even the Software Engineering Institute to make life worse for all of us. I think that except for academics and programmers themselves, no one cares. Large companies used Cobol and now Java because it is safe. Everyone else is doing the same. I once asked an executive why he picked Java over Python for a greenfield reimplementation of his site. he said it was because there were more Java programmers than any other kind.

"A couple of points buried in the talk: defects increase as the code size increases. The increase is greater than linear."

I went looking for these points in the slides, only because the slides got me in the mood to question my assumptions and try to follow citations back to source material - and it wasn't obvious that the presence of something on a slide meant that it was a known truth. (Example: the slides about Martin Fowler's claims about DSLs and productivity are in the slides as a skull on a pikestaff, not as a verified truth.)

After looking, I think you got the first point from slide 15, but I wasn't able to find a slide that mentioned non-linearity.

I like the question that you're asking, with regard to Python versus Java. I wonder, though, if that conclusion really rises to the standards of rigor that this presentation appears to advocate.

Even if you manage to dig up the right Lutz Prechelt citation, has the effect really been shown to be something that is completely unmitigated by static-provability?

Slide 25: "Most Metrics' values increase with code size". The point about non-linear increase in defects is my (unwarranted) extrapolation. An earlier slide quotes a 25% increase in complexity gives a 100% increase in size.

No, it really doesnt rise to his standards of rigor. Everything that I have read, supporting that conclusion is probably anecdotal. The formal studies I have read have usually involved college students on short projects. Studies that are probably worth less than useless. Others probably suffer from the Hawthorne effect.

Do you know if he meant larger as in 'wc -c' larger, or as in number of symbols larger? I'm curious because there are languages (Cobol, Obj-C) that are extremely verbose in terms of characters but not in symbols.

The classic study cited in Mythical Man Month (which I don't unfortunately have handy) said, if I'm remembering it correctly, that it was proportional to lines of code (thus 'wc -l' larger) and independent of the language used.

What I don't know is whether IDE-generated code gets you out of that problem or not.

He was quoting Fernando Corbato who was comparing lines of assembly to lines of PL/I in large projects. Corbato later recalls this in an article in Byte. Crobato's Law is mentioned in http://en.wikipedia.org/wiki/Fernando_J._Corbat%C3%B3

A study (http://page.mi.fu-berlin.de/prechelt/Biblio//tcheck_tse98.pd...) by the Lutz Prechelt mentioned on the slides suggests that, all else being equal, static type checking reduces defects and increases productivity (the study was for ANSI C vs. K&R C). So a concise but statically typed language (Scala?) might be a better choice than either Java or Python.

> Large companies used Cobol and now Java because it is safe.

"Socially" safe, that is. Damned dangerous in other, more neglected, senses.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact