The reason is, of course, that we have yet to come up with a reliable, meaningful metric for programmer productivity, one that would apply to the lone ruby on rails coder as well as to the enterprise java cog-in-the-great-machine or the ThoughtWorks elite agile programmer in a 6-people iterative team.
Until there is a meaningful way to compare the productivity of different software methodologies, it is pointless to berate the lack of scientific studies comparing the productivity of these methodologies.
As a final nail in that coffin, I don't think anyone can claim that Google is lacking in the "let's measure everything" department, with their reputation for intensive A/B testing of every single pixel change. And yet Google has yet to resolve this problem... which would tend to indicate that this is a real, non-trivial problem, not just a case of people not trying hard enough.
I attended this talk, and it was easily one of the best talks at Stackoverflow (Devdays) Toronto, and got a very rousing response from the crowd.
To answer your concern, the question of "why" was actually one of the overall points of his discussion. Wilson would enthusiastically agree with you that there aren't very good metrics out there not only for programmer productivity, but a number of other things that you'd want to measure in software. At the beginning of the talk he contrasted the software sciences against other, older disciplines just to show how incredibly nascent it is.
His most damning point of the current state of affairs was that there is a lot of metrics/methodology pseudoscience being spouted by people and businesses in our profession, and that we often eat it up unchallenged. His optimistic counter-point was that things are turning around and measurement, experimentation and skepticism are spreading, and as you point out, for some companies like Google, it's yielding a lot of gains, even if they haven't cracked any universal code of productivity (does anyone actually believe we ever will? This will always be a moving target).
The most frequent phrase used by Wilson in his talk was "wouldn't that be a useful thing to know?" (with respect to the result of some experiment that would show a verifiable result). Essentially, his talk was a rallying cry to attempt to stir the audience into becoming more critical about their own methods and to boost their skepticism of the things they're told about methodology.
While he is absolutely correct that software development lacks rigorous testing of effective methods (agile vs waterfall), he allows, openly, volunteers (!) to be part of his test group ("I'll post my email") after already telling them what he is testing. This makes his inquisition of the most effective method even worse than nothing at all. By taking in biased volunteers he makes his numbers look like they achieve statistically validly, while failing to take a true cross-section of software developers.
My favorite talk of the night was the Ruby guy. Clean, unemotional, understandable, interesting, informative. I almost expected him to end with "live long and prosper."
Take the two of the most common pseudo-metrics: lines-of-code, bugs-closed-per-month.
Lines-of-code encourages verbosity and complexity while discouraging code reuse. It also ignores essential code maintenance tasks that usually remove lines-of-code rather than adding them.
Bugs-closed-per-month encourages the creation of bugs in the first place. Not that anyone intentionally makes mistakes, but it lets the programmer get away with getting it less-right the first time. Even worse, it assures continued dividends from an error-prone implementation.
It would be useful to seek macroscopic measures of programmer productivity in our discipline as a whole. Let's just keep in mind the potential adverse effects of applying these metrics on the ground. Realistically, any effective macro metric will be abused my PHBs all over the world in managing their 10 man teams.
The "Performance Management" craze is sweeping corporate management, along with the pernicious use of the "Balanced Scorecard" it advocates. Unable to identify meaningful metrics to build these scorecards with, management delegates to programmers the task of inventing metrics. Programmers, who know very well how any metric can be gamed, know that they're obligated to hand management a loaded gun, and hope they can fill the scorecards with a gun of the least-damaging calibre.
This will continue unabated until a better science appears, and there's a very real possibility that one never will (if programming is more akin to an art and/or craft).
I would love to have seen this presentation.
. . . or until a better theory of management appears and gains widespread acceptance.
Here's my theory: we do know how to build very high quality software in a repeatable way. However very few organizations are willing to pay what it costs to do that, and that is why most software is so bad. Admittedly a lot of people writing code these days are "cowboys" (they don't care and neither do their clients, they just want speed) but that doesn't mean that the body of knowledge doesn't exist.
As we all know, something that works 95% of the time at 5% the price is very often better than a "perfect" solution. With critical systems like plane navigation, the tradeoff doesn't make sense, so you pay full price for near-perfection... your niche e-commerce site? 95% is sufficient.
I've done some software for buses, it's toy in comparison (with planes) but you get issues with noise, power supply, vibration, temperature, light (UV killing a touch screen) and probably lots more i've never encountered. The physical world can be every bit as unpredictable as the virtual one.
The only consistent thing you can state is that all these "real world" issues are to do with hardware whereas yours are more softwarey. However the consideration of such issues is necessary across the board.
"Essential complexity is caused by the problem to be solved, and nothing can remove it"
so the kind of complexity with have to tackle comes from the area of application we are in, not the programming task itself. its not about writing pretty, bugfree java code. let's assume your java is 100% perfect. there still remains the essential complexity of the problem.
now if the problem is in any way related to humans (a lot of software has to be 'a solution' for humans) then there is no way to proof a solution to be The Right One. hence B/B testing. this is at some stages the way to got.
show something & get feedback from the user, who knows more about the problem (himself) than you. this of course doesn't mean it's all blind trial&error - well in a way it is, but humans get better at that over time.
disclaimer: if you program the fuel-control for a rocket that is 30 years old, you can of course calculate the correct solution and resolve all essential complexity to mathematical formulas.
interestingly this was not the case for the first rocket-programmers, since the rockets kept changing every few month :)
The main difference is that other engineering disciplines don't have to deal with the halting problem.
An obvious performance metric would be "Time it takes to install a roof that will pass a quality inspection". That metric is pretty useless alone though. Can he fix a nail gun that breaks? How safe is he? Can he work as fast in cold weather? are just a few factors.
Say you could get metrics for ALL productivity factors, you still haven't taken into account personality issues that cannot be quantified.
And this is for a single person working on a measurable task with only a few tools. What chance is there for metrics to be accurate about a task that could require any number of tools, multiple correct answers, and collaboration?
Software's the same ... the people who cheer great code (art, science) are people who understand it well enough to recognize great code. The opinions of peers is an important part of achievement metrics. And that artful kind of code pretty much counts for nothing by any standard 'industry' measures. But it's influential.
I would take Code Complete as a good example. There are sections on methodology that run along the lines of "Company A tried this method and got a B% reduction in defects compared to the first version of the product. Company X also used it and only got a Y% improvement compared to the previous year, but that's still worth using." Studies don't have to be totally conclusive in order to provide meaningful back-up to anecdotal reports and subjective experience.
In other words, the perfect is the enemy of the good, and good evidentiary metrics may be achievable even if perfect ones aren't.
Similarly, compare elite six-man teams to other elite six man teams (not very reliable, but interesting if results are extreme) or to themselves on other projects, possibly with slightly different methodologies.
Yes, it's difficult. It would be nice if we had some simple means of measuring productivity, but much of programming is very artlike -- we're not solving identical problems, and we may solve them in very different ways, and a lot of the resulting quality depends on the taste of the viewer. We're not going to get a really excellent, predictable way to measure quality.
Waiting until we do means no research, which is awfully fatalistic.
In other words, I choose some superstitions and I can build up a consistent kind of world based on those superstitions and I can test various ideas to see if the ideas are consistent with my superstitions or the logical consequences of those superstitions.
But really, I have no idea whether I'm directing development traffic or whether I have a pair of coconuts clamped to my ears and I'm waving palm fronds under an empty sky.
In the extremes of ultimate financial success or failure in the long term we can determine success, but that gives us something like 1 bit of data per 3-5 years per major software project. Is twitter successful? MySpace? Facebook? Was Netscape successful? Geocities? (The stock used to buy Geocities is currently worth about $2 billion dollars.) For each of those there are even more examples that are more difficult to determine. Modern software comes in a series of releases, how much can you attribute the success of a particular release to the existing code base, and how much can you attribute it to just the particular diffs for that release? It's a tricky problem with no easy answers.
Even the most "simple" task like scaffolding (for instance) has heaps of "craft" in it. What research team decided that scaffolders should thread scaffold clamps between their fingers when they carry them? There is a subtle technique to this that has been passed down through generations.
This is how all professions learn the tricks of their trade. People try stuff, figure out what works best for them personally, and then pass that knowledge down to the rookies. Certainly with programming this happens at university - I've had profs who give practical advice like, "Make sure your program is always capable of compiling throughout the entire dev process." This is folk knowledge yes, but it's practical folk knowledge that HAS been tested over the course of careers by many people.
It is not economical to have many dedicated research scientists trying to figure out these things. They are figured out by people as they do their job.
The most important thing is mentoring. Paul Graham widely shares the things he has learned from experience and those things HAVE been legitimately tested because he has tried ways that failed and tried ways that succeeded. Just because it is not double blind does not mean that it hasn't been tested.
... it is not economical to have many scientists trying to figure out the cures for illnesses. They are figured out by people as they treat other people.
Is that what you really want to say? Basic science does not have any immediate economic pay-off, no. But it often pays in spades later on.
because he has tried ways that failed and tried ways that succeeded
These are ways that worked for PG - and we have no idea if luck is involved. See his latest article - luck plays a large role in all startups.
I'm not saying mentoring is bad. In fact, it's pretty much the only thing that produces any results right now. But creating a solid foundation for the engineering side of computer science doesn't strike me as a bad idea...
drop 25% of the features to reduce complexity by half. I suppose this is why templates and re-usable code appeal to managers moreso than software developers. one sees the clock and the other sees something that would be much more beautiful after a redo.
my son learned in mechanical engineering school that if you want the best design and the best device, don't ask the builder to design the device. the builder will make decisions based on the complications of the build process and the desire to build. the designer would focus on the utility of the device and defer thinking to the builder on how to get the thing built.
People with different Myers Briggs score do better in pair programming. Raise your hand if that makes your brain hurt.
Mythical Man Month is the classic book on this, based on the OS360 project at IBM.
Complexity drives time to complete toward infinity. Says the moth to the flame, "you're so beautiful". Says the teacher to the class, "let's not get ahead of ourselves".
A fellow named Francis Frank used to teach project management as a position that sits between the hydrant and the dog. The scrum master takes that role now, looking for a favorable outcome from myers briggs score diversity.
I especially like the point that defects are not a factor of geographic distance, but of distance on the org chart.
That said, I chatted with some attendees who thought the talks were great because they only program in ASP.NET (old style) and have never been exposed to these sort of languages. I guess they're the developers that ask the questions on StackOverflow and people like us are the ones answering the questions? :-) Or maybe I was just unlucky enough to know about the languages they decided to focus on. Had there been an introductory talk on Haskell or Clojure, I'd have learned something new.
Hence it depends on type of the company.
Until we develop actual computer science (rather than calling publicly funded engineering science) that's all we have.
The issue of programmer productivity has been around at least since The Mythical Man Month, but the generic responses have been better estimates and better methodologies. There is even the Software Engineering Institute to make life worse for all of us. I think that except for academics and programmers themselves, no one cares. Large companies used Cobol and now Java because it is safe. Everyone else is doing the same. I once asked an executive why he picked Java over Python for a greenfield reimplementation of his site. he said it was because there were more Java programmers than any other kind.
I went looking for these points in the slides, only because the slides got me in the mood to question my assumptions and try to follow citations back to source material - and it wasn't obvious that the presence of something on a slide meant that it was a known truth. (Example: the slides about Martin Fowler's claims about DSLs and productivity are in the slides as a skull on a pikestaff, not as a verified truth.)
After looking, I think you got the first point from slide 15, but I wasn't able to find a slide that mentioned non-linearity.
I like the question that you're asking, with regard to Python versus Java. I wonder, though, if that conclusion really rises to the standards of rigor that this presentation appears to advocate.
Even if you manage to dig up the right Lutz Prechelt citation, has the effect really been shown to be something that is completely unmitigated by static-provability?
No, it really doesnt rise to his standards of rigor. Everything that I have read, supporting that conclusion is probably anecdotal. The formal studies I have read have usually involved college students on short projects. Studies that are probably worth less than useless. Others probably suffer from the Hawthorne effect.
What I don't know is whether IDE-generated code gets you out of that problem or not.
"Socially" safe, that is. Damned dangerous in other, more neglected, senses.