Wednesday, December 21, 2011

What do we actually know about software development?

I recently saw this video on Vimeo; to say it blew my mind would be an understatement:


In it, Greg Wilson makes a compelling argument that we have a responsibility to seriously question the efficacy of the software engineering practices we employ. For example, do we know code reviews will really lead to better code, or are we just trusting our gut? Is pair programming effective?  Does refactoring really matter at all?   Is design paradigm "blah" effective, or is it really an untested, unproven method with no empirical basis whatsoever?  How much of the things we do are dogma, and how much have been scientifically proven to deliver results?

Painful admission: I'd never seriously considered any of this.  Many of these fads--such as unit testing or code reviews or agile or INSERT FAD HERE--are so beaten into developers that we never stop to question the basic foundation.  So as a personal exercise, I decided to seriously dig into one of them to see what I learned: test driven development. I picked this just as some arbitrary starting point, really, based on some reddit comments.

Test driven development is "...a software development process that relies on the repetition of a very short development cycle: first the developer writes a failing automated test case that defines a desired improvement or new function, then produces code to pass that test and finally refactors the new code to acceptable standards."  In other words, I start developing by writing tests that fail.  I produce code that passes these tests.  This idea seems like a good hypothesis, so I found two heavily cited scholarly articles that supported TDD and scrutinized their findings.

The first, Realizing quality improvements through test driven development: results and experiences of four industrial teams, is an empirical study cited by 40 authors according to google scholar.  The study tracked four different teams at two different companies and compared "defect density" with a "comparable team in the organization not using TDD."  Obviously "comparable team" is ambiguous, so in their "Threats to Validity" section, they mention:
The projects developed using TDD might have been easier to develop, as there can never be an accurate equal comparison between two projects except in a controlled case study. In our case studies, we alleviated this concern to some degree by the fact that these systems were compared within the same organization (with the same higher-level manager and sub-culture). Therefore, the complexity of the TDD and non-TDD projects are comparable.
First, I do not buy that "the complexity of the TDD and non-TDD projects are comparable" because "these systems were compared within the same organization (with the same higher-level manager and sub-culture)." I know this to be true within my own organization; I work on what I'd describe as a very complicated product (computer vision, lots of networking, embedded development, lots of compression, low-latency streaming, etc.), while I have coworkers in the same organization (with the same higher-level manager) who literally write software with a small subset of the requirements of our group; the difference in complexity is gigantic. By not providing specifics of the comparable projects, it's hard to take the findings seriously.  It's also clear--from total lines of code and team size comparison--that it is unlikely these are "comparable projects."  They also don't list the experience breakdown of team members for the comparable projects.

Second, none of these results are statistically significant, and the authors acknowledge that.  It's an empirical study--not an experiment--so at best they can only infer correlations between TDD and "defect density."

Third, they include figures on "Increase in time taken to code the feature due to TDD," and use management estimates as a source of data.  Seriously?  Given what we know about estimating things, is this really a valid method?

Lastly, how do they conclude that TDD was solely responsible for the improvement?  Did the teams have other differences, such as doing code reviews?  Were programmers working together, or were they geographically dispersed?  How did management affect the various projects?  What other factors could have influenced their results?  They allude to some of this obliquely in their "threats" section;  none of it stops them from recommending best practices for TDD in their discussion section.

The second paper, published in 2008--Assessing Test-Driven Development at IBM--has ~134 citations in google scholar. I found this to be a more interesting comparison: two software teams implementing the same specification, one team using a "Legacy" method for software development and the other (admittedly future) team using TDD.   It's still not an experiment since it wasn't controlled, but the final deliverable is more comparable: the same "point of sale" specification by IBM.

Their findings are summarized in two graphs; the top is legacy, and the bottom is the newer project using TDD:



At first glance, these look promising; we see that the TDD project had a much lower "defect rate per thousand lines of code (KLOC)" than the legacy project.  We also see the TDD project had better forecasting of their defect rates.

But on closer inspection, I have to seriously question these results.  In terms of absolute line count, the Legacy solution appears to be about ~11.5 KLOC (80 total defects / (7.0 defects/KLOC)) verses ~67 KLOC for the TDD version (247 total defects / ( 3.7 defects/KLOC)). In other words, from an absolute standpoint there were one-third as many defects on the legacy system and one-sixth the total line count. So I'm struggling to understand how the TDD team ended up with a massive pile of code, and what that cost them in terms of schedule/productivity/maintainability/performance, and how they justify "six times as much code and three times the defect count compared to legacy which purportedly does the same thing!" as being a positive result. I'm open to the possibility I'm misinterpreting these graphs, but if I'm not, the authors deserve a scientific keel-hauling.

There's no comparison of development time. No mention of how successful either product actually was in production. No comparison of productivity between the two teams, only a reference to the productivity of the TDD team.  No acknowledgement that other factors besides TDD might have been responsible for their findings, since again this is not a controlled experiment.  And none of this prevents them from presenting best-practices for TDD in their results section, as though two misleading graphs is a good proxy for a well-executed experiment and statistical significance.   Frankly, this source was enormously disappointing to me.  The discrepancies in KLOC and absolute defect rates are significant enough that I'm struggling to understand how this paper was A) published and B) cited by 134 other authors.

In my opinion, neither of these papers establish that TDD is an effective practice.  Of course, neither of them preclude it from being effective either, so the jury's still out.  And it's entirely possible I've made an error interpreting these findings; I graciously welcome your peer review, dear reader.  In any event, I think Greg Wilson's point stands strong: what are our standards of truth? Why do we do what we think works as opposed to what we've empirically shown to work?


2 comments:

Anonymous said...

Nice post, Jeremy. A few comments:

First, as a bit of context, it looks like the second paper you link to was published in an "Experience Report" track, which has less stringent requirements from a scientific perspective.

Second, even if it weren't, over time you'll see that the paper that covers all its bases, scientifically speaking, and produces a clear actionable result is a very rare find. That's simply because this is an incredibly messy domain, and studying it in a way that accounts for every confounding variable is quite costly, and in many cases unfeasible. So as you read research in this area, you'll still need to exercise some judgment ("OK, they're not doing this or that, but the effect they report on is quite strong... how much can I trust this finding?"). The situation might be different in a few years, but unfortunately that's the state we're in now.

Incidentally, if the authors don't report on something that you think is important to understand the trustworthiness of a finding, you should feel free to contact them! They'll usually get back to you with sensible answers to your questions, if you ask nicely.

Having said all that, you're right in that using a defect density metric to compare two projects implementing the same requirements is a strange choice, especially with such a wide difference in resulting KLOCs. A total defect count would be better, and apparently, far less impressive from a TDD perspective.

kidjan said...

catenary,

Thanks for the comments, particularly the clarification on where the second article was published.

I guess what I find disturbing about both these papers is they're really putting the cart before the horse--the authors clearly believe in TDD, and from my vantage point, they're looking for information to support the conclusion that TDD is a good practice, rather than trying to prove it's effective. See "texas sharpshooter fallacy" on wikipedia for more info on this error in reasoning.

The first paper, I don't find compelling. But the second paper....I'm struggling to see how it doesn't amount to academic dishonesty. I suppose I should send some email asking some pointed questions, like you mention, but I find it hard to believe they unknowingly presented the data as they did.