Ruminations on Computing: September 2008

Friday, September 26, 2008

22,500 Paper Airplanes

Rainn Wilson (Dwight K. Schrute from the Office) was the MC at this year's Microsoft company meeting. Ever since they tore down the Kingdome, the meeting has taken place at SafeCo field. Sitting up in the 300-level seats, it is tempting to throw paper airplanes. Each year some people do. This causes some strange moments when a speaker mistakes the applause for a particularly long airplane flight for applause for whatever they just said. At this year's company meeting they decided that everyone should throw paper airplanes. All at the same time. Rainn orchestrated it dressed in a flight jacket. It was a pretty impressive sight.

Video at 11.

Test Suite Granularity Matters

I just read a very interesting research paper entitled, "The Impact of Test Suite Granularity on the Cost-Effectiveness of Regression Testing" by Gregg Rothermel et al. In it the authors examine the impact of test suite granularity on several metrics. The two most interesting are the impacts on running time and the impact on bug finding. In both cases, they found that larger test suites were better than small ones.

When writing tests, a decision must be made how to organize the tests. The paper makes a distinction between test suites and test cases. A test case is a sequence of inputs and the expected result. A test suite is a set of test cases. How much should each test suite accomplish? There is a continuum but the endpoints are creating each point of failure as a standalone suite or writing many points of failure into a single suite.

The argument for very granular test suites (1 case or point of failure per suite) is that they can be better tracked and analyzed. The paper examines the efficacy of different techniques for restricting the number of suites run in a given regression pass. They found that more granular cases were more effectively reduced. However, the time savings even from aggressive reductions in test suites did not offset the time taken to run all test cases in larger suites. Grouping test cases into larger suites makes them run faster. Without reduction the granular cases in the study ran almost 15 times slower. With reduction, this improved to running only 6 times slower.

Why is this? Mostly it is because of overhead. Depending on how the test system launches tests there is a cost to each test suite being launched. In a local test harness like nUnit, this cost is small but can add up over a lot of cases. In a network-based system, the cost is large. There is also the cost of setup. Consider an example from my days of DVD testing. To test a particular function of a DVD decoder requires spinning up a DVD and navigating to the right title and chapter. If this can be done once and many test cases executed, the overhead is amortized across all of the cases in the suite. On the other hand, if each case is a suite, the overhead is multiplied by each case.

Perhaps more interesting, however, the study found that very granular test suites actually missed bugs. Sometimes as much as 50% of the bugs. Why? Because more less granular cases cover more state than less granular ones and are thus more likely to find bugs.

It is important to note that there are diminishing returns on both fronts. It is not wise to write all of your test cases in one giant suite. Result tracking does become a problem. It can be hard to differentiate bugs which happen in the same suite. After a certain size, the overhead costs are sufficiently amortized and enough states traversed that the benefits of a bigger suite become negligible.

I have had first-hand experience writing tests of both sorts. I can confirm that we have found bugs in large test suites that were caused by an interaction between cases. These would have been missed by granular execution. I have also seen the immense waste of time that accompanies granular cases. Not mentioned in the study is also the fact that granular cases tend to require a lot more maintenance time.

My rule of thumb is to create differentiated test cases for most instances but then to utilize a test harness that allows them to be all run in one instance of that harness. This gets the benefits of a large test suite without many of the side effects of putting too much into one case. It amortizes program startup, device enumeration, etc. but still allows for more precise tracking and easier reproduction of bugs. If there is a lot of overhead, such as the DVD case mentioned above, test cases should be merged or otherwise structured so as not to pay the high overhead each time.

Thursday, September 25, 2008

This Blog Makes Top 100 Software Development Management Blogs

Just barely, but I made the list. It is determined by Technorati authority, Alexa rank, Google page rank, etc.

Check out the list as there are a lot of interesting blogs on it. The top 10 are:

Nr Site

1 Joel on Software

2 Coding Horror

3 Seth's Blog

4 Paul Graham: Essays

5 blog.pmarca.com

6 Rough Type

7 Scott Hanselman's Computer Zen

8 Martin Fowler's Bliki

9 Rands in Repose

10 Stevey's Blog Rants

That's some esteemed company to be even mentioned with.

Wednesday, September 17, 2008

Metallica's Death Magnetic Poorly Mastered

Metallica's new CD, Death Magnetic, is pretty good. Certainly it is better than St. Anger although I suppose that's not setting the bar very high. This CD is closer to And Justice For All than anything Metallica has done in the last decade. Unfortunately, the CD is mastered very hot. Basically, the CD is authored so loud that there isn't much range between loud and quiet and the loud becomes distorted. Ryan explains this phenomenon quite well and in a lot more detail.

The fact that Kotaku is running an article about this means it has gotten the attention of people beyond the audiophile community. Their claim is that the Guitar Hero versions of the same songs is much better. I've heard Amazon MP3s are a lot less hot than CDs generally. I wonder if they fix this particular case.

We can only hope that the negative reaction this CD is getting in some sectors shows a trend and that CD mastering will go back to where it once was. Having to turn up the CD to make it loud isn't a bad thing.

Friday, September 5, 2008

Not Everyone Has the Same Definition of "Done"

Years ago I had an employee, let's call him Vanya (not the real name). He was struggling a bit so I was watching his work closely. Every week we discussed what he needed to get done the next week and what he had done the previous week. I kept a list of the work items he needed to complete and checked them off when he was done with each. For one particular work item which was testing a particular DirectShow filter, the item on the list was writing tests for it. One week he worked on and completed this work. A few months later we became aware of an issue that would fundamentally cause the filter to not work. In fact, it had probably never worked. Why, I wondered, didn't Vanya's tests catch it? I went to speak with him. It turns out, he had written the tests. They had compiled. He had not, however, ever actually run them. They were "done" in his mind, but not in mine. Oh, and what he had written didn't actually work. Shocking, I know.

I tell you this story to introduce a problem I've run into many times on many different scales. This story is probably the most aggregious, but it is certainly not isolated. The problem stems from the fact that we rarely define the word. It is assumed that everyone shares a definition but it is rarely true. Is a feature done when it compiles? When it is checked in? When it can run successfully? When it shows up in a particular build? All of these are possible interpretations of the same phrase.

It is important to have a shared idea of where the finishing line is. Without that, some will claim victory and others defeat even when talking about the same events. It is not enough to have a shared vision of the product, it is also necessary to agree on the specifics of completion. To establish a shared definition of done, it is necessary to talk about it. Flush the latent assumptions out into the open. Before starting on a project, it is imperative to have a conversation about what it means to be done. Define in strict terms what completion looks like so that everyone will have a shared vision.

For large projects, this shared vision of done can be exit critera. "We will fix all priority 1 and 2 bugs, survive this many hours of stress, etc." For small projects or individual features in a large project, less extensive criteria is needed, but it is still important to agree on what will be the state on what dates.

While not strictly necessary, it is also wise to define objective tests for done-ness. For instance, when working on new features, I define "done" as working in the primary scenarios. Bugs in corner cases are acceptable, but if a feature can't be exercized in the main way it was intended to be, it can't be tested and isn't complete. To ensure that this criteria is met, I often insist on seeing the feature demonstrated. This is a bright line. Either the feature can be seen working, or it cannot. If it can't, it isn't done and more work is needed before moving on to the next feature.

Tuesday, September 2, 2008

My PAX08 Experience

Penny Arcade Expo (PAX) is an annual gaming convension in Seattle put on by the guys at Penny Arcade. It covers all aspects of gaming from consoles to PC games to roleplaying and tabletop games. It is big. Very big. 40,000 tickets pre-sold kind of big. This was my second year attending. Last year I dipped my toe in the water. This year I came back and dove in fully. I was there all 3 days for most of the time the doors were open. I had a great time and intend on attending again next year. Here are some of the highlights:

Felicia Day and The Guild. The Guild is an online video series which pokes fun at MMORPG players. I had only seen 2 episodes before coming to PAX. We watched the whole thing and then it was followed by a Q&A afterwards. The video is hilarious. The Q&A great. Felicia Day is a fun speaker and her cohorts were quick-witted, especially Sandeep who plays the "love interest" in the show.

Wil Wheaton. Last year's keynote speaker was back. He was available for book signings and was on 2 panels that I'm aware of. I saw him on the "Is casual gaming killing hardcore?" panel and on his panel of one. For his own panel he read a selection from his latest book and one of his ST:TNG reviews. He then took Q&A from the audience. If you ever get a chance to see him in person, jump at the opportunity. I'm still kicking myself for missing his keynote last year.

The Expo Hall. Filled with the latest in gaming goodness. Gears of War 2? Check. Starcraft 2? Check. Fallout 3? Check. ChronoTrigger DS? Check. Even some board gaming companies like Wizards of the Coast and Fantasy Flight Games were there.

Ken Levine's keynote. The lead designer of Bioshock gave the keynote. I didn't really know much about him beforehand and didn't expect a whole lot. He exceeded my expectations and gave a great speech. Mostly it was a humorous approach to his biography. He tells of when he first encountered D&D, first got hired at Looking Glass and found people like him, etc. If you can find the audio of this online, give it a listen.

Omegathon. The Omegathon is a 6-round competition through all aspects of gaming. This year involved Peggle, Pictionary, Geometry Wars, Rock Band 2, Jenga, and Vs. Excitebike (on a 1986 Nintendo Home Computer no less). Watching 3,000 people cheer for Vs. Excitebike is an experience like none other.

Overall, it was great fun. I'll definitely be going again next year.