Monday, June 23, 2014

Perceived vs. Objective Quality

I recently heard this story, but I can't recall who told it to me. I don’t have proof of its veracity so it might be apocryphal. Nevertheless, it illustrates an important point that I believe to be true independent of the truth of this story.

As the story goes, in the late 1990s, several Microsoft researchers set about trying to understand the quality of various operating system codebases. Of concern were Linux, Solaris, and Windows NT. The perception among the IT crowd was that Solaris and Linux were of high quality and Windows NT was not. These researchers wanted to test that objectively and understand why NT would be considered worse.

They used many objective measures of code quality to assess the 3 operating systems. This would be things like cyclomatic complextity, depth of inheritance, static analysis tools such as lint, and measurements of coupling. Without debating the exact value of this sort of approach, there are reasons to believe these sort of measurements are at least loosely correlated with defect density and code quality.

What the researchers found was the Solaris came out on top. It was the highest quality. This matched the common sense. Next up they found was Windows NT. It was closely behind Solaris. The surprise was Linux. It was far behind both of the other two. Why then the sense that it was high quality? The perceived quality of both NT and Linux did not match their objective measures of quality.

The speculation on the part of the researchers was that while Linux had a lot of rough edges, the most used paths were well polished. The primary scenarios were close to 100% whereas the others were only at, say, 60%. NT, on the other hand, was at 80 or 90% everywhere. This made for high objective quality, but not high experienced quality.

Think of it. If everything you do is 90% right, you will run into small problems all the time. On the other hand, if you stay within the expected lanes on something like Linux, you will rarely experience issues.

This coincides well with the definition of quality being about fitness for a function. For the functions it was being used for, Linux was very fit. NT supported a wider variety of functions, but was less fit for each of them and thus perceived as being of lower quality.

The moral of the tale: Quality is not the absence of defects. Quality is the absence of the right kinds of defects. The way to achieve higher quality is not to scour the code for every possible defect. That may even have a negative effect on quality due to randomization. Instead, it is better to understand the user patterns and ensure that those are free of bugs. Data Driven Quality gives the team a chance to understand both these use patterns and what bugs are impeding them.

Monday, June 16, 2014

Data Driven Quality

My last three posts have explained how test lost its way. It evolved from advocates of the user to a highly efficient machine for producing test results, verifying correctness as determined by a specification. Today, test can find itself drowning in a sea of results which aren't correlated with any discernible user activity. If only there were a way to put the user back at the center, scale testing, and be able to handle the deluge of results. It turns out, there is. The path to a solution has been blazed by our web services brethren. That solution is data. Data driven quality is the 4th wave of testing.

There is a lot to be said for manual testing, but it doesn't scale. It takes too many people, too often. They are too expensive and too easily bored doing the same thing over and over. There is also the problem of representativeness. A tester is not like most of the population. We would need testers from all walks of life to truly represent the audience. Is it possible to hire a tester that represents how my grandmother uses a computer? It turns out, it is. For free. Services do this all the time.

If software can be released to customers early, they will use it. In using it, they will inevitably stumble across all of the important issues. If there were a way to gather and analyze their experiences, much of what test does today could be done away with. This might be called the crowdsourcing of testing. The difficulty is in the collection and analysis.

Big Data and Data Science are the hot buzzwords of the moment. Despite the hype, there is a lot of value to be had in the increased use of data analysis. What were once gut feels or anecdotal decisions can be made using real information. Instead of understanding a few of our customers one at a time, we can understand them by the thousands.

A big web service like Bing ships changes to its software out to a small subset of users and then watches them use the product. If the users stop using the product, or a part of the product, this can indicate a problem. This problem can then be investigated and fixed.

The advantage of this approach is that it represents real users. Each data point is a real person, doing what really matters to them. Because they are using the product for real, they don't get bored. They don't miss bugs. If the product is broken, their behavior will change. That is, if they experience the issue. If they don't, is it really a bug? (more on this in another post). This approach scales. It can cover all types of users. It doesn't cost more as the coverage increases.

Using data aggregated across many users, it should be possible to spot trends and anomalies. It can be as simple as looking at what features are most used, but it can quickly grow from there. Where are users failing to finish a task? What parts of the system don't work in certain geographies? What kind of changes most improve the usage.

If quality is the fitness of a feature for a particular function, then watching whether customers use a feature, for how long, and in what ways can give us a good sense of quality. By watching users use the product, quality can begin to be driven by data instead of pass/fail rates.

Moving toward data driven quality is not simple. It operates very differently than traditional testing. It will feel uncomfortable at first. It requires new organizational and technical capabilities. But the payoff in the end is high. Software quality will, by definition, improve. If users are driving the testing and the team is fixing issues to increase user engagement, the fitness for the function users demand of software must go up.

Over the next few posts, I will explore some of the changes necessary to start driving quality with data.

Tuesday, June 10, 2014

Halt and Catch Fire

I just finished watching the first episode of the new AMC show called Halt and Catch Fire.  The name comes from an old computer instruction which would stop the machine immediately.  The show follows a small Texas company trying to build IBM PC Clones.  The company and the people are fictitious, but it seems to parallel a lot of what Compaq went through in the early 80s.

I’ve always been a sucker for computing history.  I enjoy movies like Pirates of Silicon Valley and The Social Network.  I like Triumph of the Nerds.  I am happy to say that I really enjoyed the pilot episode.  It does a good job with the technical aspects of the show.  There is a scene where they are reverse engineering the ROM chip and it appears quite authentic to the way this work would be done.  They do a good job explaining things without getting dull.  They went out of their way to be accurate.  This article in Wired points out the lengths they went to in order to be period authentic.  It shows. 

If you have any interest in computing history or just like techy tv shows, give Halt and Catch Fire a try.

Monday, June 9, 2014

Test Has Lost Its Way

In a blog post, Brent Jensen relays a conversation he had with an executive mentor. In this conversation, his mentor told him that, "Test doesn't understand the customer." When I read this, my initial reaction was the same as Brent's: "No way!" If test is focused on one thing, it is our customer. Then as I thought about it more, my second reaction was also the same as Brent's: "I agree. Test no longer cares about the customer." We care about correctness over quality.

Let me give an example. This example goes way back (probably to XP), but similar things happen even today. We do a lot of self-hosting at Microsoft. That means we use an early version of the software in our day-to-day work. Too often, I am involved in a conversation like the following.

Me: I can't use the arrow keys to navigate a DVD menu in Windows Media Player.

Them: Yes you can. You just need to tab until the menu is selected.

Me: That works, but it takes (literally) 20 presses of the tab key and even then only a faint grey box indicates that I am in the right place.

Them: That's what the spec says. It's by design.

Me: It's by bad design.

What happened here? I was trying out the feature from the standpoint of a user. It wasn't very fit for my purposes because if I were disabled or my mouse was broken, I couldn't navigate the menu on a DVD. The person I was interacting with was focused too much on the correctness of the feature and not enough about the context in which it was going to be used. Without paying attention to context, quality is impossible to determine.  Put another way, if I don’t understand how the behavior feels to the user, I can’t claim it is high quality.

How did we get to this point? The long version is in my post on the history of testing. Here is the condensed version. In the beginning, developers made software and then just threw them over the wall to testers. We had no specifications. We didn't know how things were supposed to work. We thus had to just to try use the software as we imagined an end user would. This worked great for finding issues, but it didn't scale. When we turned to test automation to scale the testing, this required us to better understand the expected cases. This increased the need for detailed specifications from which to drive our tests.

Instead of a process like this:


We ended up with a process like this:


As you can see in the second, the perspective of the user is quickly lost. Rather than verifying that the software meets the needs of the customer, test begins to verify that the software meets the specifications. If quality is indeed the fitness for a function then the needs of the user is necessary for any determination of quality. If the user's needs could be completely captured in the specification, this might not be a problem, but in practice it is.

First, I have yet to see the specification that totally captures the perspective of the user. Rather, it is used only as an input to the specification. As changes happen, ambiguities are resolved, or the feature is scoped (reduced), the needs of the user are forgotten.

Second, it is impossible to test everything. A completely thorough specification would be nearly impossible to test in a reasonable amount of time. There would be too many combinations. Some would have to be prioritized, but without the user perspective, which ones?

Reality isn't quite this bleak. The user is not totally forgotten. They come up often in the conversation, but they are not systematically present. They are easily forgotten for periods of time. The matching of fitness to function may come up for one question or even one feature, but not for another.

Test has lost its way. We started as the advocate for the user and have become the advocate for the specification. We too often blindly represent the specification over the needs of real people. Going back to our roots doesn't solve the problem.

Even in the best case scenario, where the specification accounts for the needs of the user and the testers always keep the user forefront in their mind, the system breaks down. Who is this mythical user? It isn't the tester. We colloquially call this the "98052" problem. 98052 is the zip code for Redmond, where Microsoft is located. The people that live there aren't representative of most of the other zip codes in the country or the world.

Sometimes we create aggregated users called personas to help us think about other users. This works up to a point, but "Maggie" is not a real user. She doesn't have real needs. I can't ask her anything and I can't put her in a user study. Instead Maggie represents tens of thousands of users, all with slightly different needs.

Going back to our roots with manual testing also brings back the scale problem. We really can't go home. Home doesn't exist any more. Is there a way forward that puts the users, real users, back at the center, but scales to the needs of a modern software process? I will tackle one possible solution in my next post.

Wednesday, June 4, 2014

A Brief History of Test

In the exploration of quality, it is important to understand where software testing came from, where it is today, and where it is heading. We can then compare this trajectory to the goal of ensuring quality and see whether a correction is necessary or if we're going the right direction.

I have been involved in software testing for the past 16 and one half years. To give you some perspective, when I started at Microsoft, we were just shipping Windows 98. This gives me a lot of perspective on the history of software testing. What I give below is my take that that history. Others may have experienced it differently.

There are three major waves of software testing and we're beginning to approach a 4th. The first wave was manual testing. The second wave was automated testing. The third wave was that of tooling. It is important to note that each wave does not fully supplant the previous wave. There is still a need for manual testing even in the tooling phase. There is a need for automated testing even in the coming 4th wave.

The first wave was manual testing. Sometimes this is also called exploratory testing. It is often carried out by people carrying the Quality Assurance (QA) or Software Test Engineer (STE) title. Manual testing is just what it sounds like. It is people sitting in front of a keyboard or mouse and using the product. In its best form it is freeform and exploratory in nature: a tester trying to understand the user and carrying out operations trying to break the software. This is where the lore of testing comes from. The uber-tester who can find the bug no one else can imagine. In its worst form, this is the rote repetition of the same steps, the same levels, over and over again. This is also the source of legends, but not good ones. At its best, this form of testing is highly connected to quality. It is all about the user and his (or her) experience with the product.

Manual testing can produce great user experiences. As I understand it from friends who have gone there, this is the primary method of testing at Apple. The problem is, manual testing doesn't scale. It can also be mind-numbing. In the era of continuous integration and daily builds, the same tests have to be carried out each day. It becomes less about exploring and more about repeating. Manual testing is great for finding the bugs initially, but it is a terribly inefficient regression testing model.

It gets even worse when it comes time for software maintenance. At Microsoft, we support our software for a long time. Sometimes a really long time. Windows XP shipped in 2001 and is just now becoming unsupported. Consider for a moment how many testers it would take to test XP. I'll just make up a number and say it was 500. It wasn't, but that's good enough for a thought exercise. Every time you release a fix for XP, you need 500 people to run through all the tests to make sure nothing was broken. But the 500 original testers are probably working on Vista so you need an additional 500 people for sustained engineering. Add Windows 7, Windows 8, and Windows 8.1 and you now need 2,500 people testing the OS. Most are running the same regression tests every time which is not exciting so you end up losing all your good people. It just doesn't work.

This leads us to the second wave. The first wave involved hundreds of people pressing the same buttons every day. It turns out that computers are really good at repetitive tasks. They don’t get bored and quit. They don't get distracted and miss things. Thus was born test automation. Test automation in its most basic form is writing programs to do all of the things that manual testers do. They can even do things testers really can't. Manual testing is great for a user interface. It's hard to manually test an SDK. It turns out that it is easy to write software that can exercise APIs. This magic elixir allowed teams to cover much more of the product every day. We fully drank the Kool-Aid.

We set off to automate everything. There was nothing automation couldn't do and so all STEs were let go. Everyone become a Software Design Engineer in Test (SDET). This is a developer who, rather than writing the operating system, writes tests for the automation. Some of this work is mundane: calling the Foo API with a 1, a 25, and a MAX_INT. Other parts can be quite challenging. Consider how you would test the audio playback APIs. It is not enough to merely call the APIs and look at the return code. How do you know the right sound was played, at the right volume, and without crackling? Hint: it's time to break out the FFT.

Not everything is kittens and roses in the world of automation. Machines are great at doing what they are told to do. They don't take breaks or demand higher pay. However, they do only what they are told to do. They will only report bugs they are told to look for. One of my favorite bugs to talk about involved media player. In one internal build, every time you clicked next track on a CD (remember those things?) the volume would jump to maximum. While a test could be concocted to look for this, it never would be. Test automation happily reported a pass because indeed the next track started playing. It turns out that once you have run a test application for the first time, it is done finding new bugs. It can find regressions, but it can't notice the bug it missed yesterday.

The points toward the second problem with test automation. It requires very complete specifications. Because the tests can't find any bugs they weren't programmed to find, they need to be programmed to find everything. This requires a lot more up-front planning so the tests can cover the full gamut of the system under test. This heavy reliance upon specifications begins to distance testing from the needs of the user and thus we move away from testing quality and toward testing adherence to a spec.

The third problem with test automation is that it can generate too much data. Machines are happy to churn out results every day and every build. I knew teams whose tests would generate millions of results for each build. Staying on top of this becomes a full time job for many people. Is this failure a bug in the test? Is it a bug in the product? Was it an environmental issue (network down, bad installation, server unavailable)?

One other problem that can happen is that the work can grow faster than test developers can keep up. It is easy for a developer to write a little code which creates a massive amount of new surface area. Consider the humble decorator pattern. If I have 4 UI objects in the system, the tester needs to write 4 sets of tests. Now if the developer creates a decorator which can apply to each of the objects, he only has to write one unit of code to make this work. This is the advantage of the pattern. However, the tester has to write 4 sets of tests. The test surface is growing geometrically compared to the code dev is writing. This is unsustainable for very long.

This brings us to the third wave of testing. This wave involves writing software that writes tests. I call this the tooling phase. Rather than directly writing a test case, it is possible to write a tool that, given some kind of specification, can emit the relevant test cases automatically.  Model Based Testing is one form of this tooling. The advantage of this sort of tooling is that it can adapt to changes. Dev added one decorator to the system? Add one new definition in your model and tests just happen.

There are some downsides to the tooling approach. In fact, there are enough downsides that I've never experienced a team that adopted it for all or even most of their testing. They probably exist, but they aren't common. At most, this tooling approach was used to supplement other testing. The first downside is the oracle problem. It is easy enough to create a model of the system under test and generate hundreds or thousands of test cases. It is another thing entirely to understand which of these test cases pass and which fail. There are some problem domains where this is a tractable problem. Each combination or end state has an easily discernible outcome. In others, it can be exceptionally difficult without re-creating all of the logic of the system under test. The second is that the failures can be very hard to reason about in terms of the user. When the Bar API gets this and that parameter while in this state, it produces this erroneous result. Okay. But when would that ever happen in the real world?

Tooling approaches can solve the static nature of testing mentioned above. Because it is mathematically impractical to do a complete search of the state space of any non-trivial application or API, we are always limited to a subset of all possible states for testing. In traditional automation, this subset it fixed. In the tooling approach, the subset can be modified each time with random seeds, longer exploration times, or varying weights. This means each run can expose new bugs. This can be used to good effect. Given some metadata about an API and rules on how to call it, a tool can be created to automatically explore the API surface. We did this to good effect in Windows 8 when testing the Windows Runtime surface.

Sometimes it can have unexpected and even comical outcomes. I recall a story told to me my a friend. He wrote a tool to explore the .Net APIs and left it to run overnight. The next morning he came in to reams of paper on his desk. It turns out that his tool had discovered the print APIs and managed to drain every sheet of paper from every printer in the building. At Microsoft every print job has a cover sheet with the alias of the person doing the printing so his complicity was readily apparent. Some kind soul had gathered all of his print jobs and placed them in his office.

The tooling approach to testing exacerbates two of the problems of automation. It creates even more test results which then have to be understood by a human. It also moves the testing even further away from our definition of quality. Where is the fitness for a function taken into account in the tooling approach?

There is a problem developing in the trajectory of testing. We, as a discipline, have moved steadily further from the premise of quality. We'll examine this in more detail in the next post and start considering a solution in the one after that.