Ruminations on Computing: Testing

Showing posts with label Testing. Show all posts

Wednesday, July 30, 2014

The Data Driven Quality Mindset

"Success is not delivering a feature; success is learning how to solve the customer's problem." - Mark Cook, VP of Products at Kodak

I've talked recently about the 4th wave of testing called Data Driven Quality (DDQ). I also elucidated what I believe are the technical prerequisites to achieving DDQ. Getting a fast delivery/rollback system and a telemetry system is not sufficient to achieve the data driven lifestyle. It requires a fundamentally different way of thinking. This is what I call the Data Driven Quality Mindset.

Data driven quality turns on its head much of the value system which is effective in the previous waves of software quality. The data driven quality mindset is about matching form to function. It requires the acceptance of a different risk curve. It requires a new set of metrics. It is about listening, not asserting. Data driven quality is based on embracing failure instead of fearing it. And finally, it is about impact, not shipping.

Quality is the matching of form to function. It is about jobs to be done and the suitability of an object to accomplish those jobs. Traditional testing operates from a view that quality is equivalent to correctness. Verifying correctness is a huge job. It is a combinatorial explosion of potential test cases, all of which must be run to be sure of quality. Data driven quality throws out this notion. It says that correctness is not an aspect of quality. The only thing that matters is whether the software accomplishes the task at hand in an efficient manner. This reduces the test matrix considerably. Instead of testing each possible path through the software, it becomes necessary to test only those paths a user will take. Data tells us which paths these are. The test matrix then drops from something like O(2n) to closer to O(m) where n is the number of branches in the code and m is the number of action sequences a user will take. Data driven testers must give up the futile task of comprehensive testing in favor of focusing on the golden paths a user will take through the software. If a tree falls in the forest and no one is there to hear it, does it make a noise? Does it matter? Likewise with a bug down a path no user will follow.

Success in a data driven quality world demands a different risk curve than the old world. Big up front testing assumes that the cost to fix an issue rises exponentially the further along the process we get. Everyone has seen a chart like the following:

In the world of boxed software, this is true. Most decisions are made early in the process. Changing these decisions late is expensive. Because testing is cumulative and exhaustive, a bug fix late requires re-running a lot of tests which is also expensive. Fixing an issue after release is even more expensive. The massive regression suites have to be run and even then there is little self hosting so the risks are magnified.

Data driven quality changes the dynamics and thus changes the cost curve. This in turn changes the amount of risk appropriate to take at any given time. When a late fix is very expensive, it is imperative to find the issues early, but finding issues early is expensive. When making a fix is quick and cheap, the value in finding a fix early is not high. It is better to lazy-eval the issues. Wait until they become manifested in the real world before a fix is made. In this way, many latent issues will never need to be fixed. The cost of finding issues late may be lower because broad user testing is much cheaper than paid test engineers. It is also more comprehensive and representative of the real world.

Traditional testers refuse to ship anything without exhaustive testing up front. It is the only way to be reasonable sure the product will not have expensive issues later. Data driven quality encourages shipping with minimum viable quality and then fixing issues as they arise. This means foregoing most of the up front testing. It means giving up the security blanket of a comprehensive test pass.

Big up front testing is metrics-driven. It just uses different metrics than data driven quality. The metrics for success in traditional testing are things like pass rates, bug counts, and code coverage. None of these are important in data driven quality world. Pass rates do not indicate quality. This is potentially a whole post by itself, but for now it suffices to say that pass rates are arbitrary. Not all test cases are of equal importance. Additionally, test cases can be factored at many levels. A large number of failing unimportant cases can cause a pass rate to drop precipitously without lowering product quality. Likewise, a large number of passing unimportant cases can overwhelm a single failing important one.

Perhaps bug counts are a better metric. In fact, they are, but they are not sufficiently better. If quality if the fit of form and function, bugs that do not indicate this fit obscure the view of true quality. Latent issues can come to dominate the counts and render invisible those bugs that truly indicate user happiness. Every failing test case may cause a bug to be filed, whether it is an important indicator of the user experience or not. These in turn take up large amounts of investigation and triage time, not to mention time to fix them. In the end, fixing latent issues does not appreciably improve the experience of the end user. It is merely an onanistic exercise.

Code coverage, likewise, says little about code quality. The testing process in Windows Vista stressed high code coverage and yet the quality experienced by users suffered greatly. Code coverage can be useful to find areas that have not been probed, but coverage of an area says nothing about the quality of the code or the experience. Rather than code coverage, user path coverage is a better metric. What are the paths a user will take through the software? Do they work appropriately?

Metrics in data driven quality must reflect what users do with the software and how well they are able to accomplish those tasks. They can be as simple as a few key performance indicators (KPIs). A search engine might measure only repeat use. A storefront might measure only sales numbers. They could be finer grained. What percentage of users are using this feature? Are they getting to the end? If so, how quickly are they doing so? How many resources (memory, cpu, battery, etc.) are they using in doing so? These kind of metrics can be optimized for. Improving them appreciably improves the experience of the user and thus their engagement with the software.

There is a term called HiPPO (highest paid person's opinion) that describes how decisions are too often made on software projects. Someone asserts that users want to have a particular feature. Someone else may disagree. Assertions are bandied about. In the end the tie is usually broken by the highest ranking person present. This applies to bug fixes as well as features. Test finds a bug and argues that it should be fixed. Dev may disagree. Assertions are exchanged. Whether the bug is ultimately fixed or not comes down to the opinion of the relevant manager. Very rarely is the correctness of the decision ever verified. Decisions are made by gut, not data.

In data driven quality, quality decisions must be made with data. Opinions and assertions do not matter. If an issue is in doubt, run an experiment. If adding a feature or fixing a bug improves the KPI, it should be accepted. If it does not, it should be rejected. If the data is not available, sufficient instrumentation should be added and an experiment designed to tease out the data. If the KPIs are correct, there can be no arguing with the results. It is no longer about the HiPPO. Even managers must concede to data.

It is important to note that the data is often counter-intuitive. Many times things that would seem obvious turn out not to work and things that seem irrelevant are important. Always run experiments and always listen to them.

Data driven quality requires taking risks. I covered this in my post on Try.Fail.Learn.Improve. Data driven quality is about being agile. About responding to events as they happen. In theory, reality and theory are the same. In reality, they are different. Because of this, it is important to take an empiricist view. Try things. See what works. Follow the bread crumbs wherever they lead. Data driven quality provides tools for experimentation. Use them. Embrace them.

Management must support this effort. If people are punished for failure, they will become risk averse. If they are risk averse, they will not try new things. Without trying new things, progress will grind to a halt. Embrace failure. Managers should encourage their teams to fail fast and fail early. This means supporting those who fail and rewarding attempts, not success.

Finally, data driven quality requires a change in the very nature of what is rewarded. Traditional software processes reward shipping. This is bad. Shipping something users do not want is of no value. In fact, it is arguably of negative value because it complicates the user experience and it adds to the maintenance burden of the software. Instead of rewarding shipping, managers in a data driven quality model must reward impact. Reward the team (not individuals) for improving the KPIs and other metrics. These are, after all, what people use the software for and thus what the company is paid for.

Team is the important denominator here. Individuals will be taking risks which may or may not pay off. One individual may not be able to conduct sufficient experiments to stumble across success. A team should be able to. Rewards at the individual level will distort behavior and reward luck more than proper behavior.

The data driven quality culture is radically different from the big up front testing culture. As Clayton Christensen points out in his books, the values of the organization can impede adoption of a new system. It is important to explicitly adopt not just new processes, but new values. Changing values is never a fast process. The transition may take a while. Don't give up. Instead, learn from failure and improve.

Tuesday, July 1, 2014

Prerequisites to Data Driven Quality

A previous post introduced the concept of data driven quality. Moving from traditional, up-front testing to data driven quality is not easy. It is not possible to take just any product and start utilizing this method. In addition to cultural changes, there are several technical requirements on the product. These are: early deployment, friction-free deployment, partial deployment, high speed to fix, limited damage, and access to telemetry about what the users are doing.

Early deployment means shipping before the software is 100% ready. In the past, companies shipped beta software to limited audiences. This was a good start, but betas happened once or twice per product cycle. For instance, Windows 7 had two betas. Windows 8 had one. These were both 3 year product cycles. In order to use data to really understand the product's quality, shipping needs to happen much more often. That means shipping with lower quality. The exact level of stability can be determined per product, but need not be very high if the rest of the prerequisites are met. Ken Johnston has a stimulating post about the concept of Minimum Viable Quality.

Friction-free deployment means a simple mechanism to get the bits in front of users. Seamless installation. The user shouldn't have to know that they have a new version unless it looks different. Google's Chrome browser really pioneered here. It just updates in the background. You don't have to do anything to get the latest and greatest version and you don't have to care.

Because we may be pushing software that isn't 100%, deployment needs to be limited in scope. Software that is not yet fully trusted should not be given to everyone all at once. What if something goes wrong? Services do this with rings of deployment. First the changes are shown to only a small number of users, perhaps hundreds or low thousands. If that appears correct, it is shown to more, maybe tens of thousands. As the software proves itself stable with each group, it is deemed worthy of pushing outward to a bigger ring.

If someone goes wrong, it is important to fix it quickly. This can be a fix for the issue at hand (roll forward) or reversion to the last working version (roll back). The important thing is not to leave users in a broken state for very long. The software must be built with this functionality in mind. It is not okay to leave users facing a big regression for days. In the best case, they should get back to a working system as soon as we detect the problem. With proper data models, this could happen automatically.

Deployment of lower quality software means that users will experience bugs. Total experienced damage is a function of both duration and magnitude. Given the previous two prerequisites, the damage will be limited in duration, but it is also important to limit the damage in magnitude. A catastrophic bug which wipes out your file system or causes a machine not to boot need not last long. Rolling back doesn't repair the damage. Your dissertation is already corrupted. Pieces of the system which can have catastrophic (generally data loss) repercussions need to be tested differently and have a different quality bar before being released.

Finally, the product must be easy to gather telemetry on what the user is doing. The product must be capable of generating telemetry, but the system must also be capable of consuming it. The product must be modified to make generating telemetry simple. This is usually in the form of a logging library. This library must be lightweight. It is easy to overwhelm the performance of a system with too slow a library and too many log events. The library must also be capable of throttling. There is no sense causing a denial of service attack on your own datacenter if too many users use the software.

The datacenter must be capable of handling the logs. The scale of success can make it difficult. The more users, the more data will need to be processed. This can overwhelm network connections and analysis pipelines. The more data involved, the more computing power is necessary. Network pipes must be big. Storage requirements go up. Processing terabytes or even petabytes of data is not trivial. The more data, the more automated the analysis must become to keep up.

With these pieces in place, a team can begin to live the data driven quality lifestyle. There is much than just the technology to think about though. They very mindset of the team must change if the fourth wave of testing it to take root. I will cover these cultural changes next time.

Monday, June 23, 2014

Perceived vs. Objective Quality

I recently heard this story, but I can't recall who told it to me. I don’t have proof of its veracity so it might be apocryphal. Nevertheless, it illustrates an important point that I believe to be true independent of the truth of this story.

As the story goes, in the late 1990s, several Microsoft researchers set about trying to understand the quality of various operating system codebases. Of concern were Linux, Solaris, and Windows NT. The perception among the IT crowd was that Solaris and Linux were of high quality and Windows NT was not. These researchers wanted to test that objectively and understand why NT would be considered worse.

They used many objective measures of code quality to assess the 3 operating systems. This would be things like cyclomatic complextity, depth of inheritance, static analysis tools such as lint, and measurements of coupling. Without debating the exact value of this sort of approach, there are reasons to believe these sort of measurements are at least loosely correlated with defect density and code quality.

What the researchers found was the Solaris came out on top. It was the highest quality. This matched the common sense. Next up they found was Windows NT. It was closely behind Solaris. The surprise was Linux. It was far behind both of the other two. Why then the sense that it was high quality? The perceived quality of both NT and Linux did not match their objective measures of quality.

The speculation on the part of the researchers was that while Linux had a lot of rough edges, the most used paths were well polished. The primary scenarios were close to 100% whereas the others were only at, say, 60%. NT, on the other hand, was at 80 or 90% everywhere. This made for high objective quality, but not high experienced quality.

Think of it. If everything you do is 90% right, you will run into small problems all the time. On the other hand, if you stay within the expected lanes on something like Linux, you will rarely experience issues.

This coincides well with the definition of quality being about fitness for a function. For the functions it was being used for, Linux was very fit. NT supported a wider variety of functions, but was less fit for each of them and thus perceived as being of lower quality.

The moral of the tale: Quality is not the absence of defects. Quality is the absence of the right kinds of defects. The way to achieve higher quality is not to scour the code for every possible defect. That may even have a negative effect on quality due to randomization. Instead, it is better to understand the user patterns and ensure that those are free of bugs. Data Driven Quality gives the team a chance to understand both these use patterns and what bugs are impeding them.

Monday, June 16, 2014

Data Driven Quality

My last three posts have explained how test lost its way. It evolved from advocates of the user to a highly efficient machine for producing test results, verifying correctness as determined by a specification. Today, test can find itself drowning in a sea of results which aren't correlated with any discernible user activity. If only there were a way to put the user back at the center, scale testing, and be able to handle the deluge of results. It turns out, there is. The path to a solution has been blazed by our web services brethren. That solution is data. Data driven quality is the 4th wave of testing.

There is a lot to be said for manual testing, but it doesn't scale. It takes too many people, too often. They are too expensive and too easily bored doing the same thing over and over. There is also the problem of representativeness. A tester is not like most of the population. We would need testers from all walks of life to truly represent the audience. Is it possible to hire a tester that represents how my grandmother uses a computer? It turns out, it is. For free. Services do this all the time.

If software can be released to customers early, they will use it. In using it, they will inevitably stumble across all of the important issues. If there were a way to gather and analyze their experiences, much of what test does today could be done away with. This might be called the crowdsourcing of testing. The difficulty is in the collection and analysis.

Big Data and Data Science are the hot buzzwords of the moment. Despite the hype, there is a lot of value to be had in the increased use of data analysis. What were once gut feels or anecdotal decisions can be made using real information. Instead of understanding a few of our customers one at a time, we can understand them by the thousands.

A big web service like Bing ships changes to its software out to a small subset of users and then watches them use the product. If the users stop using the product, or a part of the product, this can indicate a problem. This problem can then be investigated and fixed.

The advantage of this approach is that it represents real users. Each data point is a real person, doing what really matters to them. Because they are using the product for real, they don't get bored. They don't miss bugs. If the product is broken, their behavior will change. That is, if they experience the issue. If they don't, is it really a bug? (more on this in another post). This approach scales. It can cover all types of users. It doesn't cost more as the coverage increases.

Using data aggregated across many users, it should be possible to spot trends and anomalies. It can be as simple as looking at what features are most used, but it can quickly grow from there. Where are users failing to finish a task? What parts of the system don't work in certain geographies? What kind of changes most improve the usage.

If quality is the fitness of a feature for a particular function, then watching whether customers use a feature, for how long, and in what ways can give us a good sense of quality. By watching users use the product, quality can begin to be driven by data instead of pass/fail rates.

Moving toward data driven quality is not simple. It operates very differently than traditional testing. It will feel uncomfortable at first. It requires new organizational and technical capabilities. But the payoff in the end is high. Software quality will, by definition, improve. If users are driving the testing and the team is fixing issues to increase user engagement, the fitness for the function users demand of software must go up.

Over the next few posts, I will explore some of the changes necessary to start driving quality with data.

Monday, June 9, 2014

Test Has Lost Its Way

In a blog post, Brent Jensen relays a conversation he had with an executive mentor. In this conversation, his mentor told him that, "Test doesn't understand the customer." When I read this, my initial reaction was the same as Brent's: "No way!" If test is focused on one thing, it is our customer. Then as I thought about it more, my second reaction was also the same as Brent's: "I agree. Test no longer cares about the customer." We care about correctness over quality.

Let me give an example. This example goes way back (probably to XP), but similar things happen even today. We do a lot of self-hosting at Microsoft. That means we use an early version of the software in our day-to-day work. Too often, I am involved in a conversation like the following.

Me: I can't use the arrow keys to navigate a DVD menu in Windows Media Player.

Them: Yes you can. You just need to tab until the menu is selected.

Me: That works, but it takes (literally) 20 presses of the tab key and even then only a faint grey box indicates that I am in the right place.

Them: That's what the spec says. It's by design.

Me: It's by bad design.

What happened here? I was trying out the feature from the standpoint of a user. It wasn't very fit for my purposes because if I were disabled or my mouse was broken, I couldn't navigate the menu on a DVD. The person I was interacting with was focused too much on the correctness of the feature and not enough about the context in which it was going to be used. Without paying attention to context, quality is impossible to determine. Put another way, if I don’t understand how the behavior feels to the user, I can’t claim it is high quality.

How did we get to this point? The long version is in my post on the history of testing. Here is the condensed version. In the beginning, developers made software and then just threw them over the wall to testers. We had no specifications. We didn't know how things were supposed to work. We thus had to just to try use the software as we imagined an end user would. This worked great for finding issues, but it didn't scale. When we turned to test automation to scale the testing, this required us to better understand the expected cases. This increased the need for detailed specifications from which to drive our tests.

Instead of a process like this:

We ended up with a process like this:

As you can see in the second, the perspective of the user is quickly lost. Rather than verifying that the software meets the needs of the customer, test begins to verify that the software meets the specifications. If quality is indeed the fitness for a function then the needs of the user is necessary for any determination of quality. If the user's needs could be completely captured in the specification, this might not be a problem, but in practice it is.

First, I have yet to see the specification that totally captures the perspective of the user. Rather, it is used only as an input to the specification. As changes happen, ambiguities are resolved, or the feature is scoped (reduced), the needs of the user are forgotten.

Second, it is impossible to test everything. A completely thorough specification would be nearly impossible to test in a reasonable amount of time. There would be too many combinations. Some would have to be prioritized, but without the user perspective, which ones?

Reality isn't quite this bleak. The user is not totally forgotten. They come up often in the conversation, but they are not systematically present. They are easily forgotten for periods of time. The matching of fitness to function may come up for one question or even one feature, but not for another.

Test has lost its way. We started as the advocate for the user and have become the advocate for the specification. We too often blindly represent the specification over the needs of real people. Going back to our roots doesn't solve the problem.

Even in the best case scenario, where the specification accounts for the needs of the user and the testers always keep the user forefront in their mind, the system breaks down. Who is this mythical user? It isn't the tester. We colloquially call this the "98052" problem. 98052 is the zip code for Redmond, where Microsoft is located. The people that live there aren't representative of most of the other zip codes in the country or the world.

Sometimes we create aggregated users called personas to help us think about other users. This works up to a point, but "Maggie" is not a real user. She doesn't have real needs. I can't ask her anything and I can't put her in a user study. Instead Maggie represents tens of thousands of users, all with slightly different needs.

Going back to our roots with manual testing also brings back the scale problem. We really can't go home. Home doesn't exist any more. Is there a way forward that puts the users, real users, back at the center, but scales to the needs of a modern software process? I will tackle one possible solution in my next post.

Wednesday, June 4, 2014

A Brief History of Test

In the exploration of quality, it is important to understand where software testing came from, where it is today, and where it is heading. We can then compare this trajectory to the goal of ensuring quality and see whether a correction is necessary or if we're going the right direction.

I have been involved in software testing for the past 16 and one half years. To give you some perspective, when I started at Microsoft, we were just shipping Windows 98. This gives me a lot of perspective on the history of software testing. What I give below is my take that that history. Others may have experienced it differently.

There are three major waves of software testing and we're beginning to approach a 4th. The first wave was manual testing. The second wave was automated testing. The third wave was that of tooling. It is important to note that each wave does not fully supplant the previous wave. There is still a need for manual testing even in the tooling phase. There is a need for automated testing even in the coming 4th wave.

The first wave was manual testing. Sometimes this is also called exploratory testing. It is often carried out by people carrying the Quality Assurance (QA) or Software Test Engineer (STE) title. Manual testing is just what it sounds like. It is people sitting in front of a keyboard or mouse and using the product. In its best form it is freeform and exploratory in nature: a tester trying to understand the user and carrying out operations trying to break the software. This is where the lore of testing comes from. The uber-tester who can find the bug no one else can imagine. In its worst form, this is the rote repetition of the same steps, the same levels, over and over again. This is also the source of legends, but not good ones. At its best, this form of testing is highly connected to quality. It is all about the user and his (or her) experience with the product.

Manual testing can produce great user experiences. As I understand it from friends who have gone there, this is the primary method of testing at Apple. The problem is, manual testing doesn't scale. It can also be mind-numbing. In the era of continuous integration and daily builds, the same tests have to be carried out each day. It becomes less about exploring and more about repeating. Manual testing is great for finding the bugs initially, but it is a terribly inefficient regression testing model.

It gets even worse when it comes time for software maintenance. At Microsoft, we support our software for a long time. Sometimes a really long time. Windows XP shipped in 2001 and is just now becoming unsupported. Consider for a moment how many testers it would take to test XP. I'll just make up a number and say it was 500. It wasn't, but that's good enough for a thought exercise. Every time you release a fix for XP, you need 500 people to run through all the tests to make sure nothing was broken. But the 500 original testers are probably working on Vista so you need an additional 500 people for sustained engineering. Add Windows 7, Windows 8, and Windows 8.1 and you now need 2,500 people testing the OS. Most are running the same regression tests every time which is not exciting so you end up losing all your good people. It just doesn't work.

This leads us to the second wave. The first wave involved hundreds of people pressing the same buttons every day. It turns out that computers are really good at repetitive tasks. They don’t get bored and quit. They don't get distracted and miss things. Thus was born test automation. Test automation in its most basic form is writing programs to do all of the things that manual testers do. They can even do things testers really can't. Manual testing is great for a user interface. It's hard to manually test an SDK. It turns out that it is easy to write software that can exercise APIs. This magic elixir allowed teams to cover much more of the product every day. We fully drank the Kool-Aid.

We set off to automate everything. There was nothing automation couldn't do and so all STEs were let go. Everyone become a Software Design Engineer in Test (SDET). This is a developer who, rather than writing the operating system, writes tests for the automation. Some of this work is mundane: calling the Foo API with a 1, a 25, and a MAX_INT. Other parts can be quite challenging. Consider how you would test the audio playback APIs. It is not enough to merely call the APIs and look at the return code. How do you know the right sound was played, at the right volume, and without crackling? Hint: it's time to break out the FFT.

Not everything is kittens and roses in the world of automation. Machines are great at doing what they are told to do. They don't take breaks or demand higher pay. However, they do only what they are told to do. They will only report bugs they are told to look for. One of my favorite bugs to talk about involved media player. In one internal build, every time you clicked next track on a CD (remember those things?) the volume would jump to maximum. While a test could be concocted to look for this, it never would be. Test automation happily reported a pass because indeed the next track started playing. It turns out that once you have run a test application for the first time, it is done finding new bugs. It can find regressions, but it can't notice the bug it missed yesterday.

The points toward the second problem with test automation. It requires very complete specifications. Because the tests can't find any bugs they weren't programmed to find, they need to be programmed to find everything. This requires a lot more up-front planning so the tests can cover the full gamut of the system under test. This heavy reliance upon specifications begins to distance testing from the needs of the user and thus we move away from testing quality and toward testing adherence to a spec.

The third problem with test automation is that it can generate too much data. Machines are happy to churn out results every day and every build. I knew teams whose tests would generate millions of results for each build. Staying on top of this becomes a full time job for many people. Is this failure a bug in the test? Is it a bug in the product? Was it an environmental issue (network down, bad installation, server unavailable)?

One other problem that can happen is that the work can grow faster than test developers can keep up. It is easy for a developer to write a little code which creates a massive amount of new surface area. Consider the humble decorator pattern. If I have 4 UI objects in the system, the tester needs to write 4 sets of tests. Now if the developer creates a decorator which can apply to each of the objects, he only has to write one unit of code to make this work. This is the advantage of the pattern. However, the tester has to write 4 sets of tests. The test surface is growing geometrically compared to the code dev is writing. This is unsustainable for very long.

This brings us to the third wave of testing. This wave involves writing software that writes tests. I call this the tooling phase. Rather than directly writing a test case, it is possible to write a tool that, given some kind of specification, can emit the relevant test cases automatically. Model Based Testing is one form of this tooling. The advantage of this sort of tooling is that it can adapt to changes. Dev added one decorator to the system? Add one new definition in your model and tests just happen.

There are some downsides to the tooling approach. In fact, there are enough downsides that I've never experienced a team that adopted it for all or even most of their testing. They probably exist, but they aren't common. At most, this tooling approach was used to supplement other testing. The first downside is the oracle problem. It is easy enough to create a model of the system under test and generate hundreds or thousands of test cases. It is another thing entirely to understand which of these test cases pass and which fail. There are some problem domains where this is a tractable problem. Each combination or end state has an easily discernible outcome. In others, it can be exceptionally difficult without re-creating all of the logic of the system under test. The second is that the failures can be very hard to reason about in terms of the user. When the Bar API gets this and that parameter while in this state, it produces this erroneous result. Okay. But when would that ever happen in the real world?

Tooling approaches can solve the static nature of testing mentioned above. Because it is mathematically impractical to do a complete search of the state space of any non-trivial application or API, we are always limited to a subset of all possible states for testing. In traditional automation, this subset it fixed. In the tooling approach, the subset can be modified each time with random seeds, longer exploration times, or varying weights. This means each run can expose new bugs. This can be used to good effect. Given some metadata about an API and rules on how to call it, a tool can be created to automatically explore the API surface. We did this to good effect in Windows 8 when testing the Windows Runtime surface.

Sometimes it can have unexpected and even comical outcomes. I recall a story told to me my a friend. He wrote a tool to explore the .Net APIs and left it to run overnight. The next morning he came in to reams of paper on his desk. It turns out that his tool had discovered the print APIs and managed to drain every sheet of paper from every printer in the building. At Microsoft every print job has a cover sheet with the alias of the person doing the printing so his complicity was readily apparent. Some kind soul had gathered all of his print jobs and placed them in his office.

The tooling approach to testing exacerbates two of the problems of automation. It creates even more test results which then have to be understood by a human. It also moves the testing even further away from our definition of quality. Where is the fitness for a function taken into account in the tooling approach?

There is a problem developing in the trajectory of testing. We, as a discipline, have moved steadily further from the premise of quality. We'll examine this in more detail in the next post and start considering a solution in the one after that.

Thursday, May 29, 2014

What is Quality?

Most of my career so far has focused on software testing in one form or another. What is testing if not verifying the quality of the object under test? But what does the word quality really mean? It is hard to define quality, but I will argue that a good operating definition is fitness for a function. In the world of software then, the question test should be answering is whether the software is fit for the function at hand.

The book Zen and the Art of Motorcycle Maintenance tackles the question of what quality is head on. Unfortunately, it doesn't give a clear answer. The basic conclusion seems to be that quality is out there and it drives our behavior. It's a little like Plato's theory of Forms. This is interesting philosophically, but not practically. There are some parts which are more pragmatic. One passage sticks out to me. As might be suspected from the title, there is some discussion of motorcycle maintenance in the book (but not much). At one point the Narrator character is on a trip with his friend John Sutherland. The Narrator has an older bike while John has a fancy new one. The Narrator understands his bike. John chooses not to learn about his and needs a mechanic to do anything to it. Quality then is that relationship between the operator and the bike. The more they understand it and can fix it, the higher quality the relationship. In other words, the more the person can get out of it without needing to go to someone else, the higher the quality.

Christopher Alexander wrote about architecture, yet he is quite famous in the world of software design. His books talk about patterns in buildings and spaces and how to apply them to get specific outcomes. The "Gang of Four" translated his ideas from physical space to the virtual world in their book, Design Patterns. Alexander is interesting not just in his discussion of patterns, but also of quality. What makes a design pattern good, in his mind, is its fitness for a purpose. He says, "The form is the part of the world over which we have control, and which we decide to shape while leaving the rest of the world as it is. The context is that part of the world which puts demands on this form; anything in the world that makes demands of the form is context. Fitness is the relation of mutual acceptability between these two." (Notes on the Synthesis of Form)

Both authors are making an argument that quality then is not something that can be determined in a vacuum. One cannot merely look at a device or a piece of software and make an assessment of quality. In the motorcycle case, the fancy bike would probably look to be of higher quality. It was the relationship with the owner that made the chopper of higher quality. With software, it is similar. One must understand the users and the use model before a determination of quality can be made.

Let's look at a few examples. Think of the iPhone. It has a simple interface. While it has gained complexity over time (it's nearly on version 8!), it is still quite limited compared to a traditional computer. It has constrained input options, preferring only touch. The buttons are big and the screens not dense. Because of this, the apps tend to be simple and single-purpose. There is no multitasking to be found. It didn't even have cut and paste when it appeared on the scene. Yet the iPhone is considered to be high quality. Its audience doesn't expect to be doing word processing on it. They want to check e-mail, "do" Facebook, and play games. It is exquisitely suited for this purpose.

At the other extreme, consider a workstation running Autocad. Autocad has thousands of functions, many windows open at once, and requires extreme amounts of processing power and memory. It's user interface is quite cluttered compared to that of most iPhone apps and it is not easily approachable by mere mortals. Yet it too is considered high quality. Its users expect power over everything else. They need the ability to render in 3D and model physics. It serves this purpose better than anything else in the market. The simplicity and prettiness of the iPhone interface limits utility and is unwanted in this domain. The domain of CAD is one of capability over beauty and efficiency over discoverability.

Too often in the world of software we ignore this synthesis of user and device. Instead we focus on correctness. The quality of the software is judged based on how correctly it implement a spec. This is an easier definition to interpret. It is more precise. There is a right and a wrong answer. Either the software matches the spec or it does not. With a fitness definition, things are more murky. Who is the arbiter of good fit? How bad does it need to be before it is a bug? It can be alluring to follow the sirens of precision, but that comes at a cost. I will talk about that cost next time.

Wednesday, September 28, 2011

Pruning the Decision Tree in Test

Yesterday I wrote about the need to reduce the number of things a project attempted to do in order to deliver a great product. Too many seemingly good ideas can make a product late or fragmented or both. The same is true of testing a product. Great testing is more about deciding what not to test than deciding what to test.

There is never enough time to test everything about a product. This isn’t just the fault of marketing which has a go-to-market date in mind. It is a physical reality. To thoroughly test a product requires traversing the entire state tree in each possible combination. This is analogous to the traveling salesman problem and is thus NP-Complete. In layman’s terms, this means that there is not enough time to test everything for any non-trivial program.

When someone first starts testing, thinking up test cases is hard. We often ask potential hires to test something like a telephone or a pop machine. We are looking for creativity. Can they think up a lot of interesting test cases? After some time in the field, however, most people are able to think up a lot more tests than they have time to carry out. The question then becomes no longer one of inclusion, but one of exclusion.

In Netflix’s case the exclusion was for focus. This is not the right exclusion criteria for testing. It is improper to not test the UI so that you can test the backing database. Instead, the criteria by which tests should be excluded is more complex. There is no single criteria or set of criteria that work for every project. Here are some to consider which have wide applicability:

Breadth of coverage – Often times it is best to try everything a little rather than some things very deep and others not at all. Don’t get caught up testing just one part.
Scenario coverage – Look for test cases which will intersect the primary use patterns of the users. If no one is likely to try to put a square inside a circle inside a square, finding a but in it is not highly valuable.
Risk analysis – What areas of the product would be most problematic if they went wrong? Losing user data is almost always really bad. Drawing a line one pixel off often is not. If you have to choose, prefer focusing more on the data than the drawing. Another important area for many projects are legal or regulatory requirements. If you have these, make sure to test for them. It doesn’t matter how well your product works if the customer is not allowed to buy it.
Cost of servicing – If forced to choose, spend more time on the portions that will be more difficult or costly to service if a bug shows up in the field. For instance, in a client-server architecture, it is usually easier to service the server because it is in one spot, under your control, rather than trying to go to hundreds or thousands of computers to update the client software.
Testing cost – While not a good criteria to use by itself, if a test is too expensive to carry out or to automate, perhaps it should be skipped in favor of writing many more tests that are much cheaper.
Incremental gains – How much does this test case add to existing coverage? It is better to try something wholly new than another slight variation on an existing case. Thinking in terms of code coverage may help here. It is usually better to write a case which tests 10 new blocks than one which tests 15 already covered blocked and 2 new ones. It is very possible that two test cases are both great, but the combination is not. Choose one.

There are many more criteria that could be used. The important point is to have criteria and to make intentional decisions. A test planning approach that merely says, “What are the ways we can test this product?” is insufficient. It will generate too many test cases, some of which will never be carried out due to time or cost. It is important to prune the decision tree up front so that the most important cases are done and the least important ones are left behind. Do this up front, in the test spec, not on the fly as resources dwindle.

Tuesday, March 16, 2010

Pass Rates Don’t Matter

It seems obvious that test pass rates are important. The higher the pass rate, the better quality the product. The lower the pass rate, the more known issues there are and the worse the quality of the product. It then follows that teams should drive their pass rates to be high. I’ve shipped many products where the exit criteria included some specified pass rate—usually 95% passing or higher. For most of my career I agreed with that logic. I was wrong. I have come to understand that pass rates are irrelevant. Pass rates don’t tell you the true state of the product. It is important which bugs remain in the product, but pass rates don’t actually show this.

The typical argument for pass rates is that it represents the quality of the product. This makes the assumption that the tests represent the ideal product. If they all passed, the product would be error-free (or free enough). Each case is then an important aspect of this ideal state and any deviation from 100% pass is a failure to achieve the ideal. This isn’t true though. How many times have you shipped a product with 100% passing tests? Why? You probably rationalized that certain failures were not important. You were probably right. Not every case represents this ideal state. Consider a test that calls a COM API and checks the return result. Assuming you pass in a bar argument and the return result is E_FAIL. Is that a pass? Perhaps. A lot of testers would fail this because it wasn’t E_INVALIDARG. Fair point. It should be. Would you stop the product from shipping because of this though? Perhaps not. The reality is that not all cases are important. Not all cases represent whether the product is ready to ship or not.

Another argument is that 100% passing is a bright line that is easy to see. Anything less is hard to see. Did we have 871 or 872 passing tests yesterday? If it was 871 and today is 871, are they the same 129 failures? Determining this can be hard and it’s a good way to miss a bug. It is easy to remember that everything passed yesterday and no bugs are hiding in the 0 failures. I’ve made this argument. It is true as far as it goes, but it only matters if we use humans to interpret the results. Today we can use programs to analyze the failures automatically and to compare the results from today to those from yesterday.

As soon as the line is not 100% passing, rates do not matter. There is no inherent difference in the quality of a product with 99% passing tests and the quality of a product with 80% passing tests. “Really?“ you say. “Isn’t there a difference of 18%? That’s a lot of test cases.” Yes, that is true. But how valuable are those cases? Imagine a test suite with 100 test cases, only one of which touches on some core functionality. If that case fails, you have a 99% passing rate. You also don’t have a product that should ship. On the other hand, imagine a test suite for the same software with 1000 cases. Imagine that the testers were much more zealous and coded 200 cases that intersected that one bug. Perhaps it was in some activation code. These two pass rates then represent the exact same situation. The pass rate does not correlate with quality. Likewise one could imagine a test case of 1000 cases where 200 were bugs in the tests. That is an 80% pass rate and a shippable product.

The critical takeaway is that bugs matter, not tests. Failing tests represent bugs, but not equally. There is no way to determine, from a pass rate, how important the failures are. Are they the “wrong return result” sort or the “your api won’t activate” sort? You would hold the product for the 2nd, but not the first. Tests pass/fail rates do not provide the critical context about what is failing and without the context, it cannot be known whether the product should ship or not. Test cases are a means to an end. They are not the end in themselves. Test cases are merely a way to reveal the defects in a product. After they do so, their utility is gone. The defects (bugs) become the critical information. Rather than worrying about pass rates, it is better to worry about how many critical bugs are left. When all of the critical bugs are fixed, it is time to ship the product whether the pass rate is high or low.

All that being said, there is some utility in driving up pass rates. Failing cases can mask real failures. Much like code coverage, the absolute pass rate does not matter, but the act of driving the pass rate up can yield benefits.

Wednesday, May 27, 2009

Five Books To Read If You Want My Job

This came out of a conversation I had today with a few other test leads. the question was, “What are the top 5 books you should read if you want my job?” My job in this case being that of a test development lead. At Microsoft that means I lead a team (or teams) of people whose job it is to write software which automatically tests the product.

Behind Closed Doors by Johanna Rothman – One of the best books on practical management that I’ve run across. 1:1’s, managing by walking around, etc.
The Practice of Programming by Kernighan and Pike– Similar to Code Complete but a lot more succinct. How to be a good developer. Even if you don’t develop, you have to help your team do so.
Design Patterns by Gamma et al – Understand how to construct well factored software.
How to Break Software by James Whittaker – The best practical guide to software testing. No egg headed notions here. Only ideas that work. I’ve heard that How We Test Software at Microsoft is a good alternative but I haven’t read it yet.
Smart, and Gets Things Done by Joel Spolsky – How great developers think and how to recruit them. Get and retain a great team.

This is not an exhaustive list. There is a lot more to learn than what is represented in these books, but these will touch on the essentials. If you have additional suggestions, please leave them in the comments.

Friday, February 13, 2009

Why We Conduct Bug Bashes

My team recently finished what we call a “bug bash.” That is, a period of time where we tell all of the test developers to put down their compilers and simply play with the product. Usually a bug bash lasts a few days. This particular one was 2 days long. We often make a competition out of it and track bug opened numbers across the team with bragging rights or even prizes for those who come out on the top of the list.

Bug bashes are a time when everyone on the team is asked to spend all of their time conducting exploratory testing. Sometimes managers will influence the direction by assigning people end-user scenarios or features to look at. Other times the team is just let go and told to explore wherever they desire. Experience has shown me that some direction can be good. Assigning people to explore an area they don’t usually work on gets new eyes on the product and with new eyes come new use patterns and new bugs. Recently I’ve also discovered that it can be helpful to track where people have spent their time. During our last bug bash we created a list of areas that should be explored and had people sign off when they had investigated them. This gives us a much better sense of just what the coverage looked like and allows us to ensure all areas received attention.

Conducting a bug bash can be expensive. There is a lot of work to get done and putting everything else aside for 2 days adds up to a lot of other work getting pushed off. Why do we do this? What is the return on the investment? There are three primary reasons that come to mind:

We have found that empirically, a bug bash flushes out a lot of bugs in a short period of time. Our most recent bug bash saw the number of bugs opened jump to 400% of the daily average. This is important because we frontload the finding of the bugs. The earlier we know about bugs, the more likely we are to be able to fix them. Knowing about more bugs also helps us make more informed triage decisions.

The second reason we conduct bug bashes is because they are likely to find bugs on the seams. Test automation can only find certain kinds of bugs. Exploratory testing is a much better way to find issues on the seams—where functional units join up. Sometimes these bugs are the most critical. Imagine if we could have found the Win7 MP3 bug or the interaction between playing audio and network throughput before shipping the respective products. These are the sort of issues highly unlikely to be found in test automation but which can be found through exploratory testing. We obviously don’t find all such issues through bug bashes, but we do find a lot.

The final reason we run bug bashes is to get a sense of the product. Most of the time we spend our days focused on one small part of the operating system or another. It’s hard to get a sense for the state of the forest while staring at individual trees. After spending several days conducting exploratory tests on the product, we can get a much better sense whether the overall product is doing well or if there are serious issues.

Friday, January 9, 2009

James Whittaker Netcast

James Whittaker is the author of books like How To Break Software. He ran one of the few university-level testing programs at Florida Tech. He's now as Microsoft and helping Visual Studio become better at testing. The guys at .Net Rocks caught up with him for an interview. James explains what he thinks the future of testing is and what's right and wrong with testing at Microsoft. Put this on your Zune/iPod. It's worth the hour.

Friday, October 17, 2008

The Five Why's and Testing Software

Toyota was able to eclipse the makers of American cars in part due to its production and development systems. The system has been popularized under the rubric of "Lean" techniques. Among the tenets of the Lean advocates is asking the "Five Why's." These are not the W's of journalism: Who, What, Why, Where, and When? They are not specific questions even. Asking five why's means asking why 5 times. Why was the production of cars down? Because there were missing screws. Why were there missing screws? Because the production robots were bumping them. Why were the robots bumping? Because the programming was faulty. Why was the programming faulty? Because the programmer didn't take into account a metric->English conversion. Why didn't the the programmer consider conversions? Because...

There is no magic in the number 5. It could be 4 or 6. The importance is to keep asking why until the root cause is understood and fixed. Fixing anything else is just alleviating the symptoms of a deeper problem. Not solving the root problem means it will likely cause other problems later and more time will be wasted later.

How does this apply to testing? It goes to the core of the role of test in a product team. Think about what happens when your team finds a bug in your software. What do you do? Hopefully someone on the test team files a bug report and either the tester or the developer root cause the problem and fix it. This usually means determining the line of source code causing the issue and changing it. Problem solved. Or is it? Why was that line of source code incorrect in the first place? We rarely--if ever--ask.

What if we began to view our role as testers as trying to eliminate bugs from the system instead of from the source code. In that case we would be asking what coding techniques or early testing systems the team could employ to stop the bug from entering the source code at all or at least detecting it while the code was still under development (better unit testing might be a solution in this category).

Thursday, October 2, 2008

James Whittaker on Why MS Software "Sucks" Despite Our Testing

A friend turned me on to this post by James Whittaker. I didn't know he had a blog so now I'm excited to read it. He has a lot of really interesting things to say on testing so I encourage you to read his blog (now linked on the left) if you are intrigued by testing.

Microsoft prides itself on the advanced state of its testing operations. This leads to the inevitable question, "If Microsoft is so good at testing, why does your software suck?" James Whittaker was once a person who asked this question and now that he's at Microsoft he is in a good position to try to answer it.

James gives basically three reasons:

Microsoft's software is really complex. Windows, Exchange, Office, etc. are really, really big projects.
Microsoft's software is used by a whole lot of people. Eric Raymond once made the comment that all bugs are shallow if you have enough eyes. This applies to closed source software just as much as open source. Within the first few days of a release of Microsoft software, millions of people are using them. Windows has an install base in the hundreds of millions. With so many people looking at it, all bugs are likely to be hit by someone.
Microsoft testers are not involved early enough in the process. This varies throughout the company, but there is still a lot of room for improvement.

Friday, September 26, 2008

Test Suite Granularity Matters

I just read a very interesting research paper entitled, "The Impact of Test Suite Granularity on the Cost-Effectiveness of Regression Testing" by Gregg Rothermel et al. In it the authors examine the impact of test suite granularity on several metrics. The two most interesting are the impacts on running time and the impact on bug finding. In both cases, they found that larger test suites were better than small ones.

When writing tests, a decision must be made how to organize the tests. The paper makes a distinction between test suites and test cases. A test case is a sequence of inputs and the expected result. A test suite is a set of test cases. How much should each test suite accomplish? There is a continuum but the endpoints are creating each point of failure as a standalone suite or writing many points of failure into a single suite.

The argument for very granular test suites (1 case or point of failure per suite) is that they can be better tracked and analyzed. The paper examines the efficacy of different techniques for restricting the number of suites run in a given regression pass. They found that more granular cases were more effectively reduced. However, the time savings even from aggressive reductions in test suites did not offset the time taken to run all test cases in larger suites. Grouping test cases into larger suites makes them run faster. Without reduction the granular cases in the study ran almost 15 times slower. With reduction, this improved to running only 6 times slower.

Why is this? Mostly it is because of overhead. Depending on how the test system launches tests there is a cost to each test suite being launched. In a local test harness like nUnit, this cost is small but can add up over a lot of cases. In a network-based system, the cost is large. There is also the cost of setup. Consider an example from my days of DVD testing. To test a particular function of a DVD decoder requires spinning up a DVD and navigating to the right title and chapter. If this can be done once and many test cases executed, the overhead is amortized across all of the cases in the suite. On the other hand, if each case is a suite, the overhead is multiplied by each case.

Perhaps more interesting, however, the study found that very granular test suites actually missed bugs. Sometimes as much as 50% of the bugs. Why? Because more less granular cases cover more state than less granular ones and are thus more likely to find bugs.

It is important to note that there are diminishing returns on both fronts. It is not wise to write all of your test cases in one giant suite. Result tracking does become a problem. It can be hard to differentiate bugs which happen in the same suite. After a certain size, the overhead costs are sufficiently amortized and enough states traversed that the benefits of a bigger suite become negligible.

I have had first-hand experience writing tests of both sorts. I can confirm that we have found bugs in large test suites that were caused by an interaction between cases. These would have been missed by granular execution. I have also seen the immense waste of time that accompanies granular cases. Not mentioned in the study is also the fact that granular cases tend to require a lot more maintenance time.

My rule of thumb is to create differentiated test cases for most instances but then to utilize a test harness that allows them to be all run in one instance of that harness. This gets the benefits of a large test suite without many of the side effects of putting too much into one case. It amortizes program startup, device enumeration, etc. but still allows for more precise tracking and easier reproduction of bugs. If there is a lot of overhead, such as the DVD case mentioned above, test cases should be merged or otherwise structured so as not to pay the high overhead each time.

Thursday, August 7, 2008

Test Code Must Be As Solid As Dev Code

All good development projects follow certain basic practices to ensure code quality. They use source control, get code reviewed, build daily, etc. Unfortunately, sometimes even when the shipping product follows these practices, the test team doesn't. This is true even here at Microsoft. It shouldn't be the case, however. Test code needs to be just as good as dev code.

First flaky test code can make determining the source of an error difficult. Tests that cannot be trusted make it hard to convince developers to fix issues. No one wants to believe there is a bug in their code so a flaky test becomes an easy scapegoat.

Second, spurious failures take time to triage. Test cases that fall over because they are unstable will take a lot of time to maintain. This is time that cannot be spent writing new test automation or testing new corners of the product.

Finally, poorly written test code will hide bugs in the product. If test code crashes, any bugs in the product after that point will be missed. Similarly, poor test code may not execute as expected.. I've seen test code that returns with a pass too early and doesn't execute much of the intended test case.

To ensure that test code is high quality, it is important to follow similar procedures to what development follows (or should be following) when checking in their code. This includes getting all non-trivial changes code reviewed, putting all changes in source control, making test code part of the daily build (you do have a daily build for your product don't you?), and using static verification tools like PCLint, high warning levels, or the code analysis built into Visual Studio.

Wednesday, June 4, 2008

Test For Failure, Not Success

We recently went through a round of test spec reviews on my team. Having read a good number of test specs in a short period of time, I came to a realization. It is imperative to know the failure condition in order to write a good test case. This is at least as important if not more important than understanding what success looks like.

Too often I saw a test case described by calling out what it would do, but not listing or even implying what the failure would look like. If a case cannot fail, passing has no meaning. I might see a case such as (simplified): "call API to sort 1000 pictures by date." Great. How is the test going to determine whether the sort took place correctly?

The problem is even more acute in stress or performance cases. A case such as "push buttons on this UI for 3 days" isn't likely to fail. Sure, the UI could fault, but what if it doesn't? What sort of failure is the author intending to find? Slow reaction time? Resource leaks? Drawing issues? Without calling these out, the test case could be implemented in a manner where failure will never occur. It won't be paying attention to the right state. The UI could run slow and the automation not notice. How slow is too slow anyway? The tester would feel comfortable that she had covered the stress scenario but in reality, the test adds no new knowledge about the quality of the product.

Another example: "Measure the CPU usage when doing X." This isn't a test case. There is no failure condition. Unless there is a threshold over which a failure is recorded, it is merely collecting data. Data without context is of little value.

When coming up with test cases, whether writing them down in a test spec or immediately when writing or executing them, consider the failure condition. Knowing what success looks like is insufficient. It must also be possible to enumerate what failure looks like. Only when testing for the failure condition and not finding it does a passing result gain value.

Friday, May 30, 2008

We Need A Better Way To Test

Testing started simply. Developers would run their code after they wrote it to make sure it worked. When teams became larger and code more complex, it became apparent that developers could spend more time coding if they left much of the testing to someone else. People could specialize on developing or testing. Most testers in the early stages of the profession were manual testers. They played with the user interface and made sure the right things happened.

This works fine for the first release but after several releases, it becomes very expensive. Each release has to test not only the new features in the product but also all of the features put into every other version of the product. What took 5 testers for version 1 takes 20 testers for version 4. The situation just gets worse as the product ages. The solution is test automation. Take the work people do over and over again and let the computer do that work. There is a limit to the utility of this, but I've spoken of that elsewhere and it doesn't need to be repeated here. With sufficiently skilled testers, most products can be tested in an automated fashion. Once a test is automated, the cost of running it every week or even every day becomes negligible.

As computer programs became more complex over time, the old testing paradigm didn't scale and a new paradigm--automated testing--had to be found. There is, I think, a new paradigm shift coming. Most test automation today is the work of skilled artisans. Programmers examine the interfaces of the product they are testing and craft code to exercise it in interesting and meaningful ways. Depending on the type of code being worked on, a workforce of 1:1 testers to devs can usually keep up. This was true at one point. Today it is only somewhat true and tomorrow it will be even less true. Some day, it will be false. What has changed? Developers are leveraging better programming models such as object-oriented code, larger code libraries, greater code re-use, and more efficient languages to get more done with less code. Unfortunately, this merely increases the surface area for testers to have to cover. Imagine, if you will, a circle. When a developer is able to create 1 unit of code (r=1), the perimeter which a tester must cover is only 3.14. When the developer uses tools to increase his work and the radius stretches to 2, the tester must now cover a perimeter of 12.56. The area needing to be tested increases much faster than the productivity increase. Using the same programming models as the developers will not allow test to keep up. In the circle example, a 2x boost in tester performance would only cover 1/2 of the circle.

Is test doomed? Is there any way to keep up or are we destined to be outpaced by development and to need larger and larger teams of test developers just to keep pace. The solution to the problem has the same roots as the solution to manual testing problem. That is, it is time to leverage the computer to do more work on behalf of the tester. It will soon be too expensive to hand-craft test cases for each function call and the set of parameters it entails. Writing code one test case at a time just doesn't scale--even with newer tools. In the near future, it will be important to leverage the computer to write test cases by itself. Can this be done? Work is already beginning, but it is just in its infancy. The tools and practices that will make this a widespread practice likely do not exist today. Certainly not in a readily consumed form.

This coming paradigm shift makes testing a very interesting place to be working today. On the one hand, it can be easy for testers to become overwhelmed with the amount of work asked of them. On the other hand, the solutions to the problem of how to leverage the computer to test itself are just now being created. Being in on the ground floor of a new paradigm means the ability to have a tremendous impact on how things will be done for years to come.

Update: There are a lot of people responding to this post who are unfamiliar with my other writing. Without some context, it may seem that I'm saying that test automation is the solution to all testing problems and that if we're smart we can automate all of the generation. That's not what I'm saying. What I advocate in this post is only a powerful tool to be used along with all of the others in our toolbox. If you want some context for my views, check out:

When to manually test and when to automate

The dangers of test automation

Too much test automation?

Monday, April 21, 2008

Know That Which You Test

Someone recently related to me his experience using the new Microsoft Robotics Studio. He loaded it up and proceeded through one of the tutorials. To make sure he understood, he typed everything in instead of cutting and pasting the sample code. After doing so, he compiled and ran the results. It worked! It did exactly what it was supposed to. The only problem--he didn't understand anything he had typed. He went through the process of typing in the lines of code, but didn't understand what they really meant. Sometimes testers do the same thing. It is easy to "test" something without actually understanding it. Doing so is dangerous. It lulls us into a false sense of security. We think we've done a good job testing the product when in reality we've only scratched the surface.

Being a good tester requires understanding not just the language we're writing the tests in, but also what is going on under the covers. Black-box testing can be useful, but without a sense of what is happening inside, testing can only be very naive. Without breaking the surface, it is nearly impossible to understand what the equivalency classes are. It is hard to find the corner cases or the places where errors are most likely to happen. It's also very easy to miss a critical path because it wasn't apparent from the API.

There are three practices which help to remedy this. First, program in the same language as whatever is being tested. A person writing tests written in C# against a COM interface will have a hard time beginning to understand the infrastructure beneath the interface. It can also be difficult to understand the frailties of a language different than the one being coded in. Each language has different weaknesses. Thinking about the weaknesses of C++ will blind a person to the weaknesses of Perl. Second, use code coverage data to help guide testing. Examining code coverage reports can help uncover places that have been missed. If possible, measure coverage against each test case. Validate that each new case adds to the coverage. If it doesn't, the case is probably covering the same equivalency class as another test. Third, and perhaps most importantly, become familiar with the code being testing. Read the code. Read the specs. Talk to the developers.

Wednesday, March 5, 2008

What Tests Belong in the BVTs?

BVTs or Build Verification Tests are standard Microsoft parlance for the tests we run every day to ensure that we didn't break anything important with our checkins the day before. I've previously written about the importance of keeping them clean. Within the range of tests that consistently pass, which ones should be in the BVT? BVT test failures should be something you're willing to act on immediately. In other words, the failures must be important. Based on that, here are some criteria:

Test major scenarios not minor ones. If major features are failing, they will be fixed right away. If a minor feature is failing, it should be noted, but may have to wait until later to be fixed.
Test majority use cases, not corner cases. Tests for the interaction of 3 parts shouldn't be in the BVTs. Tests outside most user scenarios shouldn't be in the BVTs. While every book on testing says to test the boundary conditions, the BVTs may not be the place to do that. Instead, pick the most likely to be used values and scenarios.
Run "positive" not "negative" tests. By that I mean, don't send out-of-bounds conditions or invalid values. These are valid tests and should definitely be run, but not in the BVTs. An API faulting when sent a null pointer should be fixed, but the fix can wait until next week.

BVTs should be a carefully guarded set of tests. They need to run quickly, consistently, and their results should matter. If these rules are followed, the BVTs will be effective because failures will be respected. Restricting the BVTs to the most important scenarios will ensure that the results are given the appropriate respect.