Monday, October 20, 2014

Scenarios and Underpants Gnomes

Data driven quality requires a certain kind of thinking. It took me a while to understand the right thought process. I kept getting caught up in the details. What were we building and what would use look like? These are valid questions, but there are more important ones to be asking. Not asking whether the product is being successfully used, but rather how it is affecting user behavior. If we know what a happy user looks like and we see that behavior, we have a successful product.

As I wrote in What Is Quality?, true quality is the fitness of a particular form (program) for a function (what the user wants to accomplish). True data driven quality should measure this fitness function. A truly successful product will maximize this function. The key to doing this is to understand what job the user needs the product to accomplish and then measure whether that job is being done in an optimal way. It is important to understand the pain the customer is experiencing and then visualize what the world would look like if that pain were relieved. If we can measure that alleviation, we know we have a successful product.

There is a key part of the process I did not mention. What does the product do that alleviates the pain the customer is experiencing? This is unimportant. In fact, it is best not to know. Wait, you might think. Clearly it is important. If the product does nothing, the situation will not change and the customer will remain in pain. That is true, but that is also getting the cart before the horse. Knowing how the product intends to operate can cloud our judgment about how to measure it. We will be tempted to utilize confirmatory metrics instead of experimental ones. We will measure what the product does and not what it accomplishes. Just like test driven development requires the tests be written before the code, data driven quality demands that the metrics be designed before the features.

One way to accomplish this is through what can be called a scenario. This term is used for many things so let me be specific about my use. A scenario takes a particular form. It asks what problem the user is having and what alleviation of that pain looks like. It treats the solution as a black box.

  1. Customer Pain

  2. Magic Happens

  3. World Without Pain

I say "Magic Happens" because at this stage, it doesn't matter how things get better, only that they do. This reminds me of an old South Park sketch called the Underpants Gnomes. In it a group of gnomes has a brilliant business plan. They will gather underwear, do something with it, and then profit!


Their pain is a lack of money and an overabundance of underwear. Their success is more money (and fewer underpants?). To measure the success of their venture, it is not necessary to understand how they will generate profits from the underpants. It will suffice to measure their profits. Unfortunately for the gnomes, there may be no magic which can turn underwear into profit.

Let's walk through a real-world example.

  1. Customer Pain: When I start my news app, the news is outdated. I must wait for updated news to be retrieved. Sometimes I close the app immediately because the data is stale.

  2. Magic Happens

  3. World Without Pain: When I start the app, the news is current. I do not need to wait for data to be retrieved. Today's news is waiting for me.

What metrics might we use to measure this? We likely cannot measure the user's satisfaction with the content directly, but we can measure the saliency of the news. We could measure the time it takes to get updated content on the screen? Does this go down? We could tag the news content with timestamps and measure the median age of news when the app starts. Does the median age reduce? We could measure how often a user closes the app within the first 15 seconds of it starting up. Are fewer users rage quitting the app? We might even be able to monitor overall use of the app. Is median user activity going up?

Whether the solution involves improving server response times, caching content, utilizing OS features to prefetch the content while the app is not active, or other solutions is not necessary to understand. These are all part of the "magic happens" stage. We can and should experiment with several ideas to see which improve the situation the most. The key here is to measure how these ideas affect user behavior and user perception, not how often the prefetch APIs are called or whether server speeds are increased.

Thursday, October 16, 2014

Confirmatory and Experimental Metrics

As we experiment more using data to understand the quality of our product, the proper use of telemetry becomes more clear. While initially we were enamored with using telemetry to understand whether the product was working as expected, recently it has become clear that there is another, more powerful use for data. Data can tell us not just what is working, but whether we are building the right thing.

There are two major types of metrics.  Both have their place in the data driven quality toolkit.  Confirmatory metrics are used to confirm that a feature or scenario is working correctly.  Experimental metrics are used to determine the effect of a change on desired outcomes.  While most teams will start using primarily the first, over time, they will shift to more of the second.

Confirmatory metrics can also be called Quality of Service (QoS) metrics .  They are monitors.  That is, metrics designed to monitor the health of the system.  Did users complete the scenario?  Did the feature crash?  These metrics can be gathered from real customers using the system or from synthetic workloads. Confirmatory metrics alert the team when something is broken, but say nothing about how it affects behavior.  They provide very similar information to test cases.  As such, the primary action they can induce is to file and fix a bug.

Experimental metrics can also called Quality of Experience (QoE) metrics.  Each scenario being monitored introduces a problem that users have and an outcome if the problem is resolved.  Experimental metrics measure that outcome.  The implementation of the solution should not matter.  What matters is how the change affected behavior.

An example may help.  There may be a scenario to improve the time taken to debug asynchronous call errors.  The problem is that debugging takes too long.  The outcome is that debugging takes less time.  Metrics can be added to measure the median time a debugging session takes (or a host of other measures).  This might be called a KPI (Key Performance Indicator).  Given the KPI, it is possible to run experiments.  The team might develop a feature to store the asynchronous call chain and expose it to developers when the app crashes.  The team can flight this change and measure how debug times are affected.  If the median time goes down, the experiment was a success.  If it stays flat or regresses, the experiment is a failure and the feature needs to be reworked or even scrapped.

Experimental metrics are a proxy for user satisfaction with the product.  The goal is to maximize (or minimize in the case of debug times) the KPI and to experiment until the team finds ways of doing so. This is the real power behind data driven quality.  It connects the team once again with the needs of the customers.

There is a 3rd kind of metric which is not desirable.  Those are called vanity metrics.  Vanity metrics are ones that make us feel good but do not drive actions.  Number of users is one such metric.  It is nice to see a feature or product being used, but what does that mean?  How does that change the team's behavior?  What action did they take to create the change?  If they don't know the answer to these questions, the metric merely makes them feel good.  You can read more about vanity metrics here.

Wednesday, July 30, 2014

The Data Driven Quality Mindset

"Success is not delivering a feature; success is learning how to solve the customer's problem." - Mark Cook, VP of Products at Kodak

I've talked recently about the 4th wave of testing called Data Driven Quality (DDQ). I also elucidated what I believe are the technical prerequisites to achieving DDQ. Getting a fast delivery/rollback system and a telemetry system is not sufficient to achieve the data driven lifestyle. It requires a fundamentally different way of thinking. This is what I call the Data Driven Quality Mindset.

Data driven quality turns on its head much of the value system which is effective in the previous waves of software quality. The data driven quality mindset is about matching form to function. It requires the acceptance of a different risk curve. It requires a new set of metrics. It is about listening, not asserting. Data driven quality is based on embracing failure instead of fearing it. And finally, it is about impact, not shipping.

Quality is the matching of form to function. It is about jobs to be done and the suitability of an object to accomplish those jobs. Traditional testing operates from a view that quality is equivalent to correctness. Verifying correctness is a huge job. It is a combinatorial explosion of potential test cases, all of which must be run to be sure of quality. Data driven quality throws out this notion. It says that correctness is not an aspect of quality. The only thing that matters is whether the software accomplishes the task at hand in an efficient manner. This reduces the test matrix considerably. Instead of testing each possible path through the software, it becomes necessary to test only those paths a user will take. Data tells us which paths these are. The test matrix then drops from something like O(2n) to closer to O(m) where n is the number of branches in the code and m is the number of action sequences a user will take. Data driven testers must give up the futile task of comprehensive testing in favor of focusing on the golden paths a user will take through the software. If a tree falls in the forest and no one is there to hear it, does it make a noise? Does it matter? Likewise with a bug down a path no user will follow.

Success in a data driven quality world demands a different risk curve than the old world. Big up front testing assumes that the cost to fix an issue rises exponentially the further along the process we get. Everyone has seen a chart like the following:


In the world of boxed software, this is true. Most decisions are made early in the process. Changing these decisions late is expensive. Because testing is cumulative and exhaustive, a bug fix late requires re-running a lot of tests which is also expensive. Fixing an issue after release is even more expensive. The massive regression suites have to be run and even then there is little self hosting so the risks are magnified.

Data driven quality changes the dynamics and thus changes the cost curve. This in turn changes the amount of risk appropriate to take at any given time. When a late fix is very expensive, it is imperative to find the issues early, but finding issues early is expensive. When making a fix is quick and cheap, the value in finding a fix early is not high. It is better to lazy-eval the issues. Wait until they become manifested in the real world before a fix is made. In this way, many latent issues will never need to be fixed. The cost of finding issues late may be lower because broad user testing is much cheaper than paid test engineers. It is also more comprehensive and representative of the real world.

Traditional testers refuse to ship anything without exhaustive testing up front. It is the only way to be reasonable sure the product will not have expensive issues later. Data driven quality encourages shipping with minimum viable quality and then fixing issues as they arise. This means foregoing most of the up front testing. It means giving up the security blanket of a comprehensive test pass.

Big up front testing is metrics-driven. It just uses different metrics than data driven quality. The metrics for success in traditional testing are things like pass rates, bug counts, and code coverage. None of these are important in data driven quality world. Pass rates do not indicate quality. This is potentially a whole post by itself, but for now it suffices to say that pass rates are arbitrary. Not all test cases are of equal importance. Additionally, test cases can be factored at many levels. A large number of failing unimportant cases can cause a pass rate to drop precipitously without lowering product quality. Likewise, a large number of passing unimportant cases can overwhelm a single failing important one.

Perhaps bug counts are a better metric. In fact, they are, but they are not sufficiently better. If quality if the fit of form and function, bugs that do not indicate this fit obscure the view of true quality. Latent issues can come to dominate the counts and render invisible those bugs that truly indicate user happiness. Every failing test case may cause a bug to be filed, whether it is an important indicator of the user experience or not. These in turn take up large amounts of investigation and triage time, not to mention time to fix them. In the end, fixing latent issues does not appreciably improve the experience of the end user. It is merely an onanistic exercise.

Code coverage, likewise, says little about code quality. The testing process in Windows Vista stressed high code coverage and yet the quality experienced by users suffered greatly. Code coverage can be useful to find areas that have not been probed, but coverage of an area says nothing about the quality of the code or the experience. Rather than code coverage, user path coverage is a better metric. What are the paths a user will take through the software? Do they work appropriately?

Metrics in data driven quality must reflect what users do with the software and how well they are able to accomplish those tasks. They can be as simple as a few key performance indicators (KPIs). A search engine might measure only repeat use. A storefront might measure only sales numbers. They could be finer grained. What percentage of users are using this feature? Are they getting to the end? If so, how quickly are they doing so? How many resources (memory, cpu, battery, etc.) are they using in doing so? These kind of metrics can be optimized for. Improving them appreciably improves the experience of the user and thus their engagement with the software.

There is a term called HiPPO (highest paid person's opinion) that describes how decisions are too often made on software projects. Someone asserts that users want to have a particular feature. Someone else may disagree. Assertions are bandied about. In the end the tie is usually broken by the highest ranking person present. This applies to bug fixes as well as features. Test finds a bug and argues that it should be fixed. Dev may disagree. Assertions are exchanged. Whether the bug is ultimately fixed or not comes down to the opinion of the relevant manager. Very rarely is the correctness of the decision ever verified. Decisions are made by gut, not data.

In data driven quality, quality decisions must be made with data. Opinions and assertions do not matter. If an issue is in doubt, run an experiment. If adding a feature or fixing a bug improves the KPI, it should be accepted. If it does not, it should be rejected. If the data is not available, sufficient instrumentation should be added and an experiment designed to tease out the data. If the KPIs are correct, there can be no arguing with the results. It is no longer about the HiPPO. Even managers must concede to data.

It is important to note that the data is often counter-intuitive. Many times things that would seem obvious turn out not to work and things that seem irrelevant are important. Always run experiments and always listen to them.

Data driven quality requires taking risks. I covered this in my post on Try.Fail.Learn.Improve. Data driven quality is about being agile. About responding to events as they happen. In theory, reality and theory are the same. In reality, they are different. Because of this, it is important to take an empiricist view. Try things. See what works. Follow the bread crumbs wherever they lead. Data driven quality provides tools for experimentation. Use them. Embrace them.

Management must support this effort. If people are punished for failure, they will become risk averse. If they are risk averse, they will not try new things. Without trying new things, progress will grind to a halt. Embrace failure. Managers should encourage their teams to fail fast and fail early. This means supporting those who fail and rewarding attempts, not success.

Finally, data driven quality requires a change in the very nature of what is rewarded. Traditional software processes reward shipping. This is bad. Shipping something users do not want is of no value. In fact, it is arguably of negative value because it complicates the user experience and it adds to the maintenance burden of the software. Instead of rewarding shipping, managers in a data driven quality model must reward impact. Reward the team (not individuals) for improving the KPIs and other metrics. These are, after all, what people use the software for and thus what the company is paid for.

Team is the important denominator here. Individuals will be taking risks which may or may not pay off. One individual may not be able to conduct sufficient experiments to stumble across success. A team should be able to. Rewards at the individual level will distort behavior and reward luck more than proper behavior.

The data driven quality culture is radically different from the big up front testing culture. As Clayton Christensen points out in his books, the values of the organization can impede adoption of a new system. It is important to explicitly adopt not just new processes, but new values. Changing values is never a fast process. The transition may take a while. Don't give up. Instead, learn from failure and improve.

Monday, July 14, 2014

10 Years of Blogging

I'm a little late, but it's time to celebrate anyway. Ten years ago, in March of 2004, I began blogging here. Hello World was, of course, my first post. You have embraced and sometimes challenged my words. Thank you for continuing to read and to engage with me. I learn a lot by writing. It is my hope you learn a fraction of that by reading.

After a long break where I published rarely and two years with no posts, you may have noticed that I am back. I waited to write this until I was sure I had some momentum. Last time I stated my return to blogging, I slacked off shortly after. Well, I'm back again and I have a lot I want to talk about. Those who continue to read after the hiatus, thank you for sticking around. I'll try to make it worth your while.

After four years as the Test Development Manager responsible for the Windows Runtime API, I am moving on to new things. I love what we did with the new API, but it is time for change. I will remain in the Operating Systems Group and in the Test Development discipline, but will be working to enable greater use of data in our testing processes.

For more than a year, I have been thinking about how to utilize data in our testing process. I have been inspired by the work of Seth Eliot, Brent Jensen, Ken Johnston, Alan Page, and others. They paint the picture of Data Driven Quality where we determine our success by observing users rather than by test pass rates. As you can see from my recent posts, I have joined their ranks.

No anniversary is complete without some stats. Over the past decade, I have written 418 posts. You have left 1194 comments. The most popular post was about how much memory Vista really needed. It garnered over 119,000 views.

A few have managed to miss my Twitter handle in the about me section. I can be found at @steverowe on Twitter if you want to engage with me there.

Wednesday, July 9, 2014


Try.Fail.Learn.Improve. That has been the signature on my e-mail for the past few months. It is intended to be both enlightening and provocative. It emphasizes that we won't get things right the first time. That it is okay to fail as long as we don't fail repeatedly in the same way. Try.Fail.Learn.Improve is a process that needs to be constantly repeated. It is a way of life.

When I first used this phrase, someone responded that it was too strongly worded. Perhaps I should say "Try, Learn, Succeed" instead. But that doesn't convey the true value of the phrase. I specifically chose the word Fail because I wanted to emphasize that we would get things wrong. I avoided the word succeed because I wanted to convey that the process would be a long one.

Try. The essence of getting anything done is to start. In the world of software and especially systems software, we are always doing something unknown. We are not building the nth bridge or even the nth website. As such, the answers are not known up front. How could they be? Thus we can't say, "Do." That implies a known course of action. Try is more accurate. Make a hypothesis about what might work and try it out. Run the experiment.

Fail. Most of the time--not just sometimes--what is tried will fail. It is important to be able to recognize when we fail. Trying something that cannot fail is also doing something from which we cannot learn. Only with the possibility of failure can learning be had. Failure should be expected. "Embrace Failure" is advice I gave early into my new role. Much traditional software has viewed success as the only metric. The downside is that failure was punished. When something is punished, it will diminish. People will shy away from it. Punishing failure will disincentivize people from taking risk. The lack of risk means a lack of failure and a lack of learning. Given that we don't know the correct path to take, this lack of learning ensures a lack of success.

Learn. Einstein is said to have defined insanity as doing the same thing over and over again and expecting a different result. It is possible to fail and not learn. This is usually the result of blame. In a failure-averse culture, admitting you were wrong has severe repercussions. Failure then is not admitted. It is not embraced. Instead, it is blamed on something external. "We would have succeeded if only the marketing team had done their job. The essence of learning is understanding failure. Why did things go differently than predicted? In the scientific method, this helps to set up the next experiment.

Improve. Once we have failed and learned something about why we failed, it is time to try again. Device the next experiment. If what we tried did not work, what else might work? Revise the hypothesis and begin the cycle again. Try the next thing. If at first you don't succeed--and you won't--try, try again.

Succeed. Eventually, if we improve in this iterative fashion, we will succeed. This will take a while. Do not expect it to happen after the first, second, or even third turn of the crank. Success often comes in stages. It is not all or nothing.

If success is such an elusive item and will take many cycles to achieve, is it possible to get to success faster? Yes. Designing better experiments can help. If we understand well enough to do them. Blind experimentation will take too long. But many times we don’t know enough to design great experiments. What then? This is the most likely situation. In it, the best solution is to learn to run more experiments. Reduce the time each turn of the crank takes. Tighten the loop. Design smaller experiments that can be implemented quickly, run quickly, and understood quickly.

One last word on success. It is impossible to succeed if you don't know what success looks like. Be sure to understand where you are trying to go. If you don't, how can you know if an experiment got you closer? You can't learn if you can't fail and you can't fail if you don't know what you are measuring for. As Yogi Berra said, "If you don't know where you are going, you'll end up somewhere else."

So get out there and fail. It's the only way to learn.

Tuesday, July 1, 2014

Prerequisites to Data Driven Quality

A previous post introduced the concept of data driven quality. Moving from traditional, up-front testing to data driven quality is not easy. It is not possible to take just any product and start utilizing this method. In addition to cultural changes, there are several technical requirements on the product. These are: early deployment, friction-free deployment, partial deployment, high speed to fix, limited damage, and access to telemetry about what the users are doing.

Early deployment means shipping before the software is 100% ready. In the past, companies shipped beta software to limited audiences. This was a good start, but betas happened once or twice per product cycle. For instance, Windows 7 had two betas. Windows 8 had one. These were both 3 year product cycles. In order to use data to really understand the product's quality, shipping needs to happen much more often. That means shipping with lower quality. The exact level of stability can be determined per product, but need not be very high if the rest of the prerequisites are met. Ken Johnston has a stimulating post about the concept of Minimum Viable Quality.

Friction-free deployment means a simple mechanism to get the bits in front of users. Seamless installation. The user shouldn't have to know that they have a new version unless it looks different. Google's Chrome browser really pioneered here. It just updates in the background. You don't have to do anything to get the latest and greatest version and you don't have to care.

Because we may be pushing software that isn't 100%, deployment needs to be limited in scope. Software that is not yet fully trusted should not be given to everyone all at once. What if something goes wrong? Services do this with rings of deployment. First the changes are shown to only a small number of users, perhaps hundreds or low thousands. If that appears correct, it is shown to more, maybe tens of thousands. As the software proves itself stable with each group, it is deemed worthy of pushing outward to a bigger ring.

If someone goes wrong, it is important to fix it quickly. This can be a fix for the issue at hand (roll forward) or reversion to the last working version (roll back). The important thing is not to leave users in a broken state for very long. The software must be built with this functionality in mind. It is not okay to leave users facing a big regression for days. In the best case, they should get back to a working system as soon as we detect the problem. With proper data models, this could happen automatically.

Deployment of lower quality software means that users will experience bugs. Total experienced damage is a function of both duration and magnitude. Given the previous two prerequisites, the damage will be limited in duration, but it is also important to limit the damage in magnitude. A catastrophic bug which wipes out your file system or causes a machine not to boot need not last long. Rolling back doesn't repair the damage. Your dissertation is already corrupted. Pieces of the system which can have catastrophic (generally data loss) repercussions need to be tested differently and have a different quality bar before being released.

Finally, the product must be easy to gather telemetry on what the user is doing. The product must be capable of generating telemetry, but the system must also be capable of consuming it. The product must be modified to make generating telemetry simple. This is usually in the form of a logging library. This library must be lightweight. It is easy to overwhelm the performance of a system with too slow a library and too many log events. The library must also be capable of throttling. There is no sense causing a denial of service attack on your own datacenter if too many users use the software.

The datacenter must be capable of handling the logs. The scale of success can make it difficult. The more users, the more data will need to be processed. This can overwhelm network connections and analysis pipelines. The more data involved, the more computing power is necessary. Network pipes must be big. Storage requirements go up. Processing terabytes or even petabytes of data is not trivial. The more data, the more automated the analysis must become to keep up.

With these pieces in place, a team can begin to live the data driven quality lifestyle. There is much than just the technology to think about though. They very mindset of the team must change if the fourth wave of testing it to take root. I will cover these cultural changes next time.

Monday, June 23, 2014

Perceived vs. Objective Quality

I recently heard this story, but I can't recall who told it to me. I don’t have proof of its veracity so it might be apocryphal. Nevertheless, it illustrates an important point that I believe to be true independent of the truth of this story.

As the story goes, in the late 1990s, several Microsoft researchers set about trying to understand the quality of various operating system codebases. Of concern were Linux, Solaris, and Windows NT. The perception among the IT crowd was that Solaris and Linux were of high quality and Windows NT was not. These researchers wanted to test that objectively and understand why NT would be considered worse.

They used many objective measures of code quality to assess the 3 operating systems. This would be things like cyclomatic complextity, depth of inheritance, static analysis tools such as lint, and measurements of coupling. Without debating the exact value of this sort of approach, there are reasons to believe these sort of measurements are at least loosely correlated with defect density and code quality.

What the researchers found was the Solaris came out on top. It was the highest quality. This matched the common sense. Next up they found was Windows NT. It was closely behind Solaris. The surprise was Linux. It was far behind both of the other two. Why then the sense that it was high quality? The perceived quality of both NT and Linux did not match their objective measures of quality.

The speculation on the part of the researchers was that while Linux had a lot of rough edges, the most used paths were well polished. The primary scenarios were close to 100% whereas the others were only at, say, 60%. NT, on the other hand, was at 80 or 90% everywhere. This made for high objective quality, but not high experienced quality.

Think of it. If everything you do is 90% right, you will run into small problems all the time. On the other hand, if you stay within the expected lanes on something like Linux, you will rarely experience issues.

This coincides well with the definition of quality being about fitness for a function. For the functions it was being used for, Linux was very fit. NT supported a wider variety of functions, but was less fit for each of them and thus perceived as being of lower quality.

The moral of the tale: Quality is not the absence of defects. Quality is the absence of the right kinds of defects. The way to achieve higher quality is not to scour the code for every possible defect. That may even have a negative effect on quality due to randomization. Instead, it is better to understand the user patterns and ensure that those are free of bugs. Data Driven Quality gives the team a chance to understand both these use patterns and what bugs are impeding them.