Monday, October 20, 2014

Scenarios and Underpants Gnomes

Data driven quality requires a certain kind of thinking. It took me a while to understand the right thought process. I kept getting caught up in the details. What were we building and what would use look like? These are valid questions, but there are more important ones to be asking. Not asking whether the product is being successfully used, but rather how it is affecting user behavior. If we know what a happy user looks like and we see that behavior, we have a successful product.

As I wrote in What Is Quality?, true quality is the fitness of a particular form (program) for a function (what the user wants to accomplish). True data driven quality should measure this fitness function. A truly successful product will maximize this function. The key to doing this is to understand what job the user needs the product to accomplish and then measure whether that job is being done in an optimal way. It is important to understand the pain the customer is experiencing and then visualize what the world would look like if that pain were relieved. If we can measure that alleviation, we know we have a successful product.

There is a key part of the process I did not mention. What does the product do that alleviates the pain the customer is experiencing? This is unimportant. In fact, it is best not to know. Wait, you might think. Clearly it is important. If the product does nothing, the situation will not change and the customer will remain in pain. That is true, but that is also getting the cart before the horse. Knowing how the product intends to operate can cloud our judgment about how to measure it. We will be tempted to utilize confirmatory metrics instead of experimental ones. We will measure what the product does and not what it accomplishes. Just like test driven development requires the tests be written before the code, data driven quality demands that the metrics be designed before the features.

One way to accomplish this is through what can be called a scenario. This term is used for many things so let me be specific about my use. A scenario takes a particular form. It asks what problem the user is having and what alleviation of that pain looks like. It treats the solution as a black box.

  1. Customer Pain

  2. Magic Happens

  3. World Without Pain

I say "Magic Happens" because at this stage, it doesn't matter how things get better, only that they do. This reminds me of an old South Park sketch called the Underpants Gnomes. In it a group of gnomes has a brilliant business plan. They will gather underwear, do something with it, and then profit!


Their pain is a lack of money and an overabundance of underwear. Their success is more money (and fewer underpants?). To measure the success of their venture, it is not necessary to understand how they will generate profits from the underpants. It will suffice to measure their profits. Unfortunately for the gnomes, there may be no magic which can turn underwear into profit.

Let's walk through a real-world example.

  1. Customer Pain: When I start my news app, the news is outdated. I must wait for updated news to be retrieved. Sometimes I close the app immediately because the data is stale.

  2. Magic Happens

  3. World Without Pain: When I start the app, the news is current. I do not need to wait for data to be retrieved. Today's news is waiting for me.

What metrics might we use to measure this? We likely cannot measure the user's satisfaction with the content directly, but we can measure the saliency of the news. We could measure the time it takes to get updated content on the screen? Does this go down? We could tag the news content with timestamps and measure the median age of news when the app starts. Does the median age reduce? We could measure how often a user closes the app within the first 15 seconds of it starting up. Are fewer users rage quitting the app? We might even be able to monitor overall use of the app. Is median user activity going up?

Whether the solution involves improving server response times, caching content, utilizing OS features to prefetch the content while the app is not active, or other solutions is not necessary to understand. These are all part of the "magic happens" stage. We can and should experiment with several ideas to see which improve the situation the most. The key here is to measure how these ideas affect user behavior and user perception, not how often the prefetch APIs are called or whether server speeds are increased.

Thursday, October 16, 2014

Confirmatory and Experimental Metrics

As we experiment more using data to understand the quality of our product, the proper use of telemetry becomes more clear. While initially we were enamored with using telemetry to understand whether the product was working as expected, recently it has become clear that there is another, more powerful use for data. Data can tell us not just what is working, but whether we are building the right thing.

There are two major types of metrics.  Both have their place in the data driven quality toolkit.  Confirmatory metrics are used to confirm that a feature or scenario is working correctly.  Experimental metrics are used to determine the effect of a change on desired outcomes.  While most teams will start using primarily the first, over time, they will shift to more of the second.

Confirmatory metrics can also be called Quality of Service (QoS) metrics .  They are monitors.  That is, metrics designed to monitor the health of the system.  Did users complete the scenario?  Did the feature crash?  These metrics can be gathered from real customers using the system or from synthetic workloads. Confirmatory metrics alert the team when something is broken, but say nothing about how it affects behavior.  They provide very similar information to test cases.  As such, the primary action they can induce is to file and fix a bug.

Experimental metrics can also called Quality of Experience (QoE) metrics.  Each scenario being monitored introduces a problem that users have and an outcome if the problem is resolved.  Experimental metrics measure that outcome.  The implementation of the solution should not matter.  What matters is how the change affected behavior.

An example may help.  There may be a scenario to improve the time taken to debug asynchronous call errors.  The problem is that debugging takes too long.  The outcome is that debugging takes less time.  Metrics can be added to measure the median time a debugging session takes (or a host of other measures).  This might be called a KPI (Key Performance Indicator).  Given the KPI, it is possible to run experiments.  The team might develop a feature to store the asynchronous call chain and expose it to developers when the app crashes.  The team can flight this change and measure how debug times are affected.  If the median time goes down, the experiment was a success.  If it stays flat or regresses, the experiment is a failure and the feature needs to be reworked or even scrapped.

Experimental metrics are a proxy for user satisfaction with the product.  The goal is to maximize (or minimize in the case of debug times) the KPI and to experiment until the team finds ways of doing so. This is the real power behind data driven quality.  It connects the team once again with the needs of the customers.

There is a 3rd kind of metric which is not desirable.  Those are called vanity metrics.  Vanity metrics are ones that make us feel good but do not drive actions.  Number of users is one such metric.  It is nice to see a feature or product being used, but what does that mean?  How does that change the team's behavior?  What action did they take to create the change?  If they don't know the answer to these questions, the metric merely makes them feel good.  You can read more about vanity metrics here.