Ruminations on Computing

Scenarios and Underpants Gnomes

2014-10-20T01:02:00.000-07:00

Data driven quality requires a certain kind of thinking. It took me a while to understand the right thought process. I kept getting caught up in the details. What were we building and what would use look like? These are valid questions, but there are more important ones to be asking. Not asking whether the product is being successfully used, but rather how it is affecting user behavior. If we know what a happy user looks like and we see that behavior, we have a successful product.

As I wrote in What Is Quality?, true quality is the fitness of a particular form (program) for a function (what the user wants to accomplish). True data driven quality should measure this fitness function. A truly successful product will maximize this function. The key to doing this is to understand what job the user needs the product to accomplish and then measure whether that job is being done in an optimal way. It is important to understand the pain the customer is experiencing and then visualize what the world would look like if that pain were relieved. If we can measure that alleviation, we know we have a successful product.

There is a key part of the process I did not mention. What does the product do that alleviates the pain the customer is experiencing? This is unimportant. In fact, it is best not to know. Wait, you might think. Clearly it is important. If the product does nothing, the situation will not change and the customer will remain in pain. That is true, but that is also getting the cart before the horse. Knowing how the product intends to operate can cloud our judgment about how to measure it. We will be tempted to utilize confirmatory metrics instead of experimental ones. We will measure what the product does and not what it accomplishes. Just like test driven development requires the tests be written before the code, data driven quality demands that the metrics be designed before the features.

One way to accomplish this is through what can be called a scenario. This term is used for many things so let me be specific about my use. A scenario takes a particular form. It asks what problem the user is having and what alleviation of that pain looks like. It treats the solution as a black box.

Customer Pain

Magic Happens

World Without Pain

I say "Magic Happens" because at this stage, it doesn't matter how things get better, only that they do. This reminds me of an old South Park sketch called the Underpants Gnomes. In it a group of gnomes has a brilliant business plan. They will gather underwear, do something with it, and then profit!

[View:https://www.youtube.com/watch?v=tO5sxLapAts]

Their pain is a lack of money and an overabundance of underwear. Their success is more money (and fewer underpants?). To measure the success of their venture, it is not necessary to understand how they will generate profits from the underpants. It will suffice to measure their profits. Unfortunately for the gnomes, there may be no magic which can turn underwear into profit.

Let's walk through a real-world example.

Customer Pain: When I start my news app, the news is outdated. I must wait for updated news to be retrieved. Sometimes I close the app immediately because the data is stale.

Magic Happens

World Without Pain: When I start the app, the news is current. I do not need to wait for data to be retrieved. Today's news is waiting for me.

What metrics might we use to measure this? We likely cannot measure the user's satisfaction with the content directly, but we can measure the saliency of the news. We could measure the time it takes to get updated content on the screen? Does this go down? We could tag the news content with timestamps and measure the median age of news when the app starts. Does the median age reduce? We could measure how often a user closes the app within the first 15 seconds of it starting up. Are fewer users rage quitting the app? We might even be able to monitor overall use of the app. Is median user activity going up?

Whether the solution involves improving server response times, caching content, utilizing OS features to prefetch the content while the app is not active, or other solutions is not necessary to understand. These are all part of the "magic happens" stage. We can and should experiment with several ideas to see which improve the situation the most. The key here is to measure how these ideas affect user behavior and user perception, not how often the prefetch APIs are called or whether server speeds are increased.

Confirmatory and Experimental Metrics

2014-10-16T00:59:00.000-07:00

As we experiment more using data to understand the quality of our product, the proper use of telemetry becomes more clear. While initially we were enamored with using telemetry to understand whether the product was working as expected, recently it has become clear that there is another, more powerful use for data. Data can tell us not just what is working, but whether we are building the right thing.

There are two major types of metrics. Both have their place in the data driven quality toolkit. Confirmatory metrics are used to confirm that a feature or scenario is working correctly. Experimental metrics are used to determine the effect of a change on desired outcomes. While most teams will start using primarily the first, over time, they will shift to more of the second.

Confirmatory metrics can also be called Quality of Service (QoS) metrics . They are monitors. That is, metrics designed to monitor the health of the system. Did users complete the scenario? Did the feature crash? These metrics can be gathered from real customers using the system or from synthetic workloads. Confirmatory metrics alert the team when something is broken, but say nothing about how it affects behavior. They provide very similar information to test cases. As such, the primary action they can induce is to file and fix a bug.

Experimental metrics can also called Quality of Experience (QoE) metrics. Each scenario being monitored introduces a problem that users have and an outcome if the problem is resolved. Experimental metrics measure that outcome. The implementation of the solution should not matter. What matters is how the change affected behavior.

An example may help. There may be a scenario to improve the time taken to debug asynchronous call errors. The problem is that debugging takes too long. The outcome is that debugging takes less time. Metrics can be added to measure the median time a debugging session takes (or a host of other measures). This might be called a KPI (Key Performance Indicator). Given the KPI, it is possible to run experiments. The team might develop a feature to store the asynchronous call chain and expose it to developers when the app crashes. The team can flight this change and measure how debug times are affected. If the median time goes down, the experiment was a success. If it stays flat or regresses, the experiment is a failure and the feature needs to be reworked or even scrapped.

Experimental metrics are a proxy for user satisfaction with the product. The goal is to maximize (or minimize in the case of debug times) the KPI and to experiment until the team finds ways of doing so. This is the real power behind data driven quality. It connects the team once again with the needs of the customers.

There is a 3rd kind of metric which is not desirable. Those are called vanity metrics. Vanity metrics are ones that make us feel good but do not drive actions. Number of users is one such metric. It is nice to see a feature or product being used, but what does that mean? How does that change the team's behavior? What action did they take to create the change? If they don't know the answer to these questions, the metric merely makes them feel good. You can read more about vanity metrics here.

The Data Driven Quality Mindset

2014-07-30T00:38:00.000-07:00

"Success is not delivering a feature; success is learning how to solve the customer's problem." - Mark Cook, VP of Products at Kodak

I've talked recently about the 4th wave of testing called Data Driven Quality (DDQ). I also elucidated what I believe are the technical prerequisites to achieving DDQ. Getting a fast delivery/rollback system and a telemetry system is not sufficient to achieve the data driven lifestyle. It requires a fundamentally different way of thinking. This is what I call the Data Driven Quality Mindset.

Data driven quality turns on its head much of the value system which is effective in the previous waves of software quality. The data driven quality mindset is about matching form to function. It requires the acceptance of a different risk curve. It requires a new set of metrics. It is about listening, not asserting. Data driven quality is based on embracing failure instead of fearing it. And finally, it is about impact, not shipping.

Quality is the matching of form to function. It is about jobs to be done and the suitability of an object to accomplish those jobs. Traditional testing operates from a view that quality is equivalent to correctness. Verifying correctness is a huge job. It is a combinatorial explosion of potential test cases, all of which must be run to be sure of quality. Data driven quality throws out this notion. It says that correctness is not an aspect of quality. The only thing that matters is whether the software accomplishes the task at hand in an efficient manner. This reduces the test matrix considerably. Instead of testing each possible path through the software, it becomes necessary to test only those paths a user will take. Data tells us which paths these are. The test matrix then drops from something like O(2n) to closer to O(m) where n is the number of branches in the code and m is the number of action sequences a user will take. Data driven testers must give up the futile task of comprehensive testing in favor of focusing on the golden paths a user will take through the software. If a tree falls in the forest and no one is there to hear it, does it make a noise? Does it matter? Likewise with a bug down a path no user will follow.

Success in a data driven quality world demands a different risk curve than the old world. Big up front testing assumes that the cost to fix an issue rises exponentially the further along the process we get. Everyone has seen a chart like the following:

In the world of boxed software, this is true. Most decisions are made early in the process. Changing these decisions late is expensive. Because testing is cumulative and exhaustive, a bug fix late requires re-running a lot of tests which is also expensive. Fixing an issue after release is even more expensive. The massive regression suites have to be run and even then there is little self hosting so the risks are magnified.

Data driven quality changes the dynamics and thus changes the cost curve. This in turn changes the amount of risk appropriate to take at any given time. When a late fix is very expensive, it is imperative to find the issues early, but finding issues early is expensive. When making a fix is quick and cheap, the value in finding a fix early is not high. It is better to lazy-eval the issues. Wait until they become manifested in the real world before a fix is made. In this way, many latent issues will never need to be fixed. The cost of finding issues late may be lower because broad user testing is much cheaper than paid test engineers. It is also more comprehensive and representative of the real world.

Traditional testers refuse to ship anything without exhaustive testing up front. It is the only way to be reasonable sure the product will not have expensive issues later. Data driven quality encourages shipping with minimum viable quality and then fixing issues as they arise. This means foregoing most of the up front testing. It means giving up the security blanket of a comprehensive test pass.

Big up front testing is metrics-driven. It just uses different metrics than data driven quality. The metrics for success in traditional testing are things like pass rates, bug counts, and code coverage. None of these are important in data driven quality world. Pass rates do not indicate quality. This is potentially a whole post by itself, but for now it suffices to say that pass rates are arbitrary. Not all test cases are of equal importance. Additionally, test cases can be factored at many levels. A large number of failing unimportant cases can cause a pass rate to drop precipitously without lowering product quality. Likewise, a large number of passing unimportant cases can overwhelm a single failing important one.

Perhaps bug counts are a better metric. In fact, they are, but they are not sufficiently better. If quality if the fit of form and function, bugs that do not indicate this fit obscure the view of true quality. Latent issues can come to dominate the counts and render invisible those bugs that truly indicate user happiness. Every failing test case may cause a bug to be filed, whether it is an important indicator of the user experience or not. These in turn take up large amounts of investigation and triage time, not to mention time to fix them. In the end, fixing latent issues does not appreciably improve the experience of the end user. It is merely an onanistic exercise.

Code coverage, likewise, says little about code quality. The testing process in Windows Vista stressed high code coverage and yet the quality experienced by users suffered greatly. Code coverage can be useful to find areas that have not been probed, but coverage of an area says nothing about the quality of the code or the experience. Rather than code coverage, user path coverage is a better metric. What are the paths a user will take through the software? Do they work appropriately?

Metrics in data driven quality must reflect what users do with the software and how well they are able to accomplish those tasks. They can be as simple as a few key performance indicators (KPIs). A search engine might measure only repeat use. A storefront might measure only sales numbers. They could be finer grained. What percentage of users are using this feature? Are they getting to the end? If so, how quickly are they doing so? How many resources (memory, cpu, battery, etc.) are they using in doing so? These kind of metrics can be optimized for. Improving them appreciably improves the experience of the user and thus their engagement with the software.

There is a term called HiPPO (highest paid person's opinion) that describes how decisions are too often made on software projects. Someone asserts that users want to have a particular feature. Someone else may disagree. Assertions are bandied about. In the end the tie is usually broken by the highest ranking person present. This applies to bug fixes as well as features. Test finds a bug and argues that it should be fixed. Dev may disagree. Assertions are exchanged. Whether the bug is ultimately fixed or not comes down to the opinion of the relevant manager. Very rarely is the correctness of the decision ever verified. Decisions are made by gut, not data.

In data driven quality, quality decisions must be made with data. Opinions and assertions do not matter. If an issue is in doubt, run an experiment. If adding a feature or fixing a bug improves the KPI, it should be accepted. If it does not, it should be rejected. If the data is not available, sufficient instrumentation should be added and an experiment designed to tease out the data. If the KPIs are correct, there can be no arguing with the results. It is no longer about the HiPPO. Even managers must concede to data.

It is important to note that the data is often counter-intuitive. Many times things that would seem obvious turn out not to work and things that seem irrelevant are important. Always run experiments and always listen to them.

Data driven quality requires taking risks. I covered this in my post on Try.Fail.Learn.Improve. Data driven quality is about being agile. About responding to events as they happen. In theory, reality and theory are the same. In reality, they are different. Because of this, it is important to take an empiricist view. Try things. See what works. Follow the bread crumbs wherever they lead. Data driven quality provides tools for experimentation. Use them. Embrace them.

Management must support this effort. If people are punished for failure, they will become risk averse. If they are risk averse, they will not try new things. Without trying new things, progress will grind to a halt. Embrace failure. Managers should encourage their teams to fail fast and fail early. This means supporting those who fail and rewarding attempts, not success.

Finally, data driven quality requires a change in the very nature of what is rewarded. Traditional software processes reward shipping. This is bad. Shipping something users do not want is of no value. In fact, it is arguably of negative value because it complicates the user experience and it adds to the maintenance burden of the software. Instead of rewarding shipping, managers in a data driven quality model must reward impact. Reward the team (not individuals) for improving the KPIs and other metrics. These are, after all, what people use the software for and thus what the company is paid for.

Team is the important denominator here. Individuals will be taking risks which may or may not pay off. One individual may not be able to conduct sufficient experiments to stumble across success. A team should be able to. Rewards at the individual level will distort behavior and reward luck more than proper behavior.

The data driven quality culture is radically different from the big up front testing culture. As Clayton Christensen points out in his books, the values of the organization can impede adoption of a new system. It is important to explicitly adopt not just new processes, but new values. Changing values is never a fast process. The transition may take a while. Don't give up. Instead, learn from failure and improve.

10 Years of Blogging

2014-07-14T00:18:00.000-07:00

I'm a little late, but it's time to celebrate anyway. Ten years ago, in March of 2004, I began blogging here. Hello World was, of course, my first post. You have embraced and sometimes challenged my words. Thank you for continuing to read and to engage with me. I learn a lot by writing. It is my hope you learn a fraction of that by reading.

After a long break where I published rarely and two years with no posts, you may have noticed that I am back. I waited to write this until I was sure I had some momentum. Last time I stated my return to blogging, I slacked off shortly after. Well, I'm back again and I have a lot I want to talk about. Those who continue to read after the hiatus, thank you for sticking around. I'll try to make it worth your while.

After four years as the Test Development Manager responsible for the Windows Runtime API, I am moving on to new things. I love what we did with the new API, but it is time for change. I will remain in the Operating Systems Group and in the Test Development discipline, but will be working to enable greater use of data in our testing processes.

For more than a year, I have been thinking about how to utilize data in our testing process. I have been inspired by the work of Seth Eliot, Brent Jensen, Ken Johnston, Alan Page, and others. They paint the picture of Data Driven Quality where we determine our success by observing users rather than by test pass rates. As you can see from my recent posts, I have joined their ranks.

No anniversary is complete without some stats. Over the past decade, I have written 418 posts. You have left 1194 comments. The most popular post was about how much memory Vista really needed. It garnered over 119,000 views.

A few have managed to miss my Twitter handle in the about me section. I can be found at @steverowe on Twitter if you want to engage with me there.

Try.Fail.Learn.Improve

2014-07-09T01:17:00.000-07:00

Try.Fail.Learn.Improve. That has been the signature on my e-mail for the past few months. It is intended to be both enlightening and provocative. It emphasizes that we won't get things right the first time. That it is okay to fail as long as we don't fail repeatedly in the same way. Try.Fail.Learn.Improve is a process that needs to be constantly repeated. It is a way of life.

When I first used this phrase, someone responded that it was too strongly worded. Perhaps I should say "Try, Learn, Succeed" instead. But that doesn't convey the true value of the phrase. I specifically chose the word Fail because I wanted to emphasize that we would get things wrong. I avoided the word succeed because I wanted to convey that the process would be a long one.

Try. The essence of getting anything done is to start. In the world of software and especially systems software, we are always doing something unknown. We are not building the nth bridge or even the nth website. As such, the answers are not known up front. How could they be? Thus we can't say, "Do." That implies a known course of action. Try is more accurate. Make a hypothesis about what might work and try it out. Run the experiment.

Fail. Most of the time--not just sometimes--what is tried will fail. It is important to be able to recognize when we fail. Trying something that cannot fail is also doing something from which we cannot learn. Only with the possibility of failure can learning be had. Failure should be expected. "Embrace Failure" is advice I gave early into my new role. Much traditional software has viewed success as the only metric. The downside is that failure was punished. When something is punished, it will diminish. People will shy away from it. Punishing failure will disincentivize people from taking risk. The lack of risk means a lack of failure and a lack of learning. Given that we don't know the correct path to take, this lack of learning ensures a lack of success.

Learn. Einstein is said to have defined insanity as doing the same thing over and over again and expecting a different result. It is possible to fail and not learn. This is usually the result of blame. In a failure-averse culture, admitting you were wrong has severe repercussions. Failure then is not admitted. It is not embraced. Instead, it is blamed on something external. "We would have succeeded if only the marketing team had done their job. The essence of learning is understanding failure. Why did things go differently than predicted? In the scientific method, this helps to set up the next experiment.

Improve. Once we have failed and learned something about why we failed, it is time to try again. Device the next experiment. If what we tried did not work, what else might work? Revise the hypothesis and begin the cycle again. Try the next thing. If at first you don't succeed--and you won't--try, try again.

Succeed. Eventually, if we improve in this iterative fashion, we will succeed. This will take a while. Do not expect it to happen after the first, second, or even third turn of the crank. Success often comes in stages. It is not all or nothing.

If success is such an elusive item and will take many cycles to achieve, is it possible to get to success faster? Yes. Designing better experiments can help. If we understand well enough to do them. Blind experimentation will take too long. But many times we don’t know enough to design great experiments. What then? This is the most likely situation. In it, the best solution is to learn to run more experiments. Reduce the time each turn of the crank takes. Tighten the loop. Design smaller experiments that can be implemented quickly, run quickly, and understood quickly.

One last word on success. It is impossible to succeed if you don't know what success looks like. Be sure to understand where you are trying to go. If you don't, how can you know if an experiment got you closer? You can't learn if you can't fail and you can't fail if you don't know what you are measuring for. As Yogi Berra said, "If you don't know where you are going, you'll end up somewhere else."

So get out there and fail. It's the only way to learn.

Prerequisites to Data Driven Quality

2014-07-01T01:33:00.000-07:00

A previous post introduced the concept of data driven quality. Moving from traditional, up-front testing to data driven quality is not easy. It is not possible to take just any product and start utilizing this method. In addition to cultural changes, there are several technical requirements on the product. These are: early deployment, friction-free deployment, partial deployment, high speed to fix, limited damage, and access to telemetry about what the users are doing.

Early deployment means shipping before the software is 100% ready. In the past, companies shipped beta software to limited audiences. This was a good start, but betas happened once or twice per product cycle. For instance, Windows 7 had two betas. Windows 8 had one. These were both 3 year product cycles. In order to use data to really understand the product's quality, shipping needs to happen much more often. That means shipping with lower quality. The exact level of stability can be determined per product, but need not be very high if the rest of the prerequisites are met. Ken Johnston has a stimulating post about the concept of Minimum Viable Quality.

Friction-free deployment means a simple mechanism to get the bits in front of users. Seamless installation. The user shouldn't have to know that they have a new version unless it looks different. Google's Chrome browser really pioneered here. It just updates in the background. You don't have to do anything to get the latest and greatest version and you don't have to care.

Because we may be pushing software that isn't 100%, deployment needs to be limited in scope. Software that is not yet fully trusted should not be given to everyone all at once. What if something goes wrong? Services do this with rings of deployment. First the changes are shown to only a small number of users, perhaps hundreds or low thousands. If that appears correct, it is shown to more, maybe tens of thousands. As the software proves itself stable with each group, it is deemed worthy of pushing outward to a bigger ring.

If someone goes wrong, it is important to fix it quickly. This can be a fix for the issue at hand (roll forward) or reversion to the last working version (roll back). The important thing is not to leave users in a broken state for very long. The software must be built with this functionality in mind. It is not okay to leave users facing a big regression for days. In the best case, they should get back to a working system as soon as we detect the problem. With proper data models, this could happen automatically.

Deployment of lower quality software means that users will experience bugs. Total experienced damage is a function of both duration and magnitude. Given the previous two prerequisites, the damage will be limited in duration, but it is also important to limit the damage in magnitude. A catastrophic bug which wipes out your file system or causes a machine not to boot need not last long. Rolling back doesn't repair the damage. Your dissertation is already corrupted. Pieces of the system which can have catastrophic (generally data loss) repercussions need to be tested differently and have a different quality bar before being released.

Finally, the product must be easy to gather telemetry on what the user is doing. The product must be capable of generating telemetry, but the system must also be capable of consuming it. The product must be modified to make generating telemetry simple. This is usually in the form of a logging library. This library must be lightweight. It is easy to overwhelm the performance of a system with too slow a library and too many log events. The library must also be capable of throttling. There is no sense causing a denial of service attack on your own datacenter if too many users use the software.

The datacenter must be capable of handling the logs. The scale of success can make it difficult. The more users, the more data will need to be processed. This can overwhelm network connections and analysis pipelines. The more data involved, the more computing power is necessary. Network pipes must be big. Storage requirements go up. Processing terabytes or even petabytes of data is not trivial. The more data, the more automated the analysis must become to keep up.

With these pieces in place, a team can begin to live the data driven quality lifestyle. There is much than just the technology to think about though. They very mindset of the team must change if the fourth wave of testing it to take root. I will cover these cultural changes next time.

Perceived vs. Objective Quality

2014-06-23T03:57:00.000-07:00

I recently heard this story, but I can't recall who told it to me. I don’t have proof of its veracity so it might be apocryphal. Nevertheless, it illustrates an important point that I believe to be true independent of the truth of this story.

As the story goes, in the late 1990s, several Microsoft researchers set about trying to understand the quality of various operating system codebases. Of concern were Linux, Solaris, and Windows NT. The perception among the IT crowd was that Solaris and Linux were of high quality and Windows NT was not. These researchers wanted to test that objectively and understand why NT would be considered worse.

They used many objective measures of code quality to assess the 3 operating systems. This would be things like cyclomatic complextity, depth of inheritance, static analysis tools such as lint, and measurements of coupling. Without debating the exact value of this sort of approach, there are reasons to believe these sort of measurements are at least loosely correlated with defect density and code quality.

What the researchers found was the Solaris came out on top. It was the highest quality. This matched the common sense. Next up they found was Windows NT. It was closely behind Solaris. The surprise was Linux. It was far behind both of the other two. Why then the sense that it was high quality? The perceived quality of both NT and Linux did not match their objective measures of quality.

The speculation on the part of the researchers was that while Linux had a lot of rough edges, the most used paths were well polished. The primary scenarios were close to 100% whereas the others were only at, say, 60%. NT, on the other hand, was at 80 or 90% everywhere. This made for high objective quality, but not high experienced quality.

Think of it. If everything you do is 90% right, you will run into small problems all the time. On the other hand, if you stay within the expected lanes on something like Linux, you will rarely experience issues.

This coincides well with the definition of quality being about fitness for a function. For the functions it was being used for, Linux was very fit. NT supported a wider variety of functions, but was less fit for each of them and thus perceived as being of lower quality.

The moral of the tale: Quality is not the absence of defects. Quality is the absence of the right kinds of defects. The way to achieve higher quality is not to scour the code for every possible defect. That may even have a negative effect on quality due to randomization. Instead, it is better to understand the user patterns and ensure that those are free of bugs. Data Driven Quality gives the team a chance to understand both these use patterns and what bugs are impeding them.

Data Driven Quality

2014-06-16T01:06:00.000-07:00

My last three posts have explained how test lost its way. It evolved from advocates of the user to a highly efficient machine for producing test results, verifying correctness as determined by a specification. Today, test can find itself drowning in a sea of results which aren't correlated with any discernible user activity. If only there were a way to put the user back at the center, scale testing, and be able to handle the deluge of results. It turns out, there is. The path to a solution has been blazed by our web services brethren. That solution is data. Data driven quality is the 4th wave of testing.

There is a lot to be said for manual testing, but it doesn't scale. It takes too many people, too often. They are too expensive and too easily bored doing the same thing over and over. There is also the problem of representativeness. A tester is not like most of the population. We would need testers from all walks of life to truly represent the audience. Is it possible to hire a tester that represents how my grandmother uses a computer? It turns out, it is. For free. Services do this all the time.

If software can be released to customers early, they will use it. In using it, they will inevitably stumble across all of the important issues. If there were a way to gather and analyze their experiences, much of what test does today could be done away with. This might be called the crowdsourcing of testing. The difficulty is in the collection and analysis.

Big Data and Data Science are the hot buzzwords of the moment. Despite the hype, there is a lot of value to be had in the increased use of data analysis. What were once gut feels or anecdotal decisions can be made using real information. Instead of understanding a few of our customers one at a time, we can understand them by the thousands.

A big web service like Bing ships changes to its software out to a small subset of users and then watches them use the product. If the users stop using the product, or a part of the product, this can indicate a problem. This problem can then be investigated and fixed.

The advantage of this approach is that it represents real users. Each data point is a real person, doing what really matters to them. Because they are using the product for real, they don't get bored. They don't miss bugs. If the product is broken, their behavior will change. That is, if they experience the issue. If they don't, is it really a bug? (more on this in another post). This approach scales. It can cover all types of users. It doesn't cost more as the coverage increases.

Using data aggregated across many users, it should be possible to spot trends and anomalies. It can be as simple as looking at what features are most used, but it can quickly grow from there. Where are users failing to finish a task? What parts of the system don't work in certain geographies? What kind of changes most improve the usage.

If quality is the fitness of a feature for a particular function, then watching whether customers use a feature, for how long, and in what ways can give us a good sense of quality. By watching users use the product, quality can begin to be driven by data instead of pass/fail rates.

Moving toward data driven quality is not simple. It operates very differently than traditional testing. It will feel uncomfortable at first. It requires new organizational and technical capabilities. But the payoff in the end is high. Software quality will, by definition, improve. If users are driving the testing and the team is fixing issues to increase user engagement, the fitness for the function users demand of software must go up.

Over the next few posts, I will explore some of the changes necessary to start driving quality with data.

Halt and Catch Fire

2014-06-10T02:03:00.000-07:00

I just finished watching the first episode of the new AMC show called Halt and Catch Fire. The name comes from an old computer instruction which would stop the machine immediately. The show follows a small Texas company trying to build IBM PC Clones. The company and the people are fictitious, but it seems to parallel a lot of what Compaq went through in the early 80s.

I’ve always been a sucker for computing history. I enjoy movies like Pirates of Silicon Valley and The Social Network. I like Triumph of the Nerds. I am happy to say that I really enjoyed the pilot episode. It does a good job with the technical aspects of the show. There is a scene where they are reverse engineering the ROM chip and it appears quite authentic to the way this work would be done. They do a good job explaining things without getting dull. They went out of their way to be accurate. This article in Wired points out the lengths they went to in order to be period authentic. It shows.

If you have any interest in computing history or just like techy tv shows, give Halt and Catch Fire a try.

Test Has Lost Its Way

2014-06-09T02:49:00.000-07:00

In a blog post, Brent Jensen relays a conversation he had with an executive mentor. In this conversation, his mentor told him that, "Test doesn't understand the customer." When I read this, my initial reaction was the same as Brent's: "No way!" If test is focused on one thing, it is our customer. Then as I thought about it more, my second reaction was also the same as Brent's: "I agree. Test no longer cares about the customer." We care about correctness over quality.

Let me give an example. This example goes way back (probably to XP), but similar things happen even today. We do a lot of self-hosting at Microsoft. That means we use an early version of the software in our day-to-day work. Too often, I am involved in a conversation like the following.

Me: I can't use the arrow keys to navigate a DVD menu in Windows Media Player.

Them: Yes you can. You just need to tab until the menu is selected.

Me: That works, but it takes (literally) 20 presses of the tab key and even then only a faint grey box indicates that I am in the right place.

Them: That's what the spec says. It's by design.

Me: It's by bad design.

What happened here? I was trying out the feature from the standpoint of a user. It wasn't very fit for my purposes because if I were disabled or my mouse was broken, I couldn't navigate the menu on a DVD. The person I was interacting with was focused too much on the correctness of the feature and not enough about the context in which it was going to be used. Without paying attention to context, quality is impossible to determine. Put another way, if I don’t understand how the behavior feels to the user, I can’t claim it is high quality.

How did we get to this point? The long version is in my post on the history of testing. Here is the condensed version. In the beginning, developers made software and then just threw them over the wall to testers. We had no specifications. We didn't know how things were supposed to work. We thus had to just to try use the software as we imagined an end user would. This worked great for finding issues, but it didn't scale. When we turned to test automation to scale the testing, this required us to better understand the expected cases. This increased the need for detailed specifications from which to drive our tests.

Instead of a process like this:

We ended up with a process like this:

As you can see in the second, the perspective of the user is quickly lost. Rather than verifying that the software meets the needs of the customer, test begins to verify that the software meets the specifications. If quality is indeed the fitness for a function then the needs of the user is necessary for any determination of quality. If the user's needs could be completely captured in the specification, this might not be a problem, but in practice it is.

First, I have yet to see the specification that totally captures the perspective of the user. Rather, it is used only as an input to the specification. As changes happen, ambiguities are resolved, or the feature is scoped (reduced), the needs of the user are forgotten.

Second, it is impossible to test everything. A completely thorough specification would be nearly impossible to test in a reasonable amount of time. There would be too many combinations. Some would have to be prioritized, but without the user perspective, which ones?

Reality isn't quite this bleak. The user is not totally forgotten. They come up often in the conversation, but they are not systematically present. They are easily forgotten for periods of time. The matching of fitness to function may come up for one question or even one feature, but not for another.

Test has lost its way. We started as the advocate for the user and have become the advocate for the specification. We too often blindly represent the specification over the needs of real people. Going back to our roots doesn't solve the problem.

Even in the best case scenario, where the specification accounts for the needs of the user and the testers always keep the user forefront in their mind, the system breaks down. Who is this mythical user? It isn't the tester. We colloquially call this the "98052" problem. 98052 is the zip code for Redmond, where Microsoft is located. The people that live there aren't representative of most of the other zip codes in the country or the world.

Sometimes we create aggregated users called personas to help us think about other users. This works up to a point, but "Maggie" is not a real user. She doesn't have real needs. I can't ask her anything and I can't put her in a user study. Instead Maggie represents tens of thousands of users, all with slightly different needs.

Going back to our roots with manual testing also brings back the scale problem. We really can't go home. Home doesn't exist any more. Is there a way forward that puts the users, real users, back at the center, but scales to the needs of a modern software process? I will tackle one possible solution in my next post.

A Brief History of Test

2014-06-04T01:05:00.000-07:00

In the exploration of quality, it is important to understand where software testing came from, where it is today, and where it is heading. We can then compare this trajectory to the goal of ensuring quality and see whether a correction is necessary or if we're going the right direction.

I have been involved in software testing for the past 16 and one half years. To give you some perspective, when I started at Microsoft, we were just shipping Windows 98. This gives me a lot of perspective on the history of software testing. What I give below is my take that that history. Others may have experienced it differently.

There are three major waves of software testing and we're beginning to approach a 4th. The first wave was manual testing. The second wave was automated testing. The third wave was that of tooling. It is important to note that each wave does not fully supplant the previous wave. There is still a need for manual testing even in the tooling phase. There is a need for automated testing even in the coming 4th wave.

The first wave was manual testing. Sometimes this is also called exploratory testing. It is often carried out by people carrying the Quality Assurance (QA) or Software Test Engineer (STE) title. Manual testing is just what it sounds like. It is people sitting in front of a keyboard or mouse and using the product. In its best form it is freeform and exploratory in nature: a tester trying to understand the user and carrying out operations trying to break the software. This is where the lore of testing comes from. The uber-tester who can find the bug no one else can imagine. In its worst form, this is the rote repetition of the same steps, the same levels, over and over again. This is also the source of legends, but not good ones. At its best, this form of testing is highly connected to quality. It is all about the user and his (or her) experience with the product.

Manual testing can produce great user experiences. As I understand it from friends who have gone there, this is the primary method of testing at Apple. The problem is, manual testing doesn't scale. It can also be mind-numbing. In the era of continuous integration and daily builds, the same tests have to be carried out each day. It becomes less about exploring and more about repeating. Manual testing is great for finding the bugs initially, but it is a terribly inefficient regression testing model.

It gets even worse when it comes time for software maintenance. At Microsoft, we support our software for a long time. Sometimes a really long time. Windows XP shipped in 2001 and is just now becoming unsupported. Consider for a moment how many testers it would take to test XP. I'll just make up a number and say it was 500. It wasn't, but that's good enough for a thought exercise. Every time you release a fix for XP, you need 500 people to run through all the tests to make sure nothing was broken. But the 500 original testers are probably working on Vista so you need an additional 500 people for sustained engineering. Add Windows 7, Windows 8, and Windows 8.1 and you now need 2,500 people testing the OS. Most are running the same regression tests every time which is not exciting so you end up losing all your good people. It just doesn't work.

This leads us to the second wave. The first wave involved hundreds of people pressing the same buttons every day. It turns out that computers are really good at repetitive tasks. They don’t get bored and quit. They don't get distracted and miss things. Thus was born test automation. Test automation in its most basic form is writing programs to do all of the things that manual testers do. They can even do things testers really can't. Manual testing is great for a user interface. It's hard to manually test an SDK. It turns out that it is easy to write software that can exercise APIs. This magic elixir allowed teams to cover much more of the product every day. We fully drank the Kool-Aid.

We set off to automate everything. There was nothing automation couldn't do and so all STEs were let go. Everyone become a Software Design Engineer in Test (SDET). This is a developer who, rather than writing the operating system, writes tests for the automation. Some of this work is mundane: calling the Foo API with a 1, a 25, and a MAX_INT. Other parts can be quite challenging. Consider how you would test the audio playback APIs. It is not enough to merely call the APIs and look at the return code. How do you know the right sound was played, at the right volume, and without crackling? Hint: it's time to break out the FFT.

Not everything is kittens and roses in the world of automation. Machines are great at doing what they are told to do. They don't take breaks or demand higher pay. However, they do only what they are told to do. They will only report bugs they are told to look for. One of my favorite bugs to talk about involved media player. In one internal build, every time you clicked next track on a CD (remember those things?) the volume would jump to maximum. While a test could be concocted to look for this, it never would be. Test automation happily reported a pass because indeed the next track started playing. It turns out that once you have run a test application for the first time, it is done finding new bugs. It can find regressions, but it can't notice the bug it missed yesterday.

The points toward the second problem with test automation. It requires very complete specifications. Because the tests can't find any bugs they weren't programmed to find, they need to be programmed to find everything. This requires a lot more up-front planning so the tests can cover the full gamut of the system under test. This heavy reliance upon specifications begins to distance testing from the needs of the user and thus we move away from testing quality and toward testing adherence to a spec.

The third problem with test automation is that it can generate too much data. Machines are happy to churn out results every day and every build. I knew teams whose tests would generate millions of results for each build. Staying on top of this becomes a full time job for many people. Is this failure a bug in the test? Is it a bug in the product? Was it an environmental issue (network down, bad installation, server unavailable)?

One other problem that can happen is that the work can grow faster than test developers can keep up. It is easy for a developer to write a little code which creates a massive amount of new surface area. Consider the humble decorator pattern. If I have 4 UI objects in the system, the tester needs to write 4 sets of tests. Now if the developer creates a decorator which can apply to each of the objects, he only has to write one unit of code to make this work. This is the advantage of the pattern. However, the tester has to write 4 sets of tests. The test surface is growing geometrically compared to the code dev is writing. This is unsustainable for very long.

This brings us to the third wave of testing. This wave involves writing software that writes tests. I call this the tooling phase. Rather than directly writing a test case, it is possible to write a tool that, given some kind of specification, can emit the relevant test cases automatically. Model Based Testing is one form of this tooling. The advantage of this sort of tooling is that it can adapt to changes. Dev added one decorator to the system? Add one new definition in your model and tests just happen.

There are some downsides to the tooling approach. In fact, there are enough downsides that I've never experienced a team that adopted it for all or even most of their testing. They probably exist, but they aren't common. At most, this tooling approach was used to supplement other testing. The first downside is the oracle problem. It is easy enough to create a model of the system under test and generate hundreds or thousands of test cases. It is another thing entirely to understand which of these test cases pass and which fail. There are some problem domains where this is a tractable problem. Each combination or end state has an easily discernible outcome. In others, it can be exceptionally difficult without re-creating all of the logic of the system under test. The second is that the failures can be very hard to reason about in terms of the user. When the Bar API gets this and that parameter while in this state, it produces this erroneous result. Okay. But when would that ever happen in the real world?

Tooling approaches can solve the static nature of testing mentioned above. Because it is mathematically impractical to do a complete search of the state space of any non-trivial application or API, we are always limited to a subset of all possible states for testing. In traditional automation, this subset it fixed. In the tooling approach, the subset can be modified each time with random seeds, longer exploration times, or varying weights. This means each run can expose new bugs. This can be used to good effect. Given some metadata about an API and rules on how to call it, a tool can be created to automatically explore the API surface. We did this to good effect in Windows 8 when testing the Windows Runtime surface.

Sometimes it can have unexpected and even comical outcomes. I recall a story told to me my a friend. He wrote a tool to explore the .Net APIs and left it to run overnight. The next morning he came in to reams of paper on his desk. It turns out that his tool had discovered the print APIs and managed to drain every sheet of paper from every printer in the building. At Microsoft every print job has a cover sheet with the alias of the person doing the printing so his complicity was readily apparent. Some kind soul had gathered all of his print jobs and placed them in his office.

The tooling approach to testing exacerbates two of the problems of automation. It creates even more test results which then have to be understood by a human. It also moves the testing even further away from our definition of quality. Where is the fitness for a function taken into account in the tooling approach?

There is a problem developing in the trajectory of testing. We, as a discipline, have moved steadily further from the premise of quality. We'll examine this in more detail in the next post and start considering a solution in the one after that.

What is Quality?

2014-05-29T16:57:00.000-07:00

Most of my career so far has focused on software testing in one form or another. What is testing if not verifying the quality of the object under test? But what does the word quality really mean? It is hard to define quality, but I will argue that a good operating definition is fitness for a function. In the world of software then, the question test should be answering is whether the software is fit for the function at hand.

The book Zen and the Art of Motorcycle Maintenance tackles the question of what quality is head on. Unfortunately, it doesn't give a clear answer. The basic conclusion seems to be that quality is out there and it drives our behavior. It's a little like Plato's theory of Forms. This is interesting philosophically, but not practically. There are some parts which are more pragmatic. One passage sticks out to me. As might be suspected from the title, there is some discussion of motorcycle maintenance in the book (but not much). At one point the Narrator character is on a trip with his friend John Sutherland. The Narrator has an older bike while John has a fancy new one. The Narrator understands his bike. John chooses not to learn about his and needs a mechanic to do anything to it. Quality then is that relationship between the operator and the bike. The more they understand it and can fix it, the higher quality the relationship. In other words, the more the person can get out of it without needing to go to someone else, the higher the quality.

Christopher Alexander wrote about architecture, yet he is quite famous in the world of software design. His books talk about patterns in buildings and spaces and how to apply them to get specific outcomes. The "Gang of Four" translated his ideas from physical space to the virtual world in their book, Design Patterns. Alexander is interesting not just in his discussion of patterns, but also of quality. What makes a design pattern good, in his mind, is its fitness for a purpose. He says, "The form is the part of the world over which we have control, and which we decide to shape while leaving the rest of the world as it is. The context is that part of the world which puts demands on this form; anything in the world that makes demands of the form is context. Fitness is the relation of mutual acceptability between these two." (Notes on the Synthesis of Form)

Both authors are making an argument that quality then is not something that can be determined in a vacuum. One cannot merely look at a device or a piece of software and make an assessment of quality. In the motorcycle case, the fancy bike would probably look to be of higher quality. It was the relationship with the owner that made the chopper of higher quality. With software, it is similar. One must understand the users and the use model before a determination of quality can be made.

Let's look at a few examples. Think of the iPhone. It has a simple interface. While it has gained complexity over time (it's nearly on version 8!), it is still quite limited compared to a traditional computer. It has constrained input options, preferring only touch. The buttons are big and the screens not dense. Because of this, the apps tend to be simple and single-purpose. There is no multitasking to be found. It didn't even have cut and paste when it appeared on the scene. Yet the iPhone is considered to be high quality. Its audience doesn't expect to be doing word processing on it. They want to check e-mail, "do" Facebook, and play games. It is exquisitely suited for this purpose.

At the other extreme, consider a workstation running Autocad. Autocad has thousands of functions, many windows open at once, and requires extreme amounts of processing power and memory. It's user interface is quite cluttered compared to that of most iPhone apps and it is not easily approachable by mere mortals. Yet it too is considered high quality. Its users expect power over everything else. They need the ability to render in 3D and model physics. It serves this purpose better than anything else in the market. The simplicity and prettiness of the iPhone interface limits utility and is unwanted in this domain. The domain of CAD is one of capability over beauty and efficiency over discoverability.

Too often in the world of software we ignore this synthesis of user and device. Instead we focus on correctness. The quality of the software is judged based on how correctly it implement a spec. This is an easier definition to interpret. It is more precise. There is a right and a wrong answer. Either the software matches the spec or it does not. With a fitness definition, things are more murky. Who is the arbiter of good fit? How bad does it need to be before it is a bug? It can be alluring to follow the sirens of precision, but that comes at a cost. I will talk about that cost next time.

How to answer a programming interview question

2012-04-10T01:15:00.000-07:00

I spent a few hours on Friday doing mock interviews for CS undergrads. The idea was to help them experience the interview process without the pressure of having a job on the line. The session was interactive with lots of stopping for advice. I found myself explaining the following to most of the candidates. I hope you find some value in it.

Explain the algorithm that you will use to solve the problem. Use visuals where you can. This will accomplish two things. First, it will make sure you understand what you are going to do before you start coding. This in turn will reduce the number of corrections and backtracks you have to make while coding. Second, it will give the interviewer a chance to help you correct course early. This means getting onto the right track before you spend a lot of time coding a solution to the wrong problem.
Write the code. Use a real programming language. Tell the interviewer what language you are using and then write in it. Don’t sweat the syntax. Focus on the flow of the code. Using a real language is important because it is too easy to gloss over the important things with pseudo-code. Try to write the code in one pass. You already have the algorithm figured out so you shouldn’t need to backtrack and change your code.
Walk through the code with real data. Before you declare yourself “done”, take the time to walk through the code. Take a real example and explain how each line of code operates on it. This will a) give you a chance to find bugs and b)demonstrate to the interviewer that the code works. Pay careful attention here. I’ve seen a number of candidates with big bugs in their code explain how they want the code to work instead of how it really works. If you don’t pay attention, you won’t find the bugs.
If you are interviewing for a test developer position, test your code. You’ll be asked to do so anyway, you might as well get credit for taking the initiative. Even if you aren’t interviewing as an SDET, it is a good idea to show that you are thinking through the failure cases.

Jack Tramiel, founder of Commodore, dies at age 83

2012-04-09T18:54:00.000-07:00

Jack Tramiel was the founder of Commodore International which produced the Commodore 64 and the Amiga computers. It was also the company that made the once ubiquitous 6502 processor which powered the Apple // and the Commodore 64. The Commodore 64 was the best selling computer of all time and the Amiga (which came after his tenure, but from his company) was almost a decade ahead of its time. I learned to program with Basic on the Commodore 128 and spent a lot of my formative years using the Amiga.

Jack Tramiel was a ruthless business person who drove a very hard bargain. His relentless push for prices drove the first major round of the home computer revolution. He died on Monday at age 83. He had a great legacy and will be missed. A great book on Commodore and Jack Tramiel's role in it is Commodore: A Company on the Edge.

Behind the Scenes of Windows 8

2012-03-20T02:19:00.000-07:00

Larry Osterman returns with another installment of his Behind the Scenes... series. This time with Windows 8. Larry is a developer for the team I work on. If you haven't caught it yet, take the time to read how the team developing Windows Runtime experienced this release.

Successfully Interviewing for a Developer Job

2012-03-13T02:59:00.000-07:00

Having recently completed another round of campus campus interviews, several things stand out as advice that could be useful to those of you aspiring to get jobs in the software industry as a Developer or Test Developer.

Always describe what you are doing and thinking – If you are given a programming question and you interact only with the paper or the white board for the next 10 minutes, you did poorly even if you got the answer right. You failed if you didn’t get the answer right. The interviewer wants to know not only that you can get the answer but also how you think. If you have the right thought processes, you might pass the interview even if you don’t get the answer. Often times people get extra credit for explaining what their options are and why they are picking one or another.
When asked about a project or job, describe what *you* did – Explaining that the team wrote a location-aware notification system for Android phones doesn’t win you any points if you only did the graphics. If you wrote the notification database and not the location APIs, talk about the database. The interviewer will likely probe the details of the project. Emphasizing the exciting parts you didn’t have a part in only leads to your having so say “I don’t know. I didn’t work on that.”
Be able to speak to the details of every project on your resume – If you put it there, it is to demonstrate your knowledge of an area. If you don’t actually have that knowledge any more, it doesn’t help you. In fact, it makes you look less knowledgeable. You don’t have to know every detail of the puzzle game you wrote in that AI class 3 years ago, but you should know enough to talk about the algorithm you used. Spend the time reviewing and perhaps even practicing talking about each one before showing up.
If you don’t know, don’t bluff – It is often okay to say “I don’t know…” as long you can also say “…and this is how I would find out.” It is certainly better than bluffing. The person interviewing you has probably done a lot of interviews and a lot of programming. They will likely know that you don’t really know. Getting caught bluffing almost certainly kills your chances of being hired.
Think through your code before you start writing – Having to backtrack a lot and change your code is not a good thing. It is better than being wrong, but it is less good than the next person who will answer the same question without having to backtrack. Having to add a lot of extra variables or loops or having to change your algorithm substantially demonstrates that you didn’t really understand the problem to start with. It also makes your code really hard to read because unlike in a text editor, everything doesn’t shift down when you insert a line. Take the time to think through your algorithm before you start writing. Even better, explain it to the interviewer. That way they can correct you if you are on the wrong track.
Time matters – Sometimes it is not enough to get the solution correct. Taking a long time to get there, especially if you have to backtrack, shows you do not have sufficient mastery. Some problems are easy and the interviewer expects you to get them right the first time and without a lot of delay. Finding a value in a binary search tree or reversing a string fall into this category.
Be friendly – Believe it or not, your interviewer is human. He or she will be more inclined to give you the benefit of the doubt if they like you. Being friendly, making eye contact, and being upbeat all help with this. If you answer questions with short sentences that do not further the conversation, never make eye contact, and have a sullen attitude, you will fail a close interview.
Ask clarifying questions – Don’t just jump in and start programming. Think about what you are being asked to do. Many times the question will be intentionally vague. Ask questions about the constraints, the expected use, the interface, etc. If the question was intentionally vague, not asking questions will be a negative. Asking intelligent questions shows that you understand the question enough to notice the edges. That will earn you points.

Have other advice? Please leave it in the comments.

If you ever wondered why Vim uses hjkl for arrow keys...

2012-03-09T07:52:00.000-08:00

I tend to use Vim as my editor of choice. Even when using Visual Studio, I do so with the ViEmu plugin. I have always wondered why the directional keys were hjkl instead of jkl;. The latter are the home keys for the right hand. The former are not and overload the right index finger. Thanks to Hacker News, I now know the answer. The terminal which Bill Joy wrote the original Vi on was an ADM-3A and that had arrows drawn on the hjkl keys (see the link for a picture).

How to Write Your First Developer Resume

2012-03-07T12:53:00.000-08:00

I am returning from a recruiting trip to interview students on campus. Because of this trip, I had a chance to read a good number of resumes. Some were well done, many, however, were not. They contained irrelevant information or were missing important items. It’s very hard writing your resume for your first job. You don’t have a corpus of work history that tells a potential employer that you will make a good employee. It’s not immediately obvious what to write down and what to leave out. What follows is advice. It is only one person’s opinion, but I have been interviewing for jobs at Microsoft for about 14 years so it is an experienced opinion. As with all opinions, others will have differing ones. I encourage others to respond in the comments if they have supplementary or even contrary opinions.

It is important to understand the purpose of a resume before writing one. If you understand what your resume will be used for, it is easier to craft the resume. If you are applying for a technical position at a software company, your resume will serve two purposes. First, it will be used to decide whether to interview you. Second, it will be used to provide conversational hooks during the interview itself.

The resume will be read by someone from Human Resources or perhaps by a technical manager. They will read it to determine if you are worth talking with further. Each will be looking for slightly different things. A person from HR will not understand most of the words and acronyms on your resume, but they will match them with the job requirements. A “skills” section is often a good way to convey this information. In it you would list the languages you know, the frameworks you understand, and the tools you have used. While the HR person will likely only scan this list, know that anything you list here is fair game for an interviewer to ask you about later. Don’t bluff.

What sort of skills should you list? Certainly list any programming languages you are comfortable with. I have seen people rate their level of experience and find this to be a useful practice. This allows you to list Perl which you used in one class even though you wouldn’t call yourself an expert in it. List major frameworks that you have used. This includes UI toolkits like WinForms, GTK, or Direct3D. It includes Web frameworks like JQuery or ASP.Net MVC. One rule of thumb to use here: if there are books written on the subject and you have experience, list it.

Another important thing to list are the tools you have used. You can list operating systems you are familiar with. You probably won’t get benefit from listing obsolete systems. Saying that you know Windows 98 won’t gain you points. Mentioning a quirky, but respected OS might make you stand out. If you used the Amiga or BeOS, saying so can be noticed. The person reading your resume might have been a fan too. You don’t need to list that you know Windows XP and Vista and 7. Just say Windows. List any major, relevant, software packages you have used. Relevance is useful here. Listing Microsoft Office won’t help unless you programmed for it. It will be assumed that if you can program, you can use an office suite. Certainly list difficult to master software like Matlab or Photoshop. List relevant server technologies like Apache or Node.JS. You might list the IDEs you are comfortable with. I saw some people list the source control systems they had used. I’m not convinced this helps. It will be assumed that you can come to master Git or Subversion.

The technical manager screening your resume will likely look past this list of technologies more quickly than the HR person. He or she will be looking for your experience. What have you done? “Not much” you might be thinking, “I haven’t had a job in the tech industry yet.” That may be true, but if you think you are skilled enough to land a job as a developer or test developer, you must think you have relevant skills and those skills were gained through experience. There are three ways to expose these to your screener: classes, interesting projects, and relevant jobs. It is fortunate that these are also the same things your interviewer will likely key off of in their perusal of your resume.

It amazes me how few students list their classes on their resume. It helps to know if you’ve taken data structures or databases so I know what kind of knowledge you are likely to have. These may also help you be considered for jobs you otherwise would not. If you have taken an AI class, you are more likely to be considered for a job with a search relevancy team.

No resume of a college student or recent graduate should be lacking a list of interesting projects. Yet probably a quarter do not have one. Especially if this is your first technology job or internship, the projects section is the only place you can talk about what relevant experience you have. Many if not most CS classes have a final project. Pick several and talk about them. List not only the name of the project but describe a) your role on it, b) what it accomplished, and c) what technologies (languages/frameworks) it used. Don’t stop at school projects. Even more impressive is the work that you did on your own because that demonstrates true interest. Did you contribute to an open source project? List it. Did you program a game or a phone app with your buddies over Christmas? List it. Do you help maintain a MUD or create objects for Second Life? List them. Don’t be shy here. Just because you weren’t paid or graded on your work doesn’t mean it isn’t interesting to us.

Finally, list relevant jobs. It is important to know why you are listing jobs here. Unlike when you are applying to be a Target Team Member, you don’t need to prove that you will show up on time and work hard. Those things are assumed at this stage. The fact that you have a college degree is proof enough. That means you don’t need to list every jobs you’ve ever had. The most important jobs to list are ones that are somewhat relevant. This can be direct relevance like if you have an internship at Cisco testing router firmware, but it can also be tangentially relevant work like doing tech support for your dad’s small business or even running the light and sound for a school theater. Working at Best Buy shows that you have an interest in technology. If you have nothing relevant, feel free to list at least a few jobs even if they were burger flipping. This will show that you have some employment history and aren’t a total slacker.

When listing a job, it is important to know why you are listing it. If it is because of relevance, explain what you did at the job. List the projects you were on, what your role was, and what languages/technologies you used. Give enough information that we will know what you learned. If you are listing the job merely to demonstrate employment, just list that you had the job and for how long. Telling me that you “created a safe pool environment” while working at the YMCA doesn’t help your cause. The fact that you “Provided excellent customer service” at McDonalds doesn’t make me more likely to interview you over the next guy. The same is true of extra-curricular activities. While you may be proud of the multicultural banquet you organized last month, it won’t gain you a lot of points. Don’t spend much space describing it.

These things will all make you more likely to get through the screening process. Once you have passed the initial screen, your resume serves a secondary purpose. Keep this in mind when writing it. The people interviewing you—either for an initial phone screen or an in person interview—want to know what you know so they can ask you about it. Toward this end they will ask you technical questions. They might have you reverse a string or find the 4th element from the end of a linked list. Perhaps just as important, though, they will want to gauge your passion and understanding. This will be done by asking about your experience. They will want to know about the research project you did or the Windows Phone app you wrote last summer. For this to happen, they need to know what you have done. Luckily, this is the same information you already provided to the technical manager. What did you do at your jobs? What projects have you participated in recently?

Sometimes the interviewer will want to get to know you as a person. Listing a few activities or interests can help here. Most of the time this section will be ignored, but if you happen to enjoy something in common with your interviewer, it can provide an opportunity to build a relationship. Feel free to list that you enjoy sea kayaking or playing chess or that you were part of a Romeo and Juliet production at the Shakespeare festival last summer.

<soapbox>Please consider whether your objective statement actually helps you. Most of the time, it does not and you shouldn’t have one. I know that the helpful person at your career center told you to put one on your resume. They are wrong. Most of the time an objective statement boils down to “I want a job.” Yep. That’s why I have your resume in my hands right now. Unless your objective statement can somehow help differentiate you from your competition, leave it off. Everyone whose resume I have in front of me wants a job. They even all want a software job. Telling me that doesn’t help. It just wastes space and makes it less likely that I’ll read something which does make you stand out. In my experience, the objective statement provides value only when used for the purpose of *limiting* the scope of what you will consider. If you really only want a job programming microcontrollers, put that in your objective statement. We’ll pass you by for the kernel programming job. If, on the other hand, you—like most people in school—just want a job in the field of computers, leave it off.

Here is an example of the kind of objective statement that is all too common: “Objective: Seeking to expand my experience and technical knowledge and gain experience in a corporate environment.” Thanks. That differentiates you from every other resume I’ve seen.</soapbox>

Steve Jobs on the Value of Saying No

2011-10-06T12:29:00.000-07:00

I ran across a great segment of Steve Jobs talking at the WWDC in 1997 just after he returned to Apple. Similar to my post about pruning the decision tree, he speaks about the power of saying no to the bad ideas. "Focusing is about saying no," he says. His analysis of what was wrong with Apple at that time was that they had terrible engineering management. They were doing too many things--interesting things--but had no direction. When he took over, the decisions that had to be made were not to cut things that were bad, but to cut things that were unfocused. A lack of focus makes the whole less than the sum of the parts. Good focus allows the whole to become greater than the sum of the parts.

Two segments are worth watching. The first is Steve Jobs explaining his philosophy:

The second is him responding to a question about why they cut OpenDoc. The interesting observation is that OpenDoc was probably better than anything else at some things. That, by itself, wasn't enough. It had to be part of a larger vision or it had to go.

He ends with a great observation about how you have to let the vision dictate the technology and not the other way around.

How much did Steve Jobs Mean To the Tech Industry?

2011-10-05T13:55:00.000-07:00

This picture says it all. The front page of Hacker News is completely dominated by the news of Steve's death. Even as someone who never owned an Apple product, he had a huge influence and raised the bar. Not just once, but at least 5 major products from him changed the state of the industry. The Apple // was a watershed for personal computers. The Mac for UI. The iPod and especially iTunes for digital music. The iPhone completely changed the phone business. The iPad created a whole new category of devices. His influence will be greatly missed.

Pruning the Decision Tree in Test

2011-09-28T00:26:00.000-07:00

Yesterday I wrote about the need to reduce the number of things a project attempted to do in order to deliver a great product. Too many seemingly good ideas can make a product late or fragmented or both. The same is true of testing a product. Great testing is more about deciding what not to test than deciding what to test.

There is never enough time to test everything about a product. This isn’t just the fault of marketing which has a go-to-market date in mind. It is a physical reality. To thoroughly test a product requires traversing the entire state tree in each possible combination. This is analogous to the traveling salesman problem and is thus NP-Complete. In layman’s terms, this means that there is not enough time to test everything for any non-trivial program.

When someone first starts testing, thinking up test cases is hard. We often ask potential hires to test something like a telephone or a pop machine. We are looking for creativity. Can they think up a lot of interesting test cases? After some time in the field, however, most people are able to think up a lot more tests than they have time to carry out. The question then becomes no longer one of inclusion, but one of exclusion.

In Netflix’s case the exclusion was for focus. This is not the right exclusion criteria for testing. It is improper to not test the UI so that you can test the backing database. Instead, the criteria by which tests should be excluded is more complex. There is no single criteria or set of criteria that work for every project. Here are some to consider which have wide applicability:

Breadth of coverage – Often times it is best to try everything a little rather than some things very deep and others not at all. Don’t get caught up testing just one part.
Scenario coverage – Look for test cases which will intersect the primary use patterns of the users. If no one is likely to try to put a square inside a circle inside a square, finding a but in it is not highly valuable.
Risk analysis – What areas of the product would be most problematic if they went wrong? Losing user data is almost always really bad. Drawing a line one pixel off often is not. If you have to choose, prefer focusing more on the data than the drawing. Another important area for many projects are legal or regulatory requirements. If you have these, make sure to test for them. It doesn’t matter how well your product works if the customer is not allowed to buy it.
Cost of servicing – If forced to choose, spend more time on the portions that will be more difficult or costly to service if a bug shows up in the field. For instance, in a client-server architecture, it is usually easier to service the server because it is in one spot, under your control, rather than trying to go to hundreds or thousands of computers to update the client software.
Testing cost – While not a good criteria to use by itself, if a test is too expensive to carry out or to automate, perhaps it should be skipped in favor of writing many more tests that are much cheaper.
Incremental gains – How much does this test case add to existing coverage? It is better to try something wholly new than another slight variation on an existing case. Thinking in terms of code coverage may help here. It is usually better to write a case which tests 10 new blocks than one which tests 15 already covered blocked and 2 new ones. It is very possible that two test cases are both great, but the combination is not. Choose one.

There are many more criteria that could be used. The important point is to have criteria and to make intentional decisions. A test planning approach that merely says, “What are the ways we can test this product?” is insufficient. It will generate too many test cases, some of which will never be carried out due to time or cost. It is important to prune the decision tree up front so that the most important cases are done and the least important ones are left behind. Do this up front, in the test spec, not on the fly as resources dwindle.

Pruning the Decision Tree

2011-09-27T01:54:00.000-07:00

A great post by Marc Randolph got me thinking. He tackles the question of why Netflix made the moves they made recently. Specifically, why did they spin off their DVD option as a company called Qwixter? The answer: focus.

What separates successful ventures from failures is choosing to do the right things. What separates great successes from merely good ones is choosing not to do the wrong things.

When faced with a question about whether to add a feature, the decision often revolves around whether customers will like it. This is a good first question, but stopping there leads to sub-optimal results. It is important to ask the next question: What can we not do by doing this? Resources are limited. For each feature that goes into a product, something else(quality, time, other features) comes out. If that is forgotten, the product will end up with too many features that are good but not great and a product that feels the same.

Shipping a great product then is about the decisions about what features not to implement and what bugs not to fix. Netflix demonstrates what it looks like to take this very seriously. They are jettisoning what is a popular part of their company so they can focus on what they think is the future. Only time will tell if they chose the right strategy, but one has to commend them for making a clear choice.

Follow my adventures at //build/

2011-09-15T02:52:00.000-07:00

This week I'm attending the //build/ conference where Microsoft is revealing many of the details about Windows 8. If you want to see what is going on, follow my Twitter feed at https://twitter.com/steverowe, I'm posting some pictures, linking to summaries, etc.

Listening to the team

2011-09-11T18:00:00.000-07:00

There is an old saying in software that goes something like this, “Scope, Timeframe, and Budget: Pick two.” Being a tester, I would rephrase this a little as, “Features, Timeframe, Budget, and Quality; Pick three”. It’s usually possible to hit all three of the first choice as long as you are willing to sacrifice quality. We’ve all seen products that do this. They have such potential, but just don’t work. Over the past year, there were lot of planning efforts going on around me and this was a great chance to observe different behaviors. One pattern I have seen a few times is worth discussing. That is managers trying to dictate all 4 aspects of a project. There are two ways this tends to happen: the dictatorial manager and the persuasive manager.

Before we jump in, it is important to define the four levers. Features represent the complexity of the product—how many things it can accomplish. Timeframe is how soon the product needs to ship. Budget is how much can be spent on the project. For software projects, this corresponds largely with the number and skill of people that can be put on the task. No discussion of budget can be complete without at least mentioning Brooks’ law. Quality is the ineffable attribute representing beauty, greatness, or merely suitability for a given task. In practical terms it means how many bugs are exposed by the product. Increasing the expectations for any of these increases the pressure on the project. Within reason, each can be relaxed to achieve the desired levels of the other three. It is important that managers understand that when they ask to increase features, they are trading off time or that when they ask to reduce time, they are trading off features.

The dictatorial manager is the more obvious of the two. This is the person that just says, “You will have the following specs done by this date.” Implied is that this will be done with a fixed number of people (never enough) and of high quality. This is the pointy haired boss from Dilbert. When the team raises objections, this manager is unphased. They stick to their position. If the team can’t get it done in the timeframe they ask for, they will find someone else who will. It never occurs to the manager that perhaps what they are asking for is impossible. The only good news is that this type of manager is often self-correcting. They are not enjoyable to work for. The best members of the team have mobility and will leave. This slowly guts the team of the most capable people and the project slowly fails. At a quality company this manager will be seen for who he is and removed from such responsibility.

The persuasive manager is less obvious but nearly as bad. This is the manager who convinces their team that it can do the impossible. In this case they don’t dictate an impossible situation, they merely get the team to agree to it. This manager is usually a really nice guy who just can’t say no. Instead they ask people nicely to sign up for everything marketing and upper management ask for. Their team likes working for them, it just can’t figure out why it is so over-worked. The quality of the project suffers as the team goes into an over-worked haze. This sort of manager is not as quickly self-correcting. The team does not leave except through burn-out. They don’t blame the manager because signing up for the work was their idea. Neither the dictatorial manager nor the persuasive manager will succeed. Both will usually ship something close to on time because they cannot admit failure. The quality, however, will be low. People stop caring once they are made to attempt the impossible. The team will be much weaker the next go-round because all of the good members leave through disgust or attrition.

A better model is to listen to the team. Most times a manager does not have control of all of the levers. Typically timeframe and budget comes down from on high. A manager cannot magically hire more people. Except through brinksmanship, they cannot cause the project to ship later. They do, however, have control over the expect level of quality and the number of features. A team will usually give signals if they think they are being asked to do too much. The wise manager listens to this and adjusts plans accordingly. This is the genius of systems like Scrum and Agile development. A tenet of these systems is listening to the team. Using a system of cost estimates and burndown tracking helps bring an objective view of the picture. Likewise, if the team hesitates or balks at signing up for work, a good manager will re-evaluate. This doesn’t mean automatically reducing the ask, but it does mean making sure you are convinced it can be done.

If a manager senses that the workload is too much, it is their responsibility to reduce it. As much of a manager’s job should be spent deciding what not to do as deciding what to do. This may generate heat from upper management who wants everything in a limited time, with a fixed budget, and at the highest quality. Helping upper management understand the situation and even taking the pain if understanding is not conveyed is the responsibility of the manager. If failure is inevitable, it is better to fail in a place of your choosing (that with the least impact/priority) rather than risk failure on a broader scale at a later date.

The Sidebar is Back!

2010-05-11T13:55:00.000-07:00

I'm apparently in the minority, but I really liked the sidebar in Windows Vista. On a widescreen monitor, there is horizontal space to waste and it was really convenient to have all of my gadgets showing over on the right-hand side. In Windows 7, the sidebar was removed. Gadgets are still there, but to me they became less useful because they were either hidden below all my windows or obscuring them. There was no way to have them always show and have windows flow around them. I've just discovered a new gadget called 7 Sidebar that acts just like the Vista sidebar, but for Windows 7. You can find it here if you too liked the look of the sidebar.