How to Test Software, Part 3: Measurement and Metrics

This post is the conclusion of my series of post about testing software.  In the first post of the series, I established scientific methods as the foundation for how we build our testing methodology.  In the second post, I discussed methods for writing quality tests, and hinted at how to measure their effectiveness.  In this post, I will discuss the issues surrounding accurately measuring quality and cover some of the important measurements and methods that should be employed.

Metrics: Good and Bad

I have often compared metrics to prescription pain-killers: they can be useful tools for assessing quality; however, they are also highly prone to misuse, abuse, and can cause significant harm when abused in this way.  As with many other things relating to testing and quality, this is a problem that science deals with on a continual basis.  One of the primary tasks of a scientific model is to be able to make predictions based on measurements.  Therefore, it is key that we be able to make good measurements and avoid the common pitfalls that occur when designing metrics.

Common pitfalls include the following (we’ll assume that we are attempting to measure quality with these metrics):

  1. Assuming correlation implies causation (ex: “ice cream sales are correlated to crime rates, therefore ice cream causes criminal activity”)
  2. Reversing the direction of causation (ex: “wet streets cause rainfall”)
  3. Metrics with large systematic errors (inaccuracy)
  4. Metrics with large random errors (imprecision)
  5. Metrics that don’t indicate anything at all about quality (irrelevant metrics)
  6. Metrics that don’t necessarily increase as quality increases (inconsistent metrics)
  7. Metrics that may increase even when quality falls (unsound metrics)
  8. Metrics that increase at a different rate than quality after a point (diminishing returns)
  9. Using metrics in conditions that differ dramatically from the assumptions under which they were developed (violating boundary conditions)
  10. Directly comparing metrics that measure different things

Inconsistency and unsoundness are extremely common flaws in older software quality (and productivity) metrics.  For example, “lines of code” was a common metric for productivity in the 80s and early 90s in software development (some very misguided firms still use it today).  This metric is flawed because it doesn’t actually correlate to real productivity at all for numerous reasons (chief among them being that low-quality code is often much longer than a well-designed and engineered solution).  Likewise, “number of bugs for a given developer” has been employed by several firms as a quality metric, and consistently has the ultimate result of dramatically reducing quality.

There are many more examples of the dangers of bad metrics, and of relying solely on metrics.  Because of the dangers associated with their use, I recommend the following points when evaluating and using metrics:

  • Consult someone with scientific training on the design and use of all metrics
  • Be watchful for warning signs that a metric is not working as intended
  • Understand the conditions under which a given metric applies, and when those conditions don’t hold
  • Understand the principle of diminishing returns and apply it to the use of metrics
  • Understand that a metric only measures a portion of the world, and watch for phenomena for which it fails to account

Examples of Measurements

The following are examples of various measurements of quality, and the factors governing their effective use.

Quality of Test Design: Case Coverage

The previous post covered various testing strategies in considerable detail, and discussed their relative levels of quality.  This discussion covered various issues affecting test quality; however, the key benefit provided by the more advanced testing methods was better case coverage.  Case coverage is an abstract metric that measures the percentage of the number of cases in which a given component or system can operate that are covered by the tests.  In the case of simple, finite (and stateless) components, case coverage can be directly measured.  However, in most cases, it is notoriously difficulty to analyze, as the case spaces for most components and systems are infinite.

With very large or infinite case spaces we need to devote careful thought to what portion of the case space is covered by the test suites.  In infinite spaces, we have some kind of equivalence structure.  We can define a notion of “depth” where equivalent problem instances all lie on a particular trajectory, and “deeper” problems grow more complex.  We would like to build test suites that cover the entire surface, and go down to a uniform depth.  Methods like combinatorial testing are quite powerful in this regard and can achieve this result for many testing problems; however, they are not infallible.  Testing problems very complex case spaces can require a prohibitively large combinatorial test in order to avoid missing certain parts of the surface.

In the most complex cases, the case space has a recursive structure, a highly complex equivalence class structure, or both.  Examples of this often arise in the context of compilers, interpreters, and database systems.  We frequently encounter these kinds of cases on compilers and programming languages, for example.  The best example of this kind of case from my own work would be the expanded selection/resolution logic in the VM spec in JDK8.  In this case, exercising every spec rule through combinatorial testing produced a prohibitively large space.  Thus, we had to employ enumeration-based methods to explore all of the many possible branch-points in the case space and avoid generating redundant instances.

The takeaway is that it is critical to consider the nature of the case space.  If we were to visualize the case space as a kind of surface, then problems that can be described (and tested) via combinatorial methods would look like a relatively simple geometric object, and a combinatorial test would look like a cubical volume.  Thus, it is relatively straightforward to capture large portions of the case space.  A problem like selection/resolution would look more like a highly complex fractcal-like structure.  Problems such as these require different methods to achieve reasonable case coverage.

Effectiveness of a Test Suite on an Implementation: Code Coverage

Case coverage is a measure of the quality of a test suite’s design, and is derived from the specification of the thing being tested.  Code coverage addresses a different problem: the effectiveness of a test suite on a particular implementation.  An important point of this is that I do not believe these two metrics to be alternatives for one another.  They measure completely different things, and thus they both must be employed to give a broader view of the test quality picture.

Code coverage is essential because the implementation will likely change more readily than the behavior specification.  Serious gaps in code coverage indicate a problem: either something is wrong with the implementation, or the test suite is missing some portion of the case space.  Coverage gaps can emerge when neither of these is the case, but if this is the case, then it should be understood why.

Moreover, gaps in code coverage cast doubt on the viability of the code.  The worst example of this comes from my first job, where I once found an entire file full of callbacks that looked like this:

getFoo(Thing* thing) {
  if (thing == NULL) {
    return thing->foo;
  } else {
    return NULL;

Note that the null-check is inverted.  Clearly this code had never been run, because there is no way that it could possibly work.  Gaps in code coverage allow cases like this to slip through undetected.

Stability Over Time

As previously discussed, stress testing seeks to test guarantees about the stability of the product.  The most important point about stress-testing is that the properties it tests are not discrete properties: they cannot be stated in terms of a single point in time.  Rather, they are continuous: they are expressed in terms of a continuous interval of time.  This is a key point, and is the reason that stress-testing is essential.  Unit and system testing can only establish discrete properties.  In order to get a sense of things like reliability and performance which are inherently continuous properties, it is necessary to do stress-testing.

A very important point is that this notion also applies to incoming bug reports.  In the OpenJDK project, we generally did not write internal stress-testing suites of the kind I advocate here.  We did, however, have a community of early adopters trying out the bleeding edge repos constantly throughout the development cycle, which had the effect of stressing the codebase continually.  Whether one considers failures generated by an automated stress-test or bugs filed by early adopters, there comes a point in the release cycle where the number of outstanding bugs hits zero (this is sometimes known as the zero-bug build or point).  However, this is not an indicator of readiness to release, because it is only a discrete point in time.  The typical pattern one sees is that the number of bugs hits zero, and then immediately goes back up.  The zero-bug point is an indicator that the backlog is cleared out, but not that the product was ready for release.  This is because the zero-bug point is a discrete property.  The property we want for a release is a continuous one: namely that in some interval of time, there were no bugs reported or existing.


The issues associated with performance measurement are worthy of a Ph.D thesis (or five), and thus are well outside the scope of this post.  This section is written more to draw attention to them, and point out a few of the many ways that performance testing can produce bad results.

Effective performance testing is HARD.  Modern computing hardware is extremely complex, with highly discontinuous, nonlinear performance function, chaotic behavior, and many unknowns.  The degree to which this can affect performance testing is just starting to come to light, and it has cast doubt on a large number of published results.  For example, it has been shown that altering the linking order of a program can affect performance by up to 5%: the typical performance gain that is suitable to secure publication in top-level computer architecture conferences.

The following are common problems that affect performance testing:

  • Assuming compositionality: the idea that good performance for isolated components of a system implies that the combined system will perform well.
  • Contrived microbenchmarks (small contrived cases that perform well).  This is a dual of the previous problem, as performing well on isolated parts of a problem instance doesn’t imply you’ll perform well on the combined problem.
  • Cherry-picking
  • Not large enough sample size, not enough randomness in selections, bad or predictable random generators
  • Failing to account for the impact of system and environmental factors (environment variables, link order, caches, etc)
  • Non-uniform preconditions for tests (failing to clear out caches, etc.)
  • Lack of repeatability

The takeaway from this is that performance testing needs to be treated as a scientific activity, and approached from the same level of discipline that one would apply in a lab setting.  Its results need to be viewed with skepticism until they can be reliably repeated many times, in many different environments.  Failure to do this casts serious doubt on any result the tests produce.

Sadly, this fact is often neglected, even in top-level conferences; however, this is not an excuse to continue to neglect it.


In this series, I have described an approach to testing that has its foundations in the scientific method.  I have discussed different views from which tests must be written.  I have described advanced methods for building tests that achieve very high case coverage.  Finally, I have described the principles of how to effectively measure quality, and the many pitfalls that must be avoided.

The single most important takeaway from this series is this:

Effective testing is a difficult multifaceted problem, deserving of serious intellectual effort by dedicated, high-level professionals.

Testing should not consist of mindlessly grinding out single-case tests.  It should employ sophisticated analysis and implementation methods to examine the case space and explore it to a satisfactory degree, to generate effective workloads for stress testing, and to analyze the performance of programs.  These are very difficult tasks, require the attention of people with advanced skills, and should be viewed with the respect that solving problems of this difficulty deserves.

Moreover, within each organization, testing and quality should be seen as an essential part of the development process, and something requiring serious attention and effort.  Time and resources must be budgeted, and large undertakings for the purpose of building testing infrastructure, improving existing tests, and building new tests should be encouraged and rewarded.

Lastly, a culture similar to what we had in the langtools team, where we constantly were looking for ways to improve our testing and quality practices pays off in a big way.  Effort put into developing high-quality tests, testing frameworks, and testing methods saves tremendous amounts of time and effort in the form of detecting and avoiding bugs, preventing regressions, and making refactoring a much easier process.  We should therefore seek to cultivate this kind of attitude in our own organizations.


Author: Eric McCorkle

Eric McCorkle is a computer scientist with a background in programming languages, concurrency, and systems.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s