How to Test Software, Part 2: Quality of Tests

In the first post in this series, I discussed an overall approach to testing based on the scientific method.  I also discussed the need for multiple views in our testing methodology as well as three important views that our testing regimen should incorporate.  Unit testing is important, as it tests the kind of guarantees that developers rely upon when using components.  System testing is important because it tests software from the same view as the end users.  Finally, stress and performance testing are important as they answer questions about the continuous operation of the system.

However, I only talked about the general approach to writing tests, and the views from which we write them.  I said nothing about the actual quality of the tests, but rather deferred the topic to a later post: namely this one.

Test Quality: Basics

In scientific investigations, experimental quality is of paramount importance.  Bad experiments lead to bad conclusions; thus it is important to design experiments that are sound, repeatable, and which convey enough information to establish the conclusions we draw.  Similar criteria govern how we should write our tests.  Specifically, tests should establish some set of guarantees to a high degree of certainty.  Thus, we must design our tests as we would an experiment: with considerable thought to the credibility of the test in establishing the guarantees we wish to establish.

Many test suites fail miserably when it comes to their experimental methodology.  Among the most common reasons are the following:

  • Low Coverage
  • Random or unreliable generation of cases
  • Lack of repeatability/intermittent failures

We want to establish a rigorous testing methodology that consistently produces high-quality, credible tests that test their hypotheses to a high degree of certainty.  We can derive the general guidelines for any test suite from the principles of sound experiments.  The following is a list of these principles:

  • Tests should be consistently repeatable, and should not have intermittent failures
  • Tests should give informative and actionable results when they fail
  • Tests should achieve high coverage, and should automatically generate a large set of cases wherever possible
  • Correctness tests should never have any kind of randomness in their operation
  • Stress and performance tests should minimize entropy from out-of-scope factors

With these general principles in mind, we can look at what specifically makes for quality tests in each of the views we discussed in the previous post in this series.

Unit Test Quality

Unit tests examine guarantees about individual components.  One of their chief advantages over other views is the ability to directly exercise cases and codepaths that may be difficult to trigger in whole-system tests.  As such, case coverage is of paramount importance for writing good unit tests.

A less obvious factor in the quality of a unit test is the set of preconditions under which the tests run.  Very few components have a purely-functional specification; most interact with parts of the system in a stateful fashion.  There is often a temptation to write synthetic harnesses which simulate the behavior of the system in a very small number of cases; however, this leads to low-quality tests.  High-quality tests will explore the behavior of the system with a wide variety of preconditions.

In summary, the additional criteria for unit tests are as follows:

  • Explore the case space for components completely
  • Simulate wide varieties of preconditions that affect the behavior of the components

System Test Quality

System tests examine guarantees about the system as a whole.  The purpose of system tests is to test the software from the same point of view as the users who will eventually use it.  The key difficulty with this view is repeatability, particularly for complex systems that interact with things like databases or the network.  For the most complex systems, considerable care must be taken in order to engineer repeatable tests.

Additionally, it is necessary to consider system-specific behaviors like character sets, filesystem conventions, date and time issues, and other such issues.

The following are common problems that need to be considered in writing system tests:

  • OS-specific factors (encoding, filesystem behaviors, etc)
  • OS-level preconditions (existing files, environment variables, etc)
  • Interactions with other services (databases, authentication servers, etc)
  • Network issues ((in)accessibility, configurations, changing IPs, etc.)

Stress/Performance Test Quality

Stress tests examine guarantees about stability under certain kinds of load.  Performance tests likewise examine performance under certain kinds of load.  Both of these differ from other kinds of testing in that the properties they examine are about continuous intervals of time as opposed to discrete points.

Both stress and performance tests tend to involve some degree of entropy (stress tests do so deliberately; performance tests do so more out of a need to measure real performance).  This is a key difference from correctness-oriented tests, which should avoid entropy at all costs.  The key to quality testing when entropy is unavoidable is to keep it limited to relevant entropy and isolate the test from irrelevant entropy- that is, maximize the signal and minimize the noise.  In stress testing, we want to measure stability under “characteristic” workloads; thus, it is critical that we generate loads that are statistically similar to a characteristic load, or at the very minimum have statistical properties that we understand.  Additionally, it is important that we don’t accidentally neglect certain aspects of the desired workload.

In performance testing, we must also avoid accidental biases in our tests arising from factors like caching.  This may seem simple, but in fact it is much more difficult than a first glance would suggest.  For example, the content of environment variables can significantly affect the cache behavior, as can the link order of the application.  The contents of caches, both CPU as well as filesystem and page caches can likewise have a significant effect on performance, and can accidentally bias the tests.  It is important to think carefully about performance tests and all the factors that affect performance in order to avoid these kinds of bias.

The following are important factors for writing stress and performance tests:

  • Ensure that the statistical properties of the synthetic workload accurately reproduce the desired properties
  • Ensure that the space of generated cases does not exclude any cases that we desire to include
  • Ensure that non-obvious biases are eliminated or minimized in performance tests

Coverage

The problem of coverage is central to the problem of correctness testing.  Older literature on testing describes two dual methodologies: blackbox and whitebox (sometimes called glassbox).  The difference between the two can be stated in terms of case coverage and code coverage.  I prefer not to talk about whitebox and blackbox testing, because both case and code coverage are important.  They also don’t represent concepts that can be directly compared.  Code coverage is a measurable quantity, which can and should be determined using coverage tools.  Case coverage, on the other hand, is a conceptual idea, and does not lend itself to direct measurement except in the simplest of cases.

Put another way, case coverage is useful for evaluating the quality of the design of an individual test or the quality of a given test-writing methodology.  We can clearly talk about what kinds of cases a test generates and tests, how many of them are generated, how varied or redundant they are, and we can reason to some extent about how much they approximate complete testing of the entire case space (which is often infinite).  Thus, case coverage is a measure of the quality of the design of a test.

Code coverage, on the other hand, generally cannot be directly inferred from a given test; rather, it is a measure that is obtained by running the test and collecting and analyzing profiling data after the fact.  It functions as a performance metric, and indicates the adequacy of a test.  Even a very well-designed test suite with good case coverage may leave gaps in the code coverage either because those gaps come from very obscure edge cases, or because for some reason those code paths cannot be exercised by any test case (which can indicate underlying problems in the implementation).  Thus, code coverage is a measure of the adequacy of a test suite.

The remainder of this post will focus on case coverage, and how different test design methodologies achieve different levels of coverage.  I will discuss code coverage in a future post.

Test Design Methodologies

The technical difficulty of designing high-quality tests is often underestimated.  By consequence, many test suites contain large numbers of low-quality tests.  In one of many discussions about testing during my time working on OpenJDK, I described a system of tiers for testing, which were focused around the degree to which they provided high levels of case coverage, and what sort of problem spaces they were equipped to handle.  This section describes these tiers in detail.

Single-Case Tests

Single-case tests are the most common method for writing tests.  They are also the least effective method, as they achieve extremely low case coverage (and often very low code coverage).  The single-case testing methodology is bad for a number of reasons:

  • It does not scale either in terms of engineering effort or in terms of execution.  Any automated case generation method can achieve coverage levels that would require hundreds of thousands of person-hours with the single-case methodology.
  • There is an inherent bias toward writing simple cases, which tends to result in the tests missing the complex cases.
  • It tends to result in a significant amount of copy-pasting, which leads to errors in the test.
  • It results in an unmaintainable test suite, often with many duplicated tests.

For these and other reasons, single-case tests were extremely strongly discouraged in the langtools group, and would usually fail code review without some justification.

Template Tests

Template tests are a method for quickly generating and testing large number of very similar tests.  With template testing, we create a template which constructs a test case from a parameter value.  This template is then applied to a range of parameter values which generate and test the various cases.

This method was frequently employed in the javac test suite to test relatively simple problems that we encountered.  It is more effective for problems with a relatively “flat” structure, though often combinatorial testing is required for more complex problem spaces.

A common variation on this style of testing was to create a “little language”, which describes a test case in a very concise format.  This was used to test bridge method generation in Lambda for JDK8 (this test suite is now part of the OpenJDK repository).

Combinatorial Tests

Combinatorial tests, or “combotests” were the most common methodology used by the javac team as we continued to develop our methodology.  Combotests work similar to template tests, except that they have multiple parameters.  The test has a range of possible inputs for each parameter, and it runs the test on every possible combination of inputs.

Combinatorial tests achieve a very high level of coverage for many problems, and can generate and test tens of thousands of problem instances in an efficient manner.  This methodology is sufficient for many problems.  Only the most complex problems require the more advanced method of enumeration testing.

For nearly all testing problems, combinatorial tests represent the “sweet spot” of the diminishing returns curve.  They achieve high coverage, but are relatively easy to implement.  For this reason, combinatorial testing should be the preferred method of writing tests.

Enumeration Testing

Combinatorial testing is a powerful method, but it is predicated on the idea that a problem can be expressed in terms of a small set of independent dimensions, each combination of which is a unique problem instance and whose expected result can be easily determined.  It breaks down in the presence of certain conditions, including the following:

  • When it is difficult to determine the expected result from a problem instance without re-implementing the thing being tested
  • When the problem has a highly recursive structure to its specification
  • When there is a complex equivalence class structure among the problem instances

When these conditions are in effect, combinatorial testing fails either because it does not explore enough of the problem space, or because it must explore a prohibitively large space in order to achieve reasonable case coverage.

Examples of where these kinds of conditions manifest include type systems, symbol resolution in the presence of inherited and nested scopes, and dispatch logic in the implementation of object-oriented languages.  In all these cases, we see the features I listed above.  As I work on compilers quite a bit, I encounter these kinds of problems frequently; thus I have moved over time to using enumeration testing methods to deal with them.

Enumeration testing is based on the notion of proof trees in logic, and is based on the idea that each rule in a specification or a type system implies something about the test case that exercises it.  For example, in symbol resolution in Java, there  has a rule which states that if a class does not define the desired symbol, then we recursively search for the symbol in its superclass.  This implies that we have (at least) two test cases for this rule: one in which a class defines a symbol, and one in which it does not.

Enumeration testing creates a builder for test cases, and a set of “templates” which potentially operate on a builder to add data to the test case.  We then use tree-enumeration to explore all possible cases out to a certain depth.  In essence, we turn the testing problem into a branching search problem.

In summary, enumeration testing is an advanced method which is difficult to implement.  However, it is the only method able to adequately address the hardest testing problems.

Conclusion

A common misconception about testing is that it is an inherently simple task.  Writing high-quality tests is a technically challenging task, and achieving very high quality requires a knowledge of advanced programming and computer science theory techniques.  In this post, I have discussed the general principles of high-quality testing, the role of different kinds of quality in testing, and a series of increasingly advanced methodologies for writing tests.  The final piece of the testing picture deals with measurements and metrics, which I will discuss in the next post.

Advertisements

Author: Eric McCorkle

Eric McCorkle is a computer scientist with a background in programming languages, concurrency, and systems.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s