This is the first of a three-part series on testing practices that I originally wrote as a part of an initiative to create and improve quality practices in my current company. It derives heavily from my past experience, particularly in the langtools team in the Java Platform Group at Oracle. In the langtools team we took quality very seriously and strove to constantly improve our practices. We brought the same advanced skillsets and commitment to bear on quality and testing as we did on our core function of programming languages and compilers.
As a result of my time with that group, I regard testing and quality as a challenging technical problem, deserving of the attention of experts and the use of advanced techniques and theory. More to the point, testing and quality should not be seen as a task for low-ranking staff, consisting mostly of mindlessly repetitive tasks (as it is through much of industry).
In this series of posts, I describe a scientific view of software testing and develop characterizations of the various techniques and practices that I’ve worked and am working to put into practice in my current role. In this first post, I focus on a scientific paradigm for testing, and on the importance of testing a system from multiple viewpoints.
Testing: A Scientific Approach
Ideally, software quality controls should provide us with guarantees about how software behaves. We would like to be able to make guarantees like “this software doesn’t fail under heavy loads” or “this software doesn’t allow users to take actions that violate our security policy”.
The only way we can make these guarantees with certainty is by employing formal methods, which are not feasible for industrial-level use at our current level of tool and programming language technology. Humanity has dealt with similar issues in the past. The scientific method arose as an alternative to trying to prove facts about the workings of the world using pure philosophy, which had proven unsuccessful. Rather than attempting to reason from first principles to derive irrefutable facts, the scientific method instead aims to make predictions based on repeatable experiments. One need only look at history to see how successful this method has been.
The scientific method acts as a highly effective guiding principle when applied to software testing. In this approach, we still seek to make guarantees; however, rather than proving those guarantees with formal methods techniques, we treat those guarantees as a hypothesis and design repeatable, sound test suites as experiments. As with science, we do not make absolute claims about the guarantees we are testing. Instead, we claim only to show up to some degree of certainty that the guarantee holds. Likewise, we try to avoid prejudice in our experiments and to document the assumptions we make in a given test suite as completely as possible. Finally, we constantly seek to find and test anything which we have not yet explored, as it represents a loophole in our certainty.
This approach will serve as the guiding principle for this series on testing.
Testing Multiple Views
Production software is built out of many separate components and consists of many levels of organization. The manner in which these components fit together can often conceal potential points of failure in the individual components. For example, a general alias analysis pass in a compiler may fail under certain conditions, but it may also be the case that the way the IR generator works never happens to expose those cases, or does so only very rarely. This can allow bugs to hide for long periods of time, and then be revealed later when changes are made that expose them. Similarly, having a high degree of certainty about the correctness of individual components does not say anything about the manner in which those components are put together. Two perfectly correct components can be combined in a way that contains flaws.
Finally, it is necessary to test guarantees beyond merely “the implementation behaves correctly”, including guarantees such as “the system does not fail under heavy load”. Systems often manifest failures under heavy load that cannot be predicted or modeled solely by testing correctness in single test runs. Similarly, high operating capacity can reveal performance problems that are not obvious from case-based testing.
The meaning of all this is that it is necessary to employ multiple testing methodologies, each of which targets a particular views of the codebase. Three such views prove themselves particularly useful: unit testing, end-to-end testing, and stress testing. I will discuss each in turn.
Testing Components: Unit Testing
The first view of software is the one that developers see: the component-level view. Any software component presents an API of some variety, which has some form of specification defining how it is to behave. Unit testing is the practice of writing tests that target the units of functionality in a given component. In unit testing, the goal is to guarantee that the components behave according to their documented specifications. As such, unit testing can be seen as testing hypotheses of the form “this functional unit behaves according to this specification”.
The key advantage of unit testing is its direct correspondence to functional units. Because each functional unit has a test suite associated with it, it is straightforward to exercise the case space for that unit. This is the primary utility of unit testing: it tests the same hypotheses that uses assume to be true when they write code using the component. Not surprisingly, it is also straightforward to expand the coverage level for the implementation. In approaches such as system testing, it can become quite difficult to ensure that the cases are covered or to increase coverage. Additionally, unit tests can be easily maintained with ongoing development. For this reason, unit tests tend to make good pre-integration tests.
Unit testing is weak, however, when it comes to making testing hypotheses about the behavior of the system as a whole. While it can test guarantees about the behavior of components individually, new cases often arise from how multiple components are used in combination. Unit testing is also unable to say anything about what sort of behavior users see from the system. For these reasons, we must employ other views in our testing.
Testing Behavior: System (and end-to-end UI) Testing
The second view of software looks at the system as a whole. In the javac team, our system tests worked on the the whole compiler, as opposed to its individual components. System, as well as end-to-end UI testing is a testing discipline that uses this view. Most of the javac test suite consists of tests of this variety: they generate a java source sample, run it through the compiler, and test the compiler’s output or error messages in some way.
Unlike unit testing, system testing works on the whole system; as such, it is able to test hypotheses about the system’s behavior, as well as how all of the components work in concert. Where unit testing works on the hypotheses that developers assume, system testing works on the hypotheses that users assume. However, system testing is weaker when it comes to testing anything about an individual component’s behavior or exercising all of its possible cases. It is also more difficult to improve code coverage through system testing, as some code paths may be effectively inaccessible or very difficult to exercise.
In a compiler, for example, it is often difficult to completely explore the behavior of a particular component. In a system like a language runtime, a database system, or something more systems-oriented, testing individual components using system testing would likely be much more difficult as these systems involve nondeterministic behavior.
As a final note, in end-to-end testing on systems-type programs, it can become difficult or impossible to avoid nondeterminism, especially in programs that make heavy use of threading. This is precisely why we need to employ unit as well as system testing: unit-style tests can be strung together to reconstruct sequences of operations that led to a failure, where it can be extremely difficult to reproduce these kinds of failures in a system test.
However, neither unit nor system testing are able to test hypotheses about how the system behaves under heavy loads, or about its performance in such conditions. Neither can these methods say anything about metrics such as expected time between failures. For this, we need to employ the final view.
Testing Behavior under Load: Stress Testing
I am a staunch advocate of deterministic, repeatable testing, as it is fundamental to the scientific view of testing. In light of this, some find it odd that I advocate stress-testing, which specifically uses randomness to continuously generate heavy workloads for the system. The key to understanding this position lies in the hypotheses being tested by stress testing.
Stress-testing does pay attention to correctness, as incorrect behavior constitutes a failure. However, its primary goal is to test hypotheses about how the system behaves under various kinds of load, about the rate of failures, and about how load affects performance. Thus, the repeatability of stress testing is based on its ability to reliably generate certain kinds of loads, as opposed to exact inputs.
Actually writing stress tests is similar to the process of writing system or unit tests, and if the facilities for writing these tests are designed well, they can be reused for system tests. The challenge of stress testing lies in reliably generating loads that contain a certain mix of operations and don’t miss some important portion of the case space.
Additionally, system and unit tests produce discrete results- that is, they run once and report something about what happened. Stress tests are different in that they provide continuous results. There is no concept of a “run” of a stress test. Stress tests don’t allow us to say things like “the test passed”. Rather, stress tests provide information over an interval of time– they allow us to say things like “the stress test ran for 72 hours without failure”, or “the average throughput was 50Kops/sec”.
This post has covered three important views for testing: unit testing, which allows us to test a hypothesis about a particular component, system testing, which allows us to test a hypothesis about the behavior of the system as a whole, and stress testing, which allows us to test a hypothesis about how the system behaves under load. No one of these can serve as a good test suite by itself; we need all of them to make quality guarantees with a reasonable level of certainty.
In the next post, I’ll be talking about techniques for writing good tests.