Distributed Package and Trust Management

I presented a lightning talk at last night’s Boston Haskell meetup on an idea I’ve been working on for some time now, concerning features for a distributed package and trust manager system.  I had previously written an internal blog post on this matter, which I am now publishing here.

Package Management Background

Anyone who has used or written open-source software or modern languages is familiar with the idea of package managers.  Nearly all modern languages provide some kind of package management facility.  Haskell has Hackage, Ruby has RubyGems, Rust has Cargo, and so on.  These package managers allow users to quickly and easily install packages from a central repository, and they provide a way for developers to publish new packages.  While this sort of system is a step up from the older method of manually fetching and installing libraries that is necessary in languages like C and Java, most implementations are limited to the use-case of open-source development for applications without high security, trust, and auditing requirements.

These systems were never designed for industrial and high-trust applications, so there are some key shortcomings for those uses:

  • No Organizational Repositories- The use of a central package repository is handy, but it fails to address the use case of an organization wanting to set up their own internal package repository.
  • Lack of Support for Closed-Source Packages- Package systems usually work by distributing source.  If you can’t push your packages up to the world, then you default back to the manual installation model.
  • Inconsistent Quality- The central repository tends to accumulate a lot of junk: low-quality, half-finished, or abandoned packages, or as my former colleague John Rose once said, “a shanty-town of bikesheds”.
  • No Verifiable Certification/Accountability- In most of these package systems, there is very little in the way of an accountability or certification system.  Some systems provide a voting or review system, and all of them provide author attribution, but this is insufficient for organizations that want to know about things like certified releases and builds.

Distributed Package Management

There has been some ongoing work in the Haskell community to build a more advanced package management library called Skete (pronounced “skeet”).  The model used for this library is a distributed model that functions more like Git (in fact, it uses Git as a backend).  This allows organizations to create their own internal repositories that receive updates from a central repository and can host internal-only projects as well.  Alec Heller, who I know through the Haskell community is one of the developers on the project.  He gave a talk about it at the Haskell meetup back in May (note: the library has progressed quite a bit since then), which you can find here.

This work is interesting, because it solves a lot of the problems with the current central repository package systems.  With a little engineering effort, the following can be accomplished:

  • Ability to maintain internal package repositories that receive updates from a master, but also contain internal-only packages
  • Ability to publish binary-only distributions up to the public repositories, but keep the source distributions internal
  • Option to publish packages directly through git push rather than a web interface
  • Ability to create “labels” which essentially amount to package sets.

This is definitely an improvement on existing package management technologies, and can serve as a basis for building an even better system.  With this in hand, we can think about building a system for accountability and certification.

Building in Accountability and Certification

My main side project is a dependently-typed systems language.  In such a language, we are able to prove facts about a program, as its type system includes a logic for doing so.  This provides much stronger guarantees about the quality of a program; however, publishing the source code, proof obligations, and proof scripts may not always be feasible for a number of reasons (most significantly, they likely provide enough information to reverse-compile the program).  The next best thing is to establish a system of accountability and certification that allows various entities to certify that the proof scripts succeed.  This would be built atop a foundation that uses strong crypto to create unforgable certificates, issued by the entities that check the code.

This same use case also works for the kinds of security audits done by security consulting firms in the modern world.  These firms conduct security audits on applications, applying a number of methods such as penetration testing, code analysis, and threat modeling to identify flaws and recommend fixes.

This brings us at last to the idea that’s been growing in my head: what if we had a distributed package management system (like Skete) that also included a certification system, so that users could check whether or not a particular entity has granted a particular certification to a particular package.  Specific use cases might look like this:

  • When I create a version of a package, I create a certification that it was authored by me.
  • A third-party entity might conduct an audit of the source code, then certify the binary artifacts of a particular source branch.  This would be pushed upstream to the public package repository along with the binaries, but the source would remain closed.
  • Such an entity could also certify an open-source package.
  • An public CI system could pick up on changes pushed to a package repository (public or private) and run tests/scans, certifying the package if they succeed.
  • A mechanism similar to a block-chain could be used to allow entities to update their certifications of a package (or revoke them)
  • Negative properties (like known vulnerabilities, deprecation, etc) could also be asserted through this mechanism (this would require additional engineering to prevent package owners from deleting certifications about their packages).
  • Users can require that certain certifications exist for all packages they install (or conversely, that certain properties are not true).

This would be fairly straightforward to implement using the Skete library:

  • Every package has a descriptor, which includes information about the package, a UUID, and hashes for all the actual data.
  • The package repositories essentially double as a CA, and manage granting/revocation of keys using the package manager as a distribution system.  Keys are granted to any package author, and any entity which wishes to certify packages.
  • Packages include a set of signed records, which include a description of the properties being assigned to the package along with a hash of the package’s descriptor.  These records can be organized as a block-chain to allow organizations to provide updates at a later date.

Implementation Plans

After I gave my brief talk about this idea, I had a discussion with one of the Skete developers about the possibility of rolling these ideas up into that project.  Based on that discussion, it all seems feasible, and hopefully a system that works this way will be coming to life in the not-too-distant future.

ZFS Support for EFI Now in FreeBSD

Sometime last year, I started working on a patch to add ZFS support to the UEFI boot facilities for FreeBSD.

Backstory: I’ve been a FreeBSD fan and user since my junior year of undergrad (2003), and I run it as my OS whenever I can.  I first started looking into UEFI support as a GSoC project.  Unfortunately, I had to drop the project due to a combination of a sudden job search and my grandfather’s brain cancer diagnosis.

Fast forward a few years, and I circled back to see what remained to be done on the UEFI front.  The boot process was there, but only for UFS.  So over the holidays, I started poking around to see what could be done.

I started out by refactoring boot1 (the program that resides in the EFI partition and pulls loader off the main filesystem and runs it), putting it into a more modular form to support multiple filesystem modules.  I then started writing the ZFS module.  I hit a tipping point some time in April, and got it working completely shortly thereafter.

The next task was loader itself.  This proved trickier, but I eventually figured out what needed to be done.  To my delight, the modified loader worked fine with the GRUB2 bootloader as well as FreeBSD’s boot1.

For most of the rest of the year, it’s been passed around and used by various people and was picked up by NextBSD and PCBSD.  It entered the formal review process in late autumn, and several people contributed changes that helped out immensely in the integration effort.  In particular, several people addressed stylistic issues (I am not terribly familiar with FreeBSD’s style guide) and integrated libstand support (which I had thought to be a problem due to the need for Win32 ABI binaries in EFI).

I was informed on the way home from the gym that it’s been committed to HEAD, and will hopefully make it into 10.3.  I’m glad to see it now officially in FreeBSD, and I’m grateful to the people who helped out with the integration.

I have future plans in this arena, too.  I deliberately modularized the boot1 program in preparation for some other efforts.  First, I plan to look into adding GELI (the full-disk encryption mechanism for FreeBSD) support.  I would also like to see support for checking cryptographic signatures of loader and kernel at boot-time (I’ve heard others are working on something like that).  In the very long run, I’d like to see a completely trusted custody chain from boot to kernel, but that is something that will take multiple steps to realize.

Boston-Area PL/Type Theory

Last night saw the first meeting of the Boston-Area PL/Type Theory group that I put together on Meetup.com (link).  This was an initial meet-and-greet and organizing meeting, intended to serve as a brainstorming session for what to do next.

I’m pleased with the outcome of this meeting.  We were joined by a number of folks from the Boston Haskell community as well as Adam Chlipala of MIT.  Adam suggested that we use space in the MIT computer science department for our events, which seems to be the most advantageous option for several reasons.

We also had a productive discussion about the mission of the group, in particular how to deal with the fact that we will have a rather wide variation in the level of knowledge among members.  The idea came forward that we have different “tracks” of events geared towards different experience levels and goals.  Three distinct tracks emerged from the discussion:

  • Beginners: Featuring events like introductory lectures and group dial-ins to the internet type theory group’s sessions
  • Experienced: Featuring events like a reading group and discussions of and/or lectures on advanced topics
  • “Do Stuff”: Geared towards active work on research and projects, featuring unconference-style events and specific project groups

Some first steps emerged as well.  We decided to have an initial unconference/hackathon (on the “do stuff” track) at some point in February.  We also decided to set up a GitHub group for maintaining the group page, as well as any other projects that happen.  We will surely find other venues for organizing as time goes on.

It looks like we’re off to a good start, and hopefully we’ll see some interesting developments grow out of this!

How to Test Software, Part 3: Measurement and Metrics

This post is the conclusion of my series of post about testing software.  In the first post of the series, I established scientific methods as the foundation for how we build our testing methodology.  In the second post, I discussed methods for writing quality tests, and hinted at how to measure their effectiveness.  In this post, I will discuss the issues surrounding accurately measuring quality and cover some of the important measurements and methods that should be employed.

Metrics: Good and Bad

I have often compared metrics to prescription pain-killers: they can be useful tools for assessing quality; however, they are also highly prone to misuse, abuse, and can cause significant harm when abused in this way.  As with many other things relating to testing and quality, this is a problem that science deals with on a continual basis.  One of the primary tasks of a scientific model is to be able to make predictions based on measurements.  Therefore, it is key that we be able to make good measurements and avoid the common pitfalls that occur when designing metrics.

Common pitfalls include the following (we’ll assume that we are attempting to measure quality with these metrics):

  1. Assuming correlation implies causation (ex: “ice cream sales are correlated to crime rates, therefore ice cream causes criminal activity”)
  2. Reversing the direction of causation (ex: “wet streets cause rainfall”)
  3. Metrics with large systematic errors (inaccuracy)
  4. Metrics with large random errors (imprecision)
  5. Metrics that don’t indicate anything at all about quality (irrelevant metrics)
  6. Metrics that don’t necessarily increase as quality increases (inconsistent metrics)
  7. Metrics that may increase even when quality falls (unsound metrics)
  8. Metrics that increase at a different rate than quality after a point (diminishing returns)
  9. Using metrics in conditions that differ dramatically from the assumptions under which they were developed (violating boundary conditions)
  10. Directly comparing metrics that measure different things

Inconsistency and unsoundness are extremely common flaws in older software quality (and productivity) metrics.  For example, “lines of code” was a common metric for productivity in the 80s and early 90s in software development (some very misguided firms still use it today).  This metric is flawed because it doesn’t actually correlate to real productivity at all for numerous reasons (chief among them being that low-quality code is often much longer than a well-designed and engineered solution).  Likewise, “number of bugs for a given developer” has been employed by several firms as a quality metric, and consistently has the ultimate result of dramatically reducing quality.

There are many more examples of the dangers of bad metrics, and of relying solely on metrics.  Because of the dangers associated with their use, I recommend the following points when evaluating and using metrics:

  • Consult someone with scientific training on the design and use of all metrics
  • Be watchful for warning signs that a metric is not working as intended
  • Understand the conditions under which a given metric applies, and when those conditions don’t hold
  • Understand the principle of diminishing returns and apply it to the use of metrics
  • Understand that a metric only measures a portion of the world, and watch for phenomena for which it fails to account

Examples of Measurements

The following are examples of various measurements of quality, and the factors governing their effective use.

Quality of Test Design: Case Coverage

The previous post covered various testing strategies in considerable detail, and discussed their relative levels of quality.  This discussion covered various issues affecting test quality; however, the key benefit provided by the more advanced testing methods was better case coverage.  Case coverage is an abstract metric that measures the percentage of the number of cases in which a given component or system can operate that are covered by the tests.  In the case of simple, finite (and stateless) components, case coverage can be directly measured.  However, in most cases, it is notoriously difficulty to analyze, as the case spaces for most components and systems are infinite.

With very large or infinite case spaces we need to devote careful thought to what portion of the case space is covered by the test suites.  In infinite spaces, we have some kind of equivalence structure.  We can define a notion of “depth” where equivalent problem instances all lie on a particular trajectory, and “deeper” problems grow more complex.  We would like to build test suites that cover the entire surface, and go down to a uniform depth.  Methods like combinatorial testing are quite powerful in this regard and can achieve this result for many testing problems; however, they are not infallible.  Testing problems very complex case spaces can require a prohibitively large combinatorial test in order to avoid missing certain parts of the surface.

In the most complex cases, the case space has a recursive structure, a highly complex equivalence class structure, or both.  Examples of this often arise in the context of compilers, interpreters, and database systems.  We frequently encounter these kinds of cases on compilers and programming languages, for example.  The best example of this kind of case from my own work would be the expanded selection/resolution logic in the VM spec in JDK8.  In this case, exercising every spec rule through combinatorial testing produced a prohibitively large space.  Thus, we had to employ enumeration-based methods to explore all of the many possible branch-points in the case space and avoid generating redundant instances.

The takeaway is that it is critical to consider the nature of the case space.  If we were to visualize the case space as a kind of surface, then problems that can be described (and tested) via combinatorial methods would look like a relatively simple geometric object, and a combinatorial test would look like a cubical volume.  Thus, it is relatively straightforward to capture large portions of the case space.  A problem like selection/resolution would look more like a highly complex fractcal-like structure.  Problems such as these require different methods to achieve reasonable case coverage.

Effectiveness of a Test Suite on an Implementation: Code Coverage

Case coverage is a measure of the quality of a test suite’s design, and is derived from the specification of the thing being tested.  Code coverage addresses a different problem: the effectiveness of a test suite on a particular implementation.  An important point of this is that I do not believe these two metrics to be alternatives for one another.  They measure completely different things, and thus they both must be employed to give a broader view of the test quality picture.

Code coverage is essential because the implementation will likely change more readily than the behavior specification.  Serious gaps in code coverage indicate a problem: either something is wrong with the implementation, or the test suite is missing some portion of the case space.  Coverage gaps can emerge when neither of these is the case, but if this is the case, then it should be understood why.

Moreover, gaps in code coverage cast doubt on the viability of the code.  The worst example of this comes from my first job, where I once found an entire file full of callbacks that looked like this:

getFoo(Thing* thing) {
  if (thing == NULL) {
    return thing->foo;
  } else {
    return NULL;
  }
}

Note that the null-check is inverted.  Clearly this code had never been run, because there is no way that it could possibly work.  Gaps in code coverage allow cases like this to slip through undetected.

Stability Over Time

As previously discussed, stress testing seeks to test guarantees about the stability of the product.  The most important point about stress-testing is that the properties it tests are not discrete properties: they cannot be stated in terms of a single point in time.  Rather, they are continuous: they are expressed in terms of a continuous interval of time.  This is a key point, and is the reason that stress-testing is essential.  Unit and system testing can only establish discrete properties.  In order to get a sense of things like reliability and performance which are inherently continuous properties, it is necessary to do stress-testing.

A very important point is that this notion also applies to incoming bug reports.  In the OpenJDK project, we generally did not write internal stress-testing suites of the kind I advocate here.  We did, however, have a community of early adopters trying out the bleeding edge repos constantly throughout the development cycle, which had the effect of stressing the codebase continually.  Whether one considers failures generated by an automated stress-test or bugs filed by early adopters, there comes a point in the release cycle where the number of outstanding bugs hits zero (this is sometimes known as the zero-bug build or point).  However, this is not an indicator of readiness to release, because it is only a discrete point in time.  The typical pattern one sees is that the number of bugs hits zero, and then immediately goes back up.  The zero-bug point is an indicator that the backlog is cleared out, but not that the product was ready for release.  This is because the zero-bug point is a discrete property.  The property we want for a release is a continuous one: namely that in some interval of time, there were no bugs reported or existing.

Performance

The issues associated with performance measurement are worthy of a Ph.D thesis (or five), and thus are well outside the scope of this post.  This section is written more to draw attention to them, and point out a few of the many ways that performance testing can produce bad results.

Effective performance testing is HARD.  Modern computing hardware is extremely complex, with highly discontinuous, nonlinear performance function, chaotic behavior, and many unknowns.  The degree to which this can affect performance testing is just starting to come to light, and it has cast doubt on a large number of published results.  For example, it has been shown that altering the linking order of a program can affect performance by up to 5%: the typical performance gain that is suitable to secure publication in top-level computer architecture conferences.

The following are common problems that affect performance testing:

  • Assuming compositionality: the idea that good performance for isolated components of a system implies that the combined system will perform well.
  • Contrived microbenchmarks (small contrived cases that perform well).  This is a dual of the previous problem, as performing well on isolated parts of a problem instance doesn’t imply you’ll perform well on the combined problem.
  • Cherry-picking
  • Not large enough sample size, not enough randomness in selections, bad or predictable random generators
  • Failing to account for the impact of system and environmental factors (environment variables, link order, caches, etc)
  • Non-uniform preconditions for tests (failing to clear out caches, etc.)
  • Lack of repeatability

The takeaway from this is that performance testing needs to be treated as a scientific activity, and approached from the same level of discipline that one would apply in a lab setting.  Its results need to be viewed with skepticism until they can be reliably repeated many times, in many different environments.  Failure to do this casts serious doubt on any result the tests produce.

Sadly, this fact is often neglected, even in top-level conferences; however, this is not an excuse to continue to neglect it.

Conclusion

In this series, I have described an approach to testing that has its foundations in the scientific method.  I have discussed different views from which tests must be written.  I have described advanced methods for building tests that achieve very high case coverage.  Finally, I have described the principles of how to effectively measure quality, and the many pitfalls that must be avoided.

The single most important takeaway from this series is this:

Effective testing is a difficult multifaceted problem, deserving of serious intellectual effort by dedicated, high-level professionals.

Testing should not consist of mindlessly grinding out single-case tests.  It should employ sophisticated analysis and implementation methods to examine the case space and explore it to a satisfactory degree, to generate effective workloads for stress testing, and to analyze the performance of programs.  These are very difficult tasks, require the attention of people with advanced skills, and should be viewed with the respect that solving problems of this difficulty deserves.

Moreover, within each organization, testing and quality should be seen as an essential part of the development process, and something requiring serious attention and effort.  Time and resources must be budgeted, and large undertakings for the purpose of building testing infrastructure, improving existing tests, and building new tests should be encouraged and rewarded.

Lastly, a culture similar to what we had in the langtools team, where we constantly were looking for ways to improve our testing and quality practices pays off in a big way.  Effort put into developing high-quality tests, testing frameworks, and testing methods saves tremendous amounts of time and effort in the form of detecting and avoiding bugs, preventing regressions, and making refactoring a much easier process.  We should therefore seek to cultivate this kind of attitude in our own organizations.

How to Test Software, Part 2: Quality of Tests

In the first post in this series, I discussed an overall approach to testing based on the scientific method.  I also discussed the need for multiple views in our testing methodology as well as three important views that our testing regimen should incorporate.  Unit testing is important, as it tests the kind of guarantees that developers rely upon when using components.  System testing is important because it tests software from the same view as the end users.  Finally, stress and performance testing are important as they answer questions about the continuous operation of the system.

However, I only talked about the general approach to writing tests, and the views from which we write them.  I said nothing about the actual quality of the tests, but rather deferred the topic to a later post: namely this one.

Test Quality: Basics

In scientific investigations, experimental quality is of paramount importance.  Bad experiments lead to bad conclusions; thus it is important to design experiments that are sound, repeatable, and which convey enough information to establish the conclusions we draw.  Similar criteria govern how we should write our tests.  Specifically, tests should establish some set of guarantees to a high degree of certainty.  Thus, we must design our tests as we would an experiment: with considerable thought to the credibility of the test in establishing the guarantees we wish to establish.

Many test suites fail miserably when it comes to their experimental methodology.  Among the most common reasons are the following:

  • Low Coverage
  • Random or unreliable generation of cases
  • Lack of repeatability/intermittent failures

We want to establish a rigorous testing methodology that consistently produces high-quality, credible tests that test their hypotheses to a high degree of certainty.  We can derive the general guidelines for any test suite from the principles of sound experiments.  The following is a list of these principles:

  • Tests should be consistently repeatable, and should not have intermittent failures
  • Tests should give informative and actionable results when they fail
  • Tests should achieve high coverage, and should automatically generate a large set of cases wherever possible
  • Correctness tests should never have any kind of randomness in their operation
  • Stress and performance tests should minimize entropy from out-of-scope factors

With these general principles in mind, we can look at what specifically makes for quality tests in each of the views we discussed in the previous post in this series.

Unit Test Quality

Unit tests examine guarantees about individual components.  One of their chief advantages over other views is the ability to directly exercise cases and codepaths that may be difficult to trigger in whole-system tests.  As such, case coverage is of paramount importance for writing good unit tests.

A less obvious factor in the quality of a unit test is the set of preconditions under which the tests run.  Very few components have a purely-functional specification; most interact with parts of the system in a stateful fashion.  There is often a temptation to write synthetic harnesses which simulate the behavior of the system in a very small number of cases; however, this leads to low-quality tests.  High-quality tests will explore the behavior of the system with a wide variety of preconditions.

In summary, the additional criteria for unit tests are as follows:

  • Explore the case space for components completely
  • Simulate wide varieties of preconditions that affect the behavior of the components

System Test Quality

System tests examine guarantees about the system as a whole.  The purpose of system tests is to test the software from the same point of view as the users who will eventually use it.  The key difficulty with this view is repeatability, particularly for complex systems that interact with things like databases or the network.  For the most complex systems, considerable care must be taken in order to engineer repeatable tests.

Additionally, it is necessary to consider system-specific behaviors like character sets, filesystem conventions, date and time issues, and other such issues.

The following are common problems that need to be considered in writing system tests:

  • OS-specific factors (encoding, filesystem behaviors, etc)
  • OS-level preconditions (existing files, environment variables, etc)
  • Interactions with other services (databases, authentication servers, etc)
  • Network issues ((in)accessibility, configurations, changing IPs, etc.)

Stress/Performance Test Quality

Stress tests examine guarantees about stability under certain kinds of load.  Performance tests likewise examine performance under certain kinds of load.  Both of these differ from other kinds of testing in that the properties they examine are about continuous intervals of time as opposed to discrete points.

Both stress and performance tests tend to involve some degree of entropy (stress tests do so deliberately; performance tests do so more out of a need to measure real performance).  This is a key difference from correctness-oriented tests, which should avoid entropy at all costs.  The key to quality testing when entropy is unavoidable is to keep it limited to relevant entropy and isolate the test from irrelevant entropy- that is, maximize the signal and minimize the noise.  In stress testing, we want to measure stability under “characteristic” workloads; thus, it is critical that we generate loads that are statistically similar to a characteristic load, or at the very minimum have statistical properties that we understand.  Additionally, it is important that we don’t accidentally neglect certain aspects of the desired workload.

In performance testing, we must also avoid accidental biases in our tests arising from factors like caching.  This may seem simple, but in fact it is much more difficult than a first glance would suggest.  For example, the content of environment variables can significantly affect the cache behavior, as can the link order of the application.  The contents of caches, both CPU as well as filesystem and page caches can likewise have a significant effect on performance, and can accidentally bias the tests.  It is important to think carefully about performance tests and all the factors that affect performance in order to avoid these kinds of bias.

The following are important factors for writing stress and performance tests:

  • Ensure that the statistical properties of the synthetic workload accurately reproduce the desired properties
  • Ensure that the space of generated cases does not exclude any cases that we desire to include
  • Ensure that non-obvious biases are eliminated or minimized in performance tests

Coverage

The problem of coverage is central to the problem of correctness testing.  Older literature on testing describes two dual methodologies: blackbox and whitebox (sometimes called glassbox).  The difference between the two can be stated in terms of case coverage and code coverage.  I prefer not to talk about whitebox and blackbox testing, because both case and code coverage are important.  They also don’t represent concepts that can be directly compared.  Code coverage is a measurable quantity, which can and should be determined using coverage tools.  Case coverage, on the other hand, is a conceptual idea, and does not lend itself to direct measurement except in the simplest of cases.

Put another way, case coverage is useful for evaluating the quality of the design of an individual test or the quality of a given test-writing methodology.  We can clearly talk about what kinds of cases a test generates and tests, how many of them are generated, how varied or redundant they are, and we can reason to some extent about how much they approximate complete testing of the entire case space (which is often infinite).  Thus, case coverage is a measure of the quality of the design of a test.

Code coverage, on the other hand, generally cannot be directly inferred from a given test; rather, it is a measure that is obtained by running the test and collecting and analyzing profiling data after the fact.  It functions as a performance metric, and indicates the adequacy of a test.  Even a very well-designed test suite with good case coverage may leave gaps in the code coverage either because those gaps come from very obscure edge cases, or because for some reason those code paths cannot be exercised by any test case (which can indicate underlying problems in the implementation).  Thus, code coverage is a measure of the adequacy of a test suite.

The remainder of this post will focus on case coverage, and how different test design methodologies achieve different levels of coverage.  I will discuss code coverage in a future post.

Test Design Methodologies

The technical difficulty of designing high-quality tests is often underestimated.  By consequence, many test suites contain large numbers of low-quality tests.  In one of many discussions about testing during my time working on OpenJDK, I described a system of tiers for testing, which were focused around the degree to which they provided high levels of case coverage, and what sort of problem spaces they were equipped to handle.  This section describes these tiers in detail.

Single-Case Tests

Single-case tests are the most common method for writing tests.  They are also the least effective method, as they achieve extremely low case coverage (and often very low code coverage).  The single-case testing methodology is bad for a number of reasons:

  • It does not scale either in terms of engineering effort or in terms of execution.  Any automated case generation method can achieve coverage levels that would require hundreds of thousands of person-hours with the single-case methodology.
  • There is an inherent bias toward writing simple cases, which tends to result in the tests missing the complex cases.
  • It tends to result in a significant amount of copy-pasting, which leads to errors in the test.
  • It results in an unmaintainable test suite, often with many duplicated tests.

For these and other reasons, single-case tests were extremely strongly discouraged in the langtools group, and would usually fail code review without some justification.

Template Tests

Template tests are a method for quickly generating and testing large number of very similar tests.  With template testing, we create a template which constructs a test case from a parameter value.  This template is then applied to a range of parameter values which generate and test the various cases.

This method was frequently employed in the javac test suite to test relatively simple problems that we encountered.  It is more effective for problems with a relatively “flat” structure, though often combinatorial testing is required for more complex problem spaces.

A common variation on this style of testing was to create a “little language”, which describes a test case in a very concise format.  This was used to test bridge method generation in Lambda for JDK8 (this test suite is now part of the OpenJDK repository).

Combinatorial Tests

Combinatorial tests, or “combotests” were the most common methodology used by the javac team as we continued to develop our methodology.  Combotests work similar to template tests, except that they have multiple parameters.  The test has a range of possible inputs for each parameter, and it runs the test on every possible combination of inputs.

Combinatorial tests achieve a very high level of coverage for many problems, and can generate and test tens of thousands of problem instances in an efficient manner.  This methodology is sufficient for many problems.  Only the most complex problems require the more advanced method of enumeration testing.

For nearly all testing problems, combinatorial tests represent the “sweet spot” of the diminishing returns curve.  They achieve high coverage, but are relatively easy to implement.  For this reason, combinatorial testing should be the preferred method of writing tests.

Enumeration Testing

Combinatorial testing is a powerful method, but it is predicated on the idea that a problem can be expressed in terms of a small set of independent dimensions, each combination of which is a unique problem instance and whose expected result can be easily determined.  It breaks down in the presence of certain conditions, including the following:

  • When it is difficult to determine the expected result from a problem instance without re-implementing the thing being tested
  • When the problem has a highly recursive structure to its specification
  • When there is a complex equivalence class structure among the problem instances

When these conditions are in effect, combinatorial testing fails either because it does not explore enough of the problem space, or because it must explore a prohibitively large space in order to achieve reasonable case coverage.

Examples of where these kinds of conditions manifest include type systems, symbol resolution in the presence of inherited and nested scopes, and dispatch logic in the implementation of object-oriented languages.  In all these cases, we see the features I listed above.  As I work on compilers quite a bit, I encounter these kinds of problems frequently; thus I have moved over time to using enumeration testing methods to deal with them.

Enumeration testing is based on the notion of proof trees in logic, and is based on the idea that each rule in a specification or a type system implies something about the test case that exercises it.  For example, in symbol resolution in Java, there  has a rule which states that if a class does not define the desired symbol, then we recursively search for the symbol in its superclass.  This implies that we have (at least) two test cases for this rule: one in which a class defines a symbol, and one in which it does not.

Enumeration testing creates a builder for test cases, and a set of “templates” which potentially operate on a builder to add data to the test case.  We then use tree-enumeration to explore all possible cases out to a certain depth.  In essence, we turn the testing problem into a branching search problem.

In summary, enumeration testing is an advanced method which is difficult to implement.  However, it is the only method able to adequately address the hardest testing problems.

Conclusion

A common misconception about testing is that it is an inherently simple task.  Writing high-quality tests is a technically challenging task, and achieving very high quality requires a knowledge of advanced programming and computer science theory techniques.  In this post, I have discussed the general principles of high-quality testing, the role of different kinds of quality in testing, and a series of increasingly advanced methodologies for writing tests.  The final piece of the testing picture deals with measurements and metrics, which I will discuss in the next post.

How to Test Software, Part 1: Hypotheses and Views

This is the first of a three-part series on testing practices that I originally wrote as a part of an initiative to create and improve quality practices in my current company.  It derives heavily from my past experience, particularly in the langtools team in the Java Platform Group at Oracle.  In the langtools team we took quality very seriously and strove to constantly improve our practices.  We brought the same advanced skillsets and commitment to bear on quality and testing as we did on our core function of programming languages and compilers.

As a result of my time with that group, I regard testing and quality as a challenging technical problem, deserving of the attention of experts and the use of advanced techniques and theory.  More to the point, testing and quality should not be seen as a task for low-ranking staff, consisting mostly of mindlessly repetitive tasks (as it is through much of industry).

In this series of posts, I describe a scientific view of software testing and develop characterizations of the various techniques and practices that I’ve worked and am working to put into practice in my current role.  In this first post, I focus on a scientific paradigm for testing, and on the importance of testing a system from multiple viewpoints.

Testing: A Scientific Approach

Ideally, software quality controls should provide us with guarantees about how software behaves.  We would like to be able to make guarantees like “this software doesn’t fail under heavy loads” or “this software doesn’t allow users to take actions that violate our security policy”.

The only way we can make these guarantees with certainty is by employing formal methods, which are not feasible for industrial-level use at our current level of tool and programming language technology.  Humanity has dealt with similar issues in the past.  The scientific method arose as an alternative to trying to prove facts about the workings of the world using pure philosophy, which had proven unsuccessful.  Rather than attempting to reason from first principles to derive irrefutable facts, the scientific method instead aims to make predictions based on repeatable experiments.  One need only look at history to see how successful this method has been.

The scientific method acts as a highly effective guiding principle when applied to software testing.  In this approach, we still seek to make guarantees; however, rather than proving those guarantees with formal methods techniques, we treat those guarantees as a hypothesis and design repeatable, sound test suites as experiments.  As with science, we do not make absolute claims about the guarantees we are testing.  Instead, we claim only to show up to some degree of certainty that the guarantee holds.  Likewise, we try to avoid prejudice in our experiments and to document the assumptions we make in a given test suite as completely as possible.  Finally, we constantly seek to find and test anything which we have not yet explored, as it represents a loophole in our certainty.

This approach will serve as the guiding principle for this series on testing.

Testing Multiple Views

Production software is built out of many separate components and consists of many levels of organization.  The manner in which these components fit together can often conceal potential points of failure in the individual components.  For example, a general alias analysis pass in a compiler may fail under certain conditions, but it may also be the case that the way the IR generator works never happens to expose those cases, or does so only very rarely.  This can allow bugs to hide for long periods of time, and then be revealed later when changes are made that expose them.  Similarly, having a high degree of certainty about the correctness of individual components does not say anything about the manner in which those components are put together.  Two perfectly correct components can be combined in a way that contains flaws.

Finally, it is necessary to test guarantees beyond merely “the implementation behaves correctly”, including guarantees such as “the system does not fail under heavy load”.  Systems often manifest failures under heavy load that cannot be predicted or modeled solely by testing correctness in single test runs.  Similarly, high operating capacity can reveal performance problems that are not obvious from case-based testing.

The meaning of all this is that it is necessary to employ multiple testing methodologies, each of which targets a particular views of the codebase.  Three such views prove themselves particularly useful: unit testing, end-to-end testing, and stress testing.  I will discuss each in turn.

Testing Components: Unit Testing

The first view of software is the one that developers see: the component-level view.  Any software component presents an API of some variety, which has some form of specification defining how it is to behave.  Unit testing is the practice of writing tests that target the units of functionality in a given component.  In unit testing, the goal is to guarantee that the components behave according to their documented specifications.  As such, unit testing can be seen as testing hypotheses of the form “this functional unit behaves according to this specification”.

The key advantage of unit testing is its direct correspondence to functional units.  Because each functional unit has a test suite associated with it, it is straightforward to exercise the case space for that unit.  This is the primary utility of unit testing: it tests the same hypotheses that uses assume to be true when they write code using the component.  Not surprisingly, it is also straightforward to expand the coverage level for the implementation.  In approaches such as system testing, it can become quite difficult to ensure that the cases are covered or to increase coverage.  Additionally, unit tests can be easily maintained with ongoing development.  For this reason, unit tests tend to make good pre-integration tests.

Unit testing is weak, however, when it comes to making testing hypotheses about the behavior of the system as a whole.  While it can test guarantees about the behavior of components individually, new cases often arise from how multiple components are used in combination.  Unit testing is also unable to say anything about what sort of behavior users see from the system.  For these reasons, we must employ other views in our testing.

Testing Behavior: System (and end-to-end UI) Testing

The second view of software looks at the system as a whole.  In the javac team, our system tests worked on the the whole compiler, as opposed to its individual components.  System, as well as end-to-end UI testing is a testing discipline that uses this view.  Most of the javac test suite consists of tests of this variety: they generate a java source sample, run it through the compiler, and test the compiler’s output or error messages in some way.

Unlike unit testing, system testing works on the whole system; as such, it is able to test hypotheses about the system’s behavior, as well as how all of the components work in concert.  Where unit testing works on the hypotheses that developers assume, system testing works on the hypotheses that users assume.  However, system testing is weaker when it comes to testing anything about an individual component’s behavior or exercising all of its possible cases.  It is also more difficult to improve code coverage through system testing, as some code paths may be effectively inaccessible or very difficult to exercise.

In a compiler, for example, it is often difficult to completely explore the behavior of a particular component.  In a system like a language runtime, a database system, or something more systems-oriented, testing individual components using system testing would likely be much more difficult as these systems involve nondeterministic behavior.

As a final note, in end-to-end testing on systems-type programs, it can become difficult or impossible to avoid nondeterminism, especially in programs that make heavy use of threading.  This is precisely why we need to employ unit as well as system testing: unit-style tests can be strung together to reconstruct sequences of operations that led to a failure, where it can be extremely difficult to reproduce these kinds of failures in a system test.

However, neither unit nor system testing are able to test hypotheses about how the system behaves under heavy loads, or about its performance in such conditions.  Neither can these methods say anything about metrics such as expected time between failures.  For this, we need to employ the final view.

Testing Behavior under Load: Stress Testing

I am a staunch advocate of deterministic, repeatable testing, as it is fundamental to the scientific view of testing.  In light of this, some find it odd that I advocate stress-testing, which specifically uses randomness to continuously generate heavy workloads for the system.  The key to understanding this position lies in the hypotheses being tested by stress testing.

Stress-testing does pay attention to correctness, as incorrect behavior constitutes a failure.  However, its primary goal is to test hypotheses about how the system behaves under various kinds of load, about the rate of failures, and about how load affects performance.  Thus, the repeatability of stress testing is based on its ability to reliably generate certain kinds of loads, as opposed to exact inputs.

Actually writing stress tests is similar to the process of writing system or unit tests, and if the facilities for writing these tests are designed well, they can be reused for system tests.  The challenge of stress testing lies in reliably generating loads that contain a certain mix of operations and don’t miss some important portion of the case space.

Additionally, system and unit tests produce discrete results- that is, they run once and report something about what happened.  Stress tests are different in that they provide continuous results.  There is no concept of a “run” of a stress test.  Stress tests don’t allow us to say things like “the test passed”.  Rather, stress tests provide information over an interval of time– they allow us to say things like “the stress test ran for 72 hours without failure”, or “the average throughput was 50Kops/sec”.

Conclusion

This post has covered three important views for testing: unit testing, which allows us to test a hypothesis about a particular component, system testing, which allows us to test a hypothesis about the behavior of the system as a whole, and stress testing, which allows us to test a hypothesis about how the system behaves under load.  No one of these can serve as a good test suite by itself; we need all of them to make quality guarantees with a reasonable level of certainty.

In the next post, I’ll be talking about techniques for writing good tests.