InfoSec and IoT: A Sustainability Analogy

Yesterday saw a major distributed denial-of-service (DDoS) attack against the DNS infrastructure that crippled the internet for much of the east coast.  This attack disabled internet access for much of the Northeastern US, as well as other areas.  These sorts of attacks are nothing new; in fact, this attack came on the anniversary of a similar attack fourteen years ago.  Yesterday’s attack is nonetheless significant, both in its scope and also in the role of the growing internet of things (IoT) in the attack.

The attack was facilitated by the Mirai malware suite, which specifically targets insecure IoT devices, applying a brute-force password attack to gain access to the machines and deploy its malware.  Such an attack would almost certainly fail if directed against machines with appropriate security measures in place and on which passwords had been correctly set.  IoT devices, however, often lack such protections, are often left with their default login credentials, and often go unpatched (afterall, who among even the most eager adopters of IoT can say that they routinely log in to every lightbulb in their house to change the passwords and download patches).  Yesterday, we saw the negative consequences of the proliferation of these kinds of devices

Public Health and Pollution Analogies

Industry regulation- whether self-imposed or imposed by the state -is an widely-accepted practice among modern societies.  The case for this practice lies in the reality that some actions are not limited in their effect to oneself and one’s customers, but rather that they have a tangible effect on the entire world.  Bad practices in these areas leads to systemic risks that threaten even those who have nothing to do with the underlying culprits.  In such a situation, industry faces a choice of two options, one of which will eventually come to pass: self-regulate, or have regulations imposed from without.

Two classic examples of such a situation come in the form of public health concerns and environmental pollution.  Both of these have direct analogs to the situation we now face with insecure IoT devices and software (in)security in the broader context.

IoT and Pollution

After the third attack yesterday, I posted a series of remarks on Twitter that gave rise to this article, beginning with “IoT is the carbon emissions of infosec. Today’s incident is the climate change analog. It won’t be the last”.  I went on to criticize the current trend of gratuitously deploying huge numbers of “smart” devices without concern for the information security implications.

The ultimate point I sought to advance is that releasing huge numbers of insecure, connected devices into the world is effectively a form of pollution, and it has serious negative impacts on information security for the entire internet.  We saw one such result yesterday in the form of one of the largest DDoS attacks and the loss of internet usability for significant portions of the US.  As serious as this attack was, however, it could be far worse.  Such a botnet could easily be used in far more serious attacks, possibly to the point of causing real damage.  And of course, we’ve already seen cases of “smart” device equipped with cameras being used to surreptitiously capture videos of unsuspecting people which are then used for blackmail purposes.

These negative effects, like pollution, affect the world as a whole, not just the subset of those who decide they need smart lightbulbs and smart brooms.  They create a swarm of devices ripe for the plucking for malware, which in turn compromises basic infrastructure and harms everyone.  It is not hard to see the analogies between this and a dirty coal-burning furnace contaminating the air, leading to maladies like acid rain and brown-lung.

Platforms, Methodologies, and Public Health

Anyone who follows me on Twitter or interacts with me in person knows I am harshly critical of the current state of software methodologies, Scrum in particular, and of platforms based on untyped languages, NodeJS in particular.  Make no mistake, scrum is snake-oil as far as I’m concerned, and NodeJS is a huge step backward in terms of a programming language and a development platform.  The popularity of both of these has an obvious-enough root cause: the extreme bias towards developing minimally-functional prototypes, or minimum-viable products (MVPs), in Silicon Valley VC lingo.  Scrum is essentially a process for managing “war-room” emergencies, and languages like JavaScript do allow one to throw together a barely-working prototype faster than a language like Java, Haskell, or Rust.  This expedience has a cost, of course: such languages are far harder to secure, to test, and to maintain.

Of course, few consumers really care what sort of language or development methodology is used, so long as they get their product, or at least the current conventional wisdom goes.  When we consider the widespread information security implications, however, the picture begins to look altogether different.  Put another way, Zuckerburg’s addage “move fast and break things” becomes irresponsible and unacceptable when the potential exists to break the entire internet.

Since the early 1900’s, the US has had laws governing healthcare-related products as well as food, drugs and others.  The reasons for this are twofold: first, to protect consumers who lack insight into the manufacturing process, and second, to protect the public from health crises such as epidemics that arise from contaminated products.  In the case of the Pure Food and Drug act, the call for this regulation was driven in a large part by the extremely poor quality standards of large-scale industrial food processing as documented in Upton Sinclair’s work The Jungle.

The root cause of the conditions that led to the regulation of food industries and the conditions that have led to the popularization of insecure platforms and unsound development methodologies is, I believe, the same.  The cause is the competition-induced drive to lower costs and production times combined with a pathological lack of accountability for the quality of products and the negative effects of quality defects.  When combined, these factors consistently lead nowhere good.

Better Development Practices and Sustainability

These trends are simply not sustainable.  They serve to exacerbate an already severe information security crisis and on a long enough timeline, they stand to cause significant economic damage as a result of attacks like yesterdays, if not more severe attacks that pose a real material risk.

I do not believe government-imposed regulations are a solution to this problem.  In fact, in the current political climate, I suspect such a regulatory effort would end up imposing regulations such as back-doors and other measures that would do more damage to the state of information security that they would help.

The answer, I believe, must come from industry itself and must be led by infosec professionals.  The key is realizing that as is the case with sustainable manufacturing, better development practices are actually more viable and lead to lower eventual costs.  Sloppy practices and bad platforms may cut costs and development times in the now, but in the long run they end up costing much more.  This sort of paradigm shift is neither implausible nor unprecedented.  Driving it is a matter of educating industrial colleagues about these issues and the benefits of more sound platforms and development processes.

Summary

Yesterday’s attack brought the potential for the proliferation of insecure devices and software to have a profound negative effect on the entire world to the forefront.  A key root cause of this is an outdated paradigm in software development that ignores these factors in favor of the short-term view.  It falls to the infosec community to bring about the necessary change toward a more accurate view and more sound and sustainable practices.

FreeBSD EFI boot/loader Refactor

I have just completed (for some value of “complete”) a project to refactor the FreeBSD EFI boot and loader code.  This originally started out as an investigation of a possible avenue in my work on GELI full-disk encryption support for the EFI boot and loader, and grew into a project in its own right.

More generally, this fits into a bunch of work I’m pursuing or planning to pursue in order to increase the overall tamper-resistance of FreeBSD, but that’s another article.

Background

To properly explain all this, I need to briefly introduce both the FreeBSD boot and loader architecture as well as EFI.

FreeBSD Boot Architecture

When an operating system starts, something has to do the work of getting the kernel (and modules, and often other stuff) off the disk and into memory, setting everything up, and then actually starting it.  This is the boot loader.  Boot loaders are often in a somewhat awkward position: they need to do things like read filesystems, detect some devices, load configurations, and do setup, but they don’t have the usual support of the operating system to get it done.  Most notably, they are difficult to work with because if something goes wrong, there is very little in the way of recovery, debugging, or even logging.

Moreover, back in the old days of x86 BIOS, space was a major concern: the BIOS pulled in the first disk sector, meaning the program had to fit into less than 512 bytes.  Even once a larger program was loaded, you were still in 16-bit execution mode.

To deal with this, FreeBSD adopted a multi-stage approach.  The initial boot loader, called “boot”, had the sole purpose of pulling in a more featureful loader program, called “loader”.  In truth, boot consisted of two stages itself: the tiny boot block, and then a slightly more powerful program loaded from a designated part of the BSD disklabel.

The loader program is much more powerful, having a full suite of filesystem drivers, a shell, facilities for loading and unloading the kernel, and other things.  This two-phase architecture overcame the severe limitations of the x86 BIOS environment.  It also allowed the platform-specific boot details to be separated from both the loader program and the kernel.  This sort of niceness is the hallmark of a sound architectural choice.

Inside the loader program, the code uses a set of abstracted interfaces to talk about devices.  Devices are detected, bound to a device switch structure, and then filesystem modules provide a way to access the filesystems those devices contain.  Devices themselves are referred to by strings that identify the device switch managing them.  This abstraction allows loader to support a huge variety of configurations and platforms in a uniform way.

The Extensible Firmware Interface

In the mid-2000’s, the Extensible Firmware Interface started to replace BIOS as the boot environment on x86 platforms.  EFI is far more modern, featureful, abstracted, and easy to work with than the archaic, crufty, and often unstandardized or undocumented BIOS.  I’ve written boot loaders for both; EFI is pretty straightforward, where BIOS is a tarpit of nightmares.

One thing EFI does is remove the draconian constraints on the initial boot loader.  The firmware loads a specific file from a filesystem, rather than a single block from a disk.  The EFI spec guarantees support for the FAT32 filesystem and the GUID Partition Table, and individual platforms are free to support others.

Another thing EFI does is provide abstracted interfaces for things like device IO, filesystems, and many other things.  Devices- both concrete hardware and derived devices such as disk partitions and network filesystems are represented using “device handles”, which support various operational interfaces through “protocol interfaces”, and are named using “device paths”.  Moreover, vendors and operating systems authors alike are able to provide their own drivers through a driver binding interface, which can create new device handles or bind new protocol interfaces to existing ones.

FreeBSD Loader and EFI Similarities

The FreeBSD loader and the EFI framework do many of the same things, and they do them in similar ways most of the time.  Both have an abstracted representation of devices, interfaces for interacting with them, and a way of naming them.  In many ways, the FreeBSD loader framework is prescient in that it did many of the things that EFI ended up doing.

The one shortcoming of FreeBSD loader is in the lack of support for dynamic device detection, also known as “hotplugging”.  When FreeBSD’s boot architecture was created (circa 1994), hotplugging was extremely uncommon: most hardware expected to be connected permanently and remain connected for the duration of operation.  Hence, the architecture was designed around a model of one-time static detection of all devices, and the code evolved around that assumption.  Hot-plugging was added to the operating system itself, of course, but there was little need for it in the boot architecture.  When EFI was born (mid 2000’s), hot-pluggable devices were common, and so supporting them was an obvious design choice.

EFI does this through its driver binding module, where drivers register a set of callbacks that check whether a device is supported, and then attempt to attach to it.  When a device is disconnected, another callback is invoked to disconnect it.  FreeBSD’s loader, on the other hand, expects to detect all devices in a probing phase during its initialization.  It then sets up additional structure (most notably, its new bcache framework) based on the list of detected devices.  Some phases of detection may rely on earlier ones; for example, the ZFS driver may update some devices that were initially detected as block devices.

Refactoring Summary

As I mentioned, my work on this was originally a strategy for implementing GELI support.  A problem with the two-phase boot process is that it’s difficult to get information between the two phases, particularly in EFI, where all code is position-independent, no hard addresses are guaranteed, and components are expected to talk through abstract interfaces.  (In other words, it rules out the sort of hacks that the non-EFI loader uses!)  This is a problem for something like GELI, which has to ask for a password to unlock the filesystem (we don’t want to ask for a password multiple times).  Also, much of what I was having to implement for GELI with abstract devices and a GPT partition driver and such ended up mirroring things that already existed in the EFI framework.

I ended up refactoring the EFI boot and loader to make more use of the EFI framework, particularly its protocol interfaces.  The following is a summary of the changes:

  • The boot and loader programs now look for instances of the EFI_SIMPLE_FILE_SYSTEM_PROTOCOL, and use that interface to load files.
  • The filesystem backend code from loader was moved into a driver which does the same initialization as before, then attaches EFI_SIMPLE_FILE_SYSTEM_PROTOCOL interfaces to all device handles that host supported filesystems.
  • This is accomplished through a pair of wrapper interfaces that translate EFI_SIMPLE_FILE_SYSTEM_PROTOCOL and the FreeBSD loader framework’s filesystem interface back and forth.
  • I originally wanted to move all device probing and filesystem detection into the EFI driver model, where probing and detection would be done in callbacks.  However, this didn’t work primarily because the bcache framework is strongly coupled to the static detection way of doing things.
  • Interfaces and device handles installed in boot can be used by loader without problems.  This provides a way to pass information between phases.
  • The boot and loader programs can also make use of interfaces installed by other programs, such as GRUB, or custom interfaces provided by open-source firmware.
  • The boot and loader programs now use the same filesystem backend code; the minimal versions used by boot have been discarded.
  • Drivers for things like GELI, custom partition schemes, and similar things can work by creating new device nodes and attaching device paths and protocol interfaces to them.

I sent an email out to -hackers announcing the patch this morning, and I hope to get GELI support up and going in the very near future (the code is all there; I just need to plug it in to the EFI driver binding and get it building and running properly).

For anyone interested, the branch can be found here: https://github.com/emc2/freebsd/tree/efize

The Complex Nature of the Security Problem

This article is an elaboration on ideas I originally developed in a post to the project blog for my pet programming language project here.  The ideas remain as valid (if not moreso) now as they did eight months ago when I wrote the original piece.

The year 2015 saw a great deal of publicity surrounding a number of high-profile computer security incidents.  While this trend has been ongoing for some time now, the past year marked a point at which the problem entered the public consciousness to the point where it has become a national news item and is likely to be a key issue in the coming elections and beyond.

“The Security Problem” as I have taken to calling it is not a simple issue and it does not have a simple solution.  It is a complex, multi-faceted problem with a number of root causes, and it cannot be solved without adequately addressing each of those causes in turn.  It is also a crucial issue that must be solved in order for technological civilization to continue its forward progress and not slip into stagnation or regression.  If there is a single message I would want to convey on the subject, it is this: the security problem can only be adequately addressed by a multitude of different approaches working in concert, each addressing an aspect of the problem.

Trust: The Critical Element

In late September, I did a “ride-along” of a training program for newly-hired security consultants.  Just before leaving, I spoke briefly to the group, encouraging them to reach out to us and collaborate.  My final words, however, were broader in scope: “I think every era in history has its critical problems that civilization has to solve in order to keep moving forward, and I think the security problem is one of those problems for our era.”

Why is this problem so important, and why would its existence have the potential to block forward progress?  The answer is trust.  Trust: specifically the ability to trust people about which we know almost nothing and indeed, may never meet is arguably the critical element that allows civilization to exist at all.  Consider what might happen, for example, if that kind of trust did not exist: we would be unable to create and sustain basic institutions such as governments, hospitals, markets, banks, and public transportation.

Technological civilization requires a much higher degree of trust.  Consider, for example, the amount of trust that goes into using something as simple as checking your bank account on your phone.  At a very cursory inspection, you trust the developers who wrote the app that allows you to access your account, the designers of the phone, the hardware manufacturers, the wireless carrier and their backbone providers, the bank’s server software and their system administrators, the third-party vendors that supplied the operating system and database software, the scientists who designed the crypto protecting your transactions and the standards organizations who codified it, the vendors who supplied the networking hardware, and this is just a small portion.  You quite literally trust thousands of technologies and millions of people that you will almost certainly never meet, just to do the simplest of tasks.

The benefits of this kind of trust are clear: the global internet and the growth of computing devices has dramatically increased efficiency and productivity in almost every aspect of life.  However, this trust was not automatic.  It took a long time and a great deal of effort to build.  Moreover, this kind of trust can be lost.  One of the major hurdles for the development of electronic commerce, for example, was the perception that online transactions were inherently insecure.

This kind of progress is not permanent, however; if our technological foundations prove themselves unworthy of this level of trust, then we can expect to see stymied progress or in the worst case, regression.

The Many Aspects of the Security Problem

As with most problems of this scope and nature, the security problem does not have a single root cause.  It is the product of many complex issues interacting to produce a problem, and therefore its solution will necessarily involve committed efforts on multiple fronts and multiple complimentary approaches to address the issues.  There is no simple cause, and no “magic bullet” solution.

The contributing factors to the security problem range from highly technical (with many aspects in that domain), to logistical, to policy issues, to educational and social.  In fact, a complete characterization of the problem could very well be the subject of a graduate thesis; the exposition I give here is therefore only intended as a brief survey of the broad areas.

Technological Factors

As the security problem concerns computer security (I have dutifully avoided gratuitous use of the phrase “cyber”), it comes as no surprise that many of the contributing factors to the problem are technological in nature.  However, even within the scope of technological factors, we see a wide variety of specific issues.

Risky Languages, Tools, and APIs

Inherently dangerous or risky programming language or API features are one of the most common factors that contribute to vulnerabilities.  Languages that lack memory safety can lead to buffer overruns and other such errors (which are among the most common exploits in systems), and untyped languages admit a much larger class of errors, many of which lead to vulnerabilities like injection attacks.  Additionally, many APIs are improperly designed and lead to vulnerabilities, or are designed in such a way that safe use is needlessly difficult.  Lastly, many tools can be difficult to use in a secure manner.

We have made some headway in this area.  Many modern frameworks are designed in such a way that they are “safe by default”, requiring no special configuration to satisfy many safety concerns and requiring the necessary configuration to address the others.  Programming language research over the past 30 years has produced many advanced type systems that can make stronger guarantees, and we are starting to see these enter common use through languages like Rust.  My current employer, Codiscope, is working to bring advanced program analysis research into the static program analysis space.  Initiatives like the NSF DeepSpec expedition are working to develop practical software verification methods.

However, we still have a way to go here.  No mature engineering discipline relies solely on testing: civil engineering, for example, accurately predicts the tolerances of a bridge long before it is built.  Software engineering has yet to develop methods with this level of sophistication.

Configuration Management

Modern systems involve a dizzying array of configuration options.  In multi-level architectures, there are many different components interacting in order to implement each bit of functionality, and all of these need to be configured properly in order to operate securely.

Misconfigurations are a very frequent cause of vulnerabilities.  Enterprise software components can have hundreds of configuration options per component, and we often string dozens of components together.  In this environment, it becomes very easy to miss a configuration option or accidentally fail to account for a particular case.  The fact that there are so many possible configurations, most of which are invalid further exacerbates the problem.

Crypto has also tended to suffer from usability problems.  Crypto is particularly sensitive to misconfigurations: a single weak link undermines the security of the entire system.  However, it can be quite difficult to develop and maintain hardened crypto configurations over time, even for the technologically adept.  The difficulty of setting up software like GPG for non-technical users has been the subject of actual research papers.  I can personally attest to this as well, having guided multiple non-technical people through the setup.

This problem can be addressed, however.  Configuration management tools allow configurations to be set up from a central location, and managed automatically by various services (CFEngine, Puppet, Chef, Ansible, etc.).  Looking farther afield, we can begin to imagine tools that construct configurations for each component from a master configuration, and to apply type-like notions to the task of identifying invalid configurations.  These suggestions are just the beginning; configuration management is a serious technical challenge, and can and should be the focus of serious technical work.

Legacy Systems

Legacy systems have long been a source of pain for technologists.  In the past, they represent a kind of debt that is often too expensive to pay off in full, but which exacts a recurring tax on resources in the form of legacy costs (compatibility issues, bad performance, blocking upgrades, unusable systems, and so on).  To most directly involved in the development of technology, legacy systems tend to be a source of chronic pain; however, from the standpoint of budgets and limited resources, they are often a kind of pain to be managed as opposed to cured, as wholesale replacement is far took expensive and risky to consider.

In the context of security, however, the picture is often different.  These kinds of systems are often extremely vulnerable, having been designed in a time when networked systems were rare or nonexistent.  In this context, they are more akin to rotten timbers at the core of a building.  Yes, they are expensive and time-consuming to replace, but the risk of not replacing them is far worse.

The real danger is that the infrastructure where vulnerable legacy systems are most prevalent: power grids, industrial facilities, mass transit, and the like are precisely the sort of systems where a breach can do catastrophic damage.  We have already seen an example of this in the real world: the Stuxnet malware was employed to destroy uranium processing centrifuges.

Replacing these legacy systems with more secure implementations is a long and expensive proposition, and doing it in a way that minimizes costs is a very challenging technological problem.  However, this is not a problem that can be neglected.

Cultural and Policy Factors

Though computer security is technological in nature, its causes and solutions are not limited solely to technological issues.  Policy, cultural, and educational factors also affect the problem, and must be a part of the solution.

Policy

The most obvious non-technical influence on the security problem is policy.  The various policy debates that have sprung up in the past years are evidence of this; however, the problem goes much deeper than these debates.

For starters, we are currently in the midst of a number of policy debates regarding strong encryption and how we as a society deal with the fact that such a technology exists.  I make my stance on the matter quite clear: I am an unwavering advocate of unescrowed, uncompromised strong encryption as a fundamental right (yes, there are possible abuses of the technology, but the same is true of such things as due process and freedom of speech).  Despite my hard-line pro-crypto stance, I can understand how those that don’t understand the technology might find the opposing position compelling.  Things like golden keys and abuse-proof backdoors certainly sound nice.  However, the real effects of pursuing such policies would be to fundamentally compromise systems and infrastructure within the US and turn defending against data breaches and cyberattacks into an impossible problem.  In the long run, this erodes the kind of trust in technological infrastructure of which I spoke earlier and bars forward progress, leaving us to be outclassed in the international marketplace.

In a broader context, we face a problem here that requires rethinking our policy process.  We have in the security problem a complex technological issue- too complex for even the most astute and deliberative legislator to develop true expertise on the subject through part-time study -but one where the effects of uninformed policy can be disastrous.  In the context of public debate, it does not lend itself to two-sided thinking or simple solutions, and attempting to force it into such a model loses too much information to be effective.

Additionally, the problem goes deeper than issues like encryption, backdoors, and dragnet surveillance.  Much of the US infrastructure runs on vulnerable legacy systems as I mentioned earlier, and replacing these systems with more secure, modern software is an expensive and time-consuming task.  Moreover, this need to invest in our infrastructure this way barely registers in public debate, if at all.  However, doing so is essential to fixing one of the most significant sources of vulnerabilities.

Education

Education, or the lack thereof also plays a key role in the security problem.  Even top-level computer science curricula fail to teach students how to think securely and develop secure applications, or even to impress upon students the importance of doing so.  This is understandable: even a decade ago, the threat level to most applications was nowhere near where it is today.  The world has changed dramatically in this regard in a rather short span of time.  The proliferation of mobile devices and connectedness combined with a tremendous upturn in the number of and sophistication of attacks launched against systems has led to a very different sort of environment than what existed even ten year ago (when I was finishing my undergraduate education).

College curricula are necessarily a conservative institution; knowledge is expected to prove its worth and go through a process of refinement and sanding off of rough edges before it reaches the point where it can be taught in an undergraduate curriculum.  By contrast, much of the knowledge of how to avoid building vulnerable systems is new, volatile, and thorny: not the sort of thing traditional academia likes to mix into a curriculum, especially in a mandatory course.

Such a change is necessary, however, and this means that educational institutions must develop new processes for effectively educating people about topics such as these.

Culture

While it is critical to have a infrastructure and systems built on sound technological approaches, it is also true that a significant number of successful attacks on both large enterprises and individuals alike make primary use of human factors and social engineering.  This is exacerbated by the fact that we, culturally speaking, are quite naive about security.  There are security-conscious individuals, of course, but most people are naive to the point that an attacker can typically rely on social engineering with a high success rate in all but the most secure of settings.

Moreover, this naivety affects everything else, ranging policy decisions to what priorities are deemed most important in product development.  The lack of public understanding of computer security allows bad policy such as back doors to be taken seriously and insecure and invasive products to thrive by publishing marketing claims that simply don’t reflect reality (SnapChat remains one of the worst offenders in this regard, in my opinion).

The root cause behind this that cultures adapt even more slowly than the other factors I’ve mentioned, and our culture has yet to develop effective ways of thinking about these issues.  But cultures do adapt; we all remember sayings like “look both ways” and “stop, drop, and roll” from our childhood, both of which teach simple but effective ways of managing more basic risks that arise from technological society.  This sort of adaptation also responds to need.  During my own youth and adolescence, the danger of HIV drove a number of significant cultural changes in a relatively short period of time that proved effective in curbing the epidemic.  While the issues surrounding the security problem represent a very different sort of danger, they are still pressing issues that require an amount of cultural adaptation to address.  A key step in addressing the cultural aspects of the security problem comes down to developing similar kinds of cultural understanding and awareness, and promoting behavior changes that help reduce risk.

Conclusion

I have presented only a portion of the issues that make up what I call the “computer security problem”.  These issues are varied, ranging from deep technological issues obviously focused on security to cultural and policy issues.  There is not one single root cause to the problem, and as a result, there is no one single “silver bullet” that can solve it.

Moreover, if the problem is this varied and complex, then we can expect the solutions to each aspect of the problem to likewise require multiple different approaches coming from different angles and reflecting different ways of thinking.  My own work, for example, focuses on the language and tooling issue, coming mostly from the direction of building tools to write better software.  However, there are other approaches to this same problem, such as sandboxing and changing the fundamental execution model.  All of these angles deserve consideration, and the eventual resolution to that part of the security problem will likely incorporate developments from each angle of approach.

If there is a final takeaway from this, it is that the problem is large and complex enough that it cannot be solved by the efforts or approach of a single person or team.  It is a monumental challenge requiring the combined tireless efforts of a generation’s worth of minds and at least a generation’s worth of time.

Distributed Package and Trust Management

I presented a lightning talk at last night’s Boston Haskell meetup on an idea I’ve been working on for some time now, concerning features for a distributed package and trust manager system.  I had previously written an internal blog post on this matter, which I am now publishing here.

Package Management Background

Anyone who has used or written open-source software or modern languages is familiar with the idea of package managers.  Nearly all modern languages provide some kind of package management facility.  Haskell has Hackage, Ruby has RubyGems, Rust has Cargo, and so on.  These package managers allow users to quickly and easily install packages from a central repository, and they provide a way for developers to publish new packages.  While this sort of system is a step up from the older method of manually fetching and installing libraries that is necessary in languages like C and Java, most implementations are limited to the use-case of open-source development for applications without high security, trust, and auditing requirements.

These systems were never designed for industrial and high-trust applications, so there are some key shortcomings for those uses:

  • No Organizational Repositories- The use of a central package repository is handy, but it fails to address the use case of an organization wanting to set up their own internal package repository.
  • Lack of Support for Closed-Source Packages- Package systems usually work by distributing source.  If you can’t push your packages up to the world, then you default back to the manual installation model.
  • Inconsistent Quality- The central repository tends to accumulate a lot of junk: low-quality, half-finished, or abandoned packages, or as my former colleague John Rose once said, “a shanty-town of bikesheds”.
  • No Verifiable Certification/Accountability- In most of these package systems, there is very little in the way of an accountability or certification system.  Some systems provide a voting or review system, and all of them provide author attribution, but this is insufficient for organizations that want to know about things like certified releases and builds.

Distributed Package Management

There has been some ongoing work in the Haskell community to build a more advanced package management library called Skete (pronounced “skeet”).  The model used for this library is a distributed model that functions more like Git (in fact, it uses Git as a backend).  This allows organizations to create their own internal repositories that receive updates from a central repository and can host internal-only projects as well.  Alec Heller, who I know through the Haskell community is one of the developers on the project.  He gave a talk about it at the Haskell meetup back in May (note: the library has progressed quite a bit since then), which you can find here.

This work is interesting, because it solves a lot of the problems with the current central repository package systems.  With a little engineering effort, the following can be accomplished:

  • Ability to maintain internal package repositories that receive updates from a master, but also contain internal-only packages
  • Ability to publish binary-only distributions up to the public repositories, but keep the source distributions internal
  • Option to publish packages directly through git push rather than a web interface
  • Ability to create “labels” which essentially amount to package sets.

This is definitely an improvement on existing package management technologies, and can serve as a basis for building an even better system.  With this in hand, we can think about building a system for accountability and certification.

Building in Accountability and Certification

My main side project is a dependently-typed systems language.  In such a language, we are able to prove facts about a program, as its type system includes a logic for doing so.  This provides much stronger guarantees about the quality of a program; however, publishing the source code, proof obligations, and proof scripts may not always be feasible for a number of reasons (most significantly, they likely provide enough information to reverse-compile the program).  The next best thing is to establish a system of accountability and certification that allows various entities to certify that the proof scripts succeed.  This would be built atop a foundation that uses strong crypto to create unforgable certificates, issued by the entities that check the code.

This same use case also works for the kinds of security audits done by security consulting firms in the modern world.  These firms conduct security audits on applications, applying a number of methods such as penetration testing, code analysis, and threat modeling to identify flaws and recommend fixes.

This brings us at last to the idea that’s been growing in my head: what if we had a distributed package management system (like Skete) that also included a certification system, so that users could check whether or not a particular entity has granted a particular certification to a particular package.  Specific use cases might look like this:

  • When I create a version of a package, I create a certification that it was authored by me.
  • A third-party entity might conduct an audit of the source code, then certify the binary artifacts of a particular source branch.  This would be pushed upstream to the public package repository along with the binaries, but the source would remain closed.
  • Such an entity could also certify an open-source package.
  • An public CI system could pick up on changes pushed to a package repository (public or private) and run tests/scans, certifying the package if they succeed.
  • A mechanism similar to a block-chain could be used to allow entities to update their certifications of a package (or revoke them)
  • Negative properties (like known vulnerabilities, deprecation, etc) could also be asserted through this mechanism (this would require additional engineering to prevent package owners from deleting certifications about their packages).
  • Users can require that certain certifications exist for all packages they install (or conversely, that certain properties are not true).

This would be fairly straightforward to implement using the Skete library:

  • Every package has a descriptor, which includes information about the package, a UUID, and hashes for all the actual data.
  • The package repositories essentially double as a CA, and manage granting/revocation of keys using the package manager as a distribution system.  Keys are granted to any package author, and any entity which wishes to certify packages.
  • Packages include a set of signed records, which include a description of the properties being assigned to the package along with a hash of the package’s descriptor.  These records can be organized as a block-chain to allow organizations to provide updates at a later date.

Implementation Plans

After I gave my brief talk about this idea, I had a discussion with one of the Skete developers about the possibility of rolling these ideas up into that project.  Based on that discussion, it all seems feasible, and hopefully a system that works this way will be coming to life in the not-too-distant future.

How to Test Software, Part 3: Measurement and Metrics

This post is the conclusion of my series of post about testing software.  In the first post of the series, I established scientific methods as the foundation for how we build our testing methodology.  In the second post, I discussed methods for writing quality tests, and hinted at how to measure their effectiveness.  In this post, I will discuss the issues surrounding accurately measuring quality and cover some of the important measurements and methods that should be employed.

Metrics: Good and Bad

I have often compared metrics to prescription pain-killers: they can be useful tools for assessing quality; however, they are also highly prone to misuse, abuse, and can cause significant harm when abused in this way.  As with many other things relating to testing and quality, this is a problem that science deals with on a continual basis.  One of the primary tasks of a scientific model is to be able to make predictions based on measurements.  Therefore, it is key that we be able to make good measurements and avoid the common pitfalls that occur when designing metrics.

Common pitfalls include the following (we’ll assume that we are attempting to measure quality with these metrics):

  1. Assuming correlation implies causation (ex: “ice cream sales are correlated to crime rates, therefore ice cream causes criminal activity”)
  2. Reversing the direction of causation (ex: “wet streets cause rainfall”)
  3. Metrics with large systematic errors (inaccuracy)
  4. Metrics with large random errors (imprecision)
  5. Metrics that don’t indicate anything at all about quality (irrelevant metrics)
  6. Metrics that don’t necessarily increase as quality increases (inconsistent metrics)
  7. Metrics that may increase even when quality falls (unsound metrics)
  8. Metrics that increase at a different rate than quality after a point (diminishing returns)
  9. Using metrics in conditions that differ dramatically from the assumptions under which they were developed (violating boundary conditions)
  10. Directly comparing metrics that measure different things

Inconsistency and unsoundness are extremely common flaws in older software quality (and productivity) metrics.  For example, “lines of code” was a common metric for productivity in the 80s and early 90s in software development (some very misguided firms still use it today).  This metric is flawed because it doesn’t actually correlate to real productivity at all for numerous reasons (chief among them being that low-quality code is often much longer than a well-designed and engineered solution).  Likewise, “number of bugs for a given developer” has been employed by several firms as a quality metric, and consistently has the ultimate result of dramatically reducing quality.

There are many more examples of the dangers of bad metrics, and of relying solely on metrics.  Because of the dangers associated with their use, I recommend the following points when evaluating and using metrics:

  • Consult someone with scientific training on the design and use of all metrics
  • Be watchful for warning signs that a metric is not working as intended
  • Understand the conditions under which a given metric applies, and when those conditions don’t hold
  • Understand the principle of diminishing returns and apply it to the use of metrics
  • Understand that a metric only measures a portion of the world, and watch for phenomena for which it fails to account

Examples of Measurements

The following are examples of various measurements of quality, and the factors governing their effective use.

Quality of Test Design: Case Coverage

The previous post covered various testing strategies in considerable detail, and discussed their relative levels of quality.  This discussion covered various issues affecting test quality; however, the key benefit provided by the more advanced testing methods was better case coverage.  Case coverage is an abstract metric that measures the percentage of the number of cases in which a given component or system can operate that are covered by the tests.  In the case of simple, finite (and stateless) components, case coverage can be directly measured.  However, in most cases, it is notoriously difficulty to analyze, as the case spaces for most components and systems are infinite.

With very large or infinite case spaces we need to devote careful thought to what portion of the case space is covered by the test suites.  In infinite spaces, we have some kind of equivalence structure.  We can define a notion of “depth” where equivalent problem instances all lie on a particular trajectory, and “deeper” problems grow more complex.  We would like to build test suites that cover the entire surface, and go down to a uniform depth.  Methods like combinatorial testing are quite powerful in this regard and can achieve this result for many testing problems; however, they are not infallible.  Testing problems very complex case spaces can require a prohibitively large combinatorial test in order to avoid missing certain parts of the surface.

In the most complex cases, the case space has a recursive structure, a highly complex equivalence class structure, or both.  Examples of this often arise in the context of compilers, interpreters, and database systems.  We frequently encounter these kinds of cases on compilers and programming languages, for example.  The best example of this kind of case from my own work would be the expanded selection/resolution logic in the VM spec in JDK8.  In this case, exercising every spec rule through combinatorial testing produced a prohibitively large space.  Thus, we had to employ enumeration-based methods to explore all of the many possible branch-points in the case space and avoid generating redundant instances.

The takeaway is that it is critical to consider the nature of the case space.  If we were to visualize the case space as a kind of surface, then problems that can be described (and tested) via combinatorial methods would look like a relatively simple geometric object, and a combinatorial test would look like a cubical volume.  Thus, it is relatively straightforward to capture large portions of the case space.  A problem like selection/resolution would look more like a highly complex fractcal-like structure.  Problems such as these require different methods to achieve reasonable case coverage.

Effectiveness of a Test Suite on an Implementation: Code Coverage

Case coverage is a measure of the quality of a test suite’s design, and is derived from the specification of the thing being tested.  Code coverage addresses a different problem: the effectiveness of a test suite on a particular implementation.  An important point of this is that I do not believe these two metrics to be alternatives for one another.  They measure completely different things, and thus they both must be employed to give a broader view of the test quality picture.

Code coverage is essential because the implementation will likely change more readily than the behavior specification.  Serious gaps in code coverage indicate a problem: either something is wrong with the implementation, or the test suite is missing some portion of the case space.  Coverage gaps can emerge when neither of these is the case, but if this is the case, then it should be understood why.

Moreover, gaps in code coverage cast doubt on the viability of the code.  The worst example of this comes from my first job, where I once found an entire file full of callbacks that looked like this:

getFoo(Thing* thing) {
  if (thing == NULL) {
    return thing->foo;
  } else {
    return NULL;
  }
}

Note that the null-check is inverted.  Clearly this code had never been run, because there is no way that it could possibly work.  Gaps in code coverage allow cases like this to slip through undetected.

Stability Over Time

As previously discussed, stress testing seeks to test guarantees about the stability of the product.  The most important point about stress-testing is that the properties it tests are not discrete properties: they cannot be stated in terms of a single point in time.  Rather, they are continuous: they are expressed in terms of a continuous interval of time.  This is a key point, and is the reason that stress-testing is essential.  Unit and system testing can only establish discrete properties.  In order to get a sense of things like reliability and performance which are inherently continuous properties, it is necessary to do stress-testing.

A very important point is that this notion also applies to incoming bug reports.  In the OpenJDK project, we generally did not write internal stress-testing suites of the kind I advocate here.  We did, however, have a community of early adopters trying out the bleeding edge repos constantly throughout the development cycle, which had the effect of stressing the codebase continually.  Whether one considers failures generated by an automated stress-test or bugs filed by early adopters, there comes a point in the release cycle where the number of outstanding bugs hits zero (this is sometimes known as the zero-bug build or point).  However, this is not an indicator of readiness to release, because it is only a discrete point in time.  The typical pattern one sees is that the number of bugs hits zero, and then immediately goes back up.  The zero-bug point is an indicator that the backlog is cleared out, but not that the product was ready for release.  This is because the zero-bug point is a discrete property.  The property we want for a release is a continuous one: namely that in some interval of time, there were no bugs reported or existing.

Performance

The issues associated with performance measurement are worthy of a Ph.D thesis (or five), and thus are well outside the scope of this post.  This section is written more to draw attention to them, and point out a few of the many ways that performance testing can produce bad results.

Effective performance testing is HARD.  Modern computing hardware is extremely complex, with highly discontinuous, nonlinear performance function, chaotic behavior, and many unknowns.  The degree to which this can affect performance testing is just starting to come to light, and it has cast doubt on a large number of published results.  For example, it has been shown that altering the linking order of a program can affect performance by up to 5%: the typical performance gain that is suitable to secure publication in top-level computer architecture conferences.

The following are common problems that affect performance testing:

  • Assuming compositionality: the idea that good performance for isolated components of a system implies that the combined system will perform well.
  • Contrived microbenchmarks (small contrived cases that perform well).  This is a dual of the previous problem, as performing well on isolated parts of a problem instance doesn’t imply you’ll perform well on the combined problem.
  • Cherry-picking
  • Not large enough sample size, not enough randomness in selections, bad or predictable random generators
  • Failing to account for the impact of system and environmental factors (environment variables, link order, caches, etc)
  • Non-uniform preconditions for tests (failing to clear out caches, etc.)
  • Lack of repeatability

The takeaway from this is that performance testing needs to be treated as a scientific activity, and approached from the same level of discipline that one would apply in a lab setting.  Its results need to be viewed with skepticism until they can be reliably repeated many times, in many different environments.  Failure to do this casts serious doubt on any result the tests produce.

Sadly, this fact is often neglected, even in top-level conferences; however, this is not an excuse to continue to neglect it.

Conclusion

In this series, I have described an approach to testing that has its foundations in the scientific method.  I have discussed different views from which tests must be written.  I have described advanced methods for building tests that achieve very high case coverage.  Finally, I have described the principles of how to effectively measure quality, and the many pitfalls that must be avoided.

The single most important takeaway from this series is this:

Effective testing is a difficult multifaceted problem, deserving of serious intellectual effort by dedicated, high-level professionals.

Testing should not consist of mindlessly grinding out single-case tests.  It should employ sophisticated analysis and implementation methods to examine the case space and explore it to a satisfactory degree, to generate effective workloads for stress testing, and to analyze the performance of programs.  These are very difficult tasks, require the attention of people with advanced skills, and should be viewed with the respect that solving problems of this difficulty deserves.

Moreover, within each organization, testing and quality should be seen as an essential part of the development process, and something requiring serious attention and effort.  Time and resources must be budgeted, and large undertakings for the purpose of building testing infrastructure, improving existing tests, and building new tests should be encouraged and rewarded.

Lastly, a culture similar to what we had in the langtools team, where we constantly were looking for ways to improve our testing and quality practices pays off in a big way.  Effort put into developing high-quality tests, testing frameworks, and testing methods saves tremendous amounts of time and effort in the form of detecting and avoiding bugs, preventing regressions, and making refactoring a much easier process.  We should therefore seek to cultivate this kind of attitude in our own organizations.

How to Test Software, Part 2: Quality of Tests

In the first post in this series, I discussed an overall approach to testing based on the scientific method.  I also discussed the need for multiple views in our testing methodology as well as three important views that our testing regimen should incorporate.  Unit testing is important, as it tests the kind of guarantees that developers rely upon when using components.  System testing is important because it tests software from the same view as the end users.  Finally, stress and performance testing are important as they answer questions about the continuous operation of the system.

However, I only talked about the general approach to writing tests, and the views from which we write them.  I said nothing about the actual quality of the tests, but rather deferred the topic to a later post: namely this one.

Test Quality: Basics

In scientific investigations, experimental quality is of paramount importance.  Bad experiments lead to bad conclusions; thus it is important to design experiments that are sound, repeatable, and which convey enough information to establish the conclusions we draw.  Similar criteria govern how we should write our tests.  Specifically, tests should establish some set of guarantees to a high degree of certainty.  Thus, we must design our tests as we would an experiment: with considerable thought to the credibility of the test in establishing the guarantees we wish to establish.

Many test suites fail miserably when it comes to their experimental methodology.  Among the most common reasons are the following:

  • Low Coverage
  • Random or unreliable generation of cases
  • Lack of repeatability/intermittent failures

We want to establish a rigorous testing methodology that consistently produces high-quality, credible tests that test their hypotheses to a high degree of certainty.  We can derive the general guidelines for any test suite from the principles of sound experiments.  The following is a list of these principles:

  • Tests should be consistently repeatable, and should not have intermittent failures
  • Tests should give informative and actionable results when they fail
  • Tests should achieve high coverage, and should automatically generate a large set of cases wherever possible
  • Correctness tests should never have any kind of randomness in their operation
  • Stress and performance tests should minimize entropy from out-of-scope factors

With these general principles in mind, we can look at what specifically makes for quality tests in each of the views we discussed in the previous post in this series.

Unit Test Quality

Unit tests examine guarantees about individual components.  One of their chief advantages over other views is the ability to directly exercise cases and codepaths that may be difficult to trigger in whole-system tests.  As such, case coverage is of paramount importance for writing good unit tests.

A less obvious factor in the quality of a unit test is the set of preconditions under which the tests run.  Very few components have a purely-functional specification; most interact with parts of the system in a stateful fashion.  There is often a temptation to write synthetic harnesses which simulate the behavior of the system in a very small number of cases; however, this leads to low-quality tests.  High-quality tests will explore the behavior of the system with a wide variety of preconditions.

In summary, the additional criteria for unit tests are as follows:

  • Explore the case space for components completely
  • Simulate wide varieties of preconditions that affect the behavior of the components

System Test Quality

System tests examine guarantees about the system as a whole.  The purpose of system tests is to test the software from the same point of view as the users who will eventually use it.  The key difficulty with this view is repeatability, particularly for complex systems that interact with things like databases or the network.  For the most complex systems, considerable care must be taken in order to engineer repeatable tests.

Additionally, it is necessary to consider system-specific behaviors like character sets, filesystem conventions, date and time issues, and other such issues.

The following are common problems that need to be considered in writing system tests:

  • OS-specific factors (encoding, filesystem behaviors, etc)
  • OS-level preconditions (existing files, environment variables, etc)
  • Interactions with other services (databases, authentication servers, etc)
  • Network issues ((in)accessibility, configurations, changing IPs, etc.)

Stress/Performance Test Quality

Stress tests examine guarantees about stability under certain kinds of load.  Performance tests likewise examine performance under certain kinds of load.  Both of these differ from other kinds of testing in that the properties they examine are about continuous intervals of time as opposed to discrete points.

Both stress and performance tests tend to involve some degree of entropy (stress tests do so deliberately; performance tests do so more out of a need to measure real performance).  This is a key difference from correctness-oriented tests, which should avoid entropy at all costs.  The key to quality testing when entropy is unavoidable is to keep it limited to relevant entropy and isolate the test from irrelevant entropy- that is, maximize the signal and minimize the noise.  In stress testing, we want to measure stability under “characteristic” workloads; thus, it is critical that we generate loads that are statistically similar to a characteristic load, or at the very minimum have statistical properties that we understand.  Additionally, it is important that we don’t accidentally neglect certain aspects of the desired workload.

In performance testing, we must also avoid accidental biases in our tests arising from factors like caching.  This may seem simple, but in fact it is much more difficult than a first glance would suggest.  For example, the content of environment variables can significantly affect the cache behavior, as can the link order of the application.  The contents of caches, both CPU as well as filesystem and page caches can likewise have a significant effect on performance, and can accidentally bias the tests.  It is important to think carefully about performance tests and all the factors that affect performance in order to avoid these kinds of bias.

The following are important factors for writing stress and performance tests:

  • Ensure that the statistical properties of the synthetic workload accurately reproduce the desired properties
  • Ensure that the space of generated cases does not exclude any cases that we desire to include
  • Ensure that non-obvious biases are eliminated or minimized in performance tests

Coverage

The problem of coverage is central to the problem of correctness testing.  Older literature on testing describes two dual methodologies: blackbox and whitebox (sometimes called glassbox).  The difference between the two can be stated in terms of case coverage and code coverage.  I prefer not to talk about whitebox and blackbox testing, because both case and code coverage are important.  They also don’t represent concepts that can be directly compared.  Code coverage is a measurable quantity, which can and should be determined using coverage tools.  Case coverage, on the other hand, is a conceptual idea, and does not lend itself to direct measurement except in the simplest of cases.

Put another way, case coverage is useful for evaluating the quality of the design of an individual test or the quality of a given test-writing methodology.  We can clearly talk about what kinds of cases a test generates and tests, how many of them are generated, how varied or redundant they are, and we can reason to some extent about how much they approximate complete testing of the entire case space (which is often infinite).  Thus, case coverage is a measure of the quality of the design of a test.

Code coverage, on the other hand, generally cannot be directly inferred from a given test; rather, it is a measure that is obtained by running the test and collecting and analyzing profiling data after the fact.  It functions as a performance metric, and indicates the adequacy of a test.  Even a very well-designed test suite with good case coverage may leave gaps in the code coverage either because those gaps come from very obscure edge cases, or because for some reason those code paths cannot be exercised by any test case (which can indicate underlying problems in the implementation).  Thus, code coverage is a measure of the adequacy of a test suite.

The remainder of this post will focus on case coverage, and how different test design methodologies achieve different levels of coverage.  I will discuss code coverage in a future post.

Test Design Methodologies

The technical difficulty of designing high-quality tests is often underestimated.  By consequence, many test suites contain large numbers of low-quality tests.  In one of many discussions about testing during my time working on OpenJDK, I described a system of tiers for testing, which were focused around the degree to which they provided high levels of case coverage, and what sort of problem spaces they were equipped to handle.  This section describes these tiers in detail.

Single-Case Tests

Single-case tests are the most common method for writing tests.  They are also the least effective method, as they achieve extremely low case coverage (and often very low code coverage).  The single-case testing methodology is bad for a number of reasons:

  • It does not scale either in terms of engineering effort or in terms of execution.  Any automated case generation method can achieve coverage levels that would require hundreds of thousands of person-hours with the single-case methodology.
  • There is an inherent bias toward writing simple cases, which tends to result in the tests missing the complex cases.
  • It tends to result in a significant amount of copy-pasting, which leads to errors in the test.
  • It results in an unmaintainable test suite, often with many duplicated tests.

For these and other reasons, single-case tests were extremely strongly discouraged in the langtools group, and would usually fail code review without some justification.

Template Tests

Template tests are a method for quickly generating and testing large number of very similar tests.  With template testing, we create a template which constructs a test case from a parameter value.  This template is then applied to a range of parameter values which generate and test the various cases.

This method was frequently employed in the javac test suite to test relatively simple problems that we encountered.  It is more effective for problems with a relatively “flat” structure, though often combinatorial testing is required for more complex problem spaces.

A common variation on this style of testing was to create a “little language”, which describes a test case in a very concise format.  This was used to test bridge method generation in Lambda for JDK8 (this test suite is now part of the OpenJDK repository).

Combinatorial Tests

Combinatorial tests, or “combotests” were the most common methodology used by the javac team as we continued to develop our methodology.  Combotests work similar to template tests, except that they have multiple parameters.  The test has a range of possible inputs for each parameter, and it runs the test on every possible combination of inputs.

Combinatorial tests achieve a very high level of coverage for many problems, and can generate and test tens of thousands of problem instances in an efficient manner.  This methodology is sufficient for many problems.  Only the most complex problems require the more advanced method of enumeration testing.

For nearly all testing problems, combinatorial tests represent the “sweet spot” of the diminishing returns curve.  They achieve high coverage, but are relatively easy to implement.  For this reason, combinatorial testing should be the preferred method of writing tests.

Enumeration Testing

Combinatorial testing is a powerful method, but it is predicated on the idea that a problem can be expressed in terms of a small set of independent dimensions, each combination of which is a unique problem instance and whose expected result can be easily determined.  It breaks down in the presence of certain conditions, including the following:

  • When it is difficult to determine the expected result from a problem instance without re-implementing the thing being tested
  • When the problem has a highly recursive structure to its specification
  • When there is a complex equivalence class structure among the problem instances

When these conditions are in effect, combinatorial testing fails either because it does not explore enough of the problem space, or because it must explore a prohibitively large space in order to achieve reasonable case coverage.

Examples of where these kinds of conditions manifest include type systems, symbol resolution in the presence of inherited and nested scopes, and dispatch logic in the implementation of object-oriented languages.  In all these cases, we see the features I listed above.  As I work on compilers quite a bit, I encounter these kinds of problems frequently; thus I have moved over time to using enumeration testing methods to deal with them.

Enumeration testing is based on the notion of proof trees in logic, and is based on the idea that each rule in a specification or a type system implies something about the test case that exercises it.  For example, in symbol resolution in Java, there  has a rule which states that if a class does not define the desired symbol, then we recursively search for the symbol in its superclass.  This implies that we have (at least) two test cases for this rule: one in which a class defines a symbol, and one in which it does not.

Enumeration testing creates a builder for test cases, and a set of “templates” which potentially operate on a builder to add data to the test case.  We then use tree-enumeration to explore all possible cases out to a certain depth.  In essence, we turn the testing problem into a branching search problem.

In summary, enumeration testing is an advanced method which is difficult to implement.  However, it is the only method able to adequately address the hardest testing problems.

Conclusion

A common misconception about testing is that it is an inherently simple task.  Writing high-quality tests is a technically challenging task, and achieving very high quality requires a knowledge of advanced programming and computer science theory techniques.  In this post, I have discussed the general principles of high-quality testing, the role of different kinds of quality in testing, and a series of increasingly advanced methodologies for writing tests.  The final piece of the testing picture deals with measurements and metrics, which I will discuss in the next post.