Librem 13 FreeBSD Port

When the Librem laptops were announced last year, I was quite excited and I ordered both the 15 and 13-inch models.  My 13-inch model arrived last week, and I have begun the process of porting FreeBSD to it.

I have to say, I am very excited to finally have a laptop from a fully-cooperative manufacturer, where I can get my hands on all the hardware specs and possibly even upstream fixes.  This is a very welcome boon after a decade of having to deal with flaky BIOS issues, black-box hardware, and other difficulties.

The Laptop

The physical laptop itself is very solid and rather light.  It doesn’t creak, and the lid stays put even better than a macbook.  My only complaints are that the camera/microphone and wireless kill-switches are unlabeled, and that ethernet cables tend to fall out of the drop-down port.  Aside from those minor issues, I’m quite pleased with the physical unit.


It’s hard to see the kill-switches in the photo below, but they are on the hinge under the screen.

My only other regret is that the dvorak keyboard option became available after I’d ordered mine.  Oh well; maybe I can sweet-talk them into swapping it for me at a conference 😉

It was also very nice to unpack a laptop without implicitly accepting a Microsoft license agreement by opening the box!

BIOS and FreeBSD Installation

The first thing I do when I get a new laptop is poke around in the BIOS menu (no photos yet).  The librem has a coreboot port, but I decided to get FreeBSD installed and check the system out a bit before diving into the art of flashing my BIOS, so I was looking at the proprietary American Megatrends BIOS menu.  Even still, I was pleased by the features it presented, most notably the ability to set up custom signing keys.  I am going to have to do some work on a signed FreeBSD boot and loader chain.

My FreeBSD installation went off without any serious issues.  I installed FreeBSD 11 from a bootable memstick option, setting up a pure-ZFS system.  I had ordered a 1TB spindle drive and a 250GB SSD.  I reserved 48GB of the SSD for swap (total of 64GB memory).  I then set up a ZFS pool with the spindle drive as the main storage, a 16GB intent log on the SSD, and the rest of the SSD as an L2ARC cache device.  (I will eventually set up the ZFS volume to make all writes synchronous, so as to really use the intent log.)  I realize some might consider ZFS on a laptop to be overkill; however, I have found it to be an extremely versatile and stable filesystem.  It is incredibly crash-resistant and corruption-resistant, and its snapshotting is invaluable for risky updates.  The transparent compression features are useful as well, and can effectively increase your available space by a sizable amount.  Lastly, I have used the ability to serialize and deserialize the entire filesystem more than once.

I did encounter one of the issues in this process: a sporadic boot-hang and USB timeout that I now strongly suspect to be a timing bug in the FreeBSD boot process.

FreeBSD did handle the hardware kill-switches rather well (I’ve heard reports of Linux kernel panicking from them).  Flipping them off causes some kernel messages about timeouts, but the bus re-initializes upon flipping them back on.  If you boot with them off, then flip them on, the kernel detects the hardware properly.

FreeBSD Setup

The first thing I do on a new FreeBSD system is grab the source tree and build world, followed by kernel customization.  I noticed that building Clang has gotten pretty slow these days (which doesn’t bother me too much; I’d rather the compiler have a lot of optimization machinery than not).

After that, I grabbed the latest ports tree and started building the usual suspects to test the system (also, to get to where I could test X11).  I also grabbed Jean-Sebastian’s Intel graphics patch to see if that driver worked with the Broadwell card.  Sadly, it didn’t.

Working Hardware

Most of the hardware Just Works™, which is nice.  I was particularly pleased that all the fn-key combinations work out-of-the-box.  I have never seen that happen with any other vendor.

The following is a list of the working hardware:

  • The EFI boot/loader
  • SD card reader (mmc driver)
  • Realtek Ethernet (re driver)
  • System management bus and CPU frequency/temperature (smb, smbus, ichsmb, coretemp, cpufreq drivers)
  • Intel High-Def Audio (snd_hda driver), though I haven’t tested the microphone yet.  Also, plugging into the headphone jack properly switches to headphones from the speakers (I’ve seen that not work).
  • Hard Drive and SSD (obviously)
  • USB ports
  • Bluetooth

Unfortunately, the Intel accelerated graphics drivers don’t support the Broadwell cards.  This will come eventually, but FreeBSD is in the midst of a graphics framework overhaul to better track the Linux drivers.  Looks like it’s going to be VESA for now.

Current Issues

There are currently some issues, which I will be working to fix:

  • The Atheros 9462 card is detected, but the radio doesn’t seem to be working.  The pciconf tool reports a few errors, and scans seem to run, but don’t pick up anything.  I have confirmed this is not a hardware issue by booting with a Kali linux memstick.
  • Blank screen on resume.  My initial investigations reveal some ACPI execution errors during resume, which may be related.  I need to get up in the kernel source and add some logging to see what’s going on.
  • VESA wierdness with X11.  The VESA X driver works mostly, but if you switch back to the terminal, there’s a couple of pixels around the border of the screen that stay the way they looked in X.  Also, when you shutdown X, the screen freezes and the logs indicate some kind of timeout.  Both of these seem to implicate the VGA BIOS.
  • Sporadic boot-hang and USB timeouts.  These seem to be specific to a kernel configuration, and go away when changing the verbosity level.  This strongly indicates a timing-related bug in the kernel initialization procedures.

Of these issues, the wireless card and blank screen are the most critical, followed by the X11 weirdness.  I will be in contact with the Librem developers should my initial attempts to fix these issues prove unsuccessful.

Following that, I want to see if there’s a way to make the kill-switches behave more gracefully.  If the USB driver could be connected to treat those devices as hot-pluggable, or else assume timeouts are disconnects.

In any case, stay tuned for updates…


The Complex Nature of the Security Problem

This article is an elaboration on ideas I originally developed in a post to the project blog for my pet programming language project here.  The ideas remain as valid (if not moreso) now as they did eight months ago when I wrote the original piece.

The year 2015 saw a great deal of publicity surrounding a number of high-profile computer security incidents.  While this trend has been ongoing for some time now, the past year marked a point at which the problem entered the public consciousness to the point where it has become a national news item and is likely to be a key issue in the coming elections and beyond.

“The Security Problem” as I have taken to calling it is not a simple issue and it does not have a simple solution.  It is a complex, multi-faceted problem with a number of root causes, and it cannot be solved without adequately addressing each of those causes in turn.  It is also a crucial issue that must be solved in order for technological civilization to continue its forward progress and not slip into stagnation or regression.  If there is a single message I would want to convey on the subject, it is this: the security problem can only be adequately addressed by a multitude of different approaches working in concert, each addressing an aspect of the problem.

Trust: The Critical Element

In late September, I did a “ride-along” of a training program for newly-hired security consultants.  Just before leaving, I spoke briefly to the group, encouraging them to reach out to us and collaborate.  My final words, however, were broader in scope: “I think every era in history has its critical problems that civilization has to solve in order to keep moving forward, and I think the security problem is one of those problems for our era.”

Why is this problem so important, and why would its existence have the potential to block forward progress?  The answer is trust.  Trust: specifically the ability to trust people about which we know almost nothing and indeed, may never meet is arguably the critical element that allows civilization to exist at all.  Consider what might happen, for example, if that kind of trust did not exist: we would be unable to create and sustain basic institutions such as governments, hospitals, markets, banks, and public transportation.

Technological civilization requires a much higher degree of trust.  Consider, for example, the amount of trust that goes into using something as simple as checking your bank account on your phone.  At a very cursory inspection, you trust the developers who wrote the app that allows you to access your account, the designers of the phone, the hardware manufacturers, the wireless carrier and their backbone providers, the bank’s server software and their system administrators, the third-party vendors that supplied the operating system and database software, the scientists who designed the crypto protecting your transactions and the standards organizations who codified it, the vendors who supplied the networking hardware, and this is just a small portion.  You quite literally trust thousands of technologies and millions of people that you will almost certainly never meet, just to do the simplest of tasks.

The benefits of this kind of trust are clear: the global internet and the growth of computing devices has dramatically increased efficiency and productivity in almost every aspect of life.  However, this trust was not automatic.  It took a long time and a great deal of effort to build.  Moreover, this kind of trust can be lost.  One of the major hurdles for the development of electronic commerce, for example, was the perception that online transactions were inherently insecure.

This kind of progress is not permanent, however; if our technological foundations prove themselves unworthy of this level of trust, then we can expect to see stymied progress or in the worst case, regression.

The Many Aspects of the Security Problem

As with most problems of this scope and nature, the security problem does not have a single root cause.  It is the product of many complex issues interacting to produce a problem, and therefore its solution will necessarily involve committed efforts on multiple fronts and multiple complimentary approaches to address the issues.  There is no simple cause, and no “magic bullet” solution.

The contributing factors to the security problem range from highly technical (with many aspects in that domain), to logistical, to policy issues, to educational and social.  In fact, a complete characterization of the problem could very well be the subject of a graduate thesis; the exposition I give here is therefore only intended as a brief survey of the broad areas.

Technological Factors

As the security problem concerns computer security (I have dutifully avoided gratuitous use of the phrase “cyber”), it comes as no surprise that many of the contributing factors to the problem are technological in nature.  However, even within the scope of technological factors, we see a wide variety of specific issues.

Risky Languages, Tools, and APIs

Inherently dangerous or risky programming language or API features are one of the most common factors that contribute to vulnerabilities.  Languages that lack memory safety can lead to buffer overruns and other such errors (which are among the most common exploits in systems), and untyped languages admit a much larger class of errors, many of which lead to vulnerabilities like injection attacks.  Additionally, many APIs are improperly designed and lead to vulnerabilities, or are designed in such a way that safe use is needlessly difficult.  Lastly, many tools can be difficult to use in a secure manner.

We have made some headway in this area.  Many modern frameworks are designed in such a way that they are “safe by default”, requiring no special configuration to satisfy many safety concerns and requiring the necessary configuration to address the others.  Programming language research over the past 30 years has produced many advanced type systems that can make stronger guarantees, and we are starting to see these enter common use through languages like Rust.  My current employer, Codiscope, is working to bring advanced program analysis research into the static program analysis space.  Initiatives like the NSF DeepSpec expedition are working to develop practical software verification methods.

However, we still have a way to go here.  No mature engineering discipline relies solely on testing: civil engineering, for example, accurately predicts the tolerances of a bridge long before it is built.  Software engineering has yet to develop methods with this level of sophistication.

Configuration Management

Modern systems involve a dizzying array of configuration options.  In multi-level architectures, there are many different components interacting in order to implement each bit of functionality, and all of these need to be configured properly in order to operate securely.

Misconfigurations are a very frequent cause of vulnerabilities.  Enterprise software components can have hundreds of configuration options per component, and we often string dozens of components together.  In this environment, it becomes very easy to miss a configuration option or accidentally fail to account for a particular case.  The fact that there are so many possible configurations, most of which are invalid further exacerbates the problem.

Crypto has also tended to suffer from usability problems.  Crypto is particularly sensitive to misconfigurations: a single weak link undermines the security of the entire system.  However, it can be quite difficult to develop and maintain hardened crypto configurations over time, even for the technologically adept.  The difficulty of setting up software like GPG for non-technical users has been the subject of actual research papers.  I can personally attest to this as well, having guided multiple non-technical people through the setup.

This problem can be addressed, however.  Configuration management tools allow configurations to be set up from a central location, and managed automatically by various services (CFEngine, Puppet, Chef, Ansible, etc.).  Looking farther afield, we can begin to imagine tools that construct configurations for each component from a master configuration, and to apply type-like notions to the task of identifying invalid configurations.  These suggestions are just the beginning; configuration management is a serious technical challenge, and can and should be the focus of serious technical work.

Legacy Systems

Legacy systems have long been a source of pain for technologists.  In the past, they represent a kind of debt that is often too expensive to pay off in full, but which exacts a recurring tax on resources in the form of legacy costs (compatibility issues, bad performance, blocking upgrades, unusable systems, and so on).  To most directly involved in the development of technology, legacy systems tend to be a source of chronic pain; however, from the standpoint of budgets and limited resources, they are often a kind of pain to be managed as opposed to cured, as wholesale replacement is far took expensive and risky to consider.

In the context of security, however, the picture is often different.  These kinds of systems are often extremely vulnerable, having been designed in a time when networked systems were rare or nonexistent.  In this context, they are more akin to rotten timbers at the core of a building.  Yes, they are expensive and time-consuming to replace, but the risk of not replacing them is far worse.

The real danger is that the infrastructure where vulnerable legacy systems are most prevalent: power grids, industrial facilities, mass transit, and the like are precisely the sort of systems where a breach can do catastrophic damage.  We have already seen an example of this in the real world: the Stuxnet malware was employed to destroy uranium processing centrifuges.

Replacing these legacy systems with more secure implementations is a long and expensive proposition, and doing it in a way that minimizes costs is a very challenging technological problem.  However, this is not a problem that can be neglected.

Cultural and Policy Factors

Though computer security is technological in nature, its causes and solutions are not limited solely to technological issues.  Policy, cultural, and educational factors also affect the problem, and must be a part of the solution.


The most obvious non-technical influence on the security problem is policy.  The various policy debates that have sprung up in the past years are evidence of this; however, the problem goes much deeper than these debates.

For starters, we are currently in the midst of a number of policy debates regarding strong encryption and how we as a society deal with the fact that such a technology exists.  I make my stance on the matter quite clear: I am an unwavering advocate of unescrowed, uncompromised strong encryption as a fundamental right (yes, there are possible abuses of the technology, but the same is true of such things as due process and freedom of speech).  Despite my hard-line pro-crypto stance, I can understand how those that don’t understand the technology might find the opposing position compelling.  Things like golden keys and abuse-proof backdoors certainly sound nice.  However, the real effects of pursuing such policies would be to fundamentally compromise systems and infrastructure within the US and turn defending against data breaches and cyberattacks into an impossible problem.  In the long run, this erodes the kind of trust in technological infrastructure of which I spoke earlier and bars forward progress, leaving us to be outclassed in the international marketplace.

In a broader context, we face a problem here that requires rethinking our policy process.  We have in the security problem a complex technological issue- too complex for even the most astute and deliberative legislator to develop true expertise on the subject through part-time study -but one where the effects of uninformed policy can be disastrous.  In the context of public debate, it does not lend itself to two-sided thinking or simple solutions, and attempting to force it into such a model loses too much information to be effective.

Additionally, the problem goes deeper than issues like encryption, backdoors, and dragnet surveillance.  Much of the US infrastructure runs on vulnerable legacy systems as I mentioned earlier, and replacing these systems with more secure, modern software is an expensive and time-consuming task.  Moreover, this need to invest in our infrastructure this way barely registers in public debate, if at all.  However, doing so is essential to fixing one of the most significant sources of vulnerabilities.


Education, or the lack thereof also plays a key role in the security problem.  Even top-level computer science curricula fail to teach students how to think securely and develop secure applications, or even to impress upon students the importance of doing so.  This is understandable: even a decade ago, the threat level to most applications was nowhere near where it is today.  The world has changed dramatically in this regard in a rather short span of time.  The proliferation of mobile devices and connectedness combined with a tremendous upturn in the number of and sophistication of attacks launched against systems has led to a very different sort of environment than what existed even ten year ago (when I was finishing my undergraduate education).

College curricula are necessarily a conservative institution; knowledge is expected to prove its worth and go through a process of refinement and sanding off of rough edges before it reaches the point where it can be taught in an undergraduate curriculum.  By contrast, much of the knowledge of how to avoid building vulnerable systems is new, volatile, and thorny: not the sort of thing traditional academia likes to mix into a curriculum, especially in a mandatory course.

Such a change is necessary, however, and this means that educational institutions must develop new processes for effectively educating people about topics such as these.


While it is critical to have a infrastructure and systems built on sound technological approaches, it is also true that a significant number of successful attacks on both large enterprises and individuals alike make primary use of human factors and social engineering.  This is exacerbated by the fact that we, culturally speaking, are quite naive about security.  There are security-conscious individuals, of course, but most people are naive to the point that an attacker can typically rely on social engineering with a high success rate in all but the most secure of settings.

Moreover, this naivety affects everything else, ranging policy decisions to what priorities are deemed most important in product development.  The lack of public understanding of computer security allows bad policy such as back doors to be taken seriously and insecure and invasive products to thrive by publishing marketing claims that simply don’t reflect reality (SnapChat remains one of the worst offenders in this regard, in my opinion).

The root cause behind this that cultures adapt even more slowly than the other factors I’ve mentioned, and our culture has yet to develop effective ways of thinking about these issues.  But cultures do adapt; we all remember sayings like “look both ways” and “stop, drop, and roll” from our childhood, both of which teach simple but effective ways of managing more basic risks that arise from technological society.  This sort of adaptation also responds to need.  During my own youth and adolescence, the danger of HIV drove a number of significant cultural changes in a relatively short period of time that proved effective in curbing the epidemic.  While the issues surrounding the security problem represent a very different sort of danger, they are still pressing issues that require an amount of cultural adaptation to address.  A key step in addressing the cultural aspects of the security problem comes down to developing similar kinds of cultural understanding and awareness, and promoting behavior changes that help reduce risk.


I have presented only a portion of the issues that make up what I call the “computer security problem”.  These issues are varied, ranging from deep technological issues obviously focused on security to cultural and policy issues.  There is not one single root cause to the problem, and as a result, there is no one single “silver bullet” that can solve it.

Moreover, if the problem is this varied and complex, then we can expect the solutions to each aspect of the problem to likewise require multiple different approaches coming from different angles and reflecting different ways of thinking.  My own work, for example, focuses on the language and tooling issue, coming mostly from the direction of building tools to write better software.  However, there are other approaches to this same problem, such as sandboxing and changing the fundamental execution model.  All of these angles deserve consideration, and the eventual resolution to that part of the security problem will likely incorporate developments from each angle of approach.

If there is a final takeaway from this, it is that the problem is large and complex enough that it cannot be solved by the efforts or approach of a single person or team.  It is a monumental challenge requiring the combined tireless efforts of a generation’s worth of minds and at least a generation’s worth of time.

Distributed Package and Trust Management

I presented a lightning talk at last night’s Boston Haskell meetup on an idea I’ve been working on for some time now, concerning features for a distributed package and trust manager system.  I had previously written an internal blog post on this matter, which I am now publishing here.

Package Management Background

Anyone who has used or written open-source software or modern languages is familiar with the idea of package managers.  Nearly all modern languages provide some kind of package management facility.  Haskell has Hackage, Ruby has RubyGems, Rust has Cargo, and so on.  These package managers allow users to quickly and easily install packages from a central repository, and they provide a way for developers to publish new packages.  While this sort of system is a step up from the older method of manually fetching and installing libraries that is necessary in languages like C and Java, most implementations are limited to the use-case of open-source development for applications without high security, trust, and auditing requirements.

These systems were never designed for industrial and high-trust applications, so there are some key shortcomings for those uses:

  • No Organizational Repositories- The use of a central package repository is handy, but it fails to address the use case of an organization wanting to set up their own internal package repository.
  • Lack of Support for Closed-Source Packages- Package systems usually work by distributing source.  If you can’t push your packages up to the world, then you default back to the manual installation model.
  • Inconsistent Quality- The central repository tends to accumulate a lot of junk: low-quality, half-finished, or abandoned packages, or as my former colleague John Rose once said, “a shanty-town of bikesheds”.
  • No Verifiable Certification/Accountability- In most of these package systems, there is very little in the way of an accountability or certification system.  Some systems provide a voting or review system, and all of them provide author attribution, but this is insufficient for organizations that want to know about things like certified releases and builds.

Distributed Package Management

There has been some ongoing work in the Haskell community to build a more advanced package management library called Skete (pronounced “skeet”).  The model used for this library is a distributed model that functions more like Git (in fact, it uses Git as a backend).  This allows organizations to create their own internal repositories that receive updates from a central repository and can host internal-only projects as well.  Alec Heller, who I know through the Haskell community is one of the developers on the project.  He gave a talk about it at the Haskell meetup back in May (note: the library has progressed quite a bit since then), which you can find here.

This work is interesting, because it solves a lot of the problems with the current central repository package systems.  With a little engineering effort, the following can be accomplished:

  • Ability to maintain internal package repositories that receive updates from a master, but also contain internal-only packages
  • Ability to publish binary-only distributions up to the public repositories, but keep the source distributions internal
  • Option to publish packages directly through git push rather than a web interface
  • Ability to create “labels” which essentially amount to package sets.

This is definitely an improvement on existing package management technologies, and can serve as a basis for building an even better system.  With this in hand, we can think about building a system for accountability and certification.

Building in Accountability and Certification

My main side project is a dependently-typed systems language.  In such a language, we are able to prove facts about a program, as its type system includes a logic for doing so.  This provides much stronger guarantees about the quality of a program; however, publishing the source code, proof obligations, and proof scripts may not always be feasible for a number of reasons (most significantly, they likely provide enough information to reverse-compile the program).  The next best thing is to establish a system of accountability and certification that allows various entities to certify that the proof scripts succeed.  This would be built atop a foundation that uses strong crypto to create unforgable certificates, issued by the entities that check the code.

This same use case also works for the kinds of security audits done by security consulting firms in the modern world.  These firms conduct security audits on applications, applying a number of methods such as penetration testing, code analysis, and threat modeling to identify flaws and recommend fixes.

This brings us at last to the idea that’s been growing in my head: what if we had a distributed package management system (like Skete) that also included a certification system, so that users could check whether or not a particular entity has granted a particular certification to a particular package.  Specific use cases might look like this:

  • When I create a version of a package, I create a certification that it was authored by me.
  • A third-party entity might conduct an audit of the source code, then certify the binary artifacts of a particular source branch.  This would be pushed upstream to the public package repository along with the binaries, but the source would remain closed.
  • Such an entity could also certify an open-source package.
  • An public CI system could pick up on changes pushed to a package repository (public or private) and run tests/scans, certifying the package if they succeed.
  • A mechanism similar to a block-chain could be used to allow entities to update their certifications of a package (or revoke them)
  • Negative properties (like known vulnerabilities, deprecation, etc) could also be asserted through this mechanism (this would require additional engineering to prevent package owners from deleting certifications about their packages).
  • Users can require that certain certifications exist for all packages they install (or conversely, that certain properties are not true).

This would be fairly straightforward to implement using the Skete library:

  • Every package has a descriptor, which includes information about the package, a UUID, and hashes for all the actual data.
  • The package repositories essentially double as a CA, and manage granting/revocation of keys using the package manager as a distribution system.  Keys are granted to any package author, and any entity which wishes to certify packages.
  • Packages include a set of signed records, which include a description of the properties being assigned to the package along with a hash of the package’s descriptor.  These records can be organized as a block-chain to allow organizations to provide updates at a later date.

Implementation Plans

After I gave my brief talk about this idea, I had a discussion with one of the Skete developers about the possibility of rolling these ideas up into that project.  Based on that discussion, it all seems feasible, and hopefully a system that works this way will be coming to life in the not-too-distant future.

ZFS Support for EFI Now in FreeBSD

Sometime last year, I started working on a patch to add ZFS support to the UEFI boot facilities for FreeBSD.

Backstory: I’ve been a FreeBSD fan and user since my junior year of undergrad (2003), and I run it as my OS whenever I can.  I first started looking into UEFI support as a GSoC project.  Unfortunately, I had to drop the project due to a combination of a sudden job search and my grandfather’s brain cancer diagnosis.

Fast forward a few years, and I circled back to see what remained to be done on the UEFI front.  The boot process was there, but only for UFS.  So over the holidays, I started poking around to see what could be done.

I started out by refactoring boot1 (the program that resides in the EFI partition and pulls loader off the main filesystem and runs it), putting it into a more modular form to support multiple filesystem modules.  I then started writing the ZFS module.  I hit a tipping point some time in April, and got it working completely shortly thereafter.

The next task was loader itself.  This proved trickier, but I eventually figured out what needed to be done.  To my delight, the modified loader worked fine with the GRUB2 bootloader as well as FreeBSD’s boot1.

For most of the rest of the year, it’s been passed around and used by various people and was picked up by NextBSD and PCBSD.  It entered the formal review process in late autumn, and several people contributed changes that helped out immensely in the integration effort.  In particular, several people addressed stylistic issues (I am not terribly familiar with FreeBSD’s style guide) and integrated libstand support (which I had thought to be a problem due to the need for Win32 ABI binaries in EFI).

I was informed on the way home from the gym that it’s been committed to HEAD, and will hopefully make it into 10.3.  I’m glad to see it now officially in FreeBSD, and I’m grateful to the people who helped out with the integration.

I have future plans in this arena, too.  I deliberately modularized the boot1 program in preparation for some other efforts.  First, I plan to look into adding GELI (the full-disk encryption mechanism for FreeBSD) support.  I would also like to see support for checking cryptographic signatures of loader and kernel at boot-time (I’ve heard others are working on something like that).  In the very long run, I’d like to see a completely trusted custody chain from boot to kernel, but that is something that will take multiple steps to realize.

Boston-Area PL/Type Theory

Last night saw the first meeting of the Boston-Area PL/Type Theory group that I put together on (link).  This was an initial meet-and-greet and organizing meeting, intended to serve as a brainstorming session for what to do next.

I’m pleased with the outcome of this meeting.  We were joined by a number of folks from the Boston Haskell community as well as Adam Chlipala of MIT.  Adam suggested that we use space in the MIT computer science department for our events, which seems to be the most advantageous option for several reasons.

We also had a productive discussion about the mission of the group, in particular how to deal with the fact that we will have a rather wide variation in the level of knowledge among members.  The idea came forward that we have different “tracks” of events geared towards different experience levels and goals.  Three distinct tracks emerged from the discussion:

  • Beginners: Featuring events like introductory lectures and group dial-ins to the internet type theory group’s sessions
  • Experienced: Featuring events like a reading group and discussions of and/or lectures on advanced topics
  • “Do Stuff”: Geared towards active work on research and projects, featuring unconference-style events and specific project groups

Some first steps emerged as well.  We decided to have an initial unconference/hackathon (on the “do stuff” track) at some point in February.  We also decided to set up a GitHub group for maintaining the group page, as well as any other projects that happen.  We will surely find other venues for organizing as time goes on.

It looks like we’re off to a good start, and hopefully we’ll see some interesting developments grow out of this!

How to Test Software, Part 3: Measurement and Metrics

This post is the conclusion of my series of post about testing software.  In the first post of the series, I established scientific methods as the foundation for how we build our testing methodology.  In the second post, I discussed methods for writing quality tests, and hinted at how to measure their effectiveness.  In this post, I will discuss the issues surrounding accurately measuring quality and cover some of the important measurements and methods that should be employed.

Metrics: Good and Bad

I have often compared metrics to prescription pain-killers: they can be useful tools for assessing quality; however, they are also highly prone to misuse, abuse, and can cause significant harm when abused in this way.  As with many other things relating to testing and quality, this is a problem that science deals with on a continual basis.  One of the primary tasks of a scientific model is to be able to make predictions based on measurements.  Therefore, it is key that we be able to make good measurements and avoid the common pitfalls that occur when designing metrics.

Common pitfalls include the following (we’ll assume that we are attempting to measure quality with these metrics):

  1. Assuming correlation implies causation (ex: “ice cream sales are correlated to crime rates, therefore ice cream causes criminal activity”)
  2. Reversing the direction of causation (ex: “wet streets cause rainfall”)
  3. Metrics with large systematic errors (inaccuracy)
  4. Metrics with large random errors (imprecision)
  5. Metrics that don’t indicate anything at all about quality (irrelevant metrics)
  6. Metrics that don’t necessarily increase as quality increases (inconsistent metrics)
  7. Metrics that may increase even when quality falls (unsound metrics)
  8. Metrics that increase at a different rate than quality after a point (diminishing returns)
  9. Using metrics in conditions that differ dramatically from the assumptions under which they were developed (violating boundary conditions)
  10. Directly comparing metrics that measure different things

Inconsistency and unsoundness are extremely common flaws in older software quality (and productivity) metrics.  For example, “lines of code” was a common metric for productivity in the 80s and early 90s in software development (some very misguided firms still use it today).  This metric is flawed because it doesn’t actually correlate to real productivity at all for numerous reasons (chief among them being that low-quality code is often much longer than a well-designed and engineered solution).  Likewise, “number of bugs for a given developer” has been employed by several firms as a quality metric, and consistently has the ultimate result of dramatically reducing quality.

There are many more examples of the dangers of bad metrics, and of relying solely on metrics.  Because of the dangers associated with their use, I recommend the following points when evaluating and using metrics:

  • Consult someone with scientific training on the design and use of all metrics
  • Be watchful for warning signs that a metric is not working as intended
  • Understand the conditions under which a given metric applies, and when those conditions don’t hold
  • Understand the principle of diminishing returns and apply it to the use of metrics
  • Understand that a metric only measures a portion of the world, and watch for phenomena for which it fails to account

Examples of Measurements

The following are examples of various measurements of quality, and the factors governing their effective use.

Quality of Test Design: Case Coverage

The previous post covered various testing strategies in considerable detail, and discussed their relative levels of quality.  This discussion covered various issues affecting test quality; however, the key benefit provided by the more advanced testing methods was better case coverage.  Case coverage is an abstract metric that measures the percentage of the number of cases in which a given component or system can operate that are covered by the tests.  In the case of simple, finite (and stateless) components, case coverage can be directly measured.  However, in most cases, it is notoriously difficulty to analyze, as the case spaces for most components and systems are infinite.

With very large or infinite case spaces we need to devote careful thought to what portion of the case space is covered by the test suites.  In infinite spaces, we have some kind of equivalence structure.  We can define a notion of “depth” where equivalent problem instances all lie on a particular trajectory, and “deeper” problems grow more complex.  We would like to build test suites that cover the entire surface, and go down to a uniform depth.  Methods like combinatorial testing are quite powerful in this regard and can achieve this result for many testing problems; however, they are not infallible.  Testing problems very complex case spaces can require a prohibitively large combinatorial test in order to avoid missing certain parts of the surface.

In the most complex cases, the case space has a recursive structure, a highly complex equivalence class structure, or both.  Examples of this often arise in the context of compilers, interpreters, and database systems.  We frequently encounter these kinds of cases on compilers and programming languages, for example.  The best example of this kind of case from my own work would be the expanded selection/resolution logic in the VM spec in JDK8.  In this case, exercising every spec rule through combinatorial testing produced a prohibitively large space.  Thus, we had to employ enumeration-based methods to explore all of the many possible branch-points in the case space and avoid generating redundant instances.

The takeaway is that it is critical to consider the nature of the case space.  If we were to visualize the case space as a kind of surface, then problems that can be described (and tested) via combinatorial methods would look like a relatively simple geometric object, and a combinatorial test would look like a cubical volume.  Thus, it is relatively straightforward to capture large portions of the case space.  A problem like selection/resolution would look more like a highly complex fractcal-like structure.  Problems such as these require different methods to achieve reasonable case coverage.

Effectiveness of a Test Suite on an Implementation: Code Coverage

Case coverage is a measure of the quality of a test suite’s design, and is derived from the specification of the thing being tested.  Code coverage addresses a different problem: the effectiveness of a test suite on a particular implementation.  An important point of this is that I do not believe these two metrics to be alternatives for one another.  They measure completely different things, and thus they both must be employed to give a broader view of the test quality picture.

Code coverage is essential because the implementation will likely change more readily than the behavior specification.  Serious gaps in code coverage indicate a problem: either something is wrong with the implementation, or the test suite is missing some portion of the case space.  Coverage gaps can emerge when neither of these is the case, but if this is the case, then it should be understood why.

Moreover, gaps in code coverage cast doubt on the viability of the code.  The worst example of this comes from my first job, where I once found an entire file full of callbacks that looked like this:

getFoo(Thing* thing) {
  if (thing == NULL) {
    return thing->foo;
  } else {
    return NULL;

Note that the null-check is inverted.  Clearly this code had never been run, because there is no way that it could possibly work.  Gaps in code coverage allow cases like this to slip through undetected.

Stability Over Time

As previously discussed, stress testing seeks to test guarantees about the stability of the product.  The most important point about stress-testing is that the properties it tests are not discrete properties: they cannot be stated in terms of a single point in time.  Rather, they are continuous: they are expressed in terms of a continuous interval of time.  This is a key point, and is the reason that stress-testing is essential.  Unit and system testing can only establish discrete properties.  In order to get a sense of things like reliability and performance which are inherently continuous properties, it is necessary to do stress-testing.

A very important point is that this notion also applies to incoming bug reports.  In the OpenJDK project, we generally did not write internal stress-testing suites of the kind I advocate here.  We did, however, have a community of early adopters trying out the bleeding edge repos constantly throughout the development cycle, which had the effect of stressing the codebase continually.  Whether one considers failures generated by an automated stress-test or bugs filed by early adopters, there comes a point in the release cycle where the number of outstanding bugs hits zero (this is sometimes known as the zero-bug build or point).  However, this is not an indicator of readiness to release, because it is only a discrete point in time.  The typical pattern one sees is that the number of bugs hits zero, and then immediately goes back up.  The zero-bug point is an indicator that the backlog is cleared out, but not that the product was ready for release.  This is because the zero-bug point is a discrete property.  The property we want for a release is a continuous one: namely that in some interval of time, there were no bugs reported or existing.


The issues associated with performance measurement are worthy of a Ph.D thesis (or five), and thus are well outside the scope of this post.  This section is written more to draw attention to them, and point out a few of the many ways that performance testing can produce bad results.

Effective performance testing is HARD.  Modern computing hardware is extremely complex, with highly discontinuous, nonlinear performance function, chaotic behavior, and many unknowns.  The degree to which this can affect performance testing is just starting to come to light, and it has cast doubt on a large number of published results.  For example, it has been shown that altering the linking order of a program can affect performance by up to 5%: the typical performance gain that is suitable to secure publication in top-level computer architecture conferences.

The following are common problems that affect performance testing:

  • Assuming compositionality: the idea that good performance for isolated components of a system implies that the combined system will perform well.
  • Contrived microbenchmarks (small contrived cases that perform well).  This is a dual of the previous problem, as performing well on isolated parts of a problem instance doesn’t imply you’ll perform well on the combined problem.
  • Cherry-picking
  • Not large enough sample size, not enough randomness in selections, bad or predictable random generators
  • Failing to account for the impact of system and environmental factors (environment variables, link order, caches, etc)
  • Non-uniform preconditions for tests (failing to clear out caches, etc.)
  • Lack of repeatability

The takeaway from this is that performance testing needs to be treated as a scientific activity, and approached from the same level of discipline that one would apply in a lab setting.  Its results need to be viewed with skepticism until they can be reliably repeated many times, in many different environments.  Failure to do this casts serious doubt on any result the tests produce.

Sadly, this fact is often neglected, even in top-level conferences; however, this is not an excuse to continue to neglect it.


In this series, I have described an approach to testing that has its foundations in the scientific method.  I have discussed different views from which tests must be written.  I have described advanced methods for building tests that achieve very high case coverage.  Finally, I have described the principles of how to effectively measure quality, and the many pitfalls that must be avoided.

The single most important takeaway from this series is this:

Effective testing is a difficult multifaceted problem, deserving of serious intellectual effort by dedicated, high-level professionals.

Testing should not consist of mindlessly grinding out single-case tests.  It should employ sophisticated analysis and implementation methods to examine the case space and explore it to a satisfactory degree, to generate effective workloads for stress testing, and to analyze the performance of programs.  These are very difficult tasks, require the attention of people with advanced skills, and should be viewed with the respect that solving problems of this difficulty deserves.

Moreover, within each organization, testing and quality should be seen as an essential part of the development process, and something requiring serious attention and effort.  Time and resources must be budgeted, and large undertakings for the purpose of building testing infrastructure, improving existing tests, and building new tests should be encouraged and rewarded.

Lastly, a culture similar to what we had in the langtools team, where we constantly were looking for ways to improve our testing and quality practices pays off in a big way.  Effort put into developing high-quality tests, testing frameworks, and testing methods saves tremendous amounts of time and effort in the form of detecting and avoiding bugs, preventing regressions, and making refactoring a much easier process.  We should therefore seek to cultivate this kind of attitude in our own organizations.

How to Test Software, Part 2: Quality of Tests

In the first post in this series, I discussed an overall approach to testing based on the scientific method.  I also discussed the need for multiple views in our testing methodology as well as three important views that our testing regimen should incorporate.  Unit testing is important, as it tests the kind of guarantees that developers rely upon when using components.  System testing is important because it tests software from the same view as the end users.  Finally, stress and performance testing are important as they answer questions about the continuous operation of the system.

However, I only talked about the general approach to writing tests, and the views from which we write them.  I said nothing about the actual quality of the tests, but rather deferred the topic to a later post: namely this one.

Test Quality: Basics

In scientific investigations, experimental quality is of paramount importance.  Bad experiments lead to bad conclusions; thus it is important to design experiments that are sound, repeatable, and which convey enough information to establish the conclusions we draw.  Similar criteria govern how we should write our tests.  Specifically, tests should establish some set of guarantees to a high degree of certainty.  Thus, we must design our tests as we would an experiment: with considerable thought to the credibility of the test in establishing the guarantees we wish to establish.

Many test suites fail miserably when it comes to their experimental methodology.  Among the most common reasons are the following:

  • Low Coverage
  • Random or unreliable generation of cases
  • Lack of repeatability/intermittent failures

We want to establish a rigorous testing methodology that consistently produces high-quality, credible tests that test their hypotheses to a high degree of certainty.  We can derive the general guidelines for any test suite from the principles of sound experiments.  The following is a list of these principles:

  • Tests should be consistently repeatable, and should not have intermittent failures
  • Tests should give informative and actionable results when they fail
  • Tests should achieve high coverage, and should automatically generate a large set of cases wherever possible
  • Correctness tests should never have any kind of randomness in their operation
  • Stress and performance tests should minimize entropy from out-of-scope factors

With these general principles in mind, we can look at what specifically makes for quality tests in each of the views we discussed in the previous post in this series.

Unit Test Quality

Unit tests examine guarantees about individual components.  One of their chief advantages over other views is the ability to directly exercise cases and codepaths that may be difficult to trigger in whole-system tests.  As such, case coverage is of paramount importance for writing good unit tests.

A less obvious factor in the quality of a unit test is the set of preconditions under which the tests run.  Very few components have a purely-functional specification; most interact with parts of the system in a stateful fashion.  There is often a temptation to write synthetic harnesses which simulate the behavior of the system in a very small number of cases; however, this leads to low-quality tests.  High-quality tests will explore the behavior of the system with a wide variety of preconditions.

In summary, the additional criteria for unit tests are as follows:

  • Explore the case space for components completely
  • Simulate wide varieties of preconditions that affect the behavior of the components

System Test Quality

System tests examine guarantees about the system as a whole.  The purpose of system tests is to test the software from the same point of view as the users who will eventually use it.  The key difficulty with this view is repeatability, particularly for complex systems that interact with things like databases or the network.  For the most complex systems, considerable care must be taken in order to engineer repeatable tests.

Additionally, it is necessary to consider system-specific behaviors like character sets, filesystem conventions, date and time issues, and other such issues.

The following are common problems that need to be considered in writing system tests:

  • OS-specific factors (encoding, filesystem behaviors, etc)
  • OS-level preconditions (existing files, environment variables, etc)
  • Interactions with other services (databases, authentication servers, etc)
  • Network issues ((in)accessibility, configurations, changing IPs, etc.)

Stress/Performance Test Quality

Stress tests examine guarantees about stability under certain kinds of load.  Performance tests likewise examine performance under certain kinds of load.  Both of these differ from other kinds of testing in that the properties they examine are about continuous intervals of time as opposed to discrete points.

Both stress and performance tests tend to involve some degree of entropy (stress tests do so deliberately; performance tests do so more out of a need to measure real performance).  This is a key difference from correctness-oriented tests, which should avoid entropy at all costs.  The key to quality testing when entropy is unavoidable is to keep it limited to relevant entropy and isolate the test from irrelevant entropy- that is, maximize the signal and minimize the noise.  In stress testing, we want to measure stability under “characteristic” workloads; thus, it is critical that we generate loads that are statistically similar to a characteristic load, or at the very minimum have statistical properties that we understand.  Additionally, it is important that we don’t accidentally neglect certain aspects of the desired workload.

In performance testing, we must also avoid accidental biases in our tests arising from factors like caching.  This may seem simple, but in fact it is much more difficult than a first glance would suggest.  For example, the content of environment variables can significantly affect the cache behavior, as can the link order of the application.  The contents of caches, both CPU as well as filesystem and page caches can likewise have a significant effect on performance, and can accidentally bias the tests.  It is important to think carefully about performance tests and all the factors that affect performance in order to avoid these kinds of bias.

The following are important factors for writing stress and performance tests:

  • Ensure that the statistical properties of the synthetic workload accurately reproduce the desired properties
  • Ensure that the space of generated cases does not exclude any cases that we desire to include
  • Ensure that non-obvious biases are eliminated or minimized in performance tests


The problem of coverage is central to the problem of correctness testing.  Older literature on testing describes two dual methodologies: blackbox and whitebox (sometimes called glassbox).  The difference between the two can be stated in terms of case coverage and code coverage.  I prefer not to talk about whitebox and blackbox testing, because both case and code coverage are important.  They also don’t represent concepts that can be directly compared.  Code coverage is a measurable quantity, which can and should be determined using coverage tools.  Case coverage, on the other hand, is a conceptual idea, and does not lend itself to direct measurement except in the simplest of cases.

Put another way, case coverage is useful for evaluating the quality of the design of an individual test or the quality of a given test-writing methodology.  We can clearly talk about what kinds of cases a test generates and tests, how many of them are generated, how varied or redundant they are, and we can reason to some extent about how much they approximate complete testing of the entire case space (which is often infinite).  Thus, case coverage is a measure of the quality of the design of a test.

Code coverage, on the other hand, generally cannot be directly inferred from a given test; rather, it is a measure that is obtained by running the test and collecting and analyzing profiling data after the fact.  It functions as a performance metric, and indicates the adequacy of a test.  Even a very well-designed test suite with good case coverage may leave gaps in the code coverage either because those gaps come from very obscure edge cases, or because for some reason those code paths cannot be exercised by any test case (which can indicate underlying problems in the implementation).  Thus, code coverage is a measure of the adequacy of a test suite.

The remainder of this post will focus on case coverage, and how different test design methodologies achieve different levels of coverage.  I will discuss code coverage in a future post.

Test Design Methodologies

The technical difficulty of designing high-quality tests is often underestimated.  By consequence, many test suites contain large numbers of low-quality tests.  In one of many discussions about testing during my time working on OpenJDK, I described a system of tiers for testing, which were focused around the degree to which they provided high levels of case coverage, and what sort of problem spaces they were equipped to handle.  This section describes these tiers in detail.

Single-Case Tests

Single-case tests are the most common method for writing tests.  They are also the least effective method, as they achieve extremely low case coverage (and often very low code coverage).  The single-case testing methodology is bad for a number of reasons:

  • It does not scale either in terms of engineering effort or in terms of execution.  Any automated case generation method can achieve coverage levels that would require hundreds of thousands of person-hours with the single-case methodology.
  • There is an inherent bias toward writing simple cases, which tends to result in the tests missing the complex cases.
  • It tends to result in a significant amount of copy-pasting, which leads to errors in the test.
  • It results in an unmaintainable test suite, often with many duplicated tests.

For these and other reasons, single-case tests were extremely strongly discouraged in the langtools group, and would usually fail code review without some justification.

Template Tests

Template tests are a method for quickly generating and testing large number of very similar tests.  With template testing, we create a template which constructs a test case from a parameter value.  This template is then applied to a range of parameter values which generate and test the various cases.

This method was frequently employed in the javac test suite to test relatively simple problems that we encountered.  It is more effective for problems with a relatively “flat” structure, though often combinatorial testing is required for more complex problem spaces.

A common variation on this style of testing was to create a “little language”, which describes a test case in a very concise format.  This was used to test bridge method generation in Lambda for JDK8 (this test suite is now part of the OpenJDK repository).

Combinatorial Tests

Combinatorial tests, or “combotests” were the most common methodology used by the javac team as we continued to develop our methodology.  Combotests work similar to template tests, except that they have multiple parameters.  The test has a range of possible inputs for each parameter, and it runs the test on every possible combination of inputs.

Combinatorial tests achieve a very high level of coverage for many problems, and can generate and test tens of thousands of problem instances in an efficient manner.  This methodology is sufficient for many problems.  Only the most complex problems require the more advanced method of enumeration testing.

For nearly all testing problems, combinatorial tests represent the “sweet spot” of the diminishing returns curve.  They achieve high coverage, but are relatively easy to implement.  For this reason, combinatorial testing should be the preferred method of writing tests.

Enumeration Testing

Combinatorial testing is a powerful method, but it is predicated on the idea that a problem can be expressed in terms of a small set of independent dimensions, each combination of which is a unique problem instance and whose expected result can be easily determined.  It breaks down in the presence of certain conditions, including the following:

  • When it is difficult to determine the expected result from a problem instance without re-implementing the thing being tested
  • When the problem has a highly recursive structure to its specification
  • When there is a complex equivalence class structure among the problem instances

When these conditions are in effect, combinatorial testing fails either because it does not explore enough of the problem space, or because it must explore a prohibitively large space in order to achieve reasonable case coverage.

Examples of where these kinds of conditions manifest include type systems, symbol resolution in the presence of inherited and nested scopes, and dispatch logic in the implementation of object-oriented languages.  In all these cases, we see the features I listed above.  As I work on compilers quite a bit, I encounter these kinds of problems frequently; thus I have moved over time to using enumeration testing methods to deal with them.

Enumeration testing is based on the notion of proof trees in logic, and is based on the idea that each rule in a specification or a type system implies something about the test case that exercises it.  For example, in symbol resolution in Java, there  has a rule which states that if a class does not define the desired symbol, then we recursively search for the symbol in its superclass.  This implies that we have (at least) two test cases for this rule: one in which a class defines a symbol, and one in which it does not.

Enumeration testing creates a builder for test cases, and a set of “templates” which potentially operate on a builder to add data to the test case.  We then use tree-enumeration to explore all possible cases out to a certain depth.  In essence, we turn the testing problem into a branching search problem.

In summary, enumeration testing is an advanced method which is difficult to implement.  However, it is the only method able to adequately address the hardest testing problems.


A common misconception about testing is that it is an inherently simple task.  Writing high-quality tests is a technically challenging task, and achieving very high quality requires a knowledge of advanced programming and computer science theory techniques.  In this post, I have discussed the general principles of high-quality testing, the role of different kinds of quality in testing, and a series of increasingly advanced methodologies for writing tests.  The final piece of the testing picture deals with measurements and metrics, which I will discuss in the next post.