Industrial Programming Languages and Pragmatics

The approach of applying pragmatics and principles of industrial programming language design to formal methods, verification, and dependently-typed languages is something I care a great deal about.  It’s the focus of my programming language project Salt, and it was the subject of my recent lightning talk at IEEE SecDev and many of the discussions that followed.

Those who interact with me know I tend to talk about this subject quite a bit, and that I have a particular concern for the pragmatic aspects of language design.  The origin of this concern no doubt comes from my time with the Java Platform Group at Oracle; there I saw the issues that a successful industrial language must confront firsthand and the demands that arise from real-world industrial use.   I’ve also seen these sorts of issues in other places, like working on the FreeBSD and Linux kernels.

A common question I get is “what do you mean by ‘industrial language’?”  I see this frequently in the context of the academic programming language and Haskell communities.  I’ve seen this frequently enough that it’s prompted me to write an article on the topic.

Academic and Industrial PL

In my experience, there’s something of a gulf between academic knowledge about programming languages and industrial knowledge.  Academic PL is primarily concerned with things like semantics, idealized language features, and type systems (particularly type systems).  Academic PL knowledge is available through the usual academic conferences: POPL, PLDI, OOPSLA, ICFP, and the like, as well as various books and journals.

Industrial PL, by contrast, is generally less structured and accessible in its knowledge, existing primarily in the form of institutional knowledge in organizations like OpenJDK and others dedicated to maintaining major industrial languages.  Some of this makes it out into industrial conferences (Brian Goetz, the Java language architect has given a number of talks on these subjects, such as this one).

In recent history, there has been a concerted effort by the industrial PL world to tap into this knowledge.  This manifests in recent and upcoming Java language features as well as in languages like Rust which incorporate a number of advanced type system features.  I saw this firsthand in my time on the Java Platform Group, and particularly in the work that went into the Lambda project and is now going into the Valhalla project.

Academic PL, on the other hand, has tended to be dismissive of the knowledge base of industrial PL as it lacks the kind of formalism that modern academic PL work demands.  The source of this, I believe, is rooted in the differing goals and scientific foundations.  The concerns of academic PL are well-addressed by higher mathematics, where the concerns of industrial PL are better answered through disciplines such as organizational psychology.  In truth, both are important.

It is my belief that both worlds need to make an effort to get out of their silos.  The industrial PL world needs to do a better job at describing the problems it solves and reporting its solutions and findings.  The academic PL world needs to acknowledge the intellectual merit of creating real-world realizations of the idealized features presented in academic PL conferences.

Principles of “Industrial Programming Languages”

A first step towards getting out of the silo is to clearly express what the notion of an “industrial programming language” means.  This phrase is an example of what might be called a “common-sense” term.  Any of my colleagues from OpenJDK certainly could have expounded at length as to its meaning, industrial programmers could probably list a number of examples, and even academics have some notion of what it means (often some variation on “icky details we don’t want to deal with”).

The problem with “common sense” notions is that everyone has their own ideas of what they mean.  To effectively study an issue, we must refine such notions into more concrete and unambiguous ideas.  A truly principled effort in this direction would, in my opinion, be worthy of peer-reviewed publication.  My goal here is merely to begin to explore the ideas.

As a first draft, we can list a number of issues that “industrial” software development must confront, and which “industrial” programming languages must provide tools to address:

  • Very large codebases, running from hundreds of thousands to tens of millions of lines of code
  • Long-running development, spanning many development cycles and lasting years to decades
  • Very large staff headcounts, distributed teams, staff turnover, varying skill-levels
  • Withstanding refactoring, architectural changes, and resisting “bit-rot”
  • Large and complex APIs, protocols, interfaces, etc, possibly consisting of hundreds of methods/variables
  • Interaction with other complex systems, implementation of complex standards and protocols
  • Need for highly-configurable systems, large numbers of possible configurations
  • Large third-party dependency sets and management thereof

The common theme here is unavoidable complexity and inelegance.  It is a well-known fact in many disciplines that different factors emerge and become dominant at different scales.  In algorithms and data structures for example, code simplicity is often the dominant factor in execution time at small scales, asymptotic complexity dominates at mid-scale, and I/O cost dominates all other concerns at large scales.  Similarly, with software development, certain stresses (such as the ones listed above) emerge and become increasingly important as the scope of a development effort increases.

Industrial practice and industrial languages aim to withstand these factors and the stresses they produce as development scope scales up to modern levels.  An important corollary is that the failure to provide mechanisms for dealing with these factors effectively limits the scope of development, as has been the case throughout the development of industrial practice.

Dealing with Development at Scale

I routinely use the term “industrial pragmatics” to refer to the various methods that have been developed in order to cope with the stresses that emerge in large-scale software development.  In particular, I use the word “pragmatics” because the efficacy of these techniques generally can’t be effectively evaluated using theory alone.  They involve both complex interactions at scale as well as human behavior- two phenomena that defy theoretical abstraction.

It is worth exploring some of these techniques and why they have been so successful.

The Path of Least Eventual Cost/Pain

At very large scales, the only successful strategy for managing things quickly becomes the path of least eventual cost (or for the more blunt, pain).  A key word in this is eventual, as it is critically important to realize that some decisions may minimize short-term cost or maximize short-term gain in some way but eventually lead to a much higher cost in the long run.  At large scales and in long-running projects, short-sighted decision-making can dramatically increase the eventual cost, to the point of compromising the entire effort.

This is, in my opinion, one of the critically-important principles of industrial pragmatics.  Any tool, technique, language, or practice must be evaluated in terms of its effect on eventual cost.

Modularity and Interfaces

More than any other technique, modularity has proven itself invaluable in managing the complexity of industrial-scale software development.  The most widely-successful adoption of this technique came in the form of Object-Oriented programming, which removed the barriers that prevented software development from scaling up.  Since the advent of OO, other adaptations have emerged: Haskell’s typeclasses, various module systems, and the like, all of which share common features.

A key features of these systems is that they manage complexity by encapsulation- walling off part of a huge and complex system, thereby limiting the concerns of both the author of that component as well as the user.  This significant limits the set of concerns developers must address, thereby limiting the size and complexity of any “local” view of the system.

Constraining Use

Large systems quickly build up a huge number of interfaces and configurations, and the fraction of possible uses of those interfaces and configurations that represent “valid” or “good” use quickly shrinks to the point where the vast majority of uses are incorrect in some way.  Well-designed systems provide mechanisms to restrict or check use of interfaces or configurations to identify misuse, or else restrict use to the “good” cases in some way.  Badly designed systems adopt an “anything goes” mentality.

An prime example of this comes in the form of cryptographic APIs.  Older crypto APIs (such as OpenSSL) provide a dizzying array of methods that can result in an insecure system at the slightest misuse.  This problem was identified in recent academic work, and a series of new crypto APIs that restrict use to correct patterns have been created.  Type systems themselves also represent a technology that restricts usage, greatly constraining the space of possible cases and greatly improving the degree to which a program can be analyzed and reasoned about.

Managing Implicit Constraints

In a sufficiently-large system, a significant effort must be dedicated to maintaining knowledge about the system itself and preventing the introduction of flaws by violating the implicit logic of the system.  Well-managed systems minimize the externally-facing implicit constraints, expose APIs that further minimize the degree to which they can be violated, and ensure that the unavoidable constraints are easily detected and well-known.  Badly-designed systems are rife with such implicit constraints and require expert knowledge in order to avoid them.

This is one area that I believe formal methods can make a huge impact if adapted correctly.  There are many ways of approaching this problem: documentation, assertions, unit tests, and so on, but they are all deficient for various reasons.  Even the ability to explicitly codify invariants and preconditions would be a considerable boon.  More modern languages and tools such as Rust and JML approach this, but only formal methods can provide a truly universal solution.

The Art of API (and Protocol/Standard) Design

The design of APIs is an art that is underappreciated in my opinion.  It takes considerable skill to design a good API.  It requires the ability to generalize, and more importantly, to know when to generalize and when to hammer down details.  It requires the ability to foresee places where people will want to change things and how to make that process easy.  It requires a good sense for how people will build upon the API.  It requires a good sense of necessary vs. unnecessary complexity.  Lastly, It requires a knack for human factors and for designing things to make doing the right thing easy and the wrong thing hard or impossible.

Managing Failure

Failure of some kind becomes unavoidable at large scales.  Software systems subject to real use will eventually be used incorrectly, and it will become necessary to diagnose failures quickly and efficiently.  This is unavoidable, even in a perfect world where all of our own developers are perfect- users and third-party developers can and will do things wrong, and they need to be able to figure out why.  In the real world, our own developers also must diagnose failures, if only in the internal development and testing processes.

This, I think, is one of the most undervalued principles in both academic and even industrial circles.  Far too many people want to answer the problem of managing failure with “don’t fail”.  But failure can’t be avoided or wished away, and failing to provide tools to quickly and effectively diagnose failure translates to increased cost in terms of development times, testing times, and diagnosis of failure in the field (in other words, much more eventual cost and pain).

In my own career, Java and other JVM languages have proven themselves to have the best track record in terms of easy diagnosis of failure of any language I’ve seen in common use, both with myself and with others.  By contrast, languages that don’t provide the kind of functionality that JVM languages do, either because they can’t (like C), or because they choose not to (like Haskell) tend to slow things down as it takes longer to diagnose issues.  Lastly, I’ve also dealt with some truly nightmarish platforms, such as embedded or boot-loaders, where diagnosis is a task for experts and takes considerable effort and time.

Human Factors

In addition to the principles I’ve discussed thus far, human factors- particularly the difficulty of modifying human behavior -plays a key role in the design of industrial languages.  A considerable amount of thought must go into how to influence user behavior and how to get them to adopt new techniques and methodologies, often with little chance of recovering from mistakes.  Unlike the other principles, these are rooted in psychology and therefore cannot be the subject of the kinds of mathematical theories that underlie many aspects of PL.  They are nonetheless critical to success.

Making it Easy to be Good/Hard to be Bad

One of the most successful principles I’ve learned is that of making it easy to be good and hard to be bad.  This means that “good” use of the tool or language should be easy and intuitive, and it should be difficult and non-obvious how to do “bad” things (this of course presupposes an accurate notion of good vs. bad, but that’s another discussion).

An excellent example of this comes from the language Standard ML.  Functional programming with immutable data is the notion of “good” in SML, and imperative programming with mutable data is strongly discouraged.  It’s tolerated, but the syntax for declaring and using mutable state is gnarly and awkward, thereby encouraging programmers to avoid it in most cases and encapsulate it when they do use it.  Java and other languages’ design reflects a notion of object-orientation being “good”, with global state being strongly discouraged and made deliberately difficult to use.

Regimentation

Coordinating human activity at scale is an extremely difficult problem.  Regimentation- creating common idioms that serve to make behavior predictable and facilitate implied communication is a very common technique for dealing with the problem at scale.  In the context of software development, this takes the form of things like design patterns and anti-patterns, well-defined interfaces and standards, style documents, and programming paradigms.

Languages that succeed at large-scale development tend to provide facilities for this kind of regimentation in one way or another.  This incidentally is one major argument in favor of types: they provide a considerable amount of implicit communication between developers and users of an interface.  Similarly, languages with well-known idioms and built-in usage patterns tend to produce more coherent codebases.  Java is one example of this.  Haskell is quite good in some ways (such as its typeclasses, tightly-controlled use of side-effects, and very powerful type system) and deficient in others (five string types).  Python achieves good regimentation despite its lack of types, which I believe is a crucial factor in its popularity.

Making Complex Ideas Accessible

A key to success in industrial development is the ability to make a complex idea accessible.  In my own experience, this was one of the core design principles of the Lambda project in Java.  We had the challenge of introducing higher-order functional programming and all the notions that come along with it in a way that was accessible to programmers used to the stateful OO style of thinking.  Conversely, many efforts have gotten this wrong, as exemplified in the infamous quote “a monad is just a monoid in the category of endofunctors”.

This translation between idioms is quite difficult; it requires one to deeply understand both sides of the fence, as well as which use cases and concerns are most common on both sides.  However, it’s critical- people don’t use what they can’t understand.  The best way to facilitate understanding is to present an idea or principle in a context the audience already understands.

Harm-Reduction

Just as failure must be dealt with in the context of software systems, changing behavior and practices must be dealt with in the context of human factors.  As an example, programming practices based on the functional paradigm are becoming increasingly recommended practice due to a number of changing factors in the world.  Many languages face the challenge of adapting their user base to these new practices.

The mentality of harm-reduction often proves the most effective attitude when it comes to changing human behavior, both here and elsewhere.  Harm-reduction accepts that “bad” behavior takes place and focuses its efforts on minimizing the negative impacts on oneself and others and encouraging a transition towards better behavior in the long-run.  This reflects the realities of industrial programming: software can’t be rewritten into a new paradigm all at once, and some legacy systems can’t be eliminated or rewritten at all.

In the design of industrial languages and tools, this takes the form of several adages I’ll list here: provide options instead of demands, minimize upfront costs and facilitate gradual transitions, work with the old but encourage the new.

On Seeming Contradictions

It might seem, particularly to the more academically and philosophically inclined, that many of these principles I’ve elaborated contradict each other.  How can we provide regimentation while working with the mentality of harm-reduction?  How can we restrict use to the “good” behaviors while gracefully handling failure?

The solution is in itself another critically important principle: the notion of avoiding absolutist conflicts of ideas.  In the practical world, ideas tend to be useful in certain contexts, and a given principle needs to be evaluated in the context of use.  Put another way, “it’s important to use the right tools for the job”.

On a deeper level, ideological wars tend to end with both sides being “wrong”, and often some synthesis of the two being right.  In physics, the debate over whether light was a wave or a particle raged for centuries; in truth, it’s both.  There are many other such examples throughout history.

Conclusion

If we reduce everything I’ve discussed down to its essence, we arrive at the following statement: as the scale of industrial programming increases, the dominant concerns become irreducible complexity and inelegance, and human factors associated with coordinating effort.  Thus, “industrial pragmatics” refers to a set of techniques and principles for managing these concerns, and industrial programming languages are those languages that consider these techniques in their design.

Slides and Notes from Last Year’s Denotational Semantics Introduction

Last year, I gave a talk at Boston Haskell introducing people to the basics of denotational semantics, starting with Scott’s domain theory and touching on the metric space approaches as well.  I never did post the slides or notes from that talk.

The slides can be found here, and the notes here.

Slides from Making Category Theory Accessible Talk

I have been working on ideas for how to make category theory more accessible and easier to learn, with the belief that it could be eventually be reorganized to the point where it could be taught to high school students interested in math.

I gave a 20-minute talk yesterday at Boston Haskell about the progress of my ideas so far.  This stimulated some interesting discussions.  I am posting the slides here.

Slides from My IEEE SecDev Talk

I gave a talk at IEEE SecDev on Nov 3 about my vision for how to combine industrial programming language pragmatics with formal methods.  The slides can be found here.

This was a 5-minute talk, but I will be expanding it into a 30-minute talk with more content.

InfoSec and IoT: A Sustainability Analogy

Yesterday saw a major distributed denial-of-service (DDoS) attack against the DNS infrastructure that crippled the internet for much of the east coast.  This attack disabled internet access for much of the Northeastern US, as well as other areas.  These sorts of attacks are nothing new; in fact, this attack came on the anniversary of a similar attack fourteen years ago.  Yesterday’s attack is nonetheless significant, both in its scope and also in the role of the growing internet of things (IoT) in the attack.

The attack was facilitated by the Mirai malware suite, which specifically targets insecure IoT devices, applying a brute-force password attack to gain access to the machines and deploy its malware.  Such an attack would almost certainly fail if directed against machines with appropriate security measures in place and on which passwords had been correctly set.  IoT devices, however, often lack such protections, are often left with their default login credentials, and often go unpatched (afterall, who among even the most eager adopters of IoT can say that they routinely log in to every lightbulb in their house to change the passwords and download patches).  Yesterday, we saw the negative consequences of the proliferation of these kinds of devices

Public Health and Pollution Analogies

Industry regulation- whether self-imposed or imposed by the state -is an widely-accepted practice among modern societies.  The case for this practice lies in the reality that some actions are not limited in their effect to oneself and one’s customers, but rather that they have a tangible effect on the entire world.  Bad practices in these areas leads to systemic risks that threaten even those who have nothing to do with the underlying culprits.  In such a situation, industry faces a choice of two options, one of which will eventually come to pass: self-regulate, or have regulations imposed from without.

Two classic examples of such a situation come in the form of public health concerns and environmental pollution.  Both of these have direct analogs to the situation we now face with insecure IoT devices and software (in)security in the broader context.

IoT and Pollution

After the third attack yesterday, I posted a series of remarks on Twitter that gave rise to this article, beginning with “IoT is the carbon emissions of infosec. Today’s incident is the climate change analog. It won’t be the last”.  I went on to criticize the current trend of gratuitously deploying huge numbers of “smart” devices without concern for the information security implications.

The ultimate point I sought to advance is that releasing huge numbers of insecure, connected devices into the world is effectively a form of pollution, and it has serious negative impacts on information security for the entire internet.  We saw one such result yesterday in the form of one of the largest DDoS attacks and the loss of internet usability for significant portions of the US.  As serious as this attack was, however, it could be far worse.  Such a botnet could easily be used in far more serious attacks, possibly to the point of causing real damage.  And of course, we’ve already seen cases of “smart” device equipped with cameras being used to surreptitiously capture videos of unsuspecting people which are then used for blackmail purposes.

These negative effects, like pollution, affect the world as a whole, not just the subset of those who decide they need smart lightbulbs and smart brooms.  They create a swarm of devices ripe for the plucking for malware, which in turn compromises basic infrastructure and harms everyone.  It is not hard to see the analogies between this and a dirty coal-burning furnace contaminating the air, leading to maladies like acid rain and brown-lung.

Platforms, Methodologies, and Public Health

Anyone who follows me on Twitter or interacts with me in person knows I am harshly critical of the current state of software methodologies, Scrum in particular, and of platforms based on untyped languages, NodeJS in particular.  Make no mistake, scrum is snake-oil as far as I’m concerned, and NodeJS is a huge step backward in terms of a programming language and a development platform.  The popularity of both of these has an obvious-enough root cause: the extreme bias towards developing minimally-functional prototypes, or minimum-viable products (MVPs), in Silicon Valley VC lingo.  Scrum is essentially a process for managing “war-room” emergencies, and languages like JavaScript do allow one to throw together a barely-working prototype faster than a language like Java, Haskell, or Rust.  This expedience has a cost, of course: such languages are far harder to secure, to test, and to maintain.

Of course, few consumers really care what sort of language or development methodology is used, so long as they get their product, or at least the current conventional wisdom goes.  When we consider the widespread information security implications, however, the picture begins to look altogether different.  Put another way, Zuckerburg’s addage “move fast and break things” becomes irresponsible and unacceptable when the potential exists to break the entire internet.

Since the early 1900’s, the US has had laws governing healthcare-related products as well as food, drugs and others.  The reasons for this are twofold: first, to protect consumers who lack insight into the manufacturing process, and second, to protect the public from health crises such as epidemics that arise from contaminated products.  In the case of the Pure Food and Drug act, the call for this regulation was driven in a large part by the extremely poor quality standards of large-scale industrial food processing as documented in Upton Sinclair’s work The Jungle.

The root cause of the conditions that led to the regulation of food industries and the conditions that have led to the popularization of insecure platforms and unsound development methodologies is, I believe, the same.  The cause is the competition-induced drive to lower costs and production times combined with a pathological lack of accountability for the quality of products and the negative effects of quality defects.  When combined, these factors consistently lead nowhere good.

Better Development Practices and Sustainability

These trends are simply not sustainable.  They serve to exacerbate an already severe information security crisis and on a long enough timeline, they stand to cause significant economic damage as a result of attacks like yesterdays, if not more severe attacks that pose a real material risk.

I do not believe government-imposed regulations are a solution to this problem.  In fact, in the current political climate, I suspect such a regulatory effort would end up imposing regulations such as back-doors and other measures that would do more damage to the state of information security that they would help.

The answer, I believe, must come from industry itself and must be led by infosec professionals.  The key is realizing that as is the case with sustainable manufacturing, better development practices are actually more viable and lead to lower eventual costs.  Sloppy practices and bad platforms may cut costs and development times in the now, but in the long run they end up costing much more.  This sort of paradigm shift is neither implausible nor unprecedented.  Driving it is a matter of educating industrial colleagues about these issues and the benefits of more sound platforms and development processes.

Summary

Yesterday’s attack brought the potential for the proliferation of insecure devices and software to have a profound negative effect on the entire world to the forefront.  A key root cause of this is an outdated paradigm in software development that ignores these factors in favor of the short-term view.  It falls to the infosec community to bring about the necessary change toward a more accurate view and more sound and sustainable practices.

FreeBSD OS-Level Tamper-Resilience

I’ve posted about my work on EFI GELI support.  This project is actually the first step in a larger series of changes that I’ve been sketching out since April.  The goal of the larger effort is to implement tamper-resilience features at the OS level for FreeBSD.  The full-disk encryption capabilities provided by GELI boot support represent the first step in this process.

OS-Level Tamper-Resilience

Before I talk about the work I’m planning to do, it’s worth discussing the goals and the rationale for them.  One of the keys to effective security is an accurate and effective threat model; another is identifying the scope of the security controls to be put in place.  This kind of thinking is important for this project in particular, where it’s easy to conflate threats stemming from vulnerable or malicious hardware with vulnerabilities at the OS level.

Regarding terminology: “tamper-resistance” means the ability of a device to resist a threat agent who seeks to gain access to the device while it is inactive (in a suspended or powered-off state) in order to exfiltrate data or install malware of some kind.  I specifically use the term “tamper-resilience” to refer to tamper-resistance features confined to the OS layer to acknowledge the fact that these features fundamentally cannot defeat threats based on hardware or firmware.

Threat Model

In our threat model, we have the following assets:

  • The operating system kernel, modules, and boot programs.
  • Specifically, a boot/resume program to be loaded by hardware, which must be stored as plaintext.
  • The userland operating system programs and configuration data.
  • The user’s data.

We assume a single threat agent with the following capabilities:

  • Access and write to any permanent storage medium (such as a disk) while the device is suspended or powered off.
  • Make copies of any volatile memory (such as RAM) while the device is suspended.
  • Defeat any sort of physical security or detection mechanisms to do so.

Specifically, the following capabilities are considered out-of-scope (they are to be handled by other mechanisms):

  • Accessing the device while powered on and in use.
  • Attacks based on hardware or firmware tampering.
  • Attacks based on things like bug devices, reading EM radiation (van Eyck phreaking), and the like.
  • Attacks based on causing users to install malware while using the device.

Thus, the threat model is based on an attacker gaining access to the device while powered-off or suspended and tampering with it at the OS level and up.

It is important to note that hardware/firmware tampering is a real and legitimate threat, and one deserving of effort.  However, it is a separate and parallel concern that requires its own effort.  Moreover, if the OS level has weaknesses, no amount of hardware or firmware hardening can compensate for it.

Tamper-Resilience Plan

The tamper resilience plan is based around the notion of protecting as much data as possible through authenticated encryption, using cryptographic verification to ensure that any part of the boot/resume process whose program must be stored as plaintext is not tampered with, and ensuring that no other data is accessible as plaintext while suspended or powered off.

The work on this breaks down into roughly three phases, one of which I’ve already finished.

Data Protection and Integrity

All data aside from the boot program to be loaded by the hardware (known in FreeBSD as boot1) can be effectively protected at rest by a combination of ZFS with SHA256 verification and the GELI disk encryption scheme.  Full-disk encryption protects data from theft, and combining it with ZFS’ integrity checks based on a cryptographically-secure hash function prevents an attacker from tampering with the contents (this can actually be done even on encrypted data without an authentication scheme in play).

Secure Boot

There is always at least one program that must remain unprotected by full-disk encryption: the boot entry-point program.  Fortunately, the EFI platform provides a mechanism for ensuring the integrity of the boot program.  EFI secure boot uses public-key crypto to allow the boot program to be signed by a private key and verified by a public key that is provided to the firmware.  If the verification fails, then the firmware informs the user that their boot program has been tampered with and aborts the boot.

In an open-source OS like FreeBSD, this presents an effective protection scheme along with full-disk encryption.  On most desktops and laptops, we build the kernel and boot loaders on the machine itself.  We can simply store a machine-specific signing key on the encrypted partition and use it to sign the boot loader for that machine.  The only way an attacker could forge the signature would be to gain access to the signing key, which is stored on an encrypted partition.  Thus, the attacker would have to already have access to the encrypted volume in order to forge a signature and tamper with the boot program.

To achieve the baseline level of protection, we need to ensure that the plaintext boot program is signed, and that it verifies the signature of a boot stage that is stored on an encrypted volume.  Because of the way the EFI boot process works, it is enough to sign the EFI boot1 and loader programs.  The loader program is typically stored on the boot device itself (which would be encrypted), and loaded by the EFI LOAD_IMAGE_PROTOCOL interface, which performs signature verification.  Thus, it should be possible to achieve baseline protection without having to modify boot1 and loader beyond what I’ve already done.

There is, of course, a case for doing signature verification on the kernel and modules.  One can even imagine signature verification on userland programs.  However, this is out-of-scope for the discussion here.

Secure Suspend/Resume

Suspend/resume represents the most significant tamper weakness at the present.  Suspend/resume in FreeBSD is currently only implemented for the suspend-to-memory sleep state.  This means that an attacker who gains access to the device while suspended effectively has access to the device at runtime.  More specifically, they have all of the following:

  • Access to the entire RAM memory state
  • Sufficient data to decrypt all mounted filesystems
  • Sufficient data to decrypt any encrypted swap partitions
  • Possibly the signing key for signing kernels

There really isn’t a way to protect a system that’s suspended to memory.  Even if you were to implement what amounts to suspend-to-disk by unmounting all filesystems and booting the kernel and all programs out to an encrypted disk storage, you still resume by starting execution at a specified memory address.  The attacker can just implant malware in that process if they have the ability to tamper with RAM.

Thus, the only secure way to do suspend/resume is to tackle suspend-to-disk support for FreeBSD.  Of course, it also has to be done securely.  The scheme I have in mind for doing so looks something like this:

  • Allow users to specify a secure suspend partition and set a resume password.  This can be done with a standard GELI partition.
  • Use the dump functionality to write out the entire kernel state to the suspend partition (because we intend to resume, we can’t do the usual trick of dumping to the swap space, as we need the data that’s stored there)
  • Alternatively, since the dump is being done voluntarily, it might be possible to write out to a filesystem (normally, dumps are done in response to a kernel panic, so the filesystem drivers are assumed to be corrupted).
  • Have the suspend-to-disk functionality sign the dumped state with a resume key (this can be the signing key for boot1, or it can be another key that’s generated during the build process)
  • Make boot1 aware of whatever it needs to know for detecting when resuming from disk and have it request a password, load the encrypted dumped state, and resume.

There are, of course, a lot of issues to be resolved in doing this sort of thing, and I imagine it will take quite some time to implement fully.

Going Beyond

Once these three things are implemented, we’d have a baseline of tamper-resilience in FreeBSD.  Of course, there are ways we could go further.  For one, signed kernels and modules are a good idea.  There has also been talk of a signed executable and libraries framework.

Current Status

My GELI EFI work is complete and waiting for testing before going through the integration process.  There are already some EFI signing utilities in existence.  I’m currently testing too many things to feel comfortable about trying out EFI signing (and I want to have a second laptop around before I do that kind of thing!); however, I plan on getting the baseline signed boot1 and loader scheme working, then trying to alter the build process to support automatically generating signed boot1 and loader programs.

The kernel crypto framework currently lacks public-key crypto support, and it needs some work anyway.  I’ve started working on a design for a new crypto library which I intend to replace the boot_crypto code in my GELI work and eventually the code in the kernel.  I’ve also heard of others working on integrating LibreSSL.  I view this as a precursor to the more advanced work like secure suspend/resume and kernel/module signing.

However, we’re currently in the middle of the 11 release process and there are several major outstanding projects (my GELI work, the i915 graphics work).  In general, I’m reluctant to move forward until those things calm down a bit.

FreeBSD Librem Update

I’ve been working on my FreeBSD setup for Purism’s Librem 13 laptop ever since receiving it back in April.  I’m relatively pleased with the way things have progressed, and most of the critical issues have been addressed.  However, the setup still has a way to go in my opinion before it gets to the point of being the “ideal” setup.

Current State

Three of the four critical issues I identified back in April have been addressed:

  • Matt Macy’s i915 graphics patch works well on the Librem 13, and I personally made sure that the suspend/resume support works.  The patch is very stable on the Librem, and I’ve only had one kernel panic the entire time testing it.
  • The HDMI output Just Works™ with the i915 driver.  Even better, it works for both X11 and console modes.
  • Full support for the Atheros 9462 card has been merged in.  I’ve had some occasional issues, but it works for the most part.
  • The vesa weirdness is obviated by i915 support, but it was resolved by using the scfb driver.

While not Librem-specific, I have been working primarily on GELI support for EFI.  This is also part of a larger set of tamper-resilience features I have planned to implement for FreeBSD.

Additional Configuration Notes

I’ve played around with the configuration a bit since April, and discovered a few things in the process:

  • The mouse is frozen after suspend.  The way to un-stick it is to restart moused (for some reason, this doesn’t work when done in rc.resume), then move your finger on the pad while pressing the Fn+F1 (mouse lock) keys until it un-sticks.
  • PulseAudio works fine, but the default settings end up sending output to the HDMI port even if nothing is plugged in.  Simply change the default sink, and it should work fine.  Sound without PulseAudio also works fine, as I previously reported.

Current Plans

I have plans to do more work on this port, especially with my Librem 15 hopefully arriving soon.  I hope to eventually turn this platform into a fully-supported FreeBSD laptop.

Current To-Do’s

There are still a number of items to be addressed, of course.  Some of these are issues, and some are longer-term work I’ve planned.

Sporadic Boot Hang

The only real “issue” at this point is a boot-hang that manifests sporadically, causing boots to hang for several minutes before having a kernel panic.  Fortunately, the machine tends not to experience this problem after rebooting from such a kernel panic.  Thus, it’s an annoying problem, but not a show-stopper.

After observing this phenomenon many times, I suspect it to be a bug in the either the AML execution engine or the AML itself that causes AML execution to go into an infinite recursion before crashing from a stack overflow.  After this happens, the hardware seems to be in an indeterminate state and causes a kernel panic later on.  Resume also sometimes causes the machine to freeze, and I suspect this to be the root cause of that as well.

This ought to be fairly straightforward to hunt down; the only confounding factor is the fact that it is sporadic, and tends to only happen after the machine has been shut down for a time.

Device Support

All the important devices are supported.  Support is missing for some minor things, though:

  • Synaptics support is not present, due to no support for the BYD touchpad beyond the regular PS/2 mouse driver.  I am currently starting an effort to port Purism’s work on the Linux PS/2 driver to FreeBSD.
  • I also plan to look into support for things like screen brightness adjustment, either bringing over support from the Linux drivers or working out the configs to make the brightness keys work if the driver support is already there.
  • A few of the hotkeys work (the mouse lock and the screen blank), others don’t.  Currently non-functional hotkeys include the volume keys, the brightness keys, the bluetooth and wireless keys, the suspend key, and the video output key.  This probably involves writing an ACPI extras driver similar to acpi_ibm and company.

Longer-Term Plans

My longer-term plans mostly revolve around security, integrity, and tamper-resilience.  Some are related to the hardware platform, while others are general FreeBSD features.

FreeBSD Tamper Resilience

My GELI EFI work was the first step in a series of projects I have planned regarding tamper-resilience features at the OS level for FreeBSD.  I’ve hinted at this, and will write a full post on my plans, but here is a sketch of the process:

  1. Full-Disk Encryption for EFI (done): This was addressed by my GELI EFI work
  2. Secure Boot and Loading: Provide support in EFI boot1/loader and the kernel for signed images.  Wire it in with EFI Secure Boot architecture.
  3. Secure Hibernate: Implement the ability to suspend to disk and resume in a secure fashion.  This is the coup-de-gras, as it means the machine’s data is fully encrypted when powered off or suspended.

I am currently looking at building a new, better crypto library, as the current crypto framework is in need of some cleanup.

CoreBoot and TianoCore

Once my Librem 15 arrives and I have two machines to work with, I plan to replace the proprietary BIOS image with CoreBoot and an open-source EFI implementation (likely TianoCore).  In addition to the obvious benefits in terms of openness and control over my own hardware, this affords certain possibilities like baking extra EFI filesystem drivers into the firmware, allowing me to get rid of the unencrypted FAT-formatted EFI system partition.

It also might be possible to boot straight into FreeBSD and avoid the EFI stuff altogether (this is possible with Linux, and leads to super-fast boots).

Disabling Intel ME

I have a good document on disabling the Intel Management Engine altogether.  This is the one goal that Purism was unable to realize for their Librem laptops (which is fine in my opinion; the push for open hardware is a process, and Purism’s success was a major step forward in that process).

The final stroke in all this is to try and set up a method to disable the ME completely.  While this is obviously platform-specific, it would represent a major victory for the forces of openness.

Conclusions

The Librem platform represents the best chance I’ve ever seen of creating the “perfect” FreeBSD setup.  It’s going to take some work to get to the laptop I’ve always wanted, of course, but stay tuned and hopefully we’ll get there eventually.