Slides from my BSDCan Talk

These are the slides from by BSDCan talk on a system-level public-key trust system for FreeBSD.

Advertisements

A Potential Meltdown/Spectre Countermeasure

I posted a proposed countermeasure for the meltdown and spectre attacks to the freebsd-security mailing list last night.  Having slept on it, I believe the reasoning is sound, but I still want to get input on it.

MAJOR DISCLAIMER: This is an idea I only just came up with last night, and it still needs to be analyzed.  There may very well be flaws in my reasoning here.

Countermeasure: Non-Cacheable Sensitive Assets

The crux of the countermeasure is to move sensitive assets (ie. keys, passwords, crypto state, etc) into a separate memory region, and mark this non-cacheable using MTRRs or equivalent functionality on a different architecture.  I’ll assume for now that the rationale for why this should work will hold.

This approach has two significant downsides:

  1. It requires modification of applications, and it’s susceptible to information leaks from careless programming, missing sensitive assets, old code, and other such problems.
  2. It drastically increases the cost to access sensitive assets (a main-memory access), which is especially punitive if you end up using sensitive asset storage as a scratchpad space

The upside of the approach is that it’s compatible with a move toward storing sensitive assets in secure memory or in special devices, such as a TPM, or the flash device I suggested in a previous post.

Programmatically, I could see this looking like a kind of “malloc with attributes” interface, one of the attributes being something like “sensitive”.  I’ll save the API design for later, though.

Rationale

In this rationale, I’ll borrow the terminology “transient operations” to refer to any instruction or portion thereof which is being executed, but whose effects will eventually be cancelled due to a fault.  In architecture terminology, this is called “squashing” the operation.  The rationale for why this will work hinges on three assumptions about how any processor pipeline and potential side-channel necessarily must work:

  1. Execution must obey dependency relationships (I can’t magically acquire data for a dependent computation, unless it’s cached somewhere)
  2. Data which never reaches the CPU core as input to a transient operation cannot make it into any side-channel
  3. CPU architects will squash all transient operations when a fault or mispredicted branch is discovered as quickly as possible, so as to recover execution units for use on other speculative branches

Defeating Meltdown

The meltdown attack depends on being able to execute transient operations that depend on data loaded from a protected address in order to inject information into a side-channel before a fault is detected.  The cache and TLB states are critical to this process.For this analysis, assume the cache is virtually-indexed (see below for the physically-indexed cache case).  Break down the outcomes based on whether the given location is in cache and TLB:

  • Cache Hit, TLB Hit: You have a race between the TLB and cache coming back.  TLBs are typically smaller, so they are unlikely to come back after the cache access.  This will detect the fault almost immediately
  • Cache Hit, TLB Miss: You have a race between a page-table walk (potentially thousands of cycles) and a cache hit.  This means you get the data back, and have a long time to execute transient operations.  This is the main case for meltdown.
  • Cache Miss, TLB Hit: The cache fill operation strongly depends on address translation, which signals a fault almost immediately.
  • Cache Miss, TLB Miss: The cache fill operation strongly depends on address translation, which signals a fault after a page-table walk.  You’re stalled for potentially thousands of cycles, but you cannot fetch the data from memory until the address translation completes.

Note that both the cache-miss cases defeat the attack.  Thus, storing sensitive assets in non-cacheable memory should prevent the attack.  Now, if your cache is physically-indexed, then every lookup depends on an address translation, and therefore, fault detection, so you’re still safe.

A New Attack?

In my original posting to the FreeBSD lists, I evidently had misunderstood the Spectre attack, and ended up making up a whole new attack (!!)  This attack is still defeated by the non-cacheable memory trick.  The attack works as follows:

  1. Locate code in another address space, which can potentially be used to gather information about sensitive information
  2. Pre-load registers to force the code to access sensitive information
  3. Jump to the code, causing your branch predictors to soak up data about the sensitive information
  4. When the fault kicks you back out, harvest information from branch predictors

This is defeated by the non-cacheable store as well, by the same reasoning.

Aside: this really is a whole new class of attack.

High-Probability Defense Against the Spectre Attack

The actual spectre attack relies on causing speculative execution of other processes to cause cache effects which are detectable within our process.  The non-cacheable store is not an absolute defense against this, however, it does defeat the attack with very high probability.

The reasoning here is that any branch of speculative execution is highly unlikely to last longer than a full main memory access (which is possibly thousands of cycles).  Branch mispredictions in particular will likely last a few dozen cycles at most, and execution of the mispredicted branch will almost certainly be squashed before the data actually arrives.  Thus, it can’t make it into any side-channel.

Summary

This is a potential defense against speculative execution-based side-channel attacks, which is based on restoring the dependency between fault detection and memory access to sensitive assets, and incurring a general access delay to sensitive assets.

This blocks any speculative branch which will eventually cause a fault from accessing sensitive information, since doing so necessarily depends on fault detection.  This has the effect of defeating attacks that rely on speculative execution of transient operations on this data before the fault is detected.

This also defeats attacks which observe side-channels manipulated by speculative branches in other processes that can be made to access sensitive data, as the delay makes it extremely unlikely that data will arrive before the branch is squashed.

Future

Assuming the reasoning here is sound I plan to start working on implementing this for FreeBSD immediately.  Additionally, this defense is a rather coarse mechanism which repurposes MTRRs to effectively mark data as “not to be speculatively executed”.  A new architecture, such as RISC-V can design more refined mechanisms.  Finally, the API for this defense ought to be designed so as to provide a general mechanism for storage of sensitive assets in “a secure location” which can be non-cacheable memory, a TPM, a programmed flash device, or something else.

Meltdown/Spectre and Implications for OS Security

This week has seen the disclosure of a devastating vulnerability: meltdown and its sister vulnerability, spectre.  Both are a symptom of a larger problem that has evidently gone unnoticed for the better part of 50 years of computer architecture: the potential for side-channels in processor pipeline designs.

Aside: I will confess that I am rather ashamed that I didn’t notice this, despite having a background in all the knowledge areas that should have revealed it.  It took me all of three sentences of the paper before I realized what was going on.  Then again, this somehow got by every other computer scientist for almost half a century.  The conclusion seems to be that we are, all of us, terrible at our jobs…

A New Class of Attack, and Implications

I won’t go in to the details of the attacks themselves; there are already plenty of good descriptions of that, not the least of which is the paper.  My purpose here is to analyze the broader implications with particular focus on the operating systems security perspective.

To be absolutely clear, this is a very serious class of vulnerabilities.  It demands immediate and serious attention from both hardware architects and OS developers, not to mention people further up the stack

The most haunting thing about this disclosure is that it suggests the existence of an entire new class of attack, of which the meltdown/spectre attacks are merely the most obvious.  The problem, which has evidently been lurking in processor architecture design for half a century has to do with two central techniques in processor design:

  1. The fact that processors tend to enhance performance by making commonly-executed things faster.  This is done by a whole host of features: memory caches, trace caches, and branch predictors, to name the more common techniques.
  2. The fact that processor design has followed a general paradigm of altering the order in which operations are executed, either by overlapping multiple instructions (pipelining), executing multiple instructions in parallel (superscalar), executing them out of order and then reconciling the results (out-of-order), and executing multiple possible outcomes and keeping only one result (speculative).

The security community is quite familiar with the impacts of (1): it is the basis for a host of side-channel vulnerabilities.  Much of the work in computer architecture deals with implementing the techniques in (2) while maintaining the illusion that everything is executed one instruction at a time.  CPUs are carefully designed for this with regard to processor state according to the ISA, carefully tracking which operations depend on which and keeping track of the multiple possible worlds that arise from speculation.  What evidently got by us is that the side-channels provided by (1) provide an attacker with the ability to violate the illusion of executing one instruction at a time.

What this almost certainly means is that this is not the last attack of this kind we will see.  Out-of-order speculative execution happens to be the most straightforward to attack.  However, I can sketch out ways in which even in-order pipelines could potentially be vulnerable.  When we consider more obscure architectural features such as trace-caches, branch predictors, instruction fusion/fission, and translation units[0], it becomes almost a certainty that we will see new attacks in this family published well into the future.

A complicating factor in this is that CPU pipelines are extraordinarily complicated, particularly these days.  (I recall a tech talk I saw about the SPARC processor, and its scope was easily as large as the FreeBSD core kernel, or the HotSpot virtual machine.)  Moreover, these designs are typically closed.  There are a lot of different ways a core can be designed, even for a simple ISA, and we can expect entire classes of these design decisions to be vulnerable.

 

[0]: The x86 architecture tends to have translation units that are significantly more complex than the cores themselves, and are likely to be a nest of side-channels.

OS Security

The implications of this for OS (and even application) security are rather dismal.  The one saving virtue in this is that these attacks are read-only: they can be used to exfiltrate data from any memory immediately accessible to the pipeline, but it’s a pretty safe assumption that they cannot be used to write to memory.

That boon aside, this is devastating to OS security.  As a kernel developer, there aren’t any absolute defenses that we can implement without a complete overhaul of the entire paradigm of modern operating systems and major revisions to the POSIX specification.  (Even then, it’s not certain that this would work, as we can’t know how the processors implement things under the hood.)

Therefore, OS developers are left with a choice among partial solutions.  We can make the attacks harder, but we cannot stop them in all cases.  The only perfect defense is to replace the hardware, and hope nobody discovers a new side-channel attack.

Attack Surface and Vulnerable Assets

The primary function of these kinds of attacks is to exfiltrate data across various isolation boundaries.  The following is the most general description of the capabilities of the attack, as far as I can tell:

  1. Any data that can be loaded into a CPU register can potentially be converted into the execution of some number of events before a fault is detected
  2. These execution patterns can affect the performance of future operations, giving rise to a side-channel
  3. These side-channels can be read through various means (note that this need not be done by the same process)

This gives rise to the ability to violate various isolation boundaries (though only for the purposes of observing state):

  • Reads of kernel/hypervisor memory from a guest (aka “meltdown”)
  • Reads of another process’ address space (aka “spectre”)
  • Reads across intra-process isolation mechanisms, such as different VM instances (this attack is not named, but constitutes, among other things, a perfect XSS attack)

A salient feature of these attacks is that they are somewhat slow: as currently described, the attack will incur 2n + 1 cache faults to read out n bits (not counting setup costs).  I do not expect this to last, however.

The most significant danger of this attack is the exfiltration of secret data, ideally small and long-lived data objects.  Examples of key assets include:

  • Master keys for disk encryption (example: the FreeBSD GELI disk encryption scheme)
  • TLS session keys
  • Kerberos tickets
  • Cached passwords
  • Keyboard input buffers, or textual input buffers

More transient information is also vulnerable, though there is an attack window.  The most vulnerable assets are those keys which necessarily must remain in memory and unencrypted for long periods of time.  The best example of this I can think of is a disk encryption key.

Imperfect OS Defense Mechanisms

Unfortunately, operating systems are limited in their ability to adequately defend against these attacks, and moreover, many of these mechanisms incur a performance penalty.

Separate Kernel and User Page Tables

The solution adopted by the Linux KAISER patches is to unmap most of the kernel from user space, keeping only essential CPU metadata and the code to switch page tables mapped.  Unfortunately, this requires a TLB flush for every system call, incurring a rather severe performance penalty.  Additionally, this only protects from access to a kernel address; it cannot stop accesses to other process address spaces or crossing isolation boundaries within a process.

Make Access Attempts to Kernel Memory Non-Recoverable

An idea I proposed on the FreeBSD mailing lists is to cause attempted memory accesses to a kernel address range result in an immediate cache flush followed by non-recoverable process termination.  This avoids the cost of separate kernel and user page tables, but is not a perfect defense in that there is a small window of time in which the attack can be carried out.

Special Handling of Sensitive Assets

Another potential middle-ground is to handle sensitive kernel assets specially, storing them in a designated location in memory (or better yet, outside of main memory).  If a main-memory store is used, this range alone can be mapped and unmapped when crossing into kernel space, thus avoiding most of the overhead of a TLB flush.  This would only protect assets that are stored in this manner; however.  Anything else (most notably the stacks for kernel threads) would remain vulnerable.

Userland Solutions

Userland programs are even less able to defend against such attacks than the kernel; however, there are architectural considerations that can be made.

Avoid Holding Sensitive Information for Long Periods

As with kernel-space sensitive assets, one mitigation mechanism is to avoid retaining sensitive assets for long periods of time.  An example of this is GPG, which keeps the users’ keys decrypted only for a short period of time (usually 5 minutes) after they request to use a key.  While this is not perfect, it limits attacks to a brief window, and presents users with the ability create practices which ensure that they shut off other means of attack during this window.

Minimize Potential for Remote Execution

While this only works on systems that are effectively single-user, minimizing the degree to which remote code can execute on ones system closes many vulnerabilities.  This is obviously difficult in the modern world, where most websites pull in Javascript from dozens if not hundreds of sources, but it can be done with browser plugins like noscript.

JIT/Interpreter Mitigations

As JIT compilers and interpreters are a common mechanism for limited execution of remote code, they are in a position to attempt to mitigate many of these attacks (there is an LLVM patch out to do this).  This is once again an imperfect defense, as they are on the wrong side of Rice’s theorem.  There is no general mechanism for detecting these kinds of attacks, particularly where multiple threads of execution are involved.

Interim Hardware Mitigations

The only real solution is to wait for hardware updates which fix the vulnerabilities.  This take time, however, and we would like some interim solution.

One possible mitigation strategy is to store and process sensitive assets outside of the main memory.  HSMs and TPMs are an example of this (though not one I’m apt to trust), and the growth of the open hardware movement offers additional options.  One in particular- which I may put into effect myself -uses a small FPGA device (such as the PicoEVB) programmed with an open design as such a device.

More generally, I believe this incident highlights the value of these sorts of hardware solutions, and I hope to see increased interest in developing open implementations in the future.

Summary

The meltdown and spectre attacks- severe as they are by themselves -represent the beginning of something much larger.  A new class of attack has come to the forefront, which enables data exfiltration in spite of hardware security mechanisms, and we can and should expect to see more vulnerabilities of this kind in the future.  What makes these vulnerabilities so dangerous is that they directly compromise hardware protection mechanisms on which OS security depends, and thus there is no perfect defense at the OS level against them.

What can and should be done is to adapt and rethink security approaches at all levels.  At the hardware level, a paradigm shift is necessary to avoid vulnerabilities of this kind, as is the consideration that hardware security mechanisms are not as absolute as has been assumed.  At the OS level, architectural changes and new approaches can harden an OS against these vulnerabilities, but they generally cannot repel all attacks.  At the user level, similar architectural changes as well as minimization of attack surfaces can likewise reduce the likelihood and ease of attack, but cannot repel all such attacks.

As a final note, the trend in information security has been increasing focus on application, and particularly web application security.  This event, as well as the fact that a relatively simple vulnerability managed to evade detection by the entire field of computer science for half a century strongly suggests that systems security is not at all the solved problem it is commonly assumed to be, and that new thinking and new approaches are needed in this area.

Slides from My IEEE SecDev Talk

I gave a talk at IEEE SecDev on Nov 3 about my vision for how to combine industrial programming language pragmatics with formal methods.  The slides can be found here.

This was a 5-minute talk, but I will be expanding it into a 30-minute talk with more content.

InfoSec and IoT: A Sustainability Analogy

Yesterday saw a major distributed denial-of-service (DDoS) attack against the DNS infrastructure that crippled the internet for much of the east coast.  This attack disabled internet access for much of the Northeastern US, as well as other areas.  These sorts of attacks are nothing new; in fact, this attack came on the anniversary of a similar attack fourteen years ago.  Yesterday’s attack is nonetheless significant, both in its scope and also in the role of the growing internet of things (IoT) in the attack.

The attack was facilitated by the Mirai malware suite, which specifically targets insecure IoT devices, applying a brute-force password attack to gain access to the machines and deploy its malware.  Such an attack would almost certainly fail if directed against machines with appropriate security measures in place and on which passwords had been correctly set.  IoT devices, however, often lack such protections, are often left with their default login credentials, and often go unpatched (afterall, who among even the most eager adopters of IoT can say that they routinely log in to every lightbulb in their house to change the passwords and download patches).  Yesterday, we saw the negative consequences of the proliferation of these kinds of devices

Public Health and Pollution Analogies

Industry regulation- whether self-imposed or imposed by the state -is an widely-accepted practice among modern societies.  The case for this practice lies in the reality that some actions are not limited in their effect to oneself and one’s customers, but rather that they have a tangible effect on the entire world.  Bad practices in these areas leads to systemic risks that threaten even those who have nothing to do with the underlying culprits.  In such a situation, industry faces a choice of two options, one of which will eventually come to pass: self-regulate, or have regulations imposed from without.

Two classic examples of such a situation come in the form of public health concerns and environmental pollution.  Both of these have direct analogs to the situation we now face with insecure IoT devices and software (in)security in the broader context.

IoT and Pollution

After the third attack yesterday, I posted a series of remarks on Twitter that gave rise to this article, beginning with “IoT is the carbon emissions of infosec. Today’s incident is the climate change analog. It won’t be the last”.  I went on to criticize the current trend of gratuitously deploying huge numbers of “smart” devices without concern for the information security implications.

The ultimate point I sought to advance is that releasing huge numbers of insecure, connected devices into the world is effectively a form of pollution, and it has serious negative impacts on information security for the entire internet.  We saw one such result yesterday in the form of one of the largest DDoS attacks and the loss of internet usability for significant portions of the US.  As serious as this attack was, however, it could be far worse.  Such a botnet could easily be used in far more serious attacks, possibly to the point of causing real damage.  And of course, we’ve already seen cases of “smart” device equipped with cameras being used to surreptitiously capture videos of unsuspecting people which are then used for blackmail purposes.

These negative effects, like pollution, affect the world as a whole, not just the subset of those who decide they need smart lightbulbs and smart brooms.  They create a swarm of devices ripe for the plucking for malware, which in turn compromises basic infrastructure and harms everyone.  It is not hard to see the analogies between this and a dirty coal-burning furnace contaminating the air, leading to maladies like acid rain and brown-lung.

Platforms, Methodologies, and Public Health

Anyone who follows me on Twitter or interacts with me in person knows I am harshly critical of the current state of software methodologies, Scrum in particular, and of platforms based on untyped languages, NodeJS in particular.  Make no mistake, scrum is snake-oil as far as I’m concerned, and NodeJS is a huge step backward in terms of a programming language and a development platform.  The popularity of both of these has an obvious-enough root cause: the extreme bias towards developing minimally-functional prototypes, or minimum-viable products (MVPs), in Silicon Valley VC lingo.  Scrum is essentially a process for managing “war-room” emergencies, and languages like JavaScript do allow one to throw together a barely-working prototype faster than a language like Java, Haskell, or Rust.  This expedience has a cost, of course: such languages are far harder to secure, to test, and to maintain.

Of course, few consumers really care what sort of language or development methodology is used, so long as they get their product, or at least the current conventional wisdom goes.  When we consider the widespread information security implications, however, the picture begins to look altogether different.  Put another way, Zuckerburg’s addage “move fast and break things” becomes irresponsible and unacceptable when the potential exists to break the entire internet.

Since the early 1900’s, the US has had laws governing healthcare-related products as well as food, drugs and others.  The reasons for this are twofold: first, to protect consumers who lack insight into the manufacturing process, and second, to protect the public from health crises such as epidemics that arise from contaminated products.  In the case of the Pure Food and Drug act, the call for this regulation was driven in a large part by the extremely poor quality standards of large-scale industrial food processing as documented in Upton Sinclair’s work The Jungle.

The root cause of the conditions that led to the regulation of food industries and the conditions that have led to the popularization of insecure platforms and unsound development methodologies is, I believe, the same.  The cause is the competition-induced drive to lower costs and production times combined with a pathological lack of accountability for the quality of products and the negative effects of quality defects.  When combined, these factors consistently lead nowhere good.

Better Development Practices and Sustainability

These trends are simply not sustainable.  They serve to exacerbate an already severe information security crisis and on a long enough timeline, they stand to cause significant economic damage as a result of attacks like yesterdays, if not more severe attacks that pose a real material risk.

I do not believe government-imposed regulations are a solution to this problem.  In fact, in the current political climate, I suspect such a regulatory effort would end up imposing regulations such as back-doors and other measures that would do more damage to the state of information security that they would help.

The answer, I believe, must come from industry itself and must be led by infosec professionals.  The key is realizing that as is the case with sustainable manufacturing, better development practices are actually more viable and lead to lower eventual costs.  Sloppy practices and bad platforms may cut costs and development times in the now, but in the long run they end up costing much more.  This sort of paradigm shift is neither implausible nor unprecedented.  Driving it is a matter of educating industrial colleagues about these issues and the benefits of more sound platforms and development processes.

Summary

Yesterday’s attack brought the potential for the proliferation of insecure devices and software to have a profound negative effect on the entire world to the forefront.  A key root cause of this is an outdated paradigm in software development that ignores these factors in favor of the short-term view.  It falls to the infosec community to bring about the necessary change toward a more accurate view and more sound and sustainable practices.

FreeBSD EFI boot/loader Refactor

I have just completed (for some value of “complete”) a project to refactor the FreeBSD EFI boot and loader code.  This originally started out as an investigation of a possible avenue in my work on GELI full-disk encryption support for the EFI boot and loader, and grew into a project in its own right.

More generally, this fits into a bunch of work I’m pursuing or planning to pursue in order to increase the overall tamper-resistance of FreeBSD, but that’s another article.

Background

To properly explain all this, I need to briefly introduce both the FreeBSD boot and loader architecture as well as EFI.

FreeBSD Boot Architecture

When an operating system starts, something has to do the work of getting the kernel (and modules, and often other stuff) off the disk and into memory, setting everything up, and then actually starting it.  This is the boot loader.  Boot loaders are often in a somewhat awkward position: they need to do things like read filesystems, detect some devices, load configurations, and do setup, but they don’t have the usual support of the operating system to get it done.  Most notably, they are difficult to work with because if something goes wrong, there is very little in the way of recovery, debugging, or even logging.

Moreover, back in the old days of x86 BIOS, space was a major concern: the BIOS pulled in the first disk sector, meaning the program had to fit into less than 512 bytes.  Even once a larger program was loaded, you were still in 16-bit execution mode.

To deal with this, FreeBSD adopted a multi-stage approach.  The initial boot loader, called “boot”, had the sole purpose of pulling in a more featureful loader program, called “loader”.  In truth, boot consisted of two stages itself: the tiny boot block, and then a slightly more powerful program loaded from a designated part of the BSD disklabel.

The loader program is much more powerful, having a full suite of filesystem drivers, a shell, facilities for loading and unloading the kernel, and other things.  This two-phase architecture overcame the severe limitations of the x86 BIOS environment.  It also allowed the platform-specific boot details to be separated from both the loader program and the kernel.  This sort of niceness is the hallmark of a sound architectural choice.

Inside the loader program, the code uses a set of abstracted interfaces to talk about devices.  Devices are detected, bound to a device switch structure, and then filesystem modules provide a way to access the filesystems those devices contain.  Devices themselves are referred to by strings that identify the device switch managing them.  This abstraction allows loader to support a huge variety of configurations and platforms in a uniform way.

The Extensible Firmware Interface

In the mid-2000’s, the Extensible Firmware Interface started to replace BIOS as the boot environment on x86 platforms.  EFI is far more modern, featureful, abstracted, and easy to work with than the archaic, crufty, and often unstandardized or undocumented BIOS.  I’ve written boot loaders for both; EFI is pretty straightforward, where BIOS is a tarpit of nightmares.

One thing EFI does is remove the draconian constraints on the initial boot loader.  The firmware loads a specific file from a filesystem, rather than a single block from a disk.  The EFI spec guarantees support for the FAT32 filesystem and the GUID Partition Table, and individual platforms are free to support others.

Another thing EFI does is provide abstracted interfaces for things like device IO, filesystems, and many other things.  Devices- both concrete hardware and derived devices such as disk partitions and network filesystems are represented using “device handles”, which support various operational interfaces through “protocol interfaces”, and are named using “device paths”.  Moreover, vendors and operating systems authors alike are able to provide their own drivers through a driver binding interface, which can create new device handles or bind new protocol interfaces to existing ones.

FreeBSD Loader and EFI Similarities

The FreeBSD loader and the EFI framework do many of the same things, and they do them in similar ways most of the time.  Both have an abstracted representation of devices, interfaces for interacting with them, and a way of naming them.  In many ways, the FreeBSD loader framework is prescient in that it did many of the things that EFI ended up doing.

The one shortcoming of FreeBSD loader is in the lack of support for dynamic device detection, also known as “hotplugging”.  When FreeBSD’s boot architecture was created (circa 1994), hotplugging was extremely uncommon: most hardware expected to be connected permanently and remain connected for the duration of operation.  Hence, the architecture was designed around a model of one-time static detection of all devices, and the code evolved around that assumption.  Hot-plugging was added to the operating system itself, of course, but there was little need for it in the boot architecture.  When EFI was born (mid 2000’s), hot-pluggable devices were common, and so supporting them was an obvious design choice.

EFI does this through its driver binding module, where drivers register a set of callbacks that check whether a device is supported, and then attempt to attach to it.  When a device is disconnected, another callback is invoked to disconnect it.  FreeBSD’s loader, on the other hand, expects to detect all devices in a probing phase during its initialization.  It then sets up additional structure (most notably, its new bcache framework) based on the list of detected devices.  Some phases of detection may rely on earlier ones; for example, the ZFS driver may update some devices that were initially detected as block devices.

Refactoring Summary

As I mentioned, my work on this was originally a strategy for implementing GELI support.  A problem with the two-phase boot process is that it’s difficult to get information between the two phases, particularly in EFI, where all code is position-independent, no hard addresses are guaranteed, and components are expected to talk through abstract interfaces.  (In other words, it rules out the sort of hacks that the non-EFI loader uses!)  This is a problem for something like GELI, which has to ask for a password to unlock the filesystem (we don’t want to ask for a password multiple times).  Also, much of what I was having to implement for GELI with abstract devices and a GPT partition driver and such ended up mirroring things that already existed in the EFI framework.

I ended up refactoring the EFI boot and loader to make more use of the EFI framework, particularly its protocol interfaces.  The following is a summary of the changes:

  • The boot and loader programs now look for instances of the EFI_SIMPLE_FILE_SYSTEM_PROTOCOL, and use that interface to load files.
  • The filesystem backend code from loader was moved into a driver which does the same initialization as before, then attaches EFI_SIMPLE_FILE_SYSTEM_PROTOCOL interfaces to all device handles that host supported filesystems.
  • This is accomplished through a pair of wrapper interfaces that translate EFI_SIMPLE_FILE_SYSTEM_PROTOCOL and the FreeBSD loader framework’s filesystem interface back and forth.
  • I originally wanted to move all device probing and filesystem detection into the EFI driver model, where probing and detection would be done in callbacks.  However, this didn’t work primarily because the bcache framework is strongly coupled to the static detection way of doing things.
  • Interfaces and device handles installed in boot can be used by loader without problems.  This provides a way to pass information between phases.
  • The boot and loader programs can also make use of interfaces installed by other programs, such as GRUB, or custom interfaces provided by open-source firmware.
  • The boot and loader programs now use the same filesystem backend code; the minimal versions used by boot have been discarded.
  • Drivers for things like GELI, custom partition schemes, and similar things can work by creating new device nodes and attaching device paths and protocol interfaces to them.

I sent an email out to -hackers announcing the patch this morning, and I hope to get GELI support up and going in the very near future (the code is all there; I just need to plug it in to the EFI driver binding and get it building and running properly).

For anyone interested, the branch can be found here: https://github.com/emc2/freebsd/tree/efize

The Complex Nature of the Security Problem

This article is an elaboration on ideas I originally developed in a post to the project blog for my pet programming language project here.  The ideas remain as valid (if not moreso) now as they did eight months ago when I wrote the original piece.

The year 2015 saw a great deal of publicity surrounding a number of high-profile computer security incidents.  While this trend has been ongoing for some time now, the past year marked a point at which the problem entered the public consciousness to the point where it has become a national news item and is likely to be a key issue in the coming elections and beyond.

“The Security Problem” as I have taken to calling it is not a simple issue and it does not have a simple solution.  It is a complex, multi-faceted problem with a number of root causes, and it cannot be solved without adequately addressing each of those causes in turn.  It is also a crucial issue that must be solved in order for technological civilization to continue its forward progress and not slip into stagnation or regression.  If there is a single message I would want to convey on the subject, it is this: the security problem can only be adequately addressed by a multitude of different approaches working in concert, each addressing an aspect of the problem.

Trust: The Critical Element

In late September, I did a “ride-along” of a training program for newly-hired security consultants.  Just before leaving, I spoke briefly to the group, encouraging them to reach out to us and collaborate.  My final words, however, were broader in scope: “I think every era in history has its critical problems that civilization has to solve in order to keep moving forward, and I think the security problem is one of those problems for our era.”

Why is this problem so important, and why would its existence have the potential to block forward progress?  The answer is trust.  Trust: specifically the ability to trust people about which we know almost nothing and indeed, may never meet is arguably the critical element that allows civilization to exist at all.  Consider what might happen, for example, if that kind of trust did not exist: we would be unable to create and sustain basic institutions such as governments, hospitals, markets, banks, and public transportation.

Technological civilization requires a much higher degree of trust.  Consider, for example, the amount of trust that goes into using something as simple as checking your bank account on your phone.  At a very cursory inspection, you trust the developers who wrote the app that allows you to access your account, the designers of the phone, the hardware manufacturers, the wireless carrier and their backbone providers, the bank’s server software and their system administrators, the third-party vendors that supplied the operating system and database software, the scientists who designed the crypto protecting your transactions and the standards organizations who codified it, the vendors who supplied the networking hardware, and this is just a small portion.  You quite literally trust thousands of technologies and millions of people that you will almost certainly never meet, just to do the simplest of tasks.

The benefits of this kind of trust are clear: the global internet and the growth of computing devices has dramatically increased efficiency and productivity in almost every aspect of life.  However, this trust was not automatic.  It took a long time and a great deal of effort to build.  Moreover, this kind of trust can be lost.  One of the major hurdles for the development of electronic commerce, for example, was the perception that online transactions were inherently insecure.

This kind of progress is not permanent, however; if our technological foundations prove themselves unworthy of this level of trust, then we can expect to see stymied progress or in the worst case, regression.

The Many Aspects of the Security Problem

As with most problems of this scope and nature, the security problem does not have a single root cause.  It is the product of many complex issues interacting to produce a problem, and therefore its solution will necessarily involve committed efforts on multiple fronts and multiple complimentary approaches to address the issues.  There is no simple cause, and no “magic bullet” solution.

The contributing factors to the security problem range from highly technical (with many aspects in that domain), to logistical, to policy issues, to educational and social.  In fact, a complete characterization of the problem could very well be the subject of a graduate thesis; the exposition I give here is therefore only intended as a brief survey of the broad areas.

Technological Factors

As the security problem concerns computer security (I have dutifully avoided gratuitous use of the phrase “cyber”), it comes as no surprise that many of the contributing factors to the problem are technological in nature.  However, even within the scope of technological factors, we see a wide variety of specific issues.

Risky Languages, Tools, and APIs

Inherently dangerous or risky programming language or API features are one of the most common factors that contribute to vulnerabilities.  Languages that lack memory safety can lead to buffer overruns and other such errors (which are among the most common exploits in systems), and untyped languages admit a much larger class of errors, many of which lead to vulnerabilities like injection attacks.  Additionally, many APIs are improperly designed and lead to vulnerabilities, or are designed in such a way that safe use is needlessly difficult.  Lastly, many tools can be difficult to use in a secure manner.

We have made some headway in this area.  Many modern frameworks are designed in such a way that they are “safe by default”, requiring no special configuration to satisfy many safety concerns and requiring the necessary configuration to address the others.  Programming language research over the past 30 years has produced many advanced type systems that can make stronger guarantees, and we are starting to see these enter common use through languages like Rust.  My current employer, Codiscope, is working to bring advanced program analysis research into the static program analysis space.  Initiatives like the NSF DeepSpec expedition are working to develop practical software verification methods.

However, we still have a way to go here.  No mature engineering discipline relies solely on testing: civil engineering, for example, accurately predicts the tolerances of a bridge long before it is built.  Software engineering has yet to develop methods with this level of sophistication.

Configuration Management

Modern systems involve a dizzying array of configuration options.  In multi-level architectures, there are many different components interacting in order to implement each bit of functionality, and all of these need to be configured properly in order to operate securely.

Misconfigurations are a very frequent cause of vulnerabilities.  Enterprise software components can have hundreds of configuration options per component, and we often string dozens of components together.  In this environment, it becomes very easy to miss a configuration option or accidentally fail to account for a particular case.  The fact that there are so many possible configurations, most of which are invalid further exacerbates the problem.

Crypto has also tended to suffer from usability problems.  Crypto is particularly sensitive to misconfigurations: a single weak link undermines the security of the entire system.  However, it can be quite difficult to develop and maintain hardened crypto configurations over time, even for the technologically adept.  The difficulty of setting up software like GPG for non-technical users has been the subject of actual research papers.  I can personally attest to this as well, having guided multiple non-technical people through the setup.

This problem can be addressed, however.  Configuration management tools allow configurations to be set up from a central location, and managed automatically by various services (CFEngine, Puppet, Chef, Ansible, etc.).  Looking farther afield, we can begin to imagine tools that construct configurations for each component from a master configuration, and to apply type-like notions to the task of identifying invalid configurations.  These suggestions are just the beginning; configuration management is a serious technical challenge, and can and should be the focus of serious technical work.

Legacy Systems

Legacy systems have long been a source of pain for technologists.  In the past, they represent a kind of debt that is often too expensive to pay off in full, but which exacts a recurring tax on resources in the form of legacy costs (compatibility issues, bad performance, blocking upgrades, unusable systems, and so on).  To most directly involved in the development of technology, legacy systems tend to be a source of chronic pain; however, from the standpoint of budgets and limited resources, they are often a kind of pain to be managed as opposed to cured, as wholesale replacement is far took expensive and risky to consider.

In the context of security, however, the picture is often different.  These kinds of systems are often extremely vulnerable, having been designed in a time when networked systems were rare or nonexistent.  In this context, they are more akin to rotten timbers at the core of a building.  Yes, they are expensive and time-consuming to replace, but the risk of not replacing them is far worse.

The real danger is that the infrastructure where vulnerable legacy systems are most prevalent: power grids, industrial facilities, mass transit, and the like are precisely the sort of systems where a breach can do catastrophic damage.  We have already seen an example of this in the real world: the Stuxnet malware was employed to destroy uranium processing centrifuges.

Replacing these legacy systems with more secure implementations is a long and expensive proposition, and doing it in a way that minimizes costs is a very challenging technological problem.  However, this is not a problem that can be neglected.

Cultural and Policy Factors

Though computer security is technological in nature, its causes and solutions are not limited solely to technological issues.  Policy, cultural, and educational factors also affect the problem, and must be a part of the solution.

Policy

The most obvious non-technical influence on the security problem is policy.  The various policy debates that have sprung up in the past years are evidence of this; however, the problem goes much deeper than these debates.

For starters, we are currently in the midst of a number of policy debates regarding strong encryption and how we as a society deal with the fact that such a technology exists.  I make my stance on the matter quite clear: I am an unwavering advocate of unescrowed, uncompromised strong encryption as a fundamental right (yes, there are possible abuses of the technology, but the same is true of such things as due process and freedom of speech).  Despite my hard-line pro-crypto stance, I can understand how those that don’t understand the technology might find the opposing position compelling.  Things like golden keys and abuse-proof backdoors certainly sound nice.  However, the real effects of pursuing such policies would be to fundamentally compromise systems and infrastructure within the US and turn defending against data breaches and cyberattacks into an impossible problem.  In the long run, this erodes the kind of trust in technological infrastructure of which I spoke earlier and bars forward progress, leaving us to be outclassed in the international marketplace.

In a broader context, we face a problem here that requires rethinking our policy process.  We have in the security problem a complex technological issue- too complex for even the most astute and deliberative legislator to develop true expertise on the subject through part-time study -but one where the effects of uninformed policy can be disastrous.  In the context of public debate, it does not lend itself to two-sided thinking or simple solutions, and attempting to force it into such a model loses too much information to be effective.

Additionally, the problem goes deeper than issues like encryption, backdoors, and dragnet surveillance.  Much of the US infrastructure runs on vulnerable legacy systems as I mentioned earlier, and replacing these systems with more secure, modern software is an expensive and time-consuming task.  Moreover, this need to invest in our infrastructure this way barely registers in public debate, if at all.  However, doing so is essential to fixing one of the most significant sources of vulnerabilities.

Education

Education, or the lack thereof also plays a key role in the security problem.  Even top-level computer science curricula fail to teach students how to think securely and develop secure applications, or even to impress upon students the importance of doing so.  This is understandable: even a decade ago, the threat level to most applications was nowhere near where it is today.  The world has changed dramatically in this regard in a rather short span of time.  The proliferation of mobile devices and connectedness combined with a tremendous upturn in the number of and sophistication of attacks launched against systems has led to a very different sort of environment than what existed even ten year ago (when I was finishing my undergraduate education).

College curricula are necessarily a conservative institution; knowledge is expected to prove its worth and go through a process of refinement and sanding off of rough edges before it reaches the point where it can be taught in an undergraduate curriculum.  By contrast, much of the knowledge of how to avoid building vulnerable systems is new, volatile, and thorny: not the sort of thing traditional academia likes to mix into a curriculum, especially in a mandatory course.

Such a change is necessary, however, and this means that educational institutions must develop new processes for effectively educating people about topics such as these.

Culture

While it is critical to have a infrastructure and systems built on sound technological approaches, it is also true that a significant number of successful attacks on both large enterprises and individuals alike make primary use of human factors and social engineering.  This is exacerbated by the fact that we, culturally speaking, are quite naive about security.  There are security-conscious individuals, of course, but most people are naive to the point that an attacker can typically rely on social engineering with a high success rate in all but the most secure of settings.

Moreover, this naivety affects everything else, ranging policy decisions to what priorities are deemed most important in product development.  The lack of public understanding of computer security allows bad policy such as back doors to be taken seriously and insecure and invasive products to thrive by publishing marketing claims that simply don’t reflect reality (SnapChat remains one of the worst offenders in this regard, in my opinion).

The root cause behind this that cultures adapt even more slowly than the other factors I’ve mentioned, and our culture has yet to develop effective ways of thinking about these issues.  But cultures do adapt; we all remember sayings like “look both ways” and “stop, drop, and roll” from our childhood, both of which teach simple but effective ways of managing more basic risks that arise from technological society.  This sort of adaptation also responds to need.  During my own youth and adolescence, the danger of HIV drove a number of significant cultural changes in a relatively short period of time that proved effective in curbing the epidemic.  While the issues surrounding the security problem represent a very different sort of danger, they are still pressing issues that require an amount of cultural adaptation to address.  A key step in addressing the cultural aspects of the security problem comes down to developing similar kinds of cultural understanding and awareness, and promoting behavior changes that help reduce risk.

Conclusion

I have presented only a portion of the issues that make up what I call the “computer security problem”.  These issues are varied, ranging from deep technological issues obviously focused on security to cultural and policy issues.  There is not one single root cause to the problem, and as a result, there is no one single “silver bullet” that can solve it.

Moreover, if the problem is this varied and complex, then we can expect the solutions to each aspect of the problem to likewise require multiple different approaches coming from different angles and reflecting different ways of thinking.  My own work, for example, focuses on the language and tooling issue, coming mostly from the direction of building tools to write better software.  However, there are other approaches to this same problem, such as sandboxing and changing the fundamental execution model.  All of these angles deserve consideration, and the eventual resolution to that part of the security problem will likely incorporate developments from each angle of approach.

If there is a final takeaway from this, it is that the problem is large and complex enough that it cannot be solved by the efforts or approach of a single person or team.  It is a monumental challenge requiring the combined tireless efforts of a generation’s worth of minds and at least a generation’s worth of time.