Meltdown/Spectre and Implications for OS Security

This week has seen the disclosure of a devastating vulnerability: meltdown and its sister vulnerability, spectre.  Both are a symptom of a larger problem that has evidently gone unnoticed for the better part of 50 years of computer architecture: the potential for side-channels in processor pipeline designs.

Aside: I will confess that I am rather ashamed that I didn’t notice this, despite having a background in all the knowledge areas that should have revealed it.  It took me all of three sentences of the paper before I realized what was going on.  Then again, this somehow got by every other computer scientist for almost half a century.  The conclusion seems to be that we are, all of us, terrible at our jobs…

A New Class of Attack, and Implications

I won’t go in to the details of the attacks themselves; there are already plenty of good descriptions of that, not the least of which is the paper.  My purpose here is to analyze the broader implications with particular focus on the operating systems security perspective.

To be absolutely clear, this is a very serious class of vulnerabilities.  It demands immediate and serious attention from both hardware architects and OS developers, not to mention people further up the stack

The most haunting thing about this disclosure is that it suggests the existence of an entire new class of attack, of which the meltdown/spectre attacks are merely the most obvious.  The problem, which has evidently been lurking in processor architecture design for half a century has to do with two central techniques in processor design:

  1. The fact that processors tend to enhance performance by making commonly-executed things faster.  This is done by a whole host of features: memory caches, trace caches, and branch predictors, to name the more common techniques.
  2. The fact that processor design has followed a general paradigm of altering the order in which operations are executed, either by overlapping multiple instructions (pipelining), executing multiple instructions in parallel (superscalar), executing them out of order and then reconciling the results (out-of-order), and executing multiple possible outcomes and keeping only one result (speculative).

The security community is quite familiar with the impacts of (1): it is the basis for a host of side-channel vulnerabilities.  Much of the work in computer architecture deals with implementing the techniques in (2) while maintaining the illusion that everything is executed one instruction at a time.  CPUs are carefully designed for this with regard to processor state according to the ISA, carefully tracking which operations depend on which and keeping track of the multiple possible worlds that arise from speculation.  What evidently got by us is that the side-channels provided by (1) provide an attacker with the ability to violate the illusion of executing one instruction at a time.

What this almost certainly means is that this is not the last attack of this kind we will see.  Out-of-order speculative execution happens to be the most straightforward to attack.  However, I can sketch out ways in which even in-order pipelines could potentially be vulnerable.  When we consider more obscure architectural features such as trace-caches, branch predictors, instruction fusion/fission, and translation units[0], it becomes almost a certainty that we will see new attacks in this family published well into the future.

A complicating factor in this is that CPU pipelines are extraordinarily complicated, particularly these days.  (I recall a tech talk I saw about the SPARC processor, and its scope was easily as large as the FreeBSD core kernel, or the HotSpot virtual machine.)  Moreover, these designs are typically closed.  There are a lot of different ways a core can be designed, even for a simple ISA, and we can expect entire classes of these design decisions to be vulnerable.


[0]: The x86 architecture tends to have translation units that are significantly more complex than the cores themselves, and are likely to be a nest of side-channels.

OS Security

The implications of this for OS (and even application) security are rather dismal.  The one saving virtue in this is that these attacks are read-only: they can be used to exfiltrate data from any memory immediately accessible to the pipeline, but it’s a pretty safe assumption that they cannot be used to write to memory.

That boon aside, this is devastating to OS security.  As a kernel developer, there aren’t any absolute defenses that we can implement without a complete overhaul of the entire paradigm of modern operating systems and major revisions to the POSIX specification.  (Even then, it’s not certain that this would work, as we can’t know how the processors implement things under the hood.)

Therefore, OS developers are left with a choice among partial solutions.  We can make the attacks harder, but we cannot stop them in all cases.  The only perfect defense is to replace the hardware, and hope nobody discovers a new side-channel attack.

Attack Surface and Vulnerable Assets

The primary function of these kinds of attacks is to exfiltrate data across various isolation boundaries.  The following is the most general description of the capabilities of the attack, as far as I can tell:

  1. Any data that can be loaded into a CPU register can potentially be converted into the execution of some number of events before a fault is detected
  2. These execution patterns can affect the performance of future operations, giving rise to a side-channel
  3. These side-channels can be read through various means (note that this need not be done by the same process)

This gives rise to the ability to violate various isolation boundaries (though only for the purposes of observing state):

  • Reads of kernel/hypervisor memory from a guest (aka “meltdown”)
  • Reads of another process’ address space (aka “spectre”)
  • Reads across intra-process isolation mechanisms, such as different VM instances (this attack is not named, but constitutes, among other things, a perfect XSS attack)

A salient feature of these attacks is that they are somewhat slow: as currently described, the attack will incur 2n + 1 cache faults to read out n bits (not counting setup costs).  I do not expect this to last, however.

The most significant danger of this attack is the exfiltration of secret data, ideally small and long-lived data objects.  Examples of key assets include:

  • Master keys for disk encryption (example: the FreeBSD GELI disk encryption scheme)
  • TLS session keys
  • Kerberos tickets
  • Cached passwords
  • Keyboard input buffers, or textual input buffers

More transient information is also vulnerable, though there is an attack window.  The most vulnerable assets are those keys which necessarily must remain in memory and unencrypted for long periods of time.  The best example of this I can think of is a disk encryption key.

Imperfect OS Defense Mechanisms

Unfortunately, operating systems are limited in their ability to adequately defend against these attacks, and moreover, many of these mechanisms incur a performance penalty.

Separate Kernel and User Page Tables

The solution adopted by the Linux KAISER patches is to unmap most of the kernel from user space, keeping only essential CPU metadata and the code to switch page tables mapped.  Unfortunately, this requires a TLB flush for every system call, incurring a rather severe performance penalty.  Additionally, this only protects from access to a kernel address; it cannot stop accesses to other process address spaces or crossing isolation boundaries within a process.

Make Access Attempts to Kernel Memory Non-Recoverable

An idea I proposed on the FreeBSD mailing lists is to cause attempted memory accesses to a kernel address range result in an immediate cache flush followed by non-recoverable process termination.  This avoids the cost of separate kernel and user page tables, but is not a perfect defense in that there is a small window of time in which the attack can be carried out.

Special Handling of Sensitive Assets

Another potential middle-ground is to handle sensitive kernel assets specially, storing them in a designated location in memory (or better yet, outside of main memory).  If a main-memory store is used, this range alone can be mapped and unmapped when crossing into kernel space, thus avoiding most of the overhead of a TLB flush.  This would only protect assets that are stored in this manner; however.  Anything else (most notably the stacks for kernel threads) would remain vulnerable.

Userland Solutions

Userland programs are even less able to defend against such attacks than the kernel; however, there are architectural considerations that can be made.

Avoid Holding Sensitive Information for Long Periods

As with kernel-space sensitive assets, one mitigation mechanism is to avoid retaining sensitive assets for long periods of time.  An example of this is GPG, which keeps the users’ keys decrypted only for a short period of time (usually 5 minutes) after they request to use a key.  While this is not perfect, it limits attacks to a brief window, and presents users with the ability create practices which ensure that they shut off other means of attack during this window.

Minimize Potential for Remote Execution

While this only works on systems that are effectively single-user, minimizing the degree to which remote code can execute on ones system closes many vulnerabilities.  This is obviously difficult in the modern world, where most websites pull in Javascript from dozens if not hundreds of sources, but it can be done with browser plugins like noscript.

JIT/Interpreter Mitigations

As JIT compilers and interpreters are a common mechanism for limited execution of remote code, they are in a position to attempt to mitigate many of these attacks (there is an LLVM patch out to do this).  This is once again an imperfect defense, as they are on the wrong side of Rice’s theorem.  There is no general mechanism for detecting these kinds of attacks, particularly where multiple threads of execution are involved.

Interim Hardware Mitigations

The only real solution is to wait for hardware updates which fix the vulnerabilities.  This take time, however, and we would like some interim solution.

One possible mitigation strategy is to store and process sensitive assets outside of the main memory.  HSMs and TPMs are an example of this (though not one I’m apt to trust), and the growth of the open hardware movement offers additional options.  One in particular- which I may put into effect myself -uses a small FPGA device (such as the PicoEVB) programmed with an open design as such a device.

More generally, I believe this incident highlights the value of these sorts of hardware solutions, and I hope to see increased interest in developing open implementations in the future.


The meltdown and spectre attacks- severe as they are by themselves -represent the beginning of something much larger.  A new class of attack has come to the forefront, which enables data exfiltration in spite of hardware security mechanisms, and we can and should expect to see more vulnerabilities of this kind in the future.  What makes these vulnerabilities so dangerous is that they directly compromise hardware protection mechanisms on which OS security depends, and thus there is no perfect defense at the OS level against them.

What can and should be done is to adapt and rethink security approaches at all levels.  At the hardware level, a paradigm shift is necessary to avoid vulnerabilities of this kind, as is the consideration that hardware security mechanisms are not as absolute as has been assumed.  At the OS level, architectural changes and new approaches can harden an OS against these vulnerabilities, but they generally cannot repel all attacks.  At the user level, similar architectural changes as well as minimization of attack surfaces can likewise reduce the likelihood and ease of attack, but cannot repel all such attacks.

As a final note, the trend in information security has been increasing focus on application, and particularly web application security.  This event, as well as the fact that a relatively simple vulnerability managed to evade detection by the entire field of computer science for half a century strongly suggests that systems security is not at all the solved problem it is commonly assumed to be, and that new thinking and new approaches are needed in this area.


RISC-V (Crypto) Engines Extension

I discovered the RISC-V project over the holidays, and promptly fell in love with it.  RISC-V represents an expansion of the open-source ethos into the hardware space, and I believe it has the potential to be one of the most important open hardware projects in the long run.

Crypto is something I care about, and an open hardware project like RISC-V presents an excellent opportunity to introduce high-quality extensible hardware cryptographic functions.  As I’m no stranger to computer architecture, I decided to roll up my sleeves and throw together a cryptographic instruction set extension for RISC-V.

This article is an overview of my design as it stands.  I have a detailed draft specification and the beginnings of an implementation in a fork of Berkeley’s rocket-chip repository.


An on-chip crypto implementation affords a number of possibilities that can improve security.  For one, hardware crypto implementations are able to control many kinds of  side-channel attacks much more effectively than software.  Hardware implementations can completely avoid timing, cache, and branch-predictor side-channels.  In addition, my designs allow for fuzzing of physical side-channels through techniques such as pipelining and injection of random “dummy” operations.

In addition to side-channel mitigation, hardware crypto potentially allows for designs which specifically account for insecure memory and disk, keeping all unencrypted key material in the processor core and not allowing it to be exported as plaintext.  This is a key principle in the design of hardware security modules (HSMs), and it would be a welcome feature in a general CPU.

Hardware Crypto Paradigms

There are roughly three dominant paradigms for hardware crypto.  The first is a wholly-separate device connected via a system bus (such as PCI), which implements various functions.  One of the key advantages of this is that the sensitive data can remain on the device, never accessible to the rest of the system (of course, this is also a disadvantage in the case of closed-source hardware, as we can never be certain that the system isn’t weakened or back-doored).  However, this can’t rightly be considered an ISA extension, as it’s wholly-separate.

The other end of the spectrum is occupied by cipher-specific instructions such as Intel’s AESNI instruction set.  These are often very efficient, as many cryptographic ciphers can be implemented very efficiently in hardware.  However, they don’t do much for protection of sensitive data.  Moreover, writing specific ciphers into the ISA is generally a bad idea: ciphers are sometimes broken, and more often phased out and replaced by newer, better algorithms.  Moreover, such a practice can enshrine weak crypto, as is seen in the continuing use of weak and broken crypto like RC4, MD5, SHA1, DES, 3DES, and 1024-bit RSA in many hardware crypto offerings.

Coprocessors are a third possibility; however, a coprocessor still must design its own instruction set, and that design must still cope with the reality of changing cryptographic algorithms.  Moreover, the interface between a general CPU and a coprocessor is complicated and difficult to design well.


I began by attempting to generalize the instruction-based approach, initially planning for special secure registers and a generalized framework for crypto instructions.  This ended up evolving into a framework I call “engines” which is most similar to the device-based approach, except that it lives in the processor core and is directly integrated into the pipeline.  The engines instruction set is also designed to allow the entire mechanism to be virtualized in an OS, and to allow for any engine to be implemented in software within a kernel.

An engine is essentially a larger, more complex functional unit which is capable of performing a single complex operation or a limited set of them.  In a typical pipeline, an engine looks and feels essentially like a memory unit, and for most intents and purposes can be treated like one.  After an engine has been configured, it is interacted with by means of a set of commands, which may supply arguments and may return results.  These behave exactly like load and store instructions in that they may generate faults, and commands yielding results may stall until data is available.

Engines also exist in a number of states, and can be moved between states by a transition instruction.  The uninitialized state represents an engine that is being configured (for example, a crypto engine needs to be supplied its key).  Initialization may performing preprocessing on initialization data, and moves the engine into the ready state (for example, the AES cipher does precomputation of the key schedules).  A ready engine can be started, causing it to enter the running state.  This allows a distinction between engines that are simply prepared, and engines that may be executing random operations continuously to fuzz side-channels.   To facilitate fast context-switching, a pause transition moves a running engine into the paused state, and ignores all other states, and the unpause transition restarts a paused engine.  Lastly, engines can be transitioned into a saving state, where their state can be serialized, and an uninitialized engine can be placed in the resuming state, where a saved state can be reloaded.

Each core has a number of engine resources, which are referenced through engine handle registers.  An acquire instruction attempts to acquire an engine resource of a particular type, storing it to an engine handle register.  The namespace for engine resources is quite large, and can be further extended using CSRs to select a particular namespace.  This allows the engines framework to function as a flexible mechanism for the indefinite future.

Engine Instruction Detail

The following is a description of each instructions the proposed engine ISA extension.

Engine Management

eng.acquire   eh, code

The acquire instruction attempts to acquire an engine resource of type code, binding it to the engine handle eh if such a resource is available.  If no such resource is available, it generates a fault trap (an OS can possibly use this along with the rebind instruction to implement engines in software).

eng.release   eh

The release instruction releases the engine resource bound to eh.

eng.ehsave   eh, rd

The ehsave instruction saves the binding in eh to the ordinary register rd.  For all hardware engine resources, this is guaranteed to be represented as a 32-bit number with the lowest bit clear.  This convention allows an operating system to represent bindings to software implementations as numbers with the lowest bit set.

eng.rebind   eh, rs

The rebind instruction re-establishes a binding to eh using the data in the in ordinary register rs.  If the lowest bit is clear in rs, then the binding is checked against the actual hardware engine resources.  Otherwise, it is taken to refer to a software implementation.

State Transitions

eng.trans   eh, code

The trans instruction executes the state transition represented by code.  It may generate a fault trap for bad transitions.

Saving and Restoring

eng.savecnt   eh, rd

The savecnt instruction writes into rd the number of state words that needs to be saved in order to save the entire state of the engine handle eh.  This can only be executed if the engine resource bound to eh is in the saving state.   eh, rd, rs

The save instruction writes the state word for the engine resource bound to eh at the index given in rs into the register rd.  The highest valid index is equal to one less than the value given by the savecnt instruction.  This can only be executed if the engine resource bound to eh is in the saving state.

eng.restore   eh, rs1, rs2

The restore instruction writes the state word in rs2 to the index rs1 in the engine handle eh.  The restore instruction must be executed for all indexes corresponding to a particular saved state in strictly ascending order.  This instruction can only be executed if the engine resource bound to eh is in the restoring state.

Command Instructions

The command instructions allow for varying numbers of arguments and results.  All command instructions may stall for a finite amount of time, and may generate faults.  Some command codes may be restricted to certain states.

eng.icmd   eh, code

The icmd instruction executes the imperative command given by code on the engine resource bound to eh.

eng.rcmd   eh, code, rd

The rcmd instruction executes the receive-only command given by code on the engine resource bound to eh.  The result of the command is stored into rd.

eng.rs1cmd   eh, code, rd, rs1

The rs1cmd instruction executes the send-receive command given by code on the engine resource bound to eh.  The argument to the command is given in the rs1 register.  The result of the command is stored into rd.

eng.rs2cmd   eh, code, rd, rs1, rs2

The rs2cmd instruction executes the send-receive command given by code on the engine resource bound to eh.  The arguments to the command are given in the rs1 and rs2 register.  The result of the command is stored into rd.

eng.s1cmd   eh, code, rs1

The s1cmd instruction executes the send-only command given by code on the engine resource bound to eh.  The argument to the command is given in the rs1 register.

eng.s2cmd   eh, code, rs1, rs2

The s2cmd instruction executes the send-only command given by code on the engine resource bound to eh.  The arguments to the command are given in the rs1 and rs2 register.

eng.s3cmd   eh, code, rs1, rs2, rs3

The s2cmd instruction executes the send-only command given by code on the engine resource bound to eh.  The arguments to the command are given in the rs1 and rs2 register.

Example Crypto Engines

There are several sketches for example crypto engines, which show how this framework can be used for that purpose.

True Random Number Generator

A true random number generator using a physical process (such as electron or photon polarization, thermal noise, or other mechanisms) to generate random bits, which it accumulates in a ring-buffer.  The generator is started with the start transition, and randomness can be read off with a receive-format command that blocks until enough randomness is available.

Symmetric Cipher Encryption/Decryption Pipeline

Symmetric cipher encryption and decryption demonstrates the side-channel fuzzing capabilities of hardware engines.  The key material is loaded during the uninitialized state, and initialization does whatever preprocessing is necessary.  When the engine is in the running state, it constantly generates “dummy” operations using pseudorandomly-generated keys, IVs, and data which are discarded from the pipeline upon completion.  The implementation uses a pipeline to allow very high throughput of operations.  Data is added to the pipeline with a two-argument send command, and read off with a receive command.  Send and receive commands can generate deadlock faults if there is insufficient buffer space or data available.

Elliptic-Curve Point Multiplier

Elliptic curve point multiplication for a restricted set of elliptic curves can be implemented in a fashion similar to the symmetric cipher pipeline, except that for an elliptic curve multiplier, the pipeline typically cannot be completely unrolled.  Finite field arithmetic modulo pseudo-mersenne primes can be implemented as a configurable circuit.  Point multiplication can then be implemented using a ladder algorithm (such as the Montgomery ladder).  The same random “dummy” operation injection suggested for symmetric ciphers can also be used here to fuzz side-channels.

Hash/MAC Algorithm

Cryptographic hashes and message authentication codes transform an arbitrary amount of data into a fixed-size code.  This is straightforward to implement in hardware, although the algorithms in question generally cannot be pipelined.  In order to fuzz side-channels, we simply maintain some number of “dummy” internal states and keys in combination with the “real” one, and feed data to them at the same time as real data.


My original intent was to provide an extensible, general mechanism for supporting crypto on RISC-V hardware.  The engines extension ended up becoming a much more general mechanism, however.  In its most general form, this could even be considered an entire I/O instruction set.  Regardless, the examples I have given clearly demonstrate that it can serve its original purpose very well.