Slides and Poster from 7th RISC-V Workshop

I recently presented a poster at the 7th RISC-V Workshop.  Here are my preview slides and poster.

Advertisements

RISC-V (Crypto) Engines Extension

I discovered the RISC-V project over the holidays, and promptly fell in love with it.  RISC-V represents an expansion of the open-source ethos into the hardware space, and I believe it has the potential to be one of the most important open hardware projects in the long run.

Crypto is something I care about, and an open hardware project like RISC-V presents an excellent opportunity to introduce high-quality extensible hardware cryptographic functions.  As I’m no stranger to computer architecture, I decided to roll up my sleeves and throw together a cryptographic instruction set extension for RISC-V.

This article is an overview of my design as it stands.  I have a detailed draft specification and the beginnings of an implementation in a fork of Berkeley’s rocket-chip repository.

Possibilities

An on-chip crypto implementation affords a number of possibilities that can improve security.  For one, hardware crypto implementations are able to control many kinds of  side-channel attacks much more effectively than software.  Hardware implementations can completely avoid timing, cache, and branch-predictor side-channels.  In addition, my designs allow for fuzzing of physical side-channels through techniques such as pipelining and injection of random “dummy” operations.

In addition to side-channel mitigation, hardware crypto potentially allows for designs which specifically account for insecure memory and disk, keeping all unencrypted key material in the processor core and not allowing it to be exported as plaintext.  This is a key principle in the design of hardware security modules (HSMs), and it would be a welcome feature in a general CPU.

Hardware Crypto Paradigms

There are roughly three dominant paradigms for hardware crypto.  The first is a wholly-separate device connected via a system bus (such as PCI), which implements various functions.  One of the key advantages of this is that the sensitive data can remain on the device, never accessible to the rest of the system (of course, this is also a disadvantage in the case of closed-source hardware, as we can never be certain that the system isn’t weakened or back-doored).  However, this can’t rightly be considered an ISA extension, as it’s wholly-separate.

The other end of the spectrum is occupied by cipher-specific instructions such as Intel’s AESNI instruction set.  These are often very efficient, as many cryptographic ciphers can be implemented very efficiently in hardware.  However, they don’t do much for protection of sensitive data.  Moreover, writing specific ciphers into the ISA is generally a bad idea: ciphers are sometimes broken, and more often phased out and replaced by newer, better algorithms.  Moreover, such a practice can enshrine weak crypto, as is seen in the continuing use of weak and broken crypto like RC4, MD5, SHA1, DES, 3DES, and 1024-bit RSA in many hardware crypto offerings.

Coprocessors are a third possibility; however, a coprocessor still must design its own instruction set, and that design must still cope with the reality of changing cryptographic algorithms.  Moreover, the interface between a general CPU and a coprocessor is complicated and difficult to design well.

Engines

I began by attempting to generalize the instruction-based approach, initially planning for special secure registers and a generalized framework for crypto instructions.  This ended up evolving into a framework I call “engines” which is most similar to the device-based approach, except that it lives in the processor core and is directly integrated into the pipeline.  The engines instruction set is also designed to allow the entire mechanism to be virtualized in an OS, and to allow for any engine to be implemented in software within a kernel.

An engine is essentially a larger, more complex functional unit which is capable of performing a single complex operation or a limited set of them.  In a typical pipeline, an engine looks and feels essentially like a memory unit, and for most intents and purposes can be treated like one.  After an engine has been configured, it is interacted with by means of a set of commands, which may supply arguments and may return results.  These behave exactly like load and store instructions in that they may generate faults, and commands yielding results may stall until data is available.

Engines also exist in a number of states, and can be moved between states by a transition instruction.  The uninitialized state represents an engine that is being configured (for example, a crypto engine needs to be supplied its key).  Initialization may performing preprocessing on initialization data, and moves the engine into the ready state (for example, the AES cipher does precomputation of the key schedules).  A ready engine can be started, causing it to enter the running state.  This allows a distinction between engines that are simply prepared, and engines that may be executing random operations continuously to fuzz side-channels.   To facilitate fast context-switching, a pause transition moves a running engine into the paused state, and ignores all other states, and the unpause transition restarts a paused engine.  Lastly, engines can be transitioned into a saving state, where their state can be serialized, and an uninitialized engine can be placed in the resuming state, where a saved state can be reloaded.

Each core has a number of engine resources, which are referenced through engine handle registers.  An acquire instruction attempts to acquire an engine resource of a particular type, storing it to an engine handle register.  The namespace for engine resources is quite large, and can be further extended using CSRs to select a particular namespace.  This allows the engines framework to function as a flexible mechanism for the indefinite future.

Engine Instruction Detail

The following is a description of each instructions the proposed engine ISA extension.

Engine Management

eng.acquire   eh, code

The acquire instruction attempts to acquire an engine resource of type code, binding it to the engine handle eh if such a resource is available.  If no such resource is available, it generates a fault trap (an OS can possibly use this along with the rebind instruction to implement engines in software).

eng.release   eh

The release instruction releases the engine resource bound to eh.

eng.ehsave   eh, rd

The ehsave instruction saves the binding in eh to the ordinary register rd.  For all hardware engine resources, this is guaranteed to be represented as a 32-bit number with the lowest bit clear.  This convention allows an operating system to represent bindings to software implementations as numbers with the lowest bit set.

eng.rebind   eh, rs

The rebind instruction re-establishes a binding to eh using the data in the in ordinary register rs.  If the lowest bit is clear in rs, then the binding is checked against the actual hardware engine resources.  Otherwise, it is taken to refer to a software implementation.

State Transitions

eng.trans   eh, code

The trans instruction executes the state transition represented by code.  It may generate a fault trap for bad transitions.

Saving and Restoring

eng.savecnt   eh, rd

The savecnt instruction writes into rd the number of state words that needs to be saved in order to save the entire state of the engine handle eh.  This can only be executed if the engine resource bound to eh is in the saving state.

eng.save   eh, rd, rs

The save instruction writes the state word for the engine resource bound to eh at the index given in rs into the register rd.  The highest valid index is equal to one less than the value given by the savecnt instruction.  This can only be executed if the engine resource bound to eh is in the saving state.

eng.restore   eh, rs1, rs2

The restore instruction writes the state word in rs2 to the index rs1 in the engine handle eh.  The restore instruction must be executed for all indexes corresponding to a particular saved state in strictly ascending order.  This instruction can only be executed if the engine resource bound to eh is in the restoring state.

Command Instructions

The command instructions allow for varying numbers of arguments and results.  All command instructions may stall for a finite amount of time, and may generate faults.  Some command codes may be restricted to certain states.

eng.icmd   eh, code

The icmd instruction executes the imperative command given by code on the engine resource bound to eh.

eng.rcmd   eh, code, rd

The rcmd instruction executes the receive-only command given by code on the engine resource bound to eh.  The result of the command is stored into rd.

eng.rs1cmd   eh, code, rd, rs1

The rs1cmd instruction executes the send-receive command given by code on the engine resource bound to eh.  The argument to the command is given in the rs1 register.  The result of the command is stored into rd.

eng.rs2cmd   eh, code, rd, rs1, rs2

The rs2cmd instruction executes the send-receive command given by code on the engine resource bound to eh.  The arguments to the command are given in the rs1 and rs2 register.  The result of the command is stored into rd.

eng.s1cmd   eh, code, rs1

The s1cmd instruction executes the send-only command given by code on the engine resource bound to eh.  The argument to the command is given in the rs1 register.

eng.s2cmd   eh, code, rs1, rs2

The s2cmd instruction executes the send-only command given by code on the engine resource bound to eh.  The arguments to the command are given in the rs1 and rs2 register.

eng.s3cmd   eh, code, rs1, rs2, rs3

The s2cmd instruction executes the send-only command given by code on the engine resource bound to eh.  The arguments to the command are given in the rs1 and rs2 register.

Example Crypto Engines

There are several sketches for example crypto engines, which show how this framework can be used for that purpose.

True Random Number Generator

A true random number generator using a physical process (such as electron or photon polarization, thermal noise, or other mechanisms) to generate random bits, which it accumulates in a ring-buffer.  The generator is started with the start transition, and randomness can be read off with a receive-format command that blocks until enough randomness is available.

Symmetric Cipher Encryption/Decryption Pipeline

Symmetric cipher encryption and decryption demonstrates the side-channel fuzzing capabilities of hardware engines.  The key material is loaded during the uninitialized state, and initialization does whatever preprocessing is necessary.  When the engine is in the running state, it constantly generates “dummy” operations using pseudorandomly-generated keys, IVs, and data which are discarded from the pipeline upon completion.  The implementation uses a pipeline to allow very high throughput of operations.  Data is added to the pipeline with a two-argument send command, and read off with a receive command.  Send and receive commands can generate deadlock faults if there is insufficient buffer space or data available.

Elliptic-Curve Point Multiplier

Elliptic curve point multiplication for a restricted set of elliptic curves can be implemented in a fashion similar to the symmetric cipher pipeline, except that for an elliptic curve multiplier, the pipeline typically cannot be completely unrolled.  Finite field arithmetic modulo pseudo-mersenne primes can be implemented as a configurable circuit.  Point multiplication can then be implemented using a ladder algorithm (such as the Montgomery ladder).  The same random “dummy” operation injection suggested for symmetric ciphers can also be used here to fuzz side-channels.

Hash/MAC Algorithm

Cryptographic hashes and message authentication codes transform an arbitrary amount of data into a fixed-size code.  This is straightforward to implement in hardware, although the algorithms in question generally cannot be pipelined.  In order to fuzz side-channels, we simply maintain some number of “dummy” internal states and keys in combination with the “real” one, and feed data to them at the same time as real data.

Conclusions

My original intent was to provide an extensible, general mechanism for supporting crypto on RISC-V hardware.  The engines extension ended up becoming a much more general mechanism, however.  In its most general form, this could even be considered an entire I/O instruction set.  Regardless, the examples I have given clearly demonstrate that it can serve its original purpose very well.