1099 lines
52 KiB
Markdown
1099 lines
52 KiB
Markdown
|
# Speculative Load Hardening
|
||
|
|
||
|
### A Spectre Variant #1 Mitigation Technique
|
||
|
|
||
|
Author: Chandler Carruth - [chandlerc@google.com](mailto:chandlerc@google.com)
|
||
|
|
||
|
## Problem Statement
|
||
|
|
||
|
Recently, Google Project Zero and other researchers have found information leak
|
||
|
vulnerabilities by exploiting speculative execution in modern CPUs. These
|
||
|
exploits are currently broken down into three variants:
|
||
|
* GPZ Variant #1 (a.k.a. Spectre Variant #1): Bounds check (or predicate) bypass
|
||
|
* GPZ Variant #2 (a.k.a. Spectre Variant #2): Branch target injection
|
||
|
* GPZ Variant #3 (a.k.a. Meltdown): Rogue data cache load
|
||
|
|
||
|
For more details, see the Google Project Zero blog post and the Spectre research
|
||
|
paper:
|
||
|
* https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
|
||
|
* https://spectreattack.com/spectre.pdf
|
||
|
|
||
|
The core problem of GPZ Variant #1 is that speculative execution uses branch
|
||
|
prediction to select the path of instructions speculatively executed. This path
|
||
|
is speculatively executed with the available data, and may load from memory and
|
||
|
leak the loaded values through various side channels that survive even when the
|
||
|
speculative execution is unwound due to being incorrect. Mispredicted paths can
|
||
|
cause code to be executed with data inputs that never occur in correct
|
||
|
executions, making checks against malicious inputs ineffective and allowing
|
||
|
attackers to use malicious data inputs to leak secret data. Here is an example,
|
||
|
extracted and simplified from the Project Zero paper:
|
||
|
```
|
||
|
struct array {
|
||
|
unsigned long length;
|
||
|
unsigned char data[];
|
||
|
};
|
||
|
struct array *arr1 = ...; // small array
|
||
|
struct array *arr2 = ...; // array of size 0x400
|
||
|
unsigned long untrusted_offset_from_caller = ...;
|
||
|
if (untrusted_offset_from_caller < arr1->length) {
|
||
|
unsigned char value = arr1->data[untrusted_offset_from_caller];
|
||
|
unsigned long index2 = ((value&1)*0x100)+0x200;
|
||
|
unsigned char value2 = arr2->data[index2];
|
||
|
}
|
||
|
```
|
||
|
|
||
|
The key of the attack is to call this with `untrusted_offset_from_caller` that
|
||
|
is far outside of the bounds when the branch predictor will predict that it
|
||
|
will be in-bounds. In that case, the body of the `if` will be executed
|
||
|
speculatively, and may read secret data into `value` and leak it via a
|
||
|
cache-timing side channel when a dependent access is made to populate `value2`.
|
||
|
|
||
|
## High Level Mitigation Approach
|
||
|
|
||
|
While several approaches are being actively pursued to mitigate specific
|
||
|
branches and/or loads inside especially risky software (most notably various OS
|
||
|
kernels), these approaches require manual and/or static analysis aided auditing
|
||
|
of code and explicit source changes to apply the mitigation. They are unlikely
|
||
|
to scale well to large applications. We are proposing a comprehensive
|
||
|
mitigation approach that would apply automatically across an entire program
|
||
|
rather than through manual changes to the code. While this is likely to have a
|
||
|
high performance cost, some applications may be in a good position to take this
|
||
|
performance / security tradeoff.
|
||
|
|
||
|
The specific technique we propose is to cause loads to be checked using
|
||
|
branchless code to ensure that they are executing along a valid control flow
|
||
|
path. Consider the following C-pseudo-code representing the core idea of a
|
||
|
predicate guarding potentially invalid loads:
|
||
|
```
|
||
|
void leak(int data);
|
||
|
void example(int* pointer1, int* pointer2) {
|
||
|
if (condition) {
|
||
|
// ... lots of code ...
|
||
|
leak(*pointer1);
|
||
|
} else {
|
||
|
// ... more code ...
|
||
|
leak(*pointer2);
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
This would get transformed into something resembling the following:
|
||
|
```
|
||
|
uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
|
||
|
uintptr_t all_zeros_mask = 0;
|
||
|
void leak(int data);
|
||
|
void example(int* pointer1, int* pointer2) {
|
||
|
uintptr_t predicate_state = all_ones_mask;
|
||
|
if (condition) {
|
||
|
// Assuming ?: is implemented using branchless logic...
|
||
|
predicate_state = !condition ? all_zeros_mask : predicate_state;
|
||
|
// ... lots of code ...
|
||
|
//
|
||
|
// Harden the pointer so it can't be loaded
|
||
|
pointer1 &= predicate_state;
|
||
|
leak(*pointer1);
|
||
|
} else {
|
||
|
predicate_state = condition ? all_zeros_mask : predicate_state;
|
||
|
// ... more code ...
|
||
|
//
|
||
|
// Alternative: Harden the loaded value
|
||
|
int value2 = *pointer2 & predicate_state;
|
||
|
leak(value2);
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
The result should be that if the `if (condition) {` branch is mis-predicted,
|
||
|
there is a *data* dependency on the condition used to zero out any pointers
|
||
|
prior to loading through them or to zero out all of the loaded bits. Even
|
||
|
though this code pattern may still execute speculatively, *invalid* speculative
|
||
|
executions are prevented from leaking secret data from memory (but note that
|
||
|
this data might still be loaded in safe ways, and some regions of memory are
|
||
|
required to not hold secrets, see below for detailed limitations). This
|
||
|
approach only requires the underlying hardware have a way to implement a
|
||
|
branchless and unpredicted conditional update of a register's value. All modern
|
||
|
architectures have support for this, and in fact such support is necessary to
|
||
|
correctly implement constant time cryptographic primitives.
|
||
|
|
||
|
Crucial properties of this approach:
|
||
|
* It is not preventing any particular side-channel from working. This is
|
||
|
important as there are an unknown number of potential side channels and we
|
||
|
expect to continue discovering more. Instead, it prevents the observation of
|
||
|
secret data in the first place.
|
||
|
* It accumulates the predicate state, protecting even in the face of nested
|
||
|
*correctly* predicted control flows.
|
||
|
* It passes this predicate state across function boundaries to provide
|
||
|
[interprocedural protection](#interprocedural-checking).
|
||
|
* When hardening the address of a load, it uses a *destructive* or
|
||
|
*non-reversible* modification of the address to prevent an attacker from
|
||
|
reversing the check using attacker-controlled inputs.
|
||
|
* It does not completely block speculative execution, and merely prevents
|
||
|
*mis*-speculated paths from leaking secrets from memory (and stalls
|
||
|
speculation until this can be determined).
|
||
|
* It is completely general and makes no fundamental assumptions about the
|
||
|
underlying architecture other than the ability to do branchless conditional
|
||
|
data updates and a lack of value prediction.
|
||
|
* It does not require programmers to identify all possible secret data using
|
||
|
static source code annotations or code vulnerable to a variant #1 style
|
||
|
attack.
|
||
|
|
||
|
Limitations of this approach:
|
||
|
* It requires re-compiling source code to insert hardening instruction
|
||
|
sequences. Only software compiled in this mode is protected.
|
||
|
* The performance is heavily dependent on a particular architecture's
|
||
|
implementation strategy. We outline a potential x86 implementation below and
|
||
|
characterize its performance.
|
||
|
* It does not defend against secret data already loaded from memory and
|
||
|
residing in registers or leaked through other side-channels in
|
||
|
non-speculative execution. Code dealing with this, e.g cryptographic
|
||
|
routines, already uses constant-time algorithms and code to prevent
|
||
|
side-channels. Such code should also scrub registers of secret data following
|
||
|
[these
|
||
|
guidelines](https://github.com/HACS-workshop/spectre-mitigations/blob/master/crypto_guidelines.md).
|
||
|
* To achieve reasonable performance, many loads may not be checked, such as
|
||
|
those with compile-time fixed addresses. This primarily consists of accesses
|
||
|
at compile-time constant offsets of global and local variables. Code which
|
||
|
needs this protection and intentionally stores secret data must ensure the
|
||
|
memory regions used for secret data are necessarily dynamic mappings or heap
|
||
|
allocations. This is an area which can be tuned to provide more comprehensive
|
||
|
protection at the cost of performance.
|
||
|
* [Hardened loads](#hardening-the-address-of-the-load) may still load data from
|
||
|
_valid_ addresses if not _attacker-controlled_ addresses. To prevent these
|
||
|
from reading secret data, the low 2gb of the address space and 2gb above and
|
||
|
below any executable pages should be protected.
|
||
|
|
||
|
Credit:
|
||
|
* The core idea of tracing misspeculation through data and marking pointers to
|
||
|
block misspeculated loads was developed as part of a HACS 2018 discussion
|
||
|
between Chandler Carruth, Paul Kocher, Thomas Pornin, and several other
|
||
|
individuals.
|
||
|
* Core idea of masking out loaded bits was part of the original mitigation
|
||
|
suggested by Jann Horn when these attacks were reported.
|
||
|
|
||
|
|
||
|
### Indirect Branches, Calls, and Returns
|
||
|
|
||
|
It is possible to attack control flow other than conditional branches with
|
||
|
variant #1 style mispredictions.
|
||
|
* A prediction towards a hot call target of a virtual method can lead to it
|
||
|
being speculatively executed when an expected type is used (often called
|
||
|
"type confusion").
|
||
|
* A hot case may be speculatively executed due to prediction instead of the
|
||
|
correct case for a switch statement implemented as a jump table.
|
||
|
* A hot common return address may be predicted incorrectly when returning from
|
||
|
a function.
|
||
|
|
||
|
These code patterns are also vulnerable to Spectre variant #2, and as such are
|
||
|
best mitigated with a
|
||
|
[retpoline](https://support.google.com/faqs/answer/7625886) on x86 platforms.
|
||
|
When a mitigation technique like retpoline is used, speculation simply cannot
|
||
|
proceed through an indirect control flow edge (or it cannot be mispredicted in
|
||
|
the case of a filled RSB) and so it is also protected from variant #1 style
|
||
|
attacks. However, some architectures, micro-architectures, or vendors do not
|
||
|
employ the retpoline mitigation, and on future x86 hardware (both Intel and
|
||
|
AMD) it is expected to become unnecessary due to hardware-based mitigation.
|
||
|
|
||
|
When not using a retpoline, these edges will need independent protection from
|
||
|
variant #1 style attacks. The analogous approach to that used for conditional
|
||
|
control flow should work:
|
||
|
```
|
||
|
uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
|
||
|
uintptr_t all_zeros_mask = 0;
|
||
|
void leak(int data);
|
||
|
void example(int* pointer1, int* pointer2) {
|
||
|
uintptr_t predicate_state = all_ones_mask;
|
||
|
switch (condition) {
|
||
|
case 0:
|
||
|
// Assuming ?: is implemented using branchless logic...
|
||
|
predicate_state = (condition != 0) ? all_zeros_mask : predicate_state;
|
||
|
// ... lots of code ...
|
||
|
//
|
||
|
// Harden the pointer so it can't be loaded
|
||
|
pointer1 &= predicate_state;
|
||
|
leak(*pointer1);
|
||
|
break;
|
||
|
|
||
|
case 1:
|
||
|
predicate_state = (condition != 1) ? all_zeros_mask : predicate_state;
|
||
|
// ... more code ...
|
||
|
//
|
||
|
// Alternative: Harden the loaded value
|
||
|
int value2 = *pointer2 & predicate_state;
|
||
|
leak(value2);
|
||
|
break;
|
||
|
|
||
|
// ...
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
The core idea remains the same: validate the control flow using data-flow and
|
||
|
use that validation to check that loads cannot leak information along
|
||
|
misspeculated paths. Typically this involves passing the desired target of such
|
||
|
control flow across the edge and checking that it is correct afterwards. Note
|
||
|
that while it is tempting to think that this mitigates variant #2 attacks, it
|
||
|
does not. Those attacks go to arbitrary gadgets that don't include the checks.
|
||
|
|
||
|
|
||
|
### Variant #1.1 and #1.2 attacks: "Bounds Check Bypass Store"
|
||
|
|
||
|
Beyond the core variant #1 attack, there are techniques to extend this attack.
|
||
|
The primary technique is known as "Bounds Check Bypass Store" and is discussed
|
||
|
in this research paper: https://people.csail.mit.edu/vlk/spectre11.pdf
|
||
|
|
||
|
We will analyze these two variants independently. First, variant #1.1 works by
|
||
|
speculatively storing over the return address after a bounds check bypass. This
|
||
|
speculative store then ends up being used by the CPU during speculative
|
||
|
execution of the return, potentially directing speculative execution to
|
||
|
arbitrary gadgets in the binary. Let's look at an example.
|
||
|
```
|
||
|
unsigned char local_buffer[4];
|
||
|
unsigned char *untrusted_data_from_caller = ...;
|
||
|
unsigned long untrusted_size_from_caller = ...;
|
||
|
if (untrusted_size_from_caller < sizeof(local_buffer)) {
|
||
|
// Speculative execution enters here with a too-large size.
|
||
|
memcpy(local_buffer, untrusted_data_from_caller,
|
||
|
untrusted_size_from_caller);
|
||
|
// The stack has now been smashed, writing an attacker-controlled
|
||
|
// address over the return address.
|
||
|
minor_processing(local_buffer);
|
||
|
return;
|
||
|
// Control will speculate to the attacker-written address.
|
||
|
}
|
||
|
```
|
||
|
|
||
|
However, this can be mitigated by hardening the load of the return address just
|
||
|
like any other load. This is sometimes complicated because x86 for example
|
||
|
*implicitly* loads the return address off the stack. However, the
|
||
|
implementation technique below is specifically designed to mitigate this
|
||
|
implicit load by using the stack pointer to communicate misspeculation between
|
||
|
functions. This additionally causes a misspeculation to have an invalid stack
|
||
|
pointer and never be able to read the speculatively stored return address. See
|
||
|
the detailed discussion below.
|
||
|
|
||
|
For variant #1.2, the attacker speculatively stores into the vtable or jump
|
||
|
table used to implement an indirect call or indirect jump. Because this is
|
||
|
speculative, this will often be possible even when these are stored in
|
||
|
read-only pages. For example:
|
||
|
```
|
||
|
class FancyObject : public BaseObject {
|
||
|
public:
|
||
|
void DoSomething() override;
|
||
|
};
|
||
|
void f(unsigned long attacker_offset, unsigned long attacker_data) {
|
||
|
FancyObject object = getMyObject();
|
||
|
unsigned long *arr[4] = getFourDataPointers();
|
||
|
if (attacker_offset < 4) {
|
||
|
// We have bypassed the bounds check speculatively.
|
||
|
unsigned long *data = arr[attacker_offset];
|
||
|
// Now we have computed a pointer inside of `object`, the vptr.
|
||
|
*data = attacker_data;
|
||
|
// The vptr points to the virtual table and we speculatively clobber that.
|
||
|
g(object); // Hand the object to some other routine.
|
||
|
}
|
||
|
}
|
||
|
// In another file, we call a method on the object.
|
||
|
void g(BaseObject &object) {
|
||
|
object.DoSomething();
|
||
|
// This speculatively calls the address stored over the vtable.
|
||
|
}
|
||
|
```
|
||
|
|
||
|
Mitigating this requires hardening loads from these locations, or mitigating
|
||
|
the indirect call or indirect jump. Any of these are sufficient to block the
|
||
|
call or jump from using a speculatively stored value that has been read back.
|
||
|
|
||
|
For both of these, using retpolines would be equally sufficient. One possible
|
||
|
hybrid approach is to use retpolines for indirect call and jump, while relying
|
||
|
on SLH to mitigate returns.
|
||
|
|
||
|
Another approach that is sufficient for both of these is to harden all of the
|
||
|
speculative stores. However, as most stores aren't interesting and don't
|
||
|
inherently leak data, this is expected to be prohibitively expensive given the
|
||
|
attack it is defending against.
|
||
|
|
||
|
|
||
|
## Implementation Details
|
||
|
|
||
|
There are a number of complex details impacting the implementation of this
|
||
|
technique, both on a particular architecture and within a particular compiler.
|
||
|
We discuss proposed implementation techniques for the x86 architecture and the
|
||
|
LLVM compiler. These are primarily to serve as an example, as other
|
||
|
implementation techniques are very possible.
|
||
|
|
||
|
|
||
|
### x86 Implementation Details
|
||
|
|
||
|
On the x86 platform we break down the implementation into three core
|
||
|
components: accumulating the predicate state through the control flow graph,
|
||
|
checking the loads, and checking control transfers between procedures.
|
||
|
|
||
|
|
||
|
#### Accumulating Predicate State
|
||
|
|
||
|
Consider baseline x86 instructions like the following, which test three
|
||
|
conditions and if all pass, loads data from memory and potentially leaks it
|
||
|
through some side channel:
|
||
|
```
|
||
|
# %bb.0: # %entry
|
||
|
pushq %rax
|
||
|
testl %edi, %edi
|
||
|
jne .LBB0_4
|
||
|
# %bb.1: # %then1
|
||
|
testl %esi, %esi
|
||
|
jne .LBB0_4
|
||
|
# %bb.2: # %then2
|
||
|
testl %edx, %edx
|
||
|
je .LBB0_3
|
||
|
.LBB0_4: # %exit
|
||
|
popq %rax
|
||
|
retq
|
||
|
.LBB0_3: # %danger
|
||
|
movl (%rcx), %edi
|
||
|
callq leak
|
||
|
popq %rax
|
||
|
retq
|
||
|
```
|
||
|
|
||
|
When we go to speculatively execute the load, we want to know whether any of
|
||
|
the dynamically executed predicates have been misspeculated. To track that,
|
||
|
along each conditional edge, we need to track the data which would allow that
|
||
|
edge to be taken. On x86, this data is stored in the flags register used by the
|
||
|
conditional jump instruction. Along both edges after this fork in control flow,
|
||
|
the flags register remains alive and contains data that we can use to build up
|
||
|
our accumulated predicate state. We accumulate it using the x86 conditional
|
||
|
move instruction which also reads the flag registers where the state resides.
|
||
|
These conditional move instructions are known to not be predicted on any x86
|
||
|
processors, making them immune to misprediction that could reintroduce the
|
||
|
vulnerability. When we insert the conditional moves, the code ends up looking
|
||
|
like the following:
|
||
|
```
|
||
|
# %bb.0: # %entry
|
||
|
pushq %rax
|
||
|
xorl %eax, %eax # Zero out initial predicate state.
|
||
|
movq $-1, %r8 # Put all-ones mask into a register.
|
||
|
testl %edi, %edi
|
||
|
jne .LBB0_1
|
||
|
# %bb.2: # %then1
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
testl %esi, %esi
|
||
|
jne .LBB0_1
|
||
|
# %bb.3: # %then2
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
testl %edx, %edx
|
||
|
je .LBB0_4
|
||
|
.LBB0_1:
|
||
|
cmoveq %r8, %rax # Conditionally update predicate state.
|
||
|
popq %rax
|
||
|
retq
|
||
|
.LBB0_4: # %danger
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
...
|
||
|
```
|
||
|
|
||
|
Here we create the "empty" or "correct execution" predicate state by zeroing
|
||
|
`%rax`, and we create a constant "incorrect execution" predicate value by
|
||
|
putting `-1` into `%r8`. Then, along each edge coming out of a conditional
|
||
|
branch we do a conditional move that in a correct execution will be a no-op,
|
||
|
but if misspeculated, will replace the `%rax` with the value of `%r8`.
|
||
|
Misspeculating any one of the three predicates will cause `%rax` to hold the
|
||
|
"incorrect execution" value from `%r8` as we preserve incoming values when
|
||
|
execution is correct rather than overwriting it.
|
||
|
|
||
|
We now have a value in `%rax` in each basic block that indicates if at some
|
||
|
point previously a predicate was mispredicted. And we have arranged for that
|
||
|
value to be particularly effective when used below to harden loads.
|
||
|
|
||
|
|
||
|
##### Indirect Call, Branch, and Return Predicates
|
||
|
|
||
|
There is no analogous flag to use when tracing indirect calls, branches, and
|
||
|
returns. The predicate state must be accumulated through some other means.
|
||
|
Fundamentally, this is the reverse of the problem posed in CFI: we need to
|
||
|
check where we came from rather than where we are going. For function-local
|
||
|
jump tables, this is easily arranged by testing the input to the jump table
|
||
|
within each destination (not yet implemented, use retpolines):
|
||
|
```
|
||
|
pushq %rax
|
||
|
xorl %eax, %eax # Zero out initial predicate state.
|
||
|
movq $-1, %r8 # Put all-ones mask into a register.
|
||
|
jmpq *.LJTI0_0(,%rdi,8) # Indirect jump through table.
|
||
|
.LBB0_2: # %sw.bb
|
||
|
testq $0, %rdi # Validate index used for jump table.
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
...
|
||
|
jmp _Z4leaki # TAILCALL
|
||
|
|
||
|
.LBB0_3: # %sw.bb1
|
||
|
testq $1, %rdi # Validate index used for jump table.
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
...
|
||
|
jmp _Z4leaki # TAILCALL
|
||
|
|
||
|
.LBB0_5: # %sw.bb10
|
||
|
testq $2, %rdi # Validate index used for jump table.
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
...
|
||
|
jmp _Z4leaki # TAILCALL
|
||
|
...
|
||
|
|
||
|
.section .rodata,"a",@progbits
|
||
|
.p2align 3
|
||
|
.LJTI0_0:
|
||
|
.quad .LBB0_2
|
||
|
.quad .LBB0_3
|
||
|
.quad .LBB0_5
|
||
|
...
|
||
|
```
|
||
|
|
||
|
Returns have a simple mitigation technique on x86-64 (or other ABIs which have
|
||
|
what is called a "red zone" region beyond the end of the stack). This region is
|
||
|
guaranteed to be preserved across interrupts and context switches, making the
|
||
|
return address used in returning to the current code remain on the stack and
|
||
|
valid to read. We can emit code in the caller to verify that a return edge was
|
||
|
not mispredicted:
|
||
|
```
|
||
|
callq other_function
|
||
|
return_addr:
|
||
|
testq -8(%rsp), return_addr # Validate return address.
|
||
|
cmovneq %r8, %rax # Update predicate state.
|
||
|
```
|
||
|
|
||
|
For an ABI without a "red zone" (and thus unable to read the return address
|
||
|
from the stack), we can compute the expected return address prior to the call
|
||
|
into a register preserved across the call and use that similarly to the above.
|
||
|
|
||
|
Indirect calls (and returns in the absence of a red zone ABI) pose the most
|
||
|
significant challenge to propagate. The simplest technique would be to define a
|
||
|
new ABI such that the intended call target is passed into the called function
|
||
|
and checked in the entry. Unfortunately, new ABIs are quite expensive to deploy
|
||
|
in C and C++. While the target function could be passed in TLS, we would still
|
||
|
require complex logic to handle a mixture of functions compiled with and
|
||
|
without this extra logic (essentially, making the ABI backwards compatible).
|
||
|
Currently, we suggest using retpolines here and will continue to investigate
|
||
|
ways of mitigating this.
|
||
|
|
||
|
|
||
|
##### Optimizations, Alternatives, and Tradeoffs
|
||
|
|
||
|
Merely accumulating predicate state involves significant cost. There are
|
||
|
several key optimizations we employ to minimize this and various alternatives
|
||
|
that present different tradeoffs in the generated code.
|
||
|
|
||
|
First, we work to reduce the number of instructions used to track the state:
|
||
|
* Rather than inserting a `cmovCC` instruction along every conditional edge in
|
||
|
the original program, we track each set of condition flags we need to capture
|
||
|
prior to entering each basic block and reuse a common `cmovCC` sequence for
|
||
|
those.
|
||
|
* We could further reuse suffixes when there are multiple `cmovCC`
|
||
|
instructions required to capture the set of flags. Currently this is
|
||
|
believed to not be worth the cost as paired flags are relatively rare and
|
||
|
suffixes of them are exceedingly rare.
|
||
|
* A common pattern in x86 is to have multiple conditional jump instructions
|
||
|
that use the same flags but handle different conditions. Naively, we could
|
||
|
consider each fallthrough between them an "edge" but this causes a much more
|
||
|
complex control flow graph. Instead, we accumulate the set of conditions
|
||
|
necessary for fallthrough and use a sequence of `cmovCC` instructions in a
|
||
|
single fallthrough edge to track it.
|
||
|
|
||
|
Second, we trade register pressure for simpler `cmovCC` instructions by
|
||
|
allocating a register for the "bad" state. We could read that value from memory
|
||
|
as part of the conditional move instruction, however, this creates more
|
||
|
micro-ops and requires the load-store unit to be involved. Currently, we place
|
||
|
the value into a virtual register and allow the register allocator to decide
|
||
|
when the register pressure is sufficient to make it worth spilling to memory
|
||
|
and reloading.
|
||
|
|
||
|
|
||
|
#### Hardening Loads
|
||
|
|
||
|
Once we have the predicate accumulated into a special value for correct vs.
|
||
|
misspeculated, we need to apply this to loads in a way that ensures they do not
|
||
|
leak secret data. There are two primary techniques for this: we can either
|
||
|
harden the loaded value to prevent observation, or we can harden the address
|
||
|
itself to prevent the load from occurring. These have significantly different
|
||
|
performance tradeoffs.
|
||
|
|
||
|
|
||
|
##### Hardening loaded values
|
||
|
|
||
|
The most appealing way to harden loads is to mask out all of the bits loaded.
|
||
|
The key requirement is that for each bit loaded, along the misspeculated path
|
||
|
that bit is always fixed at either 0 or 1 regardless of the value of the bit
|
||
|
loaded. The most obvious implementation uses either an `and` instruction with
|
||
|
an all-zero mask along misspeculated paths and an all-one mask along correct
|
||
|
paths, or an `or` instruction with an all-one mask along misspeculated paths
|
||
|
and an all-zero mask along correct paths. Other options become less appealing
|
||
|
such as multiplying by zero, or multiple shift instructions. For reasons we
|
||
|
elaborate on below, we end up suggesting you use `or` with an all-ones mask,
|
||
|
making the x86 instruction sequence look like the following:
|
||
|
```
|
||
|
...
|
||
|
|
||
|
.LBB0_4: # %danger
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
movl (%rsi), %edi # Load potentially secret data from %rsi.
|
||
|
orl %eax, %edi
|
||
|
```
|
||
|
|
||
|
Other useful patterns may be to fold the load into the `or` instruction itself
|
||
|
at the cost of a register-to-register copy.
|
||
|
|
||
|
There are some challenges with deploying this approach:
|
||
|
1. Many loads on x86 are folded into other instructions. Separating them would
|
||
|
add very significant and costly register pressure with prohibitive
|
||
|
performance cost.
|
||
|
1. Loads may not target a general purpose register requiring extra instructions
|
||
|
to map the state value into the correct register class, and potentially more
|
||
|
expensive instructions to mask the value in some way.
|
||
|
1. The flags registers on x86 are very likely to be live, and challenging to
|
||
|
preserve cheaply.
|
||
|
1. There are many more values loaded than pointers & indices used for loads. As
|
||
|
a consequence, hardening the result of a load requires substantially more
|
||
|
instructions than hardening the address of the load (see below).
|
||
|
|
||
|
Despite these challenges, hardening the result of the load critically allows
|
||
|
the load to proceed and thus has dramatically less impact on the total
|
||
|
speculative / out-of-order potential of the execution. There are also several
|
||
|
interesting techniques to try and mitigate these challenges and make hardening
|
||
|
the results of loads viable in at least some cases. However, we generally
|
||
|
expect to fall back when unprofitable from hardening the loaded value to the
|
||
|
next approach of hardening the address itself.
|
||
|
|
||
|
|
||
|
###### Loads folded into data-invariant operations can be hardened after the operation
|
||
|
|
||
|
The first key to making this feasible is to recognize that many operations on
|
||
|
x86 are "data-invariant". That is, they have no (known) observable behavior
|
||
|
differences due to the particular input data. These instructions are often used
|
||
|
when implementing cryptographic primitives dealing with private key data
|
||
|
because they are not believed to provide any side-channels. Similarly, we can
|
||
|
defer hardening until after them as they will not in-and-of-themselves
|
||
|
introduce a speculative execution side-channel. This results in code sequences
|
||
|
that look like:
|
||
|
```
|
||
|
...
|
||
|
|
||
|
.LBB0_4: # %danger
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
addl (%rsi), %edi # Load and accumulate without leaking.
|
||
|
orl %eax, %edi
|
||
|
```
|
||
|
|
||
|
While an addition happens to the loaded (potentially secret) value, that
|
||
|
doesn't leak any data and we then immediately harden it.
|
||
|
|
||
|
|
||
|
###### Hardening of loaded values deferred down the data-invariant expression graph
|
||
|
|
||
|
We can generalize the previous idea and sink the hardening down the expression
|
||
|
graph across as many data-invariant operations as desirable. This can use very
|
||
|
conservative rules for whether something is data-invariant. The primary goal
|
||
|
should be to handle multiple loads with a single hardening instruction:
|
||
|
```
|
||
|
...
|
||
|
|
||
|
.LBB0_4: # %danger
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
addl (%rsi), %edi # Load and accumulate without leaking.
|
||
|
addl 4(%rsi), %edi # Continue without leaking.
|
||
|
addl 8(%rsi), %edi
|
||
|
orl %eax, %edi # Mask out bits from all three loads.
|
||
|
```
|
||
|
|
||
|
|
||
|
###### Preserving the flags while hardening loaded values on Haswell, Zen, and newer processors
|
||
|
|
||
|
Sadly, there are no useful instructions on x86 that apply a mask to all 64 bits
|
||
|
without touching the flag registers. However, we can harden loaded values that
|
||
|
are narrower than a word (fewer than 32-bits on 32-bit systems and fewer than
|
||
|
64-bits on 64-bit systems) by zero-extending the value to the full word size
|
||
|
and then shifting right by at least the number of original bits using the BMI2
|
||
|
`shrx` instruction:
|
||
|
```
|
||
|
...
|
||
|
|
||
|
.LBB0_4: # %danger
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
addl (%rsi), %edi # Load and accumulate 32 bits of data.
|
||
|
shrxq %rax, %rdi, %rdi # Shift out all 32 bits loaded.
|
||
|
```
|
||
|
|
||
|
Because on x86 the zero-extend is free, this can efficiently harden the loaded
|
||
|
value.
|
||
|
|
||
|
|
||
|
##### Hardening the address of the load
|
||
|
|
||
|
When hardening the loaded value is inapplicable, most often because the
|
||
|
instruction directly leaks information (like `cmp` or `jmpq`), we switch to
|
||
|
hardening the _address_ of the load instead of the loaded value. This avoids
|
||
|
increasing register pressure by unfolding the load or paying some other high
|
||
|
cost.
|
||
|
|
||
|
To understand how this works in practice, we need to examine the exact
|
||
|
semantics of the x86 addressing modes which, in its fully general form, looks
|
||
|
like `(%base,%index,scale)offset`. Here `%base` and `%index` are 64-bit
|
||
|
registers that can potentially be any value, and may be attacker controlled,
|
||
|
and `scale` and `offset` are fixed immediate values. `scale` must be `1`, `2`,
|
||
|
`4`, or `8`, and `offset` can be any 32-bit sign extended value. The exact
|
||
|
computation performed to find the address is then: `%base + (scale * %index) +
|
||
|
offset` under 64-bit 2's complement modular arithmetic.
|
||
|
|
||
|
One issue with this approach is that, after hardening, the `%base + (scale *
|
||
|
%index)` subexpression will compute a value near zero (`-1 + (scale * -1)`) and
|
||
|
then a large, positive `offset` will index into memory within the first two
|
||
|
gigabytes of address space. While these offsets are not attacker controlled,
|
||
|
the attacker could chose to attack a load which happens to have the desired
|
||
|
offset and then successfully read memory in that region. This significantly
|
||
|
raises the burden on the attacker and limits the scope of attack but does not
|
||
|
eliminate it. To fully close the attack we must work with the operating system
|
||
|
to preclude mapping memory in the low two gigabytes of address space.
|
||
|
|
||
|
|
||
|
###### 64-bit load checking instructions
|
||
|
|
||
|
We can use the following instruction sequences to check loads. We set up `%r8`
|
||
|
in these examples to hold the special value of `-1` which will be `cmov`ed over
|
||
|
`%rax` in misspeculated paths.
|
||
|
|
||
|
Single register addressing mode:
|
||
|
```
|
||
|
...
|
||
|
|
||
|
.LBB0_4: # %danger
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
orq %rax, %rsi # Mask the pointer if misspeculating.
|
||
|
movl (%rsi), %edi
|
||
|
```
|
||
|
|
||
|
Two register addressing mode:
|
||
|
```
|
||
|
...
|
||
|
|
||
|
.LBB0_4: # %danger
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
orq %rax, %rsi # Mask the pointer if misspeculating.
|
||
|
orq %rax, %rcx # Mask the index if misspeculating.
|
||
|
movl (%rsi,%rcx), %edi
|
||
|
```
|
||
|
|
||
|
This will result in a negative address near zero or in `offset` wrapping the
|
||
|
address space back to a small positive address. Small, negative addresses will
|
||
|
fault in user-mode for most operating systems, but targets which need the high
|
||
|
address space to be user accessible may need to adjust the exact sequence used
|
||
|
above. Additionally, the low addresses will need to be marked unreadable by the
|
||
|
OS to fully harden the load.
|
||
|
|
||
|
|
||
|
###### RIP-relative addressing is even easier to break
|
||
|
|
||
|
There is a common addressing mode idiom that is substantially harder to check:
|
||
|
addressing relative to the instruction pointer. We cannot change the value of
|
||
|
the instruction pointer register and so we have the harder problem of forcing
|
||
|
`%base + scale * %index + offset` to be an invalid address, by *only* changing
|
||
|
`%index`. The only advantage we have is that the attacker also cannot modify
|
||
|
`%base`. If we use the fast instruction sequence above, but only apply it to
|
||
|
the index, we will always access `%rip + (scale * -1) + offset`. If the
|
||
|
attacker can find a load which with this address happens to point to secret
|
||
|
data, then they can reach it. However, the loader and base libraries can also
|
||
|
simply refuse to map the heap, data segments, or stack within 2gb of any of the
|
||
|
text in the program, much like it can reserve the low 2gb of address space.
|
||
|
|
||
|
|
||
|
###### The flag registers again make everything hard
|
||
|
|
||
|
Unfortunately, the technique of using `orq`-instructions has a serious flaw on
|
||
|
x86. The very thing that makes it easy to accumulate state, the flag registers
|
||
|
containing predicates, causes serious problems here because they may be alive
|
||
|
and used by the loading instruction or subsequent instructions. On x86, the
|
||
|
`orq` instruction **sets** the flags and will override anything already there.
|
||
|
This makes inserting them into the instruction stream very hazardous.
|
||
|
Unfortunately, unlike when hardening the loaded value, we have no fallback here
|
||
|
and so we must have a fully general approach available.
|
||
|
|
||
|
The first thing we must do when generating these sequences is try to analyze
|
||
|
the surrounding code to prove that the flags are not in fact alive or being
|
||
|
used. Typically, it has been set by some other instruction which just happens
|
||
|
to set the flags register (much like ours!) with no actual dependency. In those
|
||
|
cases, it is safe to directly insert these instructions. Alternatively we may
|
||
|
be able to move them earlier to avoid clobbering the used value.
|
||
|
|
||
|
However, this may ultimately be impossible. In that case, we need to preserve
|
||
|
the flags around these instructions:
|
||
|
```
|
||
|
...
|
||
|
|
||
|
.LBB0_4: # %danger
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
pushfq
|
||
|
orq %rax, %rcx # Mask the pointer if misspeculating.
|
||
|
orq %rax, %rdx # Mask the index if misspeculating.
|
||
|
popfq
|
||
|
movl (%rcx,%rdx), %edi
|
||
|
```
|
||
|
|
||
|
Using the `pushf` and `popf` instructions saves the flags register around our
|
||
|
inserted code, but comes at a high cost. First, we must store the flags to the
|
||
|
stack and reload them. Second, this causes the stack pointer to be adjusted
|
||
|
dynamically, requiring a frame pointer be used for referring to temporaries
|
||
|
spilled to the stack, etc.
|
||
|
|
||
|
On newer x86 processors we can use the `lahf` and `sahf` instructions to save
|
||
|
all of the flags besides the overflow flag in a register rather than on the
|
||
|
stack. We can then use `seto` and `add` to save and restore the overflow flag
|
||
|
in a register. Combined, this will save and restore flags in the same manner as
|
||
|
above but using two registers rather than the stack. That is still very
|
||
|
expensive if slightly less expensive than `pushf` and `popf` in most cases.
|
||
|
|
||
|
|
||
|
###### A flag-less alternative on Haswell, Zen and newer processors
|
||
|
|
||
|
Starting with the BMI2 x86 instruction set extensions available on Haswell and
|
||
|
Zen processors, there is an instruction for shifting that does not set any
|
||
|
flags: `shrx`. We can use this and the `lea` instruction to implement analogous
|
||
|
code sequences to the above ones. However, these are still very marginally
|
||
|
slower, as there are fewer ports able to dispatch shift instructions in most
|
||
|
modern x86 processors than there are for `or` instructions.
|
||
|
|
||
|
Fast, single register addressing mode:
|
||
|
```
|
||
|
...
|
||
|
|
||
|
.LBB0_4: # %danger
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
shrxq %rax, %rsi, %rsi # Shift away bits if misspeculating.
|
||
|
movl (%rsi), %edi
|
||
|
```
|
||
|
|
||
|
This will collapse the register to zero or one, and everything but the offset
|
||
|
in the addressing mode to be less than or equal to 9. This means the full
|
||
|
address can only be guaranteed to be less than `(1 << 31) + 9`. The OS may wish
|
||
|
to protect an extra page of the low address space to account for this
|
||
|
|
||
|
|
||
|
##### Optimizations
|
||
|
|
||
|
A very large portion of the cost for this approach comes from checking loads in
|
||
|
this way, so it is important to work to optimize this. However, beyond making
|
||
|
the instruction sequences to *apply* the checks efficient (for example by
|
||
|
avoiding `pushfq` and `popfq` sequences), the only significant optimization is
|
||
|
to check fewer loads without introducing a vulnerability. We apply several
|
||
|
techniques to accomplish that.
|
||
|
|
||
|
|
||
|
###### Don't check loads from compile-time constant stack offsets
|
||
|
|
||
|
We implement this optimization on x86 by skipping the checking of loads which
|
||
|
use a fixed frame pointer offset.
|
||
|
|
||
|
The result of this optimization is that patterns like reloading a spilled
|
||
|
register or accessing a global field don't get checked. This is a very
|
||
|
significant performance win.
|
||
|
|
||
|
|
||
|
###### Don't check dependent loads
|
||
|
|
||
|
A core part of why this mitigation strategy works is that it establishes a
|
||
|
data-flow check on the loaded address. However, this means that if the address
|
||
|
itself was already loaded using a checked load, there is no need to check a
|
||
|
dependent load provided it is within the same basic block as the checked load,
|
||
|
and therefore has no additional predicates guarding it. Consider code like the
|
||
|
following:
|
||
|
```
|
||
|
...
|
||
|
|
||
|
.LBB0_4: # %danger
|
||
|
movq (%rcx), %rdi
|
||
|
movl (%rdi), %edx
|
||
|
```
|
||
|
|
||
|
This will get transformed into:
|
||
|
```
|
||
|
...
|
||
|
|
||
|
.LBB0_4: # %danger
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
orq %rax, %rcx # Mask the pointer if misspeculating.
|
||
|
movq (%rcx), %rdi # Hardened load.
|
||
|
movl (%rdi), %edx # Unhardened load due to dependent addr.
|
||
|
```
|
||
|
|
||
|
This doesn't check the load through `%rdi` as that pointer is dependent on a
|
||
|
checked load already.
|
||
|
|
||
|
|
||
|
###### Protect large, load-heavy blocks with a single lfence
|
||
|
|
||
|
It may be worth using a single `lfence` instruction at the start of a block
|
||
|
which begins with a (very) large number of loads that require independent
|
||
|
protection *and* which require hardening the address of the load. However, this
|
||
|
is unlikely to be profitable in practice. The latency hit of the hardening
|
||
|
would need to exceed that of an `lfence` when *correctly* speculatively
|
||
|
executed. But in that case, the `lfence` cost is a complete loss of speculative
|
||
|
execution (at a minimum). So far, the evidence we have of the performance cost
|
||
|
of using `lfence` indicates few if any hot code patterns where this trade off
|
||
|
would make sense.
|
||
|
|
||
|
|
||
|
###### Tempting optimizations that break the security model
|
||
|
|
||
|
Several optimizations were considered which didn't pan out due to failure to
|
||
|
uphold the security model. One in particular is worth discussing as many others
|
||
|
will reduce to it.
|
||
|
|
||
|
We wondered whether only the *first* load in a basic block could be checked. If
|
||
|
the check works as intended, it forms an invalid pointer that doesn't even
|
||
|
virtual-address translate in the hardware. It should fault very early on in its
|
||
|
processing. Maybe that would stop things in time for the misspeculated path to
|
||
|
fail to leak any secrets. This doesn't end up working because the processor is
|
||
|
fundamentally out-of-order, even in its speculative domain. As a consequence,
|
||
|
the attacker could cause the initial address computation itself to stall and
|
||
|
allow an arbitrary number of unrelated loads (including attacked loads of
|
||
|
secret data) to pass through.
|
||
|
|
||
|
|
||
|
#### Interprocedural Checking
|
||
|
|
||
|
Modern x86 processors may speculate into called functions and out of functions
|
||
|
to their return address. As a consequence, we need a way to check loads that
|
||
|
occur after a misspeculated predicate but where the load and the misspeculated
|
||
|
predicate are in different functions. In essence, we need some interprocedural
|
||
|
generalization of the predicate state tracking. A primary challenge to passing
|
||
|
the predicate state between functions is that we would like to not require a
|
||
|
change to the ABI or calling convention in order to make this mitigation more
|
||
|
deployable, and further would like code mitigated in this way to be easily
|
||
|
mixed with code not mitigated in this way and without completely losing the
|
||
|
value of the mitigation.
|
||
|
|
||
|
|
||
|
##### Embed the predicate state into the high bit(s) of the stack pointer
|
||
|
|
||
|
We can use the same technique that allows hardening pointers to pass the
|
||
|
predicate state into and out of functions. The stack pointer is trivially
|
||
|
passed between functions and we can test for it having the high bits set to
|
||
|
detect when it has been marked due to misspeculation. The callsite instruction
|
||
|
sequence looks like (assuming a misspeculated state value of `-1`):
|
||
|
```
|
||
|
...
|
||
|
|
||
|
.LBB0_4: # %danger
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
shlq $47, %rax
|
||
|
orq %rax, %rsp
|
||
|
callq other_function
|
||
|
movq %rsp, %rax
|
||
|
sarq 63, %rax # Sign extend the high bit to all bits.
|
||
|
```
|
||
|
|
||
|
This first puts the predicate state into the high bits of `%rsp` before calling
|
||
|
the function and then reads it back out of high bits of `%rsp` afterward. When
|
||
|
correctly executing (speculatively or not), these are all no-ops. When
|
||
|
misspeculating, the stack pointer will end up negative. We arrange for it to
|
||
|
remain a canonical address, but otherwise leave the low bits alone to allow
|
||
|
stack adjustments to proceed normally without disrupting this. Within the
|
||
|
called function, we can extract this predicate state and then reset it on
|
||
|
return:
|
||
|
```
|
||
|
other_function:
|
||
|
# prolog
|
||
|
callq other_function
|
||
|
movq %rsp, %rax
|
||
|
sarq 63, %rax # Sign extend the high bit to all bits.
|
||
|
# ...
|
||
|
|
||
|
.LBB0_N:
|
||
|
cmovneq %r8, %rax # Conditionally update predicate state.
|
||
|
shlq $47, %rax
|
||
|
orq %rax, %rsp
|
||
|
retq
|
||
|
```
|
||
|
|
||
|
This approach is effective when all code is mitigated in this fashion, and can
|
||
|
even survive very limited reaches into unmitigated code (the state will
|
||
|
round-trip in and back out of an unmitigated function, it just won't be
|
||
|
updated). But it does have some limitations. There is a cost to merging the
|
||
|
state into `%rsp` and it doesn't insulate mitigated code from misspeculation in
|
||
|
an unmitigated caller.
|
||
|
|
||
|
There is also an advantage to using this form of interprocedural mitigation: by
|
||
|
forming these invalid stack pointer addresses we can prevent speculative
|
||
|
returns from successfully reading speculatively written values to the actual
|
||
|
stack. This works first by forming a data-dependency between computing the
|
||
|
address of the return address on the stack and our predicate state. And even
|
||
|
when satisfied, if a misprediction causes the state to be poisoned the
|
||
|
resulting stack pointer will be invalid.
|
||
|
|
||
|
|
||
|
##### Rewrite API of internal functions to directly propagate predicate state
|
||
|
|
||
|
(Not yet implemented.)
|
||
|
|
||
|
We have the option with internal functions to directly adjust their API to
|
||
|
accept the predicate as an argument and return it. This is likely to be
|
||
|
marginally cheaper than embedding into `%rsp` for entering functions.
|
||
|
|
||
|
|
||
|
##### Use `lfence` to guard function transitions
|
||
|
|
||
|
An `lfence` instruction can be used to prevent subsequent loads from
|
||
|
speculatively executing until all prior mispredicted predicates have resolved.
|
||
|
We can use this broader barrier to speculative loads executing between
|
||
|
functions. We emit it in the entry block to handle calls, and prior to each
|
||
|
return. This approach also has the advantage of providing the strongest degree
|
||
|
of mitigation when mixed with unmitigated code by halting all misspeculation
|
||
|
entering a function which is mitigated, regardless of what occurred in the
|
||
|
caller. However, such a mixture is inherently more risky. Whether this kind of
|
||
|
mixture is a sufficient mitigation requires careful analysis.
|
||
|
|
||
|
Unfortunately, experimental results indicate that the performance overhead of
|
||
|
this approach is very high for certain patterns of code. A classic example is
|
||
|
any form of recursive evaluation engine. The hot, rapid call and return
|
||
|
sequences exhibit dramatic performance loss when mitigated with `lfence`. This
|
||
|
component alone can regress performance by 2x or more, making it an unpleasant
|
||
|
tradeoff even when only used in a mixture of code.
|
||
|
|
||
|
|
||
|
##### Use an internal TLS location to pass predicate state
|
||
|
|
||
|
We can define a special thread-local value to hold the predicate state between
|
||
|
functions. This avoids direct ABI implications by using a side channel between
|
||
|
callers and callees to communicate the predicate state. It also allows implicit
|
||
|
zero-initialization of the state, which allows non-checked code to be the first
|
||
|
code executed.
|
||
|
|
||
|
However, this requires a load from TLS in the entry block, a store to TLS
|
||
|
before every call and every ret, and a load from TLS after every call. As a
|
||
|
consequence it is expected to be substantially more expensive even than using
|
||
|
`%rsp` and potentially `lfence` within the function entry block.
|
||
|
|
||
|
|
||
|
##### Define a new ABI and/or calling convention
|
||
|
|
||
|
We could define a new ABI and/or calling convention to explicitly pass the
|
||
|
predicate state in and out of functions. This may be interesting if none of the
|
||
|
alternatives have adequate performance, but it makes deployment and adoption
|
||
|
dramatically more complex, and potentially infeasible.
|
||
|
|
||
|
|
||
|
## High-Level Alternative Mitigation Strategies
|
||
|
|
||
|
There are completely different alternative approaches to mitigating variant 1
|
||
|
attacks. [Most](https://lwn.net/Articles/743265/)
|
||
|
[discussion](https://lwn.net/Articles/744287/) so far focuses on mitigating
|
||
|
specific known attackable components in the Linux kernel (or other kernels) by
|
||
|
manually rewriting the code to contain an instruction sequence that is not
|
||
|
vulnerable. For x86 systems this is done by either injecting an `lfence`
|
||
|
instruction along the code path which would leak data if executed speculatively
|
||
|
or by rewriting memory accesses to have branch-less masking to a known safe
|
||
|
region. On Intel systems, `lfence` [will prevent the speculative load of secret
|
||
|
data](https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf).
|
||
|
On AMD systems `lfence` is currently a no-op, but can be made
|
||
|
dispatch-serializing by setting an MSR, and thus preclude misspeculation of the
|
||
|
code path ([mitigation G-2 +
|
||
|
V1-1](https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf)).
|
||
|
|
||
|
However, this relies on finding and enumerating all possible points in code
|
||
|
which could be attacked to leak information. While in some cases static
|
||
|
analysis is effective at doing this at scale, in many cases it still relies on
|
||
|
human judgement to evaluate whether code might be vulnerable. Especially for
|
||
|
software systems which receive less detailed scrutiny but remain sensitive to
|
||
|
these attacks, this seems like an impractical security model. We need an
|
||
|
automatic and systematic mitigation strategy.
|
||
|
|
||
|
|
||
|
### Automatic `lfence` on Conditional Edges
|
||
|
|
||
|
A natural way to scale up the existing hand-coded mitigations is simply to
|
||
|
inject an `lfence` instruction into both the target and fallthrough
|
||
|
destinations of every conditional branch. This ensures that no predicate or
|
||
|
bounds check can be bypassed speculatively. However, the performance overhead
|
||
|
of this approach is, simply put, catastrophic. Yet it remains the only truly
|
||
|
"secure by default" approach known prior to this effort and serves as the
|
||
|
baseline for performance.
|
||
|
|
||
|
One attempt to address the performance overhead of this and make it more
|
||
|
realistic to deploy is [MSVC's /Qspectre
|
||
|
switch](https://blogs.msdn.microsoft.com/vcblog/2018/01/15/spectre-mitigations-in-msvc/).
|
||
|
Their technique is to use static analysis within the compiler to only insert
|
||
|
`lfence` instructions into conditional edges at risk of attack. However,
|
||
|
[initial](https://arstechnica.com/gadgets/2018/02/microsofts-compiler-level-spectre-fix-shows-how-hard-this-problem-will-be-to-solve/)
|
||
|
[analysis](https://www.paulkocher.com/doc/MicrosoftCompilerSpectreMitigation.html)
|
||
|
has shown that this approach is incomplete and only catches a small and limited
|
||
|
subset of attackable patterns which happen to resemble very closely the initial
|
||
|
proofs of concept. As such, while its performance is acceptable, it does not
|
||
|
appear to be an adequate systematic mitigation.
|
||
|
|
||
|
|
||
|
## Performance Overhead
|
||
|
|
||
|
The performance overhead of this style of comprehensive mitigation is very
|
||
|
high. However, it compares very favorably with previously recommended
|
||
|
approaches such as the `lfence` instruction. Just as users can restrict the
|
||
|
scope of `lfence` to control its performance impact, this mitigation technique
|
||
|
could be restricted in scope as well.
|
||
|
|
||
|
However, it is important to understand what it would cost to get a fully
|
||
|
mitigated baseline. Here we assume targeting a Haswell (or newer) processor and
|
||
|
using all of the tricks to improve performance (so leaves the low 2gb
|
||
|
unprotected and +/- 2gb surrounding any PC in the program). We ran both
|
||
|
Google's microbenchmark suite and a large highly-tuned server built using
|
||
|
ThinLTO and PGO. All were built with `-march=haswell` to give access to BMI2
|
||
|
instructions, and benchmarks were run on large Haswell servers. We collected
|
||
|
data both with an `lfence`-based mitigation and load hardening as presented
|
||
|
here. The summary is that mitigating with load hardening is 1.77x faster than
|
||
|
mitigating with `lfence`, and the overhead of load hardening compared to a
|
||
|
normal program is likely between a 10% overhead and a 50% overhead with most
|
||
|
large applications seeing a 30% overhead or less.
|
||
|
|
||
|
| Benchmark | `lfence` | Load Hardening | Mitigated Speedup |
|
||
|
| -------------------------------------- | -------: | -------------: | ----------------: |
|
||
|
| Google microbenchmark suite | -74.8% | -36.4% | **2.5x** |
|
||
|
| Large server QPS (using ThinLTO & PGO) | -62% | -29% | **1.8x** |
|
||
|
|
||
|
Below is a visualization of the microbenchmark suite results which helps show
|
||
|
the distribution of results that is somewhat lost in the summary. The y-axis is
|
||
|
a log-scale speedup ratio of load hardening relative to `lfence` (up -> faster
|
||
|
-> better). Each box-and-whiskers represents one microbenchmark which may have
|
||
|
many different metrics measured. The red line marks the median, the box marks
|
||
|
the first and third quartiles, and the whiskers mark the min and max.
|
||
|
|
||
|
![Microbenchmark result visualization](speculative_load_hardening_microbenchmarks.png)
|
||
|
|
||
|
We don't yet have benchmark data on SPEC or the LLVM test suite, but we can
|
||
|
work on getting that. Still, the above should give a pretty clear
|
||
|
characterization of the performance, and specific benchmarks are unlikely to
|
||
|
reveal especially interesting properties.
|
||
|
|
||
|
|
||
|
### Future Work: Fine Grained Control and API-Integration
|
||
|
|
||
|
The performance overhead of this technique is likely to be very significant and
|
||
|
something users wish to control or reduce. There are interesting options here
|
||
|
that impact the implementation strategy used.
|
||
|
|
||
|
One particularly appealing option is to allow both opt-in and opt-out of this
|
||
|
mitigation at reasonably fine granularity such as on a per-function basis,
|
||
|
including intelligent handling of inlining decisions -- protected code can be
|
||
|
prevented from inlining into unprotected code, and unprotected code will become
|
||
|
protected when inlined into protected code. For systems where only a limited
|
||
|
set of code is reachable by externally controlled inputs, it may be possible to
|
||
|
limit the scope of mitigation through such mechanisms without compromising the
|
||
|
application's overall security. The performance impact may also be focused in a
|
||
|
few key functions that can be hand-mitigated in ways that have lower
|
||
|
performance overhead while the remainder of the application receives automatic
|
||
|
protection.
|
||
|
|
||
|
For both limiting the scope of mitigation or manually mitigating hot functions,
|
||
|
there needs to be some support for mixing mitigated and unmitigated code
|
||
|
without completely defeating the mitigation. For the first use case, it would
|
||
|
be particularly desirable that mitigated code remains safe when being called
|
||
|
during misspeculation from unmitigated code.
|
||
|
|
||
|
For the second use case, it may be important to connect the automatic
|
||
|
mitigation technique to explicit mitigation APIs such as what is described in
|
||
|
http://wg21.link/p0928 (or any other eventual API) so that there is a clean way
|
||
|
to switch from automatic to manual mitigation without immediately exposing a
|
||
|
hole. However, the design for how to do this is hard to come up with until the
|
||
|
APIs are better established. We will revisit this as those APIs mature.
|