Developer documentation should be its own manual. As a start, move all developer-oriented files to a separate directory. Also move non-text files to their own directories: docs/config/ for QEMU -readconfig input, and docs/spin/ for formal models to be used with the SPIN model checker. Reviewed-by: Daniel P. Berrange <berrange@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
		
			
				
	
	
		
			351 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			351 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
Copyright (c) 2015-2016 Linaro Ltd.
 | 
						|
 | 
						|
This work is licensed under the terms of the GNU GPL, version 2 or
 | 
						|
later. See the COPYING file in the top-level directory.
 | 
						|
 | 
						|
Introduction
 | 
						|
============
 | 
						|
 | 
						|
This document outlines the design for multi-threaded TCG system-mode
 | 
						|
emulation. The current user-mode emulation mirrors the thread
 | 
						|
structure of the translated executable. Some of the work will be
 | 
						|
applicable to both system and linux-user emulation.
 | 
						|
 | 
						|
The original system-mode TCG implementation was single threaded and
 | 
						|
dealt with multiple CPUs with simple round-robin scheduling. This
 | 
						|
simplified a lot of things but became increasingly limited as systems
 | 
						|
being emulated gained additional cores and per-core performance gains
 | 
						|
for host systems started to level off.
 | 
						|
 | 
						|
vCPU Scheduling
 | 
						|
===============
 | 
						|
 | 
						|
We introduce a new running mode where each vCPU will run on its own
 | 
						|
user-space thread. This will be enabled by default for all FE/BE
 | 
						|
combinations that have had the required work done to support this
 | 
						|
safely.
 | 
						|
 | 
						|
In the general case of running translated code there should be no
 | 
						|
inter-vCPU dependencies and all vCPUs should be able to run at full
 | 
						|
speed. Synchronisation will only be required while accessing internal
 | 
						|
shared data structures or when the emulated architecture requires a
 | 
						|
coherent representation of the emulated machine state.
 | 
						|
 | 
						|
Shared Data Structures
 | 
						|
======================
 | 
						|
 | 
						|
Main Run Loop
 | 
						|
-------------
 | 
						|
 | 
						|
Even when there is no code being generated there are a number of
 | 
						|
structures associated with the hot-path through the main run-loop.
 | 
						|
These are associated with looking up the next translation block to
 | 
						|
execute. These include:
 | 
						|
 | 
						|
    tb_jmp_cache (per-vCPU, cache of recent jumps)
 | 
						|
    tb_ctx.htable (global hash table, phys address->tb lookup)
 | 
						|
 | 
						|
As TB linking only occurs when blocks are in the same page this code
 | 
						|
is critical to performance as looking up the next TB to execute is the
 | 
						|
most common reason to exit the generated code.
 | 
						|
 | 
						|
DESIGN REQUIREMENT: Make access to lookup structures safe with
 | 
						|
multiple reader/writer threads. Minimise any lock contention to do it.
 | 
						|
 | 
						|
The hot-path avoids using locks where possible. The tb_jmp_cache is
 | 
						|
updated with atomic accesses to ensure consistent results. The fall
 | 
						|
back QHT based hash table is also designed for lockless lookups. Locks
 | 
						|
are only taken when code generation is required or TranslationBlocks
 | 
						|
have their block-to-block jumps patched.
 | 
						|
 | 
						|
Global TCG State
 | 
						|
----------------
 | 
						|
 | 
						|
We need to protect the entire code generation cycle including any post
 | 
						|
generation patching of the translated code. This also implies a shared
 | 
						|
translation buffer which contains code running on all cores. Any
 | 
						|
execution path that comes to the main run loop will need to hold a
 | 
						|
mutex for code generation. This also includes times when we need flush
 | 
						|
code or entries from any shared lookups/caches. Structures held on a
 | 
						|
per-vCPU basis won't need locking unless other vCPUs will need to
 | 
						|
modify them.
 | 
						|
 | 
						|
DESIGN REQUIREMENT: Add locking around all code generation and TB
 | 
						|
patching.
 | 
						|
 | 
						|
(Current solution)
 | 
						|
 | 
						|
Mainly as part of the linux-user work all code generation is
 | 
						|
serialised with a tb_lock(). For the SoftMMU tb_lock() also takes the
 | 
						|
place of mmap_lock() in linux-user.
 | 
						|
 | 
						|
Translation Blocks
 | 
						|
------------------
 | 
						|
 | 
						|
Currently the whole system shares a single code generation buffer
 | 
						|
which when full will force a flush of all translations and start from
 | 
						|
scratch again. Some operations also force a full flush of translations
 | 
						|
including:
 | 
						|
 | 
						|
  - debugging operations (breakpoint insertion/removal)
 | 
						|
  - some CPU helper functions
 | 
						|
 | 
						|
This is done with the async_safe_run_on_cpu() mechanism to ensure all
 | 
						|
vCPUs are quiescent when changes are being made to shared global
 | 
						|
structures.
 | 
						|
 | 
						|
More granular translation invalidation events are typically due
 | 
						|
to a change of the state of a physical page:
 | 
						|
 | 
						|
  - code modification (self modify code, patching code)
 | 
						|
  - page changes (new page mapping in linux-user mode)
 | 
						|
 | 
						|
While setting the invalid flag in a TranslationBlock will stop it
 | 
						|
being used when looked up in the hot-path there are a number of other
 | 
						|
book-keeping structures that need to be safely cleared.
 | 
						|
 | 
						|
Any TranslationBlocks which have been patched to jump directly to the
 | 
						|
now invalid blocks need the jump patches reversing so they will return
 | 
						|
to the C code.
 | 
						|
 | 
						|
There are a number of look-up caches that need to be properly updated
 | 
						|
including the:
 | 
						|
 | 
						|
  - jump lookup cache
 | 
						|
  - the physical-to-tb lookup hash table
 | 
						|
  - the global page table
 | 
						|
 | 
						|
The global page table (l1_map) which provides a multi-level look-up
 | 
						|
for PageDesc structures which contain pointers to the start of a
 | 
						|
linked list of all Translation Blocks in that page (see page_next).
 | 
						|
 | 
						|
Both the jump patching and the page cache involve linked lists that
 | 
						|
the invalidated TranslationBlock needs to be removed from.
 | 
						|
 | 
						|
DESIGN REQUIREMENT: Safely handle invalidation of TBs
 | 
						|
                      - safely patch/revert direct jumps
 | 
						|
                      - remove central PageDesc lookup entries
 | 
						|
                      - ensure lookup caches/hashes are safely updated
 | 
						|
 | 
						|
(Current solution)
 | 
						|
 | 
						|
The direct jump themselves are updated atomically by the TCG
 | 
						|
tb_set_jmp_target() code. Modification to the linked lists that allow
 | 
						|
searching for linked pages are done under the protect of the
 | 
						|
tb_lock().
 | 
						|
 | 
						|
The global page table is protected by the tb_lock() in system-mode and
 | 
						|
mmap_lock() in linux-user mode.
 | 
						|
 | 
						|
The lookup caches are updated atomically and the lookup hash uses QHT
 | 
						|
which is designed for concurrent safe lookup.
 | 
						|
 | 
						|
 | 
						|
Memory maps and TLBs
 | 
						|
--------------------
 | 
						|
 | 
						|
The memory handling code is fairly critical to the speed of memory
 | 
						|
access in the emulated system. The SoftMMU code is designed so the
 | 
						|
hot-path can be handled entirely within translated code. This is
 | 
						|
handled with a per-vCPU TLB structure which once populated will allow
 | 
						|
a series of accesses to the page to occur without exiting the
 | 
						|
translated code. It is possible to set flags in the TLB address which
 | 
						|
will ensure the slow-path is taken for each access. This can be done
 | 
						|
to support:
 | 
						|
 | 
						|
  - Memory regions (dividing up access to PIO, MMIO and RAM)
 | 
						|
  - Dirty page tracking (for code gen, SMC detection, migration and display)
 | 
						|
  - Virtual TLB (for translating guest address->real address)
 | 
						|
 | 
						|
When the TLB tables are updated by a vCPU thread other than their own
 | 
						|
we need to ensure it is done in a safe way so no inconsistent state is
 | 
						|
seen by the vCPU thread.
 | 
						|
 | 
						|
Some operations require updating a number of vCPUs TLBs at the same
 | 
						|
time in a synchronised manner.
 | 
						|
 | 
						|
DESIGN REQUIREMENTS:
 | 
						|
 | 
						|
  - TLB Flush All/Page
 | 
						|
    - can be across-vCPUs
 | 
						|
    - cross vCPU TLB flush may need other vCPU brought to halt
 | 
						|
    - change may need to be visible to the calling vCPU immediately
 | 
						|
  - TLB Flag Update
 | 
						|
    - usually cross-vCPU
 | 
						|
    - want change to be visible as soon as possible
 | 
						|
  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
 | 
						|
    - This is a per-vCPU table - by definition can't race
 | 
						|
    - updated by its own thread when the slow-path is forced
 | 
						|
 | 
						|
(Current solution)
 | 
						|
 | 
						|
We have updated cputlb.c to defer operations when a cross-vCPU
 | 
						|
operation with async_run_on_cpu() which ensures each vCPU sees a
 | 
						|
coherent state when it next runs its work (in a few instructions
 | 
						|
time).
 | 
						|
 | 
						|
A new set up operations (tlb_flush_*_all_cpus) take an additional flag
 | 
						|
which when set will force synchronisation by setting the source vCPUs
 | 
						|
work as "safe work" and exiting the cpu run loop. This ensure by the
 | 
						|
time execution restarts all flush operations have completed.
 | 
						|
 | 
						|
TLB flag updates are all done atomically and are also protected by the
 | 
						|
tb_lock() which is used by the functions that update the TLB in bulk.
 | 
						|
 | 
						|
(Known limitation)
 | 
						|
 | 
						|
Not really a limitation but the wait mechanism is overly strict for
 | 
						|
some architectures which only need flushes completed by a barrier
 | 
						|
instruction. This could be a future optimisation.
 | 
						|
 | 
						|
Emulated hardware state
 | 
						|
-----------------------
 | 
						|
 | 
						|
Currently thanks to KVM work any access to IO memory is automatically
 | 
						|
protected by the global iothread mutex, also known as the BQL (Big
 | 
						|
Qemu Lock). Any IO region that doesn't use global mutex is expected to
 | 
						|
do its own locking.
 | 
						|
 | 
						|
However IO memory isn't the only way emulated hardware state can be
 | 
						|
modified. Some architectures have model specific registers that
 | 
						|
trigger hardware emulation features. Generally any translation helper
 | 
						|
that needs to update more than a single vCPUs of state should take the
 | 
						|
BQL.
 | 
						|
 | 
						|
As the BQL, or global iothread mutex is shared across the system we
 | 
						|
push the use of the lock as far down into the TCG code as possible to
 | 
						|
minimise contention.
 | 
						|
 | 
						|
(Current solution)
 | 
						|
 | 
						|
MMIO access automatically serialises hardware emulation by way of the
 | 
						|
BQL. Currently ARM targets serialise all ARM_CP_IO register accesses
 | 
						|
and also defer the reset/startup of vCPUs to the vCPU context by way
 | 
						|
of async_run_on_cpu().
 | 
						|
 | 
						|
Updates to interrupt state are also protected by the BQL as they can
 | 
						|
often be cross vCPU.
 | 
						|
 | 
						|
Memory Consistency
 | 
						|
==================
 | 
						|
 | 
						|
Between emulated guests and host systems there are a range of memory
 | 
						|
consistency models. Even emulating weakly ordered systems on strongly
 | 
						|
ordered hosts needs to ensure things like store-after-load re-ordering
 | 
						|
can be prevented when the guest wants to.
 | 
						|
 | 
						|
Memory Barriers
 | 
						|
---------------
 | 
						|
 | 
						|
Barriers (sometimes known as fences) provide a mechanism for software
 | 
						|
to enforce a particular ordering of memory operations from the point
 | 
						|
of view of external observers (e.g. another processor core). They can
 | 
						|
apply to any memory operations as well as just loads or stores.
 | 
						|
 | 
						|
The Linux kernel has an excellent write-up on the various forms of
 | 
						|
memory barrier and the guarantees they can provide [1].
 | 
						|
 | 
						|
Barriers are often wrapped around synchronisation primitives to
 | 
						|
provide explicit memory ordering semantics. However they can be used
 | 
						|
by themselves to provide safe lockless access by ensuring for example
 | 
						|
a change to a signal flag will only be visible once the changes to
 | 
						|
payload are.
 | 
						|
 | 
						|
DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
 | 
						|
 | 
						|
This would enforce a strong load/store ordering so all loads/stores
 | 
						|
complete at the memory barrier. On single-core non-SMP strongly
 | 
						|
ordered backends this could become a NOP.
 | 
						|
 | 
						|
Aside from explicit standalone memory barrier instructions there are
 | 
						|
also implicit memory ordering semantics which comes with each guest
 | 
						|
memory access instruction. For example all x86 load/stores come with
 | 
						|
fairly strong guarantees of sequential consistency where as ARM has
 | 
						|
special variants of load/store instructions that imply acquire/release
 | 
						|
semantics.
 | 
						|
 | 
						|
In the case of a strongly ordered guest architecture being emulated on
 | 
						|
a weakly ordered host the scope for a heavy performance impact is
 | 
						|
quite high.
 | 
						|
 | 
						|
DESIGN REQUIREMENTS: Be efficient with use of memory barriers
 | 
						|
       - host systems with stronger implied guarantees can skip some barriers
 | 
						|
       - merge consecutive barriers to the strongest one
 | 
						|
 | 
						|
(Current solution)
 | 
						|
 | 
						|
The system currently has a tcg_gen_mb() which will add memory barrier
 | 
						|
operations if code generation is being done in a parallel context. The
 | 
						|
tcg_optimize() function attempts to merge barriers up to their
 | 
						|
strongest form before any load/store operations. The solution was
 | 
						|
originally developed and tested for linux-user based systems. All
 | 
						|
backends have been converted to emit fences when required. So far the
 | 
						|
following front-ends have been updated to emit fences when required:
 | 
						|
 | 
						|
    - target-i386
 | 
						|
    - target-arm
 | 
						|
    - target-aarch64
 | 
						|
    - target-alpha
 | 
						|
    - target-mips
 | 
						|
 | 
						|
Memory Control and Maintenance
 | 
						|
------------------------------
 | 
						|
 | 
						|
This includes a class of instructions for controlling system cache
 | 
						|
behaviour. While QEMU doesn't model cache behaviour these instructions
 | 
						|
are often seen when code modification has taken place to ensure the
 | 
						|
changes take effect.
 | 
						|
 | 
						|
Synchronisation Primitives
 | 
						|
--------------------------
 | 
						|
 | 
						|
There are two broad types of synchronisation primitives found in
 | 
						|
modern ISAs: atomic instructions and exclusive regions.
 | 
						|
 | 
						|
The first type offer a simple atomic instruction which will guarantee
 | 
						|
some sort of test and conditional store will be truly atomic w.r.t.
 | 
						|
other cores sharing access to the memory. The classic example is the
 | 
						|
x86 cmpxchg instruction.
 | 
						|
 | 
						|
The second type offer a pair of load/store instructions which offer a
 | 
						|
guarantee that an region of memory has not been touched between the
 | 
						|
load and store instructions. An example of this is ARM's ldrex/strex
 | 
						|
pair where the strex instruction will return a flag indicating a
 | 
						|
successful store only if no other CPU has accessed the memory region
 | 
						|
since the ldrex.
 | 
						|
 | 
						|
Traditionally TCG has generated a series of operations that work
 | 
						|
because they are within the context of a single translation block so
 | 
						|
will have completed before another CPU is scheduled. However with
 | 
						|
the ability to have multiple threads running to emulate multiple CPUs
 | 
						|
we will need to explicitly expose these semantics.
 | 
						|
 | 
						|
DESIGN REQUIREMENTS:
 | 
						|
  - Support classic atomic instructions
 | 
						|
  - Support load/store exclusive (or load link/store conditional) pairs
 | 
						|
  - Generic enough infrastructure to support all guest architectures
 | 
						|
CURRENT OPEN QUESTIONS:
 | 
						|
  - How problematic is the ABA problem in general?
 | 
						|
 | 
						|
(Current solution)
 | 
						|
 | 
						|
The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
 | 
						|
can be used directly or combined to emulate other instructions like
 | 
						|
ARM's ldrex/strex instructions. While they are susceptible to the ABA
 | 
						|
problem so far common guests have not implemented patterns where
 | 
						|
this may be a problem - typically presenting a locking ABI which
 | 
						|
assumes cmpxchg like semantics.
 | 
						|
 | 
						|
The code also includes a fall-back for cases where multi-threaded TCG
 | 
						|
ops can't work (e.g. guest atomic width > host atomic width). In this
 | 
						|
case an EXCP_ATOMIC exit occurs and the instruction is emulated with
 | 
						|
an exclusive lock which ensures all emulation is serialised.
 | 
						|
 | 
						|
While the atomic helpers look good enough for now there may be a need
 | 
						|
to look at solutions that can more closely model the guest
 | 
						|
architectures semantics.
 | 
						|
 | 
						|
==========
 | 
						|
 | 
						|
[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
 |