 729c0ddd3c
			
		
	
	
		729c0ddd3c
		
	
	
	
	
		
			
			We will use them in the next patch, document what they do. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
		
			
				
	
	
		
			404 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			404 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| CPUs perform independent memory operations effectively in random order.
 | |
| but this can be a problem for CPU-CPU interaction (including interactions
 | |
| between QEMU and the guest).  Multi-threaded programs use various tools
 | |
| to instruct the compiler and the CPU to restrict the order to something
 | |
| that is consistent with the expectations of the programmer.
 | |
| 
 | |
| The most basic tool is locking.  Mutexes, condition variables and
 | |
| semaphores are used in QEMU, and should be the default approach to
 | |
| synchronization.  Anything else is considerably harder, but it's
 | |
| also justified more often than one would like.  The two tools that
 | |
| are provided by qemu/atomic.h are memory barriers and atomic operations.
 | |
| 
 | |
| Macros defined by qemu/atomic.h fall in three camps:
 | |
| 
 | |
| - compiler barriers: barrier();
 | |
| 
 | |
| - weak atomic access and manual memory barriers: atomic_read(),
 | |
|   atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_mb_acquire(),
 | |
|   smp_mb_release(), smp_read_barrier_depends();
 | |
| 
 | |
| - sequentially consistent atomic access: everything else.
 | |
| 
 | |
| 
 | |
| COMPILER MEMORY BARRIER
 | |
| =======================
 | |
| 
 | |
| barrier() prevents the compiler from moving the memory accesses either
 | |
| side of it to the other side.  The compiler barrier has no direct effect
 | |
| on the CPU, which may then reorder things however it wishes.
 | |
| 
 | |
| barrier() is mostly used within qemu/atomic.h itself.  On some
 | |
| architectures, CPU guarantees are strong enough that blocking compiler
 | |
| optimizations already ensures the correct order of execution.  In this
 | |
| case, qemu/atomic.h will reduce stronger memory barriers to simple
 | |
| compiler barriers.
 | |
| 
 | |
| Still, barrier() can be useful when writing code that can be interrupted
 | |
| by signal handlers.
 | |
| 
 | |
| 
 | |
| SEQUENTIALLY CONSISTENT ATOMIC ACCESS
 | |
| =====================================
 | |
| 
 | |
| Most of the operations in the qemu/atomic.h header ensure *sequential
 | |
| consistency*, where "the result of any execution is the same as if the
 | |
| operations of all the processors were executed in some sequential order,
 | |
| and the operations of each individual processor appear in this sequence
 | |
| in the order specified by its program".
 | |
| 
 | |
| qemu/atomic.h provides the following set of atomic read-modify-write
 | |
| operations:
 | |
| 
 | |
|     void atomic_inc(ptr)
 | |
|     void atomic_dec(ptr)
 | |
|     void atomic_add(ptr, val)
 | |
|     void atomic_sub(ptr, val)
 | |
|     void atomic_and(ptr, val)
 | |
|     void atomic_or(ptr, val)
 | |
| 
 | |
|     typeof(*ptr) atomic_fetch_inc(ptr)
 | |
|     typeof(*ptr) atomic_fetch_dec(ptr)
 | |
|     typeof(*ptr) atomic_fetch_add(ptr, val)
 | |
|     typeof(*ptr) atomic_fetch_sub(ptr, val)
 | |
|     typeof(*ptr) atomic_fetch_and(ptr, val)
 | |
|     typeof(*ptr) atomic_fetch_or(ptr, val)
 | |
|     typeof(*ptr) atomic_fetch_xor(ptr, val)
 | |
|     typeof(*ptr) atomic_fetch_inc_nonzero(ptr)
 | |
|     typeof(*ptr) atomic_xchg(ptr, val)
 | |
|     typeof(*ptr) atomic_cmpxchg(ptr, old, new)
 | |
| 
 | |
| all of which return the old value of *ptr.  These operations are
 | |
| polymorphic; they operate on any type that is as wide as a pointer.
 | |
| 
 | |
| Similar operations return the new value of *ptr:
 | |
| 
 | |
|     typeof(*ptr) atomic_inc_fetch(ptr)
 | |
|     typeof(*ptr) atomic_dec_fetch(ptr)
 | |
|     typeof(*ptr) atomic_add_fetch(ptr, val)
 | |
|     typeof(*ptr) atomic_sub_fetch(ptr, val)
 | |
|     typeof(*ptr) atomic_and_fetch(ptr, val)
 | |
|     typeof(*ptr) atomic_or_fetch(ptr, val)
 | |
|     typeof(*ptr) atomic_xor_fetch(ptr, val)
 | |
| 
 | |
| Sequentially consistent loads and stores can be done using:
 | |
| 
 | |
|     atomic_fetch_add(ptr, 0) for loads
 | |
|     atomic_xchg(ptr, val) for stores
 | |
| 
 | |
| However, they are quite expensive on some platforms, notably POWER and
 | |
| ARM.  Therefore, qemu/atomic.h provides two primitives with slightly
 | |
| weaker constraints:
 | |
| 
 | |
|     typeof(*ptr) atomic_mb_read(ptr)
 | |
|     void         atomic_mb_set(ptr, val)
 | |
| 
 | |
| The semantics of these primitives map to Java volatile variables,
 | |
| and are strongly related to memory barriers as used in the Linux
 | |
| kernel (see below).
 | |
| 
 | |
| As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
 | |
| be reordered with each other, and it is also not possible to reorder
 | |
| "normal" accesses around them.
 | |
| 
 | |
| However, and this is the important difference between
 | |
| atomic_mb_read/atomic_mb_set and sequential consistency, it is important
 | |
| for both threads to access the same volatile variable.  It is not the
 | |
| case that everything visible to thread A when it writes volatile field f
 | |
| becomes visible to thread B after it reads volatile field g. The store
 | |
| and load have to "match" (i.e., be performed on the same volatile
 | |
| field) to achieve the right semantics.
 | |
| 
 | |
| 
 | |
| These operations operate on any type that is as wide as an int or smaller.
 | |
| 
 | |
| 
 | |
| WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS
 | |
| =============================================
 | |
| 
 | |
| Compared to sequentially consistent atomic access, programming with
 | |
| weaker consistency models can be considerably more complicated.
 | |
| In general, if the algorithm you are writing includes both writes
 | |
| and reads on the same side, it is generally simpler to use sequentially
 | |
| consistent primitives.
 | |
| 
 | |
| When using this model, variables are accessed with:
 | |
| 
 | |
| - atomic_read() and atomic_set(); these prevent the compiler from
 | |
|   optimizing accesses out of existence and creating unsolicited
 | |
|   accesses, but do not otherwise impose any ordering on loads and
 | |
|   stores: both the compiler and the processor are free to reorder
 | |
|   them.
 | |
| 
 | |
| - atomic_load_acquire(), which guarantees the LOAD to appear to
 | |
|   happen, with respect to the other components of the system,
 | |
|   before all the LOAD or STORE operations specified afterwards.
 | |
|   Operations coming before atomic_load_acquire() can still be
 | |
|   reordered after it.
 | |
| 
 | |
| - atomic_store_release(), which guarantees the STORE to appear to
 | |
|   happen, with respect to the other components of the system,
 | |
|   after all the LOAD or STORE operations specified afterwards.
 | |
|   Operations coming after atomic_store_release() can still be
 | |
|   reordered after it.
 | |
| 
 | |
| Restrictions to the ordering of accesses can also be specified
 | |
| using the memory barrier macros: smp_rmb(), smp_wmb(), smp_mb(),
 | |
| smp_mb_acquire(), smp_mb_release(), smp_read_barrier_depends().
 | |
| 
 | |
| Memory barriers control the order of references to shared memory.
 | |
| They come in six kinds:
 | |
| 
 | |
| - smp_rmb() guarantees that all the LOAD operations specified before
 | |
|   the barrier will appear to happen before all the LOAD operations
 | |
|   specified after the barrier with respect to the other components of
 | |
|   the system.
 | |
| 
 | |
|   In other words, smp_rmb() puts a partial ordering on loads, but is not
 | |
|   required to have any effect on stores.
 | |
| 
 | |
| - smp_wmb() guarantees that all the STORE operations specified before
 | |
|   the barrier will appear to happen before all the STORE operations
 | |
|   specified after the barrier with respect to the other components of
 | |
|   the system.
 | |
| 
 | |
|   In other words, smp_wmb() puts a partial ordering on stores, but is not
 | |
|   required to have any effect on loads.
 | |
| 
 | |
| - smp_mb_acquire() guarantees that all the LOAD operations specified before
 | |
|   the barrier will appear to happen before all the LOAD or STORE operations
 | |
|   specified after the barrier with respect to the other components of
 | |
|   the system.
 | |
| 
 | |
| - smp_mb_release() guarantees that all the STORE operations specified *after*
 | |
|   the barrier will appear to happen after all the LOAD or STORE operations
 | |
|   specified *before* the barrier with respect to the other components of
 | |
|   the system.
 | |
| 
 | |
| - smp_mb() guarantees that all the LOAD and STORE operations specified
 | |
|   before the barrier will appear to happen before all the LOAD and
 | |
|   STORE operations specified after the barrier with respect to the other
 | |
|   components of the system.
 | |
| 
 | |
|   smp_mb() puts a partial ordering on both loads and stores.  It is
 | |
|   stronger than both a read and a write memory barrier; it implies both
 | |
|   smp_mb_acquire() and smp_mb_release(), but it also prevents STOREs
 | |
|   coming before the barrier from overtaking LOADs coming after the
 | |
|   barrier and vice versa.
 | |
| 
 | |
| - smp_read_barrier_depends() is a weaker kind of read barrier.  On
 | |
|   most processors, whenever two loads are performed such that the
 | |
|   second depends on the result of the first (e.g., the first load
 | |
|   retrieves the address to which the second load will be directed),
 | |
|   the processor will guarantee that the first LOAD will appear to happen
 | |
|   before the second with respect to the other components of the system.
 | |
|   However, this is not always true---for example, it was not true on
 | |
|   Alpha processors.  Whenever this kind of access happens to shared
 | |
|   memory (that is not protected by a lock), a read barrier is needed,
 | |
|   and smp_read_barrier_depends() can be used instead of smp_rmb().
 | |
| 
 | |
|   Note that the first load really has to have a _data_ dependency and not
 | |
|   a control dependency.  If the address for the second load is dependent
 | |
|   on the first load, but the dependency is through a conditional rather
 | |
|   than actually loading the address itself, then it's a _control_
 | |
|   dependency and a full read barrier or better is required.
 | |
| 
 | |
| 
 | |
| This is the set of barriers that is required *between* two atomic_read()
 | |
| and atomic_set() operations to achieve sequential consistency:
 | |
| 
 | |
|                     |               2nd operation                   |
 | |
|                     |-----------------------------------------------|
 | |
|      1st operation  | (after last)   | atomic_read | atomic_set     |
 | |
|      ---------------+----------------+-------------+----------------|
 | |
|      (before first) |                | none        | smp_mb_release |
 | |
|      ---------------+----------------+-------------+----------------|
 | |
|      atomic_read    | smp_mb_acquire | smp_rmb     | **             |
 | |
|      ---------------+----------------+-------------+----------------|
 | |
|      atomic_set     | none           | smp_mb()*** | smp_wmb()      |
 | |
|      ---------------+----------------+-------------+----------------|
 | |
| 
 | |
|        * Or smp_read_barrier_depends().
 | |
| 
 | |
|       ** This requires a load-store barrier.  This is achieved by
 | |
|          either smp_mb_acquire() or smp_mb_release().
 | |
| 
 | |
|      *** This requires a store-load barrier.  On most machines, the only
 | |
|          way to achieve this is a full barrier.
 | |
| 
 | |
| 
 | |
| You can see that the two possible definitions of atomic_mb_read()
 | |
| and atomic_mb_set() are the following:
 | |
| 
 | |
|     1) atomic_mb_read(p)   = atomic_read(p); smp_mb_acquire()
 | |
|        atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb()
 | |
| 
 | |
|     2) atomic_mb_read(p)   = smp_mb() atomic_read(p); smp_mb_acquire()
 | |
|        atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v);
 | |
| 
 | |
| Usually the former is used, because smp_mb() is expensive and a program
 | |
| normally has more reads than writes.  Therefore it makes more sense to
 | |
| make atomic_mb_set() the more expensive operation.
 | |
| 
 | |
| There are two common cases in which atomic_mb_read and atomic_mb_set
 | |
| generate too many memory barriers, and thus it can be useful to manually
 | |
| place barriers, or use atomic_load_acquire/atomic_store_release instead:
 | |
| 
 | |
| - when a data structure has one thread that is always a writer
 | |
|   and one thread that is always a reader, manual placement of
 | |
|   memory barriers makes the write side faster.  Furthermore,
 | |
|   correctness is easy to check for in this case using the "pairing"
 | |
|   trick that is explained below:
 | |
| 
 | |
|      thread 1                                thread 1
 | |
|      -------------------------               ------------------------
 | |
|      (other writes)
 | |
|      atomic_mb_set(&a, x)                    atomic_store_release(&a, x)
 | |
|      atomic_mb_set(&b, y)                    atomic_store_release(&b, y)
 | |
| 
 | |
|                                        =>
 | |
|      thread 2                                thread 2
 | |
|      -------------------------               ------------------------
 | |
|      y = atomic_mb_read(&b)                  y = atomic_load_acquire(&b)
 | |
|      x = atomic_mb_read(&a)                  x = atomic_load_acquire(&a)
 | |
|      (other reads)
 | |
| 
 | |
|   Note that the barrier between the stores in thread 1, and between
 | |
|   the loads in thread 2, has been optimized here to a write or a
 | |
|   read memory barrier respectively.  On some architectures, notably
 | |
|   ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as
 | |
|   smp_mb, but smp_rmb and/or smp_wmb are more efficient.
 | |
| 
 | |
| - sometimes, a thread is accessing many variables that are otherwise
 | |
|   unrelated to each other (for example because, apart from the current
 | |
|   thread, exactly one other thread will read or write each of these
 | |
|   variables).  In this case, it is possible to "hoist" the implicit
 | |
|   barriers provided by atomic_mb_read() and atomic_mb_set() outside
 | |
|   a loop.  For example, the above definition atomic_mb_read() gives
 | |
|   the following transformation:
 | |
| 
 | |
|      n = 0;                                  n = 0;
 | |
|      for (i = 0; i < 10; i++)          =>    for (i = 0; i < 10; i++)
 | |
|        n += atomic_mb_read(&a[i]);             n += atomic_read(&a[i]);
 | |
|                                              smp_mb_acquire();
 | |
| 
 | |
|   Similarly, atomic_mb_set() can be transformed as follows:
 | |
| 
 | |
|                                              smp_mb_release();
 | |
|      for (i = 0; i < 10; i++)          =>    for (i = 0; i < 10; i++)
 | |
|        atomic_mb_set(&a[i], false);            atomic_set(&a[i], false);
 | |
|                                              smp_mb();
 | |
| 
 | |
| 
 | |
|   The other thread can still use atomic_mb_read()/atomic_mb_set().
 | |
| 
 | |
| The two tricks can be combined.  In this case, splitting a loop in
 | |
| two lets you hoist the barriers out of the loops _and_ eliminate the
 | |
| expensive smp_mb():
 | |
| 
 | |
|                                              smp_mb_release();
 | |
|      for (i = 0; i < 10; i++) {        =>    for (i = 0; i < 10; i++)
 | |
|        atomic_mb_set(&a[i], false);            atomic_set(&a[i], false);
 | |
|        atomic_mb_set(&b[i], false);          smb_wmb();
 | |
|      }                                       for (i = 0; i < 10; i++)
 | |
|                                                atomic_set(&a[i], false);
 | |
|                                              smp_mb();
 | |
| 
 | |
| 
 | |
| Memory barrier pairing
 | |
| ----------------------
 | |
| 
 | |
| A useful rule of thumb is that memory barriers should always, or almost
 | |
| always, be paired with another barrier.  In the case of QEMU, however,
 | |
| note that the other barrier may actually be in a driver that runs in
 | |
| the guest!
 | |
| 
 | |
| For the purposes of pairing, smp_read_barrier_depends() and smp_rmb()
 | |
| both count as read barriers.  A read barrier shall pair with a write
 | |
| barrier or a full barrier; a write barrier shall pair with a read
 | |
| barrier or a full barrier.  A full barrier can pair with anything.
 | |
| For example:
 | |
| 
 | |
|         thread 1             thread 2
 | |
|         ===============      ===============
 | |
|         a = 1;
 | |
|         smp_wmb();
 | |
|         b = 2;               x = b;
 | |
|                              smp_rmb();
 | |
|                              y = a;
 | |
| 
 | |
| Note that the "writing" thread is accessing the variables in the
 | |
| opposite order as the "reading" thread.  This is expected: stores
 | |
| before the write barrier will normally match the loads after the
 | |
| read barrier, and vice versa.  The same is true for more than 2
 | |
| access and for data dependency barriers:
 | |
| 
 | |
|         thread 1             thread 2
 | |
|         ===============      ===============
 | |
|         b[2] = 1;
 | |
|         smp_wmb();
 | |
|         x->i = 2;
 | |
|         smp_wmb();
 | |
|         a = x;               x = a;
 | |
|                              smp_read_barrier_depends();
 | |
|                              y = x->i;
 | |
|                              smp_read_barrier_depends();
 | |
|                              z = b[y];
 | |
| 
 | |
| smp_wmb() also pairs with atomic_mb_read() and smp_mb_acquire().
 | |
| and smp_rmb() also pairs with atomic_mb_set() and smp_mb_release().
 | |
| 
 | |
| 
 | |
| COMPARISON WITH LINUX KERNEL MEMORY BARRIERS
 | |
| ============================================
 | |
| 
 | |
| Here is a list of differences between Linux kernel atomic operations
 | |
| and memory barriers, and the equivalents in QEMU:
 | |
| 
 | |
| - atomic operations in Linux are always on a 32-bit int type and
 | |
|   use a boxed atomic_t type; atomic operations in QEMU are polymorphic
 | |
|   and use normal C types.
 | |
| 
 | |
| - Originally, atomic_read and atomic_set in Linux gave no guarantee
 | |
|   at all. Linux 4.1 updated them to implement volatile
 | |
|   semantics via ACCESS_ONCE (or the more recent READ/WRITE_ONCE).
 | |
| 
 | |
|   QEMU's atomic_read/set implement, if the compiler supports it, C11
 | |
|   atomic relaxed semantics, and volatile semantics otherwise.
 | |
|   Both semantics prevent the compiler from doing certain transformations;
 | |
|   the difference is that atomic accesses are guaranteed to be atomic,
 | |
|   while volatile accesses aren't. Thus, in the volatile case we just cross
 | |
|   our fingers hoping that the compiler will generate atomic accesses,
 | |
|   since we assume the variables passed are machine-word sized and
 | |
|   properly aligned.
 | |
|   No barriers are implied by atomic_read/set in either Linux or QEMU.
 | |
| 
 | |
| - atomic read-modify-write operations in Linux are of three kinds:
 | |
| 
 | |
|          atomic_OP          returns void
 | |
|          atomic_OP_return   returns new value of the variable
 | |
|          atomic_fetch_OP    returns the old value of the variable
 | |
|          atomic_cmpxchg     returns the old value of the variable
 | |
| 
 | |
|   In QEMU, the second kind does not exist.  Currently Linux has
 | |
|   atomic_fetch_or only.  QEMU provides and, or, inc, dec, add, sub.
 | |
| 
 | |
| - different atomic read-modify-write operations in Linux imply
 | |
|   a different set of memory barriers; in QEMU, all of them enforce
 | |
|   sequential consistency, which means they imply full memory barriers
 | |
|   before and after the operation.
 | |
| 
 | |
| - Linux does not have an equivalent of atomic_mb_set().  In particular,
 | |
|   note that smp_store_mb() is a little weaker than atomic_mb_set().
 | |
|   atomic_mb_read() compiles to the same instructions as Linux's
 | |
|   smp_load_acquire(), but this should be treated as an implementation
 | |
|   detail.
 | |
| 
 | |
| SOURCES
 | |
| =======
 | |
| 
 | |
| * Documentation/memory-barriers.txt from the Linux kernel
 | |
| 
 | |
| * "The JSR-133 Cookbook for Compiler Writers", available at
 | |
|   http://g.oswego.edu/dl/jmm/cookbook.html
 |