 fbcc3e5004
			
		
	
	
		fbcc3e5004
		
	
	
	
	
		
			
			This is complex, but I think it is reasonably documented in the source. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20170112180800.21085-5-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
		
			
				
	
	
		
			278 lines
		
	
	
		
			9.8 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			278 lines
		
	
	
		
			9.8 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| DOCUMENTATION FOR LOCKED COUNTERS (aka QemuLockCnt)
 | |
| ===================================================
 | |
| 
 | |
| QEMU often uses reference counts to track data structures that are being
 | |
| accessed and should not be freed.  For example, a loop that invoke
 | |
| callbacks like this is not safe:
 | |
| 
 | |
|     QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
 | |
|         if (ioh->revents & G_IO_OUT) {
 | |
|             ioh->fd_write(ioh->opaque);
 | |
|         }
 | |
|     }
 | |
| 
 | |
| QLIST_FOREACH_SAFE protects against deletion of the current node (ioh)
 | |
| by stashing away its "next" pointer.  However, ioh->fd_write could
 | |
| actually delete the next node from the list.  The simplest way to
 | |
| avoid this is to mark the node as deleted, and remove it from the
 | |
| list in the above loop:
 | |
| 
 | |
|     QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
 | |
|         if (ioh->deleted) {
 | |
|             QLIST_REMOVE(ioh, next);
 | |
|             g_free(ioh);
 | |
|         } else {
 | |
|             if (ioh->revents & G_IO_OUT) {
 | |
|                 ioh->fd_write(ioh->opaque);
 | |
|             }
 | |
|         }
 | |
|     }
 | |
| 
 | |
| If however this loop must also be reentrant, i.e. it is possible that
 | |
| ioh->fd_write invokes the loop again, some kind of counting is needed:
 | |
| 
 | |
|     walking_handlers++;
 | |
|     QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
 | |
|         if (ioh->deleted) {
 | |
|             if (walking_handlers == 1) {
 | |
|                 QLIST_REMOVE(ioh, next);
 | |
|                 g_free(ioh);
 | |
|             }
 | |
|         } else {
 | |
|             if (ioh->revents & G_IO_OUT) {
 | |
|                 ioh->fd_write(ioh->opaque);
 | |
|             }
 | |
|         }
 | |
|     }
 | |
|     walking_handlers--;
 | |
| 
 | |
| One may think of using the RCU primitives, rcu_read_lock() and
 | |
| rcu_read_unlock(); effectively, the RCU nesting count would take
 | |
| the place of the walking_handlers global variable.  Indeed,
 | |
| reference counting and RCU have similar purposes, but their usage in
 | |
| general is complementary:
 | |
| 
 | |
| - reference counting is fine-grained and limited to a single data
 | |
|   structure; RCU delays reclamation of *all* RCU-protected data
 | |
|   structures;
 | |
| 
 | |
| - reference counting works even in the presence of code that keeps
 | |
|   a reference for a long time; RCU critical sections in principle
 | |
|   should be kept short;
 | |
| 
 | |
| - reference counting is often applied to code that is not thread-safe
 | |
|   but is reentrant; in fact, usage of reference counting in QEMU predates
 | |
|   the introduction of threads by many years.  RCU is generally used to
 | |
|   protect readers from other threads freeing memory after concurrent
 | |
|   modifications to a data structure.
 | |
| 
 | |
| - reclaiming data can be done by a separate thread in the case of RCU;
 | |
|   this can improve performance, but also delay reclamation undesirably.
 | |
|   With reference counting, reclamation is deterministic.
 | |
| 
 | |
| This file documents QemuLockCnt, an abstraction for using reference
 | |
| counting in code that has to be both thread-safe and reentrant.
 | |
| 
 | |
| 
 | |
| QemuLockCnt concepts
 | |
| --------------------
 | |
| 
 | |
| A QemuLockCnt comprises both a counter and a mutex; it has primitives
 | |
| to increment and decrement the counter, and to take and release the
 | |
| mutex.  The counter notes how many visits to the data structures are
 | |
| taking place (the visits could be from different threads, or there could
 | |
| be multiple reentrant visits from the same thread).  The basic rules
 | |
| governing the counter/mutex pair then are the following:
 | |
| 
 | |
| - Data protected by the QemuLockCnt must not be freed unless the
 | |
|   counter is zero and the mutex is taken.
 | |
| 
 | |
| - A new visit cannot be started while the counter is zero and the
 | |
|   mutex is taken.
 | |
| 
 | |
| Most of the time, the mutex protects all writes to the data structure,
 | |
| not just frees, though there could be cases where this is not necessary.
 | |
| 
 | |
| Reads, instead, can be done without taking the mutex, as long as the
 | |
| readers and writers use the same macros that are used for RCU, for
 | |
| example atomic_rcu_read, atomic_rcu_set, QLIST_FOREACH_RCU, etc.  This is
 | |
| because the reads are done outside a lock and a set or QLIST_INSERT_HEAD
 | |
| can happen concurrently with the read.  The RCU API ensures that the
 | |
| processor and the compiler see all required memory barriers.
 | |
| 
 | |
| This could be implemented simply by protecting the counter with the
 | |
| mutex, for example:
 | |
| 
 | |
|     // (1)
 | |
|     qemu_mutex_lock(&walking_handlers_mutex);
 | |
|     walking_handlers++;
 | |
|     qemu_mutex_unlock(&walking_handlers_mutex);
 | |
| 
 | |
|     ...
 | |
| 
 | |
|     // (2)
 | |
|     qemu_mutex_lock(&walking_handlers_mutex);
 | |
|     if (--walking_handlers == 0) {
 | |
|         QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
 | |
|             if (ioh->deleted) {
 | |
|                 QLIST_REMOVE(ioh, next);
 | |
|                 g_free(ioh);
 | |
|             }
 | |
|         }
 | |
|     }
 | |
|     qemu_mutex_unlock(&walking_handlers_mutex);
 | |
| 
 | |
| Here, no frees can happen in the code represented by the ellipsis.
 | |
| If another thread is executing critical section (2), that part of
 | |
| the code cannot be entered, because the thread will not be able
 | |
| to increment the walking_handlers variable.  And of course
 | |
| during the visit any other thread will see a nonzero value for
 | |
| walking_handlers, as in the single-threaded code.
 | |
| 
 | |
| Note that it is possible for multiple concurrent accesses to delay
 | |
| the cleanup arbitrarily; in other words, for the walking_handlers
 | |
| counter to never become zero.  For this reason, this technique is
 | |
| more easily applicable if concurrent access to the structure is rare.
 | |
| 
 | |
| However, critical sections are easy to forget since you have to do
 | |
| them for each modification of the counter.  QemuLockCnt ensures that
 | |
| all modifications of the counter take the lock appropriately, and it
 | |
| can also be more efficient in two ways:
 | |
| 
 | |
| - it avoids taking the lock for many operations (for example
 | |
|   incrementing the counter while it is non-zero);
 | |
| 
 | |
| - on some platforms, one can implement QemuLockCnt to hold the lock
 | |
|   and the mutex in a single word, making the fast path no more expensive
 | |
|   than simply managing a counter using atomic operations (see
 | |
|   docs/atomics.txt).  This can be very helpful if concurrent access to
 | |
|   the data structure is expected to be rare.
 | |
| 
 | |
| 
 | |
| Using the same mutex for frees and writes can still incur some small
 | |
| inefficiencies; for example, a visit can never start if the counter is
 | |
| zero and the mutex is taken---even if the mutex is taken by a write,
 | |
| which in principle need not block a visit of the data structure.
 | |
| However, these are usually not a problem if any of the following
 | |
| assumptions are valid:
 | |
| 
 | |
| - concurrent access is possible but rare
 | |
| 
 | |
| - writes are rare
 | |
| 
 | |
| - writes are frequent, but this kind of write (e.g. appending to a
 | |
|   list) has a very small critical section.
 | |
| 
 | |
| For example, QEMU uses QemuLockCnt to manage an AioContext's list of
 | |
| bottom halves and file descriptor handlers.  Modifications to the list
 | |
| of file descriptor handlers are rare.  Creation of a new bottom half is
 | |
| frequent and can happen on a fast path; however: 1) it is almost never
 | |
| concurrent with a visit to the list of bottom halves; 2) it only has
 | |
| three instructions in the critical path, two assignments and a smp_wmb().
 | |
| 
 | |
| 
 | |
| QemuLockCnt API
 | |
| ---------------
 | |
| 
 | |
| The QemuLockCnt API is described in include/qemu/thread.h.
 | |
| 
 | |
| 
 | |
| QemuLockCnt usage
 | |
| -----------------
 | |
| 
 | |
| This section explains the typical usage patterns for QemuLockCnt functions.
 | |
| 
 | |
| Setting a variable to a non-NULL value can be done between
 | |
| qemu_lockcnt_lock and qemu_lockcnt_unlock:
 | |
| 
 | |
|     qemu_lockcnt_lock(&xyz_lockcnt);
 | |
|     if (!xyz) {
 | |
|         new_xyz = g_new(XYZ, 1);
 | |
|         ...
 | |
|         atomic_rcu_set(&xyz, new_xyz);
 | |
|     }
 | |
|     qemu_lockcnt_unlock(&xyz_lockcnt);
 | |
| 
 | |
| Accessing the value can be done between qemu_lockcnt_inc and
 | |
| qemu_lockcnt_dec:
 | |
| 
 | |
|     qemu_lockcnt_inc(&xyz_lockcnt);
 | |
|     if (xyz) {
 | |
|         XYZ *p = atomic_rcu_read(&xyz);
 | |
|         ...
 | |
|         /* Accesses can now be done through "p".  */
 | |
|     }
 | |
|     qemu_lockcnt_dec(&xyz_lockcnt);
 | |
| 
 | |
| Freeing the object can similarly use qemu_lockcnt_lock and
 | |
| qemu_lockcnt_unlock, but you also need to ensure that the count
 | |
| is zero (i.e. there is no concurrent visit).  Because qemu_lockcnt_inc
 | |
| takes the QemuLockCnt's lock, the count cannot become non-zero while
 | |
| the object is being freed.  Freeing an object looks like this:
 | |
| 
 | |
|     qemu_lockcnt_lock(&xyz_lockcnt);
 | |
|     if (!qemu_lockcnt_count(&xyz_lockcnt)) {
 | |
|         g_free(xyz);
 | |
|         xyz = NULL;
 | |
|     }
 | |
|     qemu_lockcnt_unlock(&xyz_lockcnt);
 | |
| 
 | |
| If an object has to be freed right after a visit, you can combine
 | |
| the decrement, the locking and the check on count as follows:
 | |
| 
 | |
|     qemu_lockcnt_inc(&xyz_lockcnt);
 | |
|     if (xyz) {
 | |
|         XYZ *p = atomic_rcu_read(&xyz);
 | |
|         ...
 | |
|         /* Accesses can now be done through "p".  */
 | |
|     }
 | |
|     if (qemu_lockcnt_dec_and_lock(&xyz_lockcnt)) {
 | |
|         g_free(xyz);
 | |
|         xyz = NULL;
 | |
|         qemu_lockcnt_unlock(&xyz_lockcnt);
 | |
|     }
 | |
| 
 | |
| QemuLockCnt can also be used to access a list as follows:
 | |
| 
 | |
|     qemu_lockcnt_inc(&io_handlers_lockcnt);
 | |
|     QLIST_FOREACH_RCU(ioh, &io_handlers, pioh) {
 | |
|         if (ioh->revents & G_IO_OUT) {
 | |
|             ioh->fd_write(ioh->opaque);
 | |
|         }
 | |
|     }
 | |
| 
 | |
|     if (qemu_lockcnt_dec_and_lock(&io_handlers_lockcnt)) {
 | |
|         QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
 | |
|             if (ioh->deleted) {
 | |
|                 QLIST_REMOVE(ioh, next);
 | |
|                 g_free(ioh);
 | |
|             }
 | |
|         }
 | |
|         qemu_lockcnt_unlock(&io_handlers_lockcnt);
 | |
|     }
 | |
| 
 | |
| Again, the RCU primitives are used because new items can be added to the
 | |
| list during the walk.  QLIST_FOREACH_RCU ensures that the processor and
 | |
| the compiler see the appropriate memory barriers.
 | |
| 
 | |
| An alternative pattern uses qemu_lockcnt_dec_if_lock:
 | |
| 
 | |
|     qemu_lockcnt_inc(&io_handlers_lockcnt);
 | |
|     QLIST_FOREACH_SAFE_RCU(ioh, &io_handlers, next, pioh) {
 | |
|         if (ioh->deleted) {
 | |
|             if (qemu_lockcnt_dec_if_lock(&io_handlers_lockcnt)) {
 | |
|                 QLIST_REMOVE(ioh, next);
 | |
|                 g_free(ioh);
 | |
|                 qemu_lockcnt_inc_and_unlock(&io_handlers_lockcnt);
 | |
|             }
 | |
|         } else {
 | |
|             if (ioh->revents & G_IO_OUT) {
 | |
|                 ioh->fd_write(ioh->opaque);
 | |
|             }
 | |
|         }
 | |
|     }
 | |
|     qemu_lockcnt_dec(&io_handlers_lockcnt);
 | |
| 
 | |
| Here you can use qemu_lockcnt_dec instead of qemu_lockcnt_dec_and_lock,
 | |
| because there is no special task to do if the count goes from 1 to 0.
 |