Devices can pass their MemoryReentrancyGuard (from their DeviceState), when creating new BHes. Then, the async API will toggle the guard before/after calling the BH call-back. This prevents bh->mmio reentrancy issues. Signed-off-by: Alexander Bulekov <alxndr@bu.edu> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Message-Id: <20230427211013.2994127-3-alxndr@bu.edu> [thuth: Fix "line over 90 characters" checkpatch.pl error] Signed-off-by: Thomas Huth <thuth@redhat.com>
		
			
				
	
	
		
			146 lines
		
	
	
		
			6.9 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			146 lines
		
	
	
		
			6.9 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Copyright (c) 2014-2017 Red Hat Inc.
 | |
| 
 | |
| This work is licensed under the terms of the GNU GPL, version 2 or later.  See
 | |
| the COPYING file in the top-level directory.
 | |
| 
 | |
| 
 | |
| This document explains the IOThread feature and how to write code that runs
 | |
| outside the QEMU global mutex.
 | |
| 
 | |
| The main loop and IOThreads
 | |
| ---------------------------
 | |
| QEMU is an event-driven program that can do several things at once using an
 | |
| event loop.  The VNC server and the QMP monitor are both processed from the
 | |
| same event loop, which monitors their file descriptors until they become
 | |
| readable and then invokes a callback.
 | |
| 
 | |
| The default event loop is called the main loop (see main-loop.c).  It is
 | |
| possible to create additional event loop threads using -object
 | |
| iothread,id=my-iothread.
 | |
| 
 | |
| Side note: The main loop and IOThread are both event loops but their code is
 | |
| not shared completely.  Sometimes it is useful to remember that although they
 | |
| are conceptually similar they are currently not interchangeable.
 | |
| 
 | |
| Why IOThreads are useful
 | |
| ------------------------
 | |
| IOThreads allow the user to control the placement of work.  The main loop is a
 | |
| scalability bottleneck on hosts with many CPUs.  Work can be spread across
 | |
| several IOThreads instead of just one main loop.  When set up correctly this
 | |
| can improve I/O latency and reduce jitter seen by the guest.
 | |
| 
 | |
| The main loop is also deeply associated with the QEMU global mutex, which is a
 | |
| scalability bottleneck in itself.  vCPU threads and the main loop use the QEMU
 | |
| global mutex to serialize execution of QEMU code.  This mutex is necessary
 | |
| because a lot of QEMU's code historically was not thread-safe.
 | |
| 
 | |
| The fact that all I/O processing is done in a single main loop and that the
 | |
| QEMU global mutex is contended by all vCPU threads and the main loop explain
 | |
| why it is desirable to place work into IOThreads.
 | |
| 
 | |
| The experimental virtio-blk data-plane implementation has been benchmarked and
 | |
| shows these effects:
 | |
| ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf
 | |
| 
 | |
| How to program for IOThreads
 | |
| ----------------------------
 | |
| The main difference between legacy code and new code that can run in an
 | |
| IOThread is dealing explicitly with the event loop object, AioContext
 | |
| (see include/block/aio.h).  Code that only works in the main loop
 | |
| implicitly uses the main loop's AioContext.  Code that supports running
 | |
| in IOThreads must be aware of its AioContext.
 | |
| 
 | |
| AioContext supports the following services:
 | |
|  * File descriptor monitoring (read/write/error on POSIX hosts)
 | |
|  * Event notifiers (inter-thread signalling)
 | |
|  * Timers
 | |
|  * Bottom Halves (BH) deferred callbacks
 | |
| 
 | |
| There are several old APIs that use the main loop AioContext:
 | |
|  * LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor
 | |
|  * LEGACY qemu_aio_set_event_notifier() - monitor an event notifier
 | |
|  * LEGACY timer_new_ms() - create a timer
 | |
|  * LEGACY qemu_bh_new() - create a BH
 | |
|  * LEGACY qemu_bh_new_guarded() - create a BH with a device re-entrancy guard
 | |
|  * LEGACY qemu_aio_wait() - run an event loop iteration
 | |
| 
 | |
| Since they implicitly work on the main loop they cannot be used in code that
 | |
| runs in an IOThread.  They might cause a crash or deadlock if called from an
 | |
| IOThread since the QEMU global mutex is not held.
 | |
| 
 | |
| Instead, use the AioContext functions directly (see include/block/aio.h):
 | |
|  * aio_set_fd_handler() - monitor a file descriptor
 | |
|  * aio_set_event_notifier() - monitor an event notifier
 | |
|  * aio_timer_new() - create a timer
 | |
|  * aio_bh_new() - create a BH
 | |
|  * aio_bh_new_guarded() - create a BH with a device re-entrancy guard
 | |
|  * aio_poll() - run an event loop iteration
 | |
| 
 | |
| The qemu_bh_new_guarded/aio_bh_new_guarded APIs accept a "MemReentrancyGuard"
 | |
| argument, which is used to check for and prevent re-entrancy problems. For
 | |
| BHs associated with devices, the reentrancy-guard is contained in the
 | |
| corresponding DeviceState and named "mem_reentrancy_guard".
 | |
| 
 | |
| The AioContext can be obtained from the IOThread using
 | |
| iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
 | |
| Code that takes an AioContext argument works both in IOThreads or the main
 | |
| loop, depending on which AioContext instance the caller passes in.
 | |
| 
 | |
| How to synchronize with an IOThread
 | |
| -----------------------------------
 | |
| AioContext is not thread-safe so some rules must be followed when using file
 | |
| descriptors, event notifiers, timers, or BHs across threads:
 | |
| 
 | |
| 1. AioContext functions can always be called safely.  They handle their
 | |
| own locking internally.
 | |
| 
 | |
| 2. Other threads wishing to access the AioContext must use
 | |
| aio_context_acquire()/aio_context_release() for mutual exclusion.  Once the
 | |
| context is acquired no other thread can access it or run event loop iterations
 | |
| in this AioContext.
 | |
| 
 | |
| Legacy code sometimes nests aio_context_acquire()/aio_context_release() calls.
 | |
| Do not use nesting anymore, it is incompatible with the BDRV_POLL_WHILE() macro
 | |
| used in the block layer and can lead to hangs.
 | |
| 
 | |
| There is currently no lock ordering rule if a thread needs to acquire multiple
 | |
| AioContexts simultaneously.  Therefore, it is only safe for code holding the
 | |
| QEMU global mutex to acquire other AioContexts.
 | |
| 
 | |
| Side note: the best way to schedule a function call across threads is to call
 | |
| aio_bh_schedule_oneshot().  No acquire/release or locking is needed.
 | |
| 
 | |
| AioContext and the block layer
 | |
| ------------------------------
 | |
| The AioContext originates from the QEMU block layer, even though nowadays
 | |
| AioContext is a generic event loop that can be used by any QEMU subsystem.
 | |
| 
 | |
| The block layer has support for AioContext integrated.  Each BlockDriverState
 | |
| is associated with an AioContext using bdrv_try_change_aio_context() and
 | |
| bdrv_get_aio_context().  This allows block layer code to process I/O inside the
 | |
| right AioContext.  Other subsystems may wish to follow a similar approach.
 | |
| 
 | |
| Block layer code must therefore expect to run in an IOThread and avoid using
 | |
| old APIs that implicitly use the main loop.  See the "How to program for
 | |
| IOThreads" above for information on how to do that.
 | |
| 
 | |
| If main loop code such as a QMP function wishes to access a BlockDriverState
 | |
| it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure
 | |
| that callbacks in the IOThread do not run in parallel.
 | |
| 
 | |
| Code running in the monitor typically needs to ensure that past
 | |
| requests from the guest are completed.  When a block device is running
 | |
| in an IOThread, the IOThread can also process requests from the guest
 | |
| (via ioeventfd).  To achieve both objects, wrap the code between
 | |
| bdrv_drained_begin() and bdrv_drained_end(), thus creating a "drained
 | |
| section".  The functions must be called between aio_context_acquire()
 | |
| and aio_context_release().  You can freely release and re-acquire the
 | |
| AioContext within a drained section.
 | |
| 
 | |
| Long-running jobs (usually in the form of coroutines) are best scheduled in
 | |
| the BlockDriverState's AioContext to avoid the need to acquire/release around
 | |
| each bdrv_*() call.  The functions bdrv_add/remove_aio_context_notifier,
 | |
| or alternatively blk_add/remove_aio_context_notifier if you use BlockBackends,
 | |
| can be used to get a notification whenever bdrv_try_change_aio_context() moves a
 | |
| BlockDriverState to a different AioContext.
 |