Now that objects actually obey the rules, document them. Reviewed-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
		
			
				
	
	
		
			287 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			287 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
The memory API
 | 
						|
==============
 | 
						|
 | 
						|
The memory API models the memory and I/O buses and controllers of a QEMU
 | 
						|
machine.  It attempts to allow modelling of:
 | 
						|
 | 
						|
 - ordinary RAM
 | 
						|
 - memory-mapped I/O (MMIO)
 | 
						|
 - memory controllers that can dynamically reroute physical memory regions
 | 
						|
   to different destinations
 | 
						|
 | 
						|
The memory model provides support for
 | 
						|
 | 
						|
 - tracking RAM changes by the guest
 | 
						|
 - setting up coalesced memory for kvm
 | 
						|
 - setting up ioeventfd regions for kvm
 | 
						|
 | 
						|
Memory is modelled as an acyclic graph of MemoryRegion objects.  Sinks
 | 
						|
(leaves) are RAM and MMIO regions, while other nodes represent
 | 
						|
buses, memory controllers, and memory regions that have been rerouted.
 | 
						|
 | 
						|
In addition to MemoryRegion objects, the memory API provides AddressSpace
 | 
						|
objects for every root and possibly for intermediate MemoryRegions too.
 | 
						|
These represent memory as seen from the CPU or a device's viewpoint.
 | 
						|
 | 
						|
Types of regions
 | 
						|
----------------
 | 
						|
 | 
						|
There are four types of memory regions (all represented by a single C type
 | 
						|
MemoryRegion):
 | 
						|
 | 
						|
- RAM: a RAM region is simply a range of host memory that can be made available
 | 
						|
  to the guest.
 | 
						|
 | 
						|
- MMIO: a range of guest memory that is implemented by host callbacks;
 | 
						|
  each read or write causes a callback to be called on the host.
 | 
						|
 | 
						|
- container: a container simply includes other memory regions, each at
 | 
						|
  a different offset.  Containers are useful for grouping several regions
 | 
						|
  into one unit.  For example, a PCI BAR may be composed of a RAM region
 | 
						|
  and an MMIO region.
 | 
						|
 | 
						|
  A container's subregions are usually non-overlapping.  In some cases it is
 | 
						|
  useful to have overlapping regions; for example a memory controller that
 | 
						|
  can overlay a subregion of RAM with MMIO or ROM, or a PCI controller
 | 
						|
  that does not prevent card from claiming overlapping BARs.
 | 
						|
 | 
						|
- alias: a subsection of another region.  Aliases allow a region to be
 | 
						|
  split apart into discontiguous regions.  Examples of uses are memory banks
 | 
						|
  used when the guest address space is smaller than the amount of RAM
 | 
						|
  addressed, or a memory controller that splits main memory to expose a "PCI
 | 
						|
  hole".  Aliases may point to any type of region, including other aliases,
 | 
						|
  but an alias may not point back to itself, directly or indirectly.
 | 
						|
 | 
						|
It is valid to add subregions to a region which is not a pure container
 | 
						|
(that is, to an MMIO, RAM or ROM region). This means that the region
 | 
						|
will act like a container, except that any addresses within the container's
 | 
						|
region which are not claimed by any subregion are handled by the
 | 
						|
container itself (ie by its MMIO callbacks or RAM backing). However
 | 
						|
it is generally possible to achieve the same effect with a pure container
 | 
						|
one of whose subregions is a low priority "background" region covering
 | 
						|
the whole address range; this is often clearer and is preferred.
 | 
						|
Subregions cannot be added to an alias region.
 | 
						|
 | 
						|
Region names
 | 
						|
------------
 | 
						|
 | 
						|
Regions are assigned names by the constructor.  For most regions these are
 | 
						|
only used for debugging purposes, but RAM regions also use the name to identify
 | 
						|
live migration sections.  This means that RAM region names need to have ABI
 | 
						|
stability.
 | 
						|
 | 
						|
Region lifecycle
 | 
						|
----------------
 | 
						|
 | 
						|
A region is created by one of the memory_region_init*() functions and
 | 
						|
attached to an object, which acts as its owner or parent.  QEMU ensures
 | 
						|
that the owner object remains alive as long as the region is visible to
 | 
						|
the guest, or as long as the region is in use by a virtual CPU or another
 | 
						|
device.  For example, the owner object will not die between an
 | 
						|
address_space_map operation and the corresponding address_space_unmap.
 | 
						|
 | 
						|
After creation, a region can be added to an address space or a
 | 
						|
container with memory_region_add_subregion(), and removed using
 | 
						|
memory_region_del_subregion().
 | 
						|
 | 
						|
Various region attributes (read-only, dirty logging, coalesced mmio,
 | 
						|
ioeventfd) can be changed during the region lifecycle.  They take effect
 | 
						|
as soon as the region is made visible.  This can be immediately, later,
 | 
						|
or never.
 | 
						|
 | 
						|
Destruction of a memory region happens automatically when the owner
 | 
						|
object dies.
 | 
						|
 | 
						|
If however the memory region is part of a dynamically allocated data
 | 
						|
structure, you should call object_unparent() to destroy the memory region
 | 
						|
before the data structure is freed.  For an example see VFIOMSIXInfo
 | 
						|
and VFIOQuirk in hw/vfio/pci.c.
 | 
						|
 | 
						|
You must not destroy a memory region as long as it may be in use by a
 | 
						|
device or CPU.  In order to do this, as a general rule do not create or
 | 
						|
destroy memory regions dynamically during a device's lifetime, and only
 | 
						|
call object_unparent() in the memory region owner's instance_finalize
 | 
						|
callback.  The dynamically allocated data structure that contains the
 | 
						|
memory region then should obviously be freed in the instance_finalize
 | 
						|
callback as well.
 | 
						|
 | 
						|
If you break this rule, the following situation can happen:
 | 
						|
 | 
						|
- the memory region's owner had a reference taken via memory_region_ref
 | 
						|
  (for example by address_space_map)
 | 
						|
 | 
						|
- the region is unparented, and has no owner anymore
 | 
						|
 | 
						|
- when address_space_unmap is called, the reference to the memory region's
 | 
						|
  owner is leaked.
 | 
						|
 | 
						|
 | 
						|
There is an exception to the above rule: it is okay to call
 | 
						|
object_unparent at any time for an alias or a container region.  It is
 | 
						|
therefore also okay to create or destroy alias and container regions
 | 
						|
dynamically during a device's lifetime.
 | 
						|
 | 
						|
This exceptional usage is valid because aliases and containers only help
 | 
						|
QEMU building the guest's memory map; they are never accessed directly.
 | 
						|
memory_region_ref and memory_region_unref are never called on aliases
 | 
						|
or containers, and the above situation then cannot happen.  Exploiting
 | 
						|
this exception is rarely necessary, and therefore it is discouraged,
 | 
						|
but nevertheless it is used in a few places.
 | 
						|
 | 
						|
For regions that "have no owner" (NULL is passed at creation time), the
 | 
						|
machine object is actually used as the owner.  Since instance_finalize is
 | 
						|
never called for the machine object, you must never call object_unparent
 | 
						|
on regions that have no owner, unless they are aliases or containers.
 | 
						|
 | 
						|
 | 
						|
Overlapping regions and priority
 | 
						|
--------------------------------
 | 
						|
Usually, regions may not overlap each other; a memory address decodes into
 | 
						|
exactly one target.  In some cases it is useful to allow regions to overlap,
 | 
						|
and sometimes to control which of an overlapping regions is visible to the
 | 
						|
guest.  This is done with memory_region_add_subregion_overlap(), which
 | 
						|
allows the region to overlap any other region in the same container, and
 | 
						|
specifies a priority that allows the core to decide which of two regions at
 | 
						|
the same address are visible (highest wins).
 | 
						|
Priority values are signed, and the default value is zero. This means that
 | 
						|
you can use memory_region_add_subregion_overlap() both to specify a region
 | 
						|
that must sit 'above' any others (with a positive priority) and also a
 | 
						|
background region that sits 'below' others (with a negative priority).
 | 
						|
 | 
						|
If the higher priority region in an overlap is a container or alias, then
 | 
						|
the lower priority region will appear in any "holes" that the higher priority
 | 
						|
region has left by not mapping subregions to that area of its address range.
 | 
						|
(This applies recursively -- if the subregions are themselves containers or
 | 
						|
aliases that leave holes then the lower priority region will appear in these
 | 
						|
holes too.)
 | 
						|
 | 
						|
For example, suppose we have a container A of size 0x8000 with two subregions
 | 
						|
B and C. B is a container mapped at 0x2000, size 0x4000, priority 1; C is
 | 
						|
an MMIO region mapped at 0x0, size 0x6000, priority 2. B currently has two
 | 
						|
of its own subregions: D of size 0x1000 at offset 0 and E of size 0x1000 at
 | 
						|
offset 0x2000. As a diagram:
 | 
						|
 | 
						|
        0      1000   2000   3000   4000   5000   6000   7000    8000
 | 
						|
        |------|------|------|------|------|------|------|-------|
 | 
						|
  A:    [                                                       ]
 | 
						|
  C:    [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC]
 | 
						|
  B:                  [                          ]
 | 
						|
  D:                  [DDDDD]
 | 
						|
  E:                                [EEEEE]
 | 
						|
 | 
						|
The regions that will be seen within this address range then are:
 | 
						|
        [CCCCCCCCCCCC][DDDDD][CCCCC][EEEEE][CCCCC]
 | 
						|
 | 
						|
Since B has higher priority than C, its subregions appear in the flat map
 | 
						|
even where they overlap with C. In ranges where B has not mapped anything
 | 
						|
C's region appears.
 | 
						|
 | 
						|
If B had provided its own MMIO operations (ie it was not a pure container)
 | 
						|
then these would be used for any addresses in its range not handled by
 | 
						|
D or E, and the result would be:
 | 
						|
        [CCCCCCCCCCCC][DDDDD][BBBBB][EEEEE][BBBBB]
 | 
						|
 | 
						|
Priority values are local to a container, because the priorities of two
 | 
						|
regions are only compared when they are both children of the same container.
 | 
						|
This means that the device in charge of the container (typically modelling
 | 
						|
a bus or a memory controller) can use them to manage the interaction of
 | 
						|
its child regions without any side effects on other parts of the system.
 | 
						|
In the example above, the priorities of D and E are unimportant because
 | 
						|
they do not overlap each other. It is the relative priority of B and C
 | 
						|
that causes D and E to appear on top of C: D and E's priorities are never
 | 
						|
compared against the priority of C.
 | 
						|
 | 
						|
Visibility
 | 
						|
----------
 | 
						|
The memory core uses the following rules to select a memory region when the
 | 
						|
guest accesses an address:
 | 
						|
 | 
						|
- all direct subregions of the root region are matched against the address, in
 | 
						|
  descending priority order
 | 
						|
  - if the address lies outside the region offset/size, the subregion is
 | 
						|
    discarded
 | 
						|
  - if the subregion is a leaf (RAM or MMIO), the search terminates, returning
 | 
						|
    this leaf region
 | 
						|
  - if the subregion is a container, the same algorithm is used within the
 | 
						|
    subregion (after the address is adjusted by the subregion offset)
 | 
						|
  - if the subregion is an alias, the search is continued at the alias target
 | 
						|
    (after the address is adjusted by the subregion offset and alias offset)
 | 
						|
  - if a recursive search within a container or alias subregion does not
 | 
						|
    find a match (because of a "hole" in the container's coverage of its
 | 
						|
    address range), then if this is a container with its own MMIO or RAM
 | 
						|
    backing the search terminates, returning the container itself. Otherwise
 | 
						|
    we continue with the next subregion in priority order
 | 
						|
- if none of the subregions match the address then the search terminates
 | 
						|
  with no match found
 | 
						|
 | 
						|
Example memory map
 | 
						|
------------------
 | 
						|
 | 
						|
system_memory: container@0-2^48-1
 | 
						|
 |
 | 
						|
 +---- lomem: alias@0-0xdfffffff ---> #ram (0-0xdfffffff)
 | 
						|
 |
 | 
						|
 +---- himem: alias@0x100000000-0x11fffffff ---> #ram (0xe0000000-0xffffffff)
 | 
						|
 |
 | 
						|
 +---- vga-window: alias@0xa0000-0xbfffff ---> #pci (0xa0000-0xbffff)
 | 
						|
 |      (prio 1)
 | 
						|
 |
 | 
						|
 +---- pci-hole: alias@0xe0000000-0xffffffff ---> #pci (0xe0000000-0xffffffff)
 | 
						|
 | 
						|
pci (0-2^32-1)
 | 
						|
 |
 | 
						|
 +--- vga-area: container@0xa0000-0xbffff
 | 
						|
 |      |
 | 
						|
 |      +--- alias@0x00000-0x7fff  ---> #vram (0x010000-0x017fff)
 | 
						|
 |      |
 | 
						|
 |      +--- alias@0x08000-0xffff  ---> #vram (0x020000-0x027fff)
 | 
						|
 |
 | 
						|
 +---- vram: ram@0xe1000000-0xe1ffffff
 | 
						|
 |
 | 
						|
 +---- vga-mmio: mmio@0xe2000000-0xe200ffff
 | 
						|
 | 
						|
ram: ram@0x00000000-0xffffffff
 | 
						|
 | 
						|
This is a (simplified) PC memory map. The 4GB RAM block is mapped into the
 | 
						|
system address space via two aliases: "lomem" is a 1:1 mapping of the first
 | 
						|
3.5GB; "himem" maps the last 0.5GB at address 4GB.  This leaves 0.5GB for the
 | 
						|
so-called PCI hole, that allows a 32-bit PCI bus to exist in a system with
 | 
						|
4GB of memory.
 | 
						|
 | 
						|
The memory controller diverts addresses in the range 640K-768K to the PCI
 | 
						|
address space.  This is modelled using the "vga-window" alias, mapped at a
 | 
						|
higher priority so it obscures the RAM at the same addresses.  The vga window
 | 
						|
can be removed by programming the memory controller; this is modelled by
 | 
						|
removing the alias and exposing the RAM underneath.
 | 
						|
 | 
						|
The pci address space is not a direct child of the system address space, since
 | 
						|
we only want parts of it to be visible (we accomplish this using aliases).
 | 
						|
It has two subregions: vga-area models the legacy vga window and is occupied
 | 
						|
by two 32K memory banks pointing at two sections of the framebuffer.
 | 
						|
In addition the vram is mapped as a BAR at address e1000000, and an additional
 | 
						|
BAR containing MMIO registers is mapped after it.
 | 
						|
 | 
						|
Note that if the guest maps a BAR outside the PCI hole, it would not be
 | 
						|
visible as the pci-hole alias clips it to a 0.5GB range.
 | 
						|
 | 
						|
MMIO Operations
 | 
						|
---------------
 | 
						|
 | 
						|
MMIO regions are provided with ->read() and ->write() callbacks; in addition
 | 
						|
various constraints can be supplied to control how these callbacks are called:
 | 
						|
 | 
						|
 - .valid.min_access_size, .valid.max_access_size define the access sizes
 | 
						|
   (in bytes) which the device accepts; accesses outside this range will
 | 
						|
   have device and bus specific behaviour (ignored, or machine check)
 | 
						|
 - .valid.aligned specifies that the device only accepts naturally aligned
 | 
						|
   accesses.  Unaligned accesses invoke device and bus specific behaviour.
 | 
						|
 - .impl.min_access_size, .impl.max_access_size define the access sizes
 | 
						|
   (in bytes) supported by the *implementation*; other access sizes will be
 | 
						|
   emulated using the ones available.  For example a 4-byte write will be
 | 
						|
   emulated using four 1-byte writes, if .impl.max_access_size = 1.
 | 
						|
 - .impl.unaligned specifies that the *implementation* supports unaligned
 | 
						|
   accesses; if false, unaligned accesses will be emulated by two aligned
 | 
						|
   accesses.
 | 
						|
 - .old_mmio can be used to ease porting from code using
 | 
						|
   cpu_register_io_memory(). It should not be used in new code.
 |