 76ca4b58c2
			
		
	
	
		76ca4b58c2
		
	
	
	
	
		
			
			I found that there are many spelling errors in the comments of qemu, so I used the spellcheck tool to check the spelling errors and finally found some spelling errors in the docs folder. Signed-off-by: zhaolichang <zhaolichang@huawei.com> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Message-Id: <20200917075029.313-4-zhaolichang@huawei.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
		
			
				
	
	
		
			410 lines
		
	
	
		
			18 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			410 lines
		
	
	
		
			18 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| = sPAPR Dynamic Reconfiguration =
 | |
| 
 | |
| sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration
 | |
| to handle hotplugging of dynamic "physical" resources like PCI cards, or
 | |
| "logical"/paravirtual resources like memory, CPUs, and "physical"
 | |
| host-bridges, which are generally managed by the host/hypervisor and provided
 | |
| to guests as virtualized resources. The specifics of dynamic-reconfiguration
 | |
| are documented extensively in PAPR+ v2.7, Section 13.1. This document
 | |
| provides a summary of that information as it applies to the implementation
 | |
| within QEMU.
 | |
| 
 | |
| == Dynamic-reconfiguration Connectors ==
 | |
| 
 | |
| To manage hotplug/unplug of these resources, a firmware abstraction known as
 | |
| a Dynamic Resource Connector (DRC) is used to assign a particular dynamic
 | |
| resource to the guest, and provide an interface for the guest to manage
 | |
| configuration/removal of the resource associated with it.
 | |
| 
 | |
| == Device-tree description of DRCs ==
 | |
| 
 | |
| A set of 4 Open Firmware device tree array properties are used to describe
 | |
| the name/index/power-domain/type of each DRC allocated to a guest at
 | |
| boot-time. There may be multiple sets of these arrays, rooted at different
 | |
| paths in the device tree depending on the type of resource the DRCs manage.
 | |
| 
 | |
| In some cases, the DRCs themselves may be provided by a dynamic resource,
 | |
| such as the DRCs managing PCI slots on a hotplugged PHB. In this case the
 | |
| arrays would be fetched as part of the device tree retrieval interfaces
 | |
| for hotplugged resources described under "Guest->Host interface".
 | |
| 
 | |
| The array properties are described below. Each entry/element in an array
 | |
| describes the DRC identified by the element in the corresponding position
 | |
| of ibm,drc-indexes:
 | |
| 
 | |
| ibm,drc-names:
 | |
|   first 4-bytes: BE-encoded integer denoting the number of entries
 | |
|   each entry: a NULL-terminated <name> string encoded as a byte array
 | |
| 
 | |
|   <name> values for logical/virtual resources are defined in PAPR+ v2.7,
 | |
|   Section 13.5.2.4, and basically consist of the type of the resource
 | |
|   followed by a space and a numerical value that's unique across resources
 | |
|   of that type.
 | |
| 
 | |
|   <name> values for "physical" resources such as PCI or VIO devices are
 | |
|   defined as being "location codes", which are the "location labels" of
 | |
|   each encapsulating device, starting from the chassis down to the
 | |
|   individual slot for the device, concatenated by a hyphen. This provides
 | |
|   a mapping of resources to a physical location in a chassis for debugging
 | |
|   purposes. For QEMU, this mapping is less important, so we assign a
 | |
|   location code that conforms to naming specifications, but is simply a
 | |
|   location label for the slot by itself to simplify the implementation.
 | |
|   The naming convention for location labels is documented in detail in
 | |
|   PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>"
 | |
|   for PCI/VIO device slots, where <n> is unique across all PCI/VIO
 | |
|   device slots.
 | |
| 
 | |
| ibm,drc-indexes:
 | |
|   first 4-bytes: BE-encoded integer denoting the number of entries
 | |
|   each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs
 | |
|     in the machine
 | |
| 
 | |
|   <index> is arbitrary, but in the case of QEMU we try to maintain the
 | |
|   convention used to assign them to pSeries guests on pHyp:
 | |
| 
 | |
|     bit[31:28]: integer encoding of <type>, where <type> is:
 | |
|                   1 for CPU resource
 | |
|                   2 for PHB resource
 | |
|                   3 for VIO resource
 | |
|                   4 for PCI resource
 | |
|                   8 for Memory resource
 | |
|     bit[27:0]: integer encoding of <id>, where <id> is unique across
 | |
|                  all resources of specified type
 | |
| 
 | |
| ibm,drc-power-domains:
 | |
|   first 4-bytes: BE-encoded integer denoting the number of entries
 | |
|   each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the
 | |
|     power domain the resource will be assigned to. In the case of QEMU
 | |
|     we associated all resources with a "live insertion" domain, where the
 | |
|     power is assumed to be managed automatically. The integer value for
 | |
|     this domain is a special value of -1.
 | |
| 
 | |
| 
 | |
| ibm,drc-types:
 | |
|   first 4-bytes: BE-encoded integer denoting the number of entries
 | |
|   each entry: a NULL-terminated <type> string encoded as a byte array
 | |
| 
 | |
|   <type> is assigned as follows:
 | |
|     "CPU" for a CPU
 | |
|     "PHB" for a physical host-bridge
 | |
|     "SLOT" for a VIO slot
 | |
|     "28" for a PCI slot
 | |
|     "MEM" for memory resource
 | |
| 
 | |
| == Guest->Host interface to manage dynamic resources ==
 | |
| 
 | |
| Each DRC is given a globally unique DRC Index, and resources associated with
 | |
| a particular DRC are configured/managed by the guest via a number of RTAS
 | |
| calls which reference individual DRCs based on the DRC index. This can be
 | |
| considered the guest->host interface.
 | |
| 
 | |
| rtas-set-power-level:
 | |
|   arg[0]: integer identifying power domain
 | |
|   arg[1]: new power level for the domain, 0-100
 | |
|   output[0]: status, 0 on success
 | |
|   output[1]: power level after command
 | |
| 
 | |
|   Set the power level for a specified power domain
 | |
| 
 | |
| rtas-get-power-level:
 | |
|   arg[0]: integer identifying power domain
 | |
|   output[0]: status, 0 on success
 | |
|   output[1]: current power level
 | |
| 
 | |
|   Get the power level for a specified power domain
 | |
| 
 | |
| rtas-set-indicator:
 | |
|   arg[0]: integer identifying sensor/indicator type
 | |
|   arg[1]: index of sensor, for DR-related sensors this is generally the
 | |
|           DRC index
 | |
|   arg[2]: desired sensor value
 | |
|   output[0]: status, 0 on success
 | |
| 
 | |
|   Set the state of an indicator or sensor. For the purpose of this document we
 | |
|   focus on the indicator/sensor types associated with a DRC. The types are:
 | |
| 
 | |
|     9001: isolation-state, controls/indicates whether a device has been made
 | |
|           accessible to a guest
 | |
| 
 | |
|           supported sensor values:
 | |
|             0: isolate, device is made unaccessible by guest OS
 | |
|             1: unisolate, device is made available to guest OS
 | |
| 
 | |
|     9002: dr-indicator, controls "visual" indicator associated with device
 | |
| 
 | |
|           supported sensor values:
 | |
|             0: inactive, resource may be safely removed
 | |
|             1: active, resource is in use and cannot be safely removed
 | |
|             2: identify, used to visually identify slot for interactive hotplug
 | |
|             3: action, in most cases, used in the same manner as identify
 | |
| 
 | |
|     9003: allocation-state, generally only used for "logical" DR resources to
 | |
|           request the allocation/deallocation of a resource prior to acquiring
 | |
|           it via isolation-state->unisolate, or after releasing it via
 | |
|           isolation-state->isolate, respectively. for "physical" DR (like PCI
 | |
|           hotplug/unplug) the pre-allocation of the resource is implied and
 | |
|           this sensor is unused.
 | |
| 
 | |
|           supported sensor values:
 | |
|             0: unusable, tell firmware/system the resource can be
 | |
|                unallocated/reclaimed and added back to the system resource pool
 | |
|             1: usable, request the resource be allocated/reserved for use by
 | |
|                guest OS
 | |
|             2: exchange, used to allocate a spare resource to use for fail-over
 | |
|                in certain situations. unused in QEMU
 | |
|             3: recover, used to reclaim a previously allocated resource that's
 | |
|                not currently allocated to the guest OS. unused in QEMU
 | |
| 
 | |
| rtas-get-sensor-state:
 | |
|   arg[0]: integer identifying sensor/indicator type
 | |
|   arg[1]: index of sensor, for DR-related sensors this is generally the
 | |
|           DRC index
 | |
|   output[0]: status, 0 on success
 | |
| 
 | |
|   Used to read an indicator or sensor value.
 | |
| 
 | |
|   For DR-related operations, the only noteworthy sensor is dr-entity-sense,
 | |
|   which has a type value of 9003, as allocation-state does in the case of
 | |
|   rtas-set-indicator. The semantics/encodings of the sensor values are distinct
 | |
|   however:
 | |
| 
 | |
|   supported sensor values for dr-entity-sense (9003) sensor:
 | |
|     0: empty,
 | |
|          for physical resources: DRC/slot is empty
 | |
|          for logical resources: unused
 | |
|     1: present,
 | |
|          for physical resources: DRC/slot is populated with a device/resource
 | |
|          for logical resources: resource has been allocated to the DRC
 | |
|     2: unusable,
 | |
|          for physical resources: unused
 | |
|          for logical resources: DRC has no resource allocated to it
 | |
|     3: exchange,
 | |
|          for physical resources: unused
 | |
|          for logical resources: resource available for exchange (see
 | |
|            allocation-state sensor semantics above)
 | |
|     4: recovery,
 | |
|          for physical resources: unused
 | |
|          for logical resources: resource available for recovery (see
 | |
|            allocation-state sensor semantics above)
 | |
| 
 | |
| rtas-ibm-configure-connector:
 | |
|   arg[0]: guest physical address of 4096-byte work area buffer
 | |
|   arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero
 | |
|           if a prior RTAS response indicated a need for additional memory
 | |
|   output[0]: status:
 | |
|                0: completed transmittal of device-tree node
 | |
|                1: instruct guest to prepare for next DT sibling node
 | |
|                2: instruct guest to prepare for next DT child node
 | |
|                3: instruct guest to prepare for next DT property
 | |
|                4: instruct guest to ascend to parent DT node
 | |
|                5: instruct guest to provide additional work-area buffer
 | |
|                   via arg[1]
 | |
|             990x: instruct guest that operation took too long and to try
 | |
|                   again later
 | |
| 
 | |
|   Used to fetch an OF device-tree description of the resource associated with
 | |
|   a particular DRC. The DRC index is encoded in the first 4-bytes of the first
 | |
|   work area buffer.
 | |
| 
 | |
|   Work area layout, using 4-byte offsets:
 | |
|     wa[0]: DRC index of the DRC to fetch device-tree nodes from
 | |
|     wa[1]: 0 (hard-coded)
 | |
|     wa[2]: for next-sibling/next-child response:
 | |
|              wa offset of null-terminated string denoting the new node's name
 | |
|            for next-property response:
 | |
|              wa offset of null-terminated string denoting new property's name
 | |
|     wa[3]: for next-property response (unused otherwise):
 | |
|              byte-length of new property's value
 | |
|     wa[4]: for next-property response (unused otherwise):
 | |
|              new property's value, encoded as an OFDT-compatible byte array
 | |
| 
 | |
| == hotplug/unplug events ==
 | |
| 
 | |
| For most DR operations, the hypervisor will issue host->guest add/remove events
 | |
| using the EPOW/check-exception notification framework, where the host issues a
 | |
| check-exception interrupt, then provides an RTAS event log via an
 | |
| rtas-check-exception call issued by the guest in response. This framework is
 | |
| documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown
 | |
| requests via EPOW events.
 | |
| 
 | |
| For DR, this framework has been extended to include hotplug events, which were
 | |
| previously unneeded due to direct manipulation of DR-related guest userspace
 | |
| tools by host-level management such as an HMC. This level of management is not
 | |
| applicable to PowerKVM, hence the reason for extending the notification
 | |
| framework to support hotplug events.
 | |
| 
 | |
| The format for these EPOW-signalled events is described below under
 | |
| "hotplug/unplug event structure". Note that these events are not
 | |
| formally part of the PAPR+ specification, and have been superseded by a
 | |
| newer format, also described below under "hotplug/unplug event structure",
 | |
| and so are now deemed a "legacy" format. The formats are similar, but the
 | |
| "modern" format contains additional fields/flags, which are denoted for the
 | |
| purposes of this documentation with "#ifdef GUEST_SUPPORTS_MODERN" guards.
 | |
| 
 | |
| QEMU should assume support only for "legacy" fields/flags unless the guest
 | |
| advertises support for the "modern" format via ibm,client-architecture-support
 | |
| hcall by setting byte 5, bit 6 of it's ibm,architecture-vec-5 option vector
 | |
| structure (as described by LoPAPR v11, B.6.2.3). As with "legacy" format events,
 | |
| "modern" format events are surfaced to the guest via check-exception RTAS calls,
 | |
| but use a dedicated event source to signal the guest. This event source is
 | |
| advertised to the guest by the addition of a "hot-plug-events" node under
 | |
| "/event-sources" node of the guest's device tree using the standard format
 | |
| described in LoPAPR v11, B.6.12.1.
 | |
| 
 | |
| == hotplug/unplug event structure ==
 | |
| 
 | |
| The hotplug-specific payload in QEMU is implemented as follows (with all values
 | |
| encoded in big-endian format):
 | |
| 
 | |
| struct rtas_event_log_v6_hp {
 | |
| #define SECTION_ID_HOTPLUG              0x4850 /* HP */
 | |
|     struct section_header {
 | |
|         uint16_t section_id;            /* set to SECTION_ID_HOTPLUG */
 | |
|         uint16_t section_length;        /* sizeof(rtas_event_log_v6_hp),
 | |
|                                          * plus the length of the DRC name
 | |
|                                          * if a DRC name identifier is
 | |
|                                          * specified for hotplug_identifier
 | |
|                                          */
 | |
|         uint8_t section_version;        /* version 1 */
 | |
|         uint8_t section_subtype;        /* unused */
 | |
|         uint16_t creator_component_id;  /* unused */
 | |
|     } hdr;
 | |
| #define RTAS_LOG_V6_HP_TYPE_CPU         1
 | |
| #define RTAS_LOG_V6_HP_TYPE_MEMORY      2
 | |
| #define RTAS_LOG_V6_HP_TYPE_SLOT        3
 | |
| #define RTAS_LOG_V6_HP_TYPE_PHB         4
 | |
| #define RTAS_LOG_V6_HP_TYPE_PCI         5
 | |
|     uint8_t hotplug_type;               /* type of resource/device */
 | |
| #define RTAS_LOG_V6_HP_ACTION_ADD       1
 | |
| #define RTAS_LOG_V6_HP_ACTION_REMOVE    2
 | |
|     uint8_t hotplug_action;             /* action (add/remove) */
 | |
| #define RTAS_LOG_V6_HP_ID_DRC_NAME          1
 | |
| #define RTAS_LOG_V6_HP_ID_DRC_INDEX         2
 | |
| #define RTAS_LOG_V6_HP_ID_DRC_COUNT         3
 | |
| #ifdef GUEST_SUPPORTS_MODERN
 | |
| #define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4
 | |
| #endif
 | |
|     uint8_t hotplug_identifier;         /* type of the resource identifier,
 | |
|                                          * which serves as the discriminator
 | |
|                                          * for the 'drc' union field below
 | |
|                                          */
 | |
| #ifdef GUEST_SUPPORTS_MODERN
 | |
|     uint8_t capabilities;               /* capability flags, currently unused
 | |
|                                          * by QEMU
 | |
|                                          */
 | |
| #else
 | |
|     uint8_t reserved;
 | |
| #endif
 | |
|     union {
 | |
|         uint32_t index;                 /* DRC index of resource to take action
 | |
|                                          * on
 | |
|                                          */
 | |
|         uint32_t count;                 /* number of DR resources to take
 | |
|                                          * action on (guest chooses which)
 | |
|                                          */
 | |
| #ifdef GUEST_SUPPORTS_MODERN
 | |
|         struct {
 | |
|             uint32_t count;             /* number of DR resources to take
 | |
|                                          * action on
 | |
|                                          */
 | |
|             uint32_t index;             /* DRC index of first resource to take
 | |
|                                          * action on. guest will take action
 | |
|                                          * on DRC index <index> through
 | |
|                                          * DRC index <index + count - 1> in
 | |
|                                          * sequential order
 | |
|                                          */
 | |
|         } count_indexed;
 | |
| #endif
 | |
|         char name[1];                   /* string representing the name of the
 | |
|                                          * DRC to take action on
 | |
|                                          */
 | |
|     } drc;
 | |
| } QEMU_PACKED;
 | |
| 
 | |
| == ibm,lrdr-capacity ==
 | |
| 
 | |
| ibm,lrdr-capacity is a property in the /rtas device tree node that identifies
 | |
| the dynamic reconfiguration capabilities of the guest. It consists of a triple
 | |
| consisting of <phys>, <size> and <maxcpus>.
 | |
| 
 | |
|   <phys>, encoded in BE format represents the maximum address in bytes and
 | |
|   hence the maximum memory that can be allocated to the guest.
 | |
| 
 | |
|   <size>, encoded in BE format represents the size increments in which
 | |
|   memory can be hot-plugged to the guest.
 | |
| 
 | |
|   <maxcpus>, a BE-encoded integer, represents the maximum number of
 | |
|   processors that the guest can have.
 | |
| 
 | |
| pseries guests use this property to note the maximum allowed CPUs for the
 | |
| guest.
 | |
| 
 | |
| == ibm,dynamic-reconfiguration-memory ==
 | |
| 
 | |
| ibm,dynamic-reconfiguration-memory is a device tree node that represents
 | |
| dynamically reconfigurable logical memory blocks (LMB). This node
 | |
| is generated only when the guest advertises the support for it via
 | |
| ibm,client-architecture-support call. Memory that is not dynamically
 | |
| reconfigurable is represented by /memory nodes. The properties of this
 | |
| node that are of interest to the sPAPR memory hotplug implementation
 | |
| in QEMU are described here.
 | |
| 
 | |
| ibm,lmb-size
 | |
| 
 | |
| This 64bit integer defines the size of each dynamically reconfigurable LMB.
 | |
| 
 | |
| ibm,associativity-lookup-arrays
 | |
| 
 | |
| This property defines a lookup array in which the NUMA associativity
 | |
| information for each LMB can be found. It is a property encoded array
 | |
| that begins with an integer M, the number of associativity lists followed
 | |
| by an integer N, the number of entries per associativity list and terminated
 | |
| by M associativity lists each of length N integers.
 | |
| 
 | |
| This property provides the same information as given by ibm,associativity
 | |
| property in a /memory node. Each assigned LMB has an index value between
 | |
| 0 and M-1 which is used as an index into this table to select which
 | |
| associativity list to use for the LMB. This index value for each LMB
 | |
| is defined in ibm,dynamic-memory property.
 | |
| 
 | |
| ibm,dynamic-memory
 | |
| 
 | |
| This property describes the dynamically reconfigurable memory. It is a
 | |
| property encoded array that has an integer N, the number of LMBs followed
 | |
| by N LMB list entries.
 | |
| 
 | |
| Each LMB list entry consists of the following elements:
 | |
| 
 | |
| - Logical address of the start of the LMB encoded as a 64bit integer. This
 | |
|   corresponds to reg property in /memory node.
 | |
| - DRC index of the LMB that corresponds to ibm,my-drc-index property
 | |
|   in a /memory node.
 | |
| - Four bytes reserved for expansion.
 | |
| - Associativity list index for the LMB that is used as an index into
 | |
|   ibm,associativity-lookup-arrays property described earlier. This
 | |
|   is used to retrieve the right associativity list to be used for this
 | |
|   LMB.
 | |
| - A 32bit flags word. The bit at bit position 0x00000008 defines whether
 | |
|   the LMB is assigned to the partition as of boot time.
 | |
| 
 | |
| ibm,dynamic-memory-v2
 | |
| 
 | |
| This property describes the dynamically reconfigurable memory. This is
 | |
| an alternate and newer way to describe dynamically reconfigurable memory.
 | |
| It is a property encoded array that has an integer N (the number of
 | |
| LMB set entries) followed by N LMB set entries. There is an LMB set entry
 | |
| for each sequential group of LMBs that share common attributes.
 | |
| 
 | |
| Each LMB set entry consists of the following elements:
 | |
| 
 | |
| - Number of sequential LMBs in the entry represented by a 32bit integer.
 | |
| - Logical address of the first LMB in the set encoded as a 64bit integer.
 | |
| - DRC index of the first LMB in the set.
 | |
| - Associativity list index that is used as an index into
 | |
|   ibm,associativity-lookup-arrays property described earlier. This
 | |
|   is used to retrieve the right associativity list to be used for all
 | |
|   the LMBs in this set.
 | |
| - A 32bit flags word that applies to all the LMBs in the set.
 | |
| 
 | |
| [1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867
 |