313 lines
15 KiB
ReStructuredText
313 lines
15 KiB
ReStructuredText
|
===================================================
|
||
|
PCI Express I/O Virtualization Resource on Powerenv
|
||
|
===================================================
|
||
|
|
||
|
Wei Yang <weiyang@linux.vnet.ibm.com>
|
||
|
|
||
|
Benjamin Herrenschmidt <benh@au1.ibm.com>
|
||
|
|
||
|
Bjorn Helgaas <bhelgaas@google.com>
|
||
|
|
||
|
26 Aug 2014
|
||
|
|
||
|
This document describes the requirement from hardware for PCI MMIO resource
|
||
|
sizing and assignment on PowerKVM and how generic PCI code handles this
|
||
|
requirement. The first two sections describe the concepts of Partitionable
|
||
|
Endpoints and the implementation on P8 (IODA2). The next two sections talks
|
||
|
about considerations on enabling SRIOV on IODA2.
|
||
|
|
||
|
1. Introduction to Partitionable Endpoints
|
||
|
==========================================
|
||
|
|
||
|
A Partitionable Endpoint (PE) is a way to group the various resources
|
||
|
associated with a device or a set of devices to provide isolation between
|
||
|
partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
|
||
|
to freeze a device that is causing errors in order to limit the possibility
|
||
|
of propagation of bad data.
|
||
|
|
||
|
There is thus, in HW, a table of PE states that contains a pair of "frozen"
|
||
|
state bits (one for MMIO and one for DMA, they get set together but can be
|
||
|
cleared independently) for each PE.
|
||
|
|
||
|
When a PE is frozen, all stores in any direction are dropped and all loads
|
||
|
return all 1's value. MSIs are also blocked. There's a bit more state that
|
||
|
captures things like the details of the error that caused the freeze etc., but
|
||
|
that's not critical.
|
||
|
|
||
|
The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
|
||
|
are matched to their corresponding PEs.
|
||
|
|
||
|
The following section provides a rough description of what we have on P8
|
||
|
(IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB
|
||
|
is a completely separate HW entity that replicates the entire logic, so has
|
||
|
its own set of PEs, etc.
|
||
|
|
||
|
2. Implementation of Partitionable Endpoints on P8 (IODA2)
|
||
|
==========================================================
|
||
|
|
||
|
P8 supports up to 256 Partitionable Endpoints per PHB.
|
||
|
|
||
|
* Inbound
|
||
|
|
||
|
For DMA, MSIs and inbound PCIe error messages, we have a table (in
|
||
|
memory but accessed in HW by the chip) that provides a direct
|
||
|
correspondence between a PCIe RID (bus/dev/fn) with a PE number.
|
||
|
We call this the RTT.
|
||
|
|
||
|
- For DMA we then provide an entire address space for each PE that can
|
||
|
contain two "windows", depending on the value of PCI address bit 59.
|
||
|
Each window can be configured to be remapped via a "TCE table" (IOMMU
|
||
|
translation table), which has various configurable characteristics
|
||
|
not described here.
|
||
|
|
||
|
- For MSIs, we have two windows in the address space (one at the top of
|
||
|
the 32-bit space and one much higher) which, via a combination of the
|
||
|
address and MSI value, will result in one of the 2048 interrupts per
|
||
|
bridge being triggered. There's a PE# in the interrupt controller
|
||
|
descriptor table as well which is compared with the PE# obtained from
|
||
|
the RTT to "authorize" the device to emit that specific interrupt.
|
||
|
|
||
|
- Error messages just use the RTT.
|
||
|
|
||
|
* Outbound. That's where the tricky part is.
|
||
|
|
||
|
Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
|
||
|
from the CPU address space to the PCI address space. There is one M32
|
||
|
window and sixteen M64 windows. They have different characteristics.
|
||
|
First what they have in common: they forward a configurable portion of
|
||
|
the CPU address space to the PCIe bus and must be naturally aligned
|
||
|
power of two in size. The rest is different:
|
||
|
|
||
|
- The M32 window:
|
||
|
|
||
|
* Is limited to 4GB in size.
|
||
|
|
||
|
* Drops the top bits of the address (above the size) and replaces
|
||
|
them with a configurable value. This is typically used to generate
|
||
|
32-bit PCIe accesses. We configure that window at boot from FW and
|
||
|
don't touch it from Linux; it's usually set to forward a 2GB
|
||
|
portion of address space from the CPU to PCIe
|
||
|
0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually
|
||
|
reserved for MSIs but this is not a problem at this point; we just
|
||
|
need to ensure Linux doesn't assign anything there, the M32 logic
|
||
|
ignores that however and will forward in that space if we try).
|
||
|
|
||
|
* It is divided into 256 segments of equal size. A table in the chip
|
||
|
maps each segment to a PE#. That allows portions of the MMIO space
|
||
|
to be assigned to PEs on a segment granularity. For a 2GB window,
|
||
|
the segment granularity is 2GB/256 = 8MB.
|
||
|
|
||
|
Now, this is the "main" window we use in Linux today (excluding
|
||
|
SR-IOV). We basically use the trick of forcing the bridge MMIO windows
|
||
|
onto a segment alignment/granularity so that the space behind a bridge
|
||
|
can be assigned to a PE.
|
||
|
|
||
|
Ideally we would like to be able to have individual functions in PEs
|
||
|
but that would mean using a completely different address allocation
|
||
|
scheme where individual function BARs can be "grouped" to fit in one or
|
||
|
more segments.
|
||
|
|
||
|
- The M64 windows:
|
||
|
|
||
|
* Must be at least 256MB in size.
|
||
|
|
||
|
* Do not translate addresses (the address on PCIe is the same as the
|
||
|
address on the PowerBus). There is a way to also set the top 14
|
||
|
bits which are not conveyed by PowerBus but we don't use this.
|
||
|
|
||
|
* Can be configured to be segmented. When not segmented, we can
|
||
|
specify the PE# for the entire window. When segmented, a window
|
||
|
has 256 segments; however, there is no table for mapping a segment
|
||
|
to a PE#. The segment number *is* the PE#.
|
||
|
|
||
|
* Support overlaps. If an address is covered by multiple windows,
|
||
|
there's a defined ordering for which window applies.
|
||
|
|
||
|
We have code (fairly new compared to the M32 stuff) that exploits that
|
||
|
for large BARs in 64-bit space:
|
||
|
|
||
|
We configure an M64 window to cover the entire region of address space
|
||
|
that has been assigned by FW for the PHB (about 64GB, ignore the space
|
||
|
for the M32, it comes out of a different "reserve"). We configure it
|
||
|
as segmented.
|
||
|
|
||
|
Then we do the same thing as with M32, using the bridge alignment
|
||
|
trick, to match to those giant segments.
|
||
|
|
||
|
Since we cannot remap, we have two additional constraints:
|
||
|
|
||
|
- We do the PE# allocation *after* the 64-bit space has been assigned
|
||
|
because the addresses we use directly determine the PE#. We then
|
||
|
update the M32 PE# for the devices that use both 32-bit and 64-bit
|
||
|
spaces or assign the remaining PE# to 32-bit only devices.
|
||
|
|
||
|
- We cannot "group" segments in HW, so if a device ends up using more
|
||
|
than one segment, we end up with more than one PE#. There is a HW
|
||
|
mechanism to make the freeze state cascade to "companion" PEs but
|
||
|
that only works for PCIe error messages (typically used so that if
|
||
|
you freeze a switch, it freezes all its children). So we do it in
|
||
|
SW. We lose a bit of effectiveness of EEH in that case, but that's
|
||
|
the best we found. So when any of the PEs freezes, we freeze the
|
||
|
other ones for that "domain". We thus introduce the concept of
|
||
|
"master PE" which is the one used for DMA, MSIs, etc., and "secondary
|
||
|
PEs" that are used for the remaining M64 segments.
|
||
|
|
||
|
We would like to investigate using additional M64 windows in "single
|
||
|
PE" mode to overlay over specific BARs to work around some of that, for
|
||
|
example for devices with very large BARs, e.g., GPUs. It would make
|
||
|
sense, but we haven't done it yet.
|
||
|
|
||
|
3. Considerations for SR-IOV on PowerKVM
|
||
|
========================================
|
||
|
|
||
|
* SR-IOV Background
|
||
|
|
||
|
The PCIe SR-IOV feature allows a single Physical Function (PF) to
|
||
|
support several Virtual Functions (VFs). Registers in the PF's SR-IOV
|
||
|
Capability control the number of VFs and whether they are enabled.
|
||
|
|
||
|
When VFs are enabled, they appear in Configuration Space like normal
|
||
|
PCI devices, but the BARs in VF config space headers are unusual. For
|
||
|
a non-VF device, software uses BARs in the config space header to
|
||
|
discover the BAR sizes and assign addresses for them. For VF devices,
|
||
|
software uses VF BAR registers in the *PF* SR-IOV Capability to
|
||
|
discover sizes and assign addresses. The BARs in the VF's config space
|
||
|
header are read-only zeros.
|
||
|
|
||
|
When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
|
||
|
base address for all the corresponding VF(n) BARs. For example, if the
|
||
|
PF SR-IOV Capability is programmed to enable eight VFs, and it has a
|
||
|
1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
|
||
|
This region is divided into eight contiguous 1MB regions, each of which
|
||
|
is a BAR0 for one of the VFs. Note that even though the VF BAR
|
||
|
describes an 8MB region, the alignment requirement is for a single VF,
|
||
|
i.e., 1MB in this example.
|
||
|
|
||
|
There are several strategies for isolating VFs in PEs:
|
||
|
|
||
|
- M32 window: There's one M32 window, and it is split into 256
|
||
|
equally-sized segments. The finest granularity possible is a 256MB
|
||
|
window with 1MB segments. VF BARs that are 1MB or larger could be
|
||
|
mapped to separate PEs in this window. Each segment can be
|
||
|
individually mapped to a PE via the lookup table, so this is quite
|
||
|
flexible, but it works best when all the VF BARs are the same size. If
|
||
|
they are different sizes, the entire window has to be small enough that
|
||
|
the segment size matches the smallest VF BAR, which means larger VF
|
||
|
BARs span several segments.
|
||
|
|
||
|
- Non-segmented M64 window: A non-segmented M64 window is mapped entirely
|
||
|
to a single PE, so it could only isolate one VF.
|
||
|
|
||
|
- Single segmented M64 windows: A segmented M64 window could be used just
|
||
|
like the M32 window, but the segments can't be individually mapped to
|
||
|
PEs (the segment number is the PE#), so there isn't as much
|
||
|
flexibility. A VF with multiple BARs would have to be in a "domain" of
|
||
|
multiple PEs, which is not as well isolated as a single PE.
|
||
|
|
||
|
- Multiple segmented M64 windows: As usual, each window is split into 256
|
||
|
equally-sized segments, and the segment number is the PE#. But if we
|
||
|
use several M64 windows, they can be set to different base addresses
|
||
|
and different segment sizes. If we have VFs that each have a 1MB BAR
|
||
|
and a 32MB BAR, we could use one M64 window to assign 1MB segments and
|
||
|
another M64 window to assign 32MB segments.
|
||
|
|
||
|
Finally, the plan to use M64 windows for SR-IOV, which will be described
|
||
|
more in the next two sections. For a given VF BAR, we need to
|
||
|
effectively reserve the entire 256 segments (256 * VF BAR size) and
|
||
|
position the VF BAR to start at the beginning of a free range of
|
||
|
segments/PEs inside that M64 window.
|
||
|
|
||
|
The goal is of course to be able to give a separate PE for each VF.
|
||
|
|
||
|
The IODA2 platform has 16 M64 windows, which are used to map MMIO
|
||
|
range to PE#. Each M64 window defines one MMIO range and this range is
|
||
|
divided into 256 segments, with each segment corresponding to one PE.
|
||
|
|
||
|
We decide to leverage this M64 window to map VFs to individual PEs, since
|
||
|
SR-IOV VF BARs are all the same size.
|
||
|
|
||
|
But doing so introduces another problem: total_VFs is usually smaller
|
||
|
than the number of M64 window segments, so if we map one VF BAR directly
|
||
|
to one M64 window, some part of the M64 window will map to another
|
||
|
device's MMIO range.
|
||
|
|
||
|
IODA supports 256 PEs, so segmented windows contain 256 segments, so if
|
||
|
total_VFs is less than 256, we have the situation in Figure 1.0, where
|
||
|
segments [total_VFs, 255] of the M64 window may map to some MMIO range on
|
||
|
other devices::
|
||
|
|
||
|
0 1 total_VFs - 1
|
||
|
+------+------+- -+------+------+
|
||
|
| | | ... | | |
|
||
|
+------+------+- -+------+------+
|
||
|
|
||
|
VF(n) BAR space
|
||
|
|
||
|
0 1 total_VFs - 1 255
|
||
|
+------+------+- -+------+------+- -+------+------+
|
||
|
| | | ... | | | ... | | |
|
||
|
+------+------+- -+------+------+- -+------+------+
|
||
|
|
||
|
M64 window
|
||
|
|
||
|
Figure 1.0 Direct map VF(n) BAR space
|
||
|
|
||
|
Our current solution is to allocate 256 segments even if the VF(n) BAR
|
||
|
space doesn't need that much, as shown in Figure 1.1::
|
||
|
|
||
|
0 1 total_VFs - 1 255
|
||
|
+------+------+- -+------+------+- -+------+------+
|
||
|
| | | ... | | | ... | | |
|
||
|
+------+------+- -+------+------+- -+------+------+
|
||
|
|
||
|
VF(n) BAR space + extra
|
||
|
|
||
|
0 1 total_VFs - 1 255
|
||
|
+------+------+- -+------+------+- -+------+------+
|
||
|
| | | ... | | | ... | | |
|
||
|
+------+------+- -+------+------+- -+------+------+
|
||
|
|
||
|
M64 window
|
||
|
|
||
|
Figure 1.1 Map VF(n) BAR space + extra
|
||
|
|
||
|
Allocating the extra space ensures that the entire M64 window will be
|
||
|
assigned to this one SR-IOV device and none of the space will be
|
||
|
available for other devices. Note that this only expands the space
|
||
|
reserved in software; there are still only total_VFs VFs, and they only
|
||
|
respond to segments [0, total_VFs - 1]. There's nothing in hardware that
|
||
|
responds to segments [total_VFs, 255].
|
||
|
|
||
|
4. Implications for the Generic PCI Code
|
||
|
========================================
|
||
|
|
||
|
The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
|
||
|
aligned to the size of an individual VF BAR.
|
||
|
|
||
|
In IODA2, the MMIO address determines the PE#. If the address is in an M32
|
||
|
window, we can set the PE# by updating the table that translates segments
|
||
|
to PE#s. Similarly, if the address is in an unsegmented M64 window, we can
|
||
|
set the PE# for the window. But if it's in a segmented M64 window, the
|
||
|
segment number is the PE#.
|
||
|
|
||
|
Therefore, the only way to control the PE# for a VF is to change the base
|
||
|
of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact
|
||
|
amount of space required for the VF(n) BAR space, the VF BAR value is fixed
|
||
|
and cannot be changed.
|
||
|
|
||
|
On the other hand, if the PCI core allocates additional space, the VF BAR
|
||
|
value can be changed as long as the entire VF(n) BAR space remains inside
|
||
|
the space allocated by the core.
|
||
|
|
||
|
Ideally the segment size will be the same as an individual VF BAR size.
|
||
|
Then each VF will be in its own PE. The VF BARs (and therefore the PE#s)
|
||
|
are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we
|
||
|
allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
|
||
|
|
||
|
If the segment size is smaller than the VF BAR size, it will take several
|
||
|
segments to cover a VF BAR, and a VF will be in several PEs. This is
|
||
|
possible, but the isolation isn't as good, and it reduces the number of PE#
|
||
|
choices because instead of consuming only numVFs segments, the VF(n) BAR
|
||
|
space will consume (numVFs * n) segments. That means there aren't as many
|
||
|
available segments for adjusting base of the VF(n) BAR space.
|