powerpc/pci: Add PCI resource alignment documentation

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

In order to enable SRIOV on PowerNV platform, the PF's IOV BAR needs to be
adjusted:

1. size expanded
2. aligned to M64BT size

This patch documents this change on the reason and how.

[bhelgaas: reformat, clarify, expand]
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

authored by

Wei Yang and committed by

Benjamin Herrenschmidt 11 years ago 33052440 250c7b27

+301

1 changed file

expand all

Documentation

powerpc

pci_iov_resource_on_powernv.txt

+301

Documentation/powerpc/pci_iov_resource_on_powernv.txt

··· 1 + Wei Yang <weiyang@linux.vnet.ibm.com> 2 + Benjamin Herrenschmidt <benh@au1.ibm.com> 3 + Bjorn Helgaas <bhelgaas@google.com> 4 + 26 Aug 2014 5 + 6 + This document describes the requirement from hardware for PCI MMIO resource 7 + sizing and assignment on PowerKVM and how generic PCI code handles this 8 + requirement. The first two sections describe the concepts of Partitionable 9 + Endpoints and the implementation on P8 (IODA2). The next two sections talks 10 + about considerations on enabling SRIOV on IODA2. 11 + 12 + 1. Introduction to Partitionable Endpoints 13 + 14 + A Partitionable Endpoint (PE) is a way to group the various resources 15 + associated with a device or a set of devices to provide isolation between 16 + partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism 17 + to freeze a device that is causing errors in order to limit the possibility 18 + of propagation of bad data. 19 + 20 + There is thus, in HW, a table of PE states that contains a pair of "frozen" 21 + state bits (one for MMIO and one for DMA, they get set together but can be 22 + cleared independently) for each PE. 23 + 24 + When a PE is frozen, all stores in any direction are dropped and all loads 25 + return all 1's value. MSIs are also blocked. There's a bit more state that 26 + captures things like the details of the error that caused the freeze etc., but 27 + that's not critical. 28 + 29 + The interesting part is how the various PCIe transactions (MMIO, DMA, ...) 30 + are matched to their corresponding PEs. 31 + 32 + The following section provides a rough description of what we have on P8 33 + (IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB 34 + is a completely separate HW entity that replicates the entire logic, so has 35 + its own set of PEs, etc. 36 + 37 + 2. Implementation of Partitionable Endpoints on P8 (IODA2) 38 + 39 + P8 supports up to 256 Partitionable Endpoints per PHB. 40 + 41 + * Inbound 42 + 43 + For DMA, MSIs and inbound PCIe error messages, we have a table (in 44 + memory but accessed in HW by the chip) that provides a direct 45 + correspondence between a PCIe RID (bus/dev/fn) with a PE number. 46 + We call this the RTT. 47 + 48 + - For DMA we then provide an entire address space for each PE that can 49 + contain two "windows", depending on the value of PCI address bit 59. 50 + Each window can be configured to be remapped via a "TCE table" (IOMMU 51 + translation table), which has various configurable characteristics 52 + not described here. 53 + 54 + - For MSIs, we have two windows in the address space (one at the top of 55 + the 32-bit space and one much higher) which, via a combination of the 56 + address and MSI value, will result in one of the 2048 interrupts per 57 + bridge being triggered. There's a PE# in the interrupt controller 58 + descriptor table as well which is compared with the PE# obtained from 59 + the RTT to "authorize" the device to emit that specific interrupt. 60 + 61 + - Error messages just use the RTT. 62 + 63 + * Outbound. That's where the tricky part is. 64 + 65 + Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" 66 + from the CPU address space to the PCI address space. There is one M32 67 + window and sixteen M64 windows. They have different characteristics. 68 + First what they have in common: they forward a configurable portion of 69 + the CPU address space to the PCIe bus and must be naturally aligned 70 + power of two in size. The rest is different: 71 + 72 + - The M32 window: 73 + 74 + * Is limited to 4GB in size. 75 + 76 + * Drops the top bits of the address (above the size) and replaces 77 + them with a configurable value. This is typically used to generate 78 + 32-bit PCIe accesses. We configure that window at boot from FW and 79 + don't touch it from Linux; it's usually set to forward a 2GB 80 + portion of address space from the CPU to PCIe 81 + 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually 82 + reserved for MSIs but this is not a problem at this point; we just 83 + need to ensure Linux doesn't assign anything there, the M32 logic 84 + ignores that however and will forward in that space if we try). 85 + 86 + * It is divided into 256 segments of equal size. A table in the chip 87 + maps each segment to a PE#. That allows portions of the MMIO space 88 + to be assigned to PEs on a segment granularity. For a 2GB window, 89 + the segment granularity is 2GB/256 = 8MB. 90 + 91 + Now, this is the "main" window we use in Linux today (excluding 92 + SR-IOV). We basically use the trick of forcing the bridge MMIO windows 93 + onto a segment alignment/granularity so that the space behind a bridge 94 + can be assigned to a PE. 95 + 96 + Ideally we would like to be able to have individual functions in PEs 97 + but that would mean using a completely different address allocation 98 + scheme where individual function BARs can be "grouped" to fit in one or 99 + more segments. 100 + 101 + - The M64 windows: 102 + 103 + * Must be at least 256MB in size. 104 + 105 + * Do not translate addresses (the address on PCIe is the same as the 106 + address on the PowerBus). There is a way to also set the top 14 107 + bits which are not conveyed by PowerBus but we don't use this. 108 + 109 + * Can be configured to be segmented. When not segmented, we can 110 + specify the PE# for the entire window. When segmented, a window 111 + has 256 segments; however, there is no table for mapping a segment 112 + to a PE#. The segment number *is* the PE#. 113 + 114 + * Support overlaps. If an address is covered by multiple windows, 115 + there's a defined ordering for which window applies. 116 + 117 + We have code (fairly new compared to the M32 stuff) that exploits that 118 + for large BARs in 64-bit space: 119 + 120 + We configure an M64 window to cover the entire region of address space 121 + that has been assigned by FW for the PHB (about 64GB, ignore the space 122 + for the M32, it comes out of a different "reserve"). We configure it 123 + as segmented. 124 + 125 + Then we do the same thing as with M32, using the bridge alignment 126 + trick, to match to those giant segments. 127 + 128 + Since we cannot remap, we have two additional constraints: 129 + 130 + - We do the PE# allocation *after* the 64-bit space has been assigned 131 + because the addresses we use directly determine the PE#. We then 132 + update the M32 PE# for the devices that use both 32-bit and 64-bit 133 + spaces or assign the remaining PE# to 32-bit only devices. 134 + 135 + - We cannot "group" segments in HW, so if a device ends up using more 136 + than one segment, we end up with more than one PE#. There is a HW 137 + mechanism to make the freeze state cascade to "companion" PEs but 138 + that only works for PCIe error messages (typically used so that if 139 + you freeze a switch, it freezes all its children). So we do it in 140 + SW. We lose a bit of effectiveness of EEH in that case, but that's 141 + the best we found. So when any of the PEs freezes, we freeze the 142 + other ones for that "domain". We thus introduce the concept of 143 + "master PE" which is the one used for DMA, MSIs, etc., and "secondary 144 + PEs" that are used for the remaining M64 segments. 145 + 146 + We would like to investigate using additional M64 windows in "single 147 + PE" mode to overlay over specific BARs to work around some of that, for 148 + example for devices with very large BARs, e.g., GPUs. It would make 149 + sense, but we haven't done it yet. 150 + 151 + 3. Considerations for SR-IOV on PowerKVM 152 + 153 + * SR-IOV Background 154 + 155 + The PCIe SR-IOV feature allows a single Physical Function (PF) to 156 + support several Virtual Functions (VFs). Registers in the PF's SR-IOV 157 + Capability control the number of VFs and whether they are enabled. 158 + 159 + When VFs are enabled, they appear in Configuration Space like normal 160 + PCI devices, but the BARs in VF config space headers are unusual. For 161 + a non-VF device, software uses BARs in the config space header to 162 + discover the BAR sizes and assign addresses for them. For VF devices, 163 + software uses VF BAR registers in the *PF* SR-IOV Capability to 164 + discover sizes and assign addresses. The BARs in the VF's config space 165 + header are read-only zeros. 166 + 167 + When a VF BAR in the PF SR-IOV Capability is programmed, it sets the 168 + base address for all the corresponding VF(n) BARs. For example, if the 169 + PF SR-IOV Capability is programmed to enable eight VFs, and it has a 170 + 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. 171 + This region is divided into eight contiguous 1MB regions, each of which 172 + is a BAR0 for one of the VFs. Note that even though the VF BAR 173 + describes an 8MB region, the alignment requirement is for a single VF, 174 + i.e., 1MB in this example. 175 + 176 + There are several strategies for isolating VFs in PEs: 177 + 178 + - M32 window: There's one M32 window, and it is split into 256 179 + equally-sized segments. The finest granularity possible is a 256MB 180 + window with 1MB segments. VF BARs that are 1MB or larger could be 181 + mapped to separate PEs in this window. Each segment can be 182 + individually mapped to a PE via the lookup table, so this is quite 183 + flexible, but it works best when all the VF BARs are the same size. If 184 + they are different sizes, the entire window has to be small enough that 185 + the segment size matches the smallest VF BAR, which means larger VF 186 + BARs span several segments. 187 + 188 + - Non-segmented M64 window: A non-segmented M64 window is mapped entirely 189 + to a single PE, so it could only isolate one VF. 190 + 191 + - Single segmented M64 windows: A segmented M64 window could be used just 192 + like the M32 window, but the segments can't be individually mapped to 193 + PEs (the segment number is the PE#), so there isn't as much 194 + flexibility. A VF with multiple BARs would have to be in a "domain" of 195 + multiple PEs, which is not as well isolated as a single PE. 196 + 197 + - Multiple segmented M64 windows: As usual, each window is split into 256 198 + equally-sized segments, and the segment number is the PE#. But if we 199 + use several M64 windows, they can be set to different base addresses 200 + and different segment sizes. If we have VFs that each have a 1MB BAR 201 + and a 32MB BAR, we could use one M64 window to assign 1MB segments and 202 + another M64 window to assign 32MB segments. 203 + 204 + Finally, the plan to use M64 windows for SR-IOV, which will be described 205 + more in the next two sections. For a given VF BAR, we need to 206 + effectively reserve the entire 256 segments (256 * VF BAR size) and 207 + position the VF BAR to start at the beginning of a free range of 208 + segments/PEs inside that M64 window. 209 + 210 + The goal is of course to be able to give a separate PE for each VF. 211 + 212 + The IODA2 platform has 16 M64 windows, which are used to map MMIO 213 + range to PE#. Each M64 window defines one MMIO range and this range is 214 + divided into 256 segments, with each segment corresponding to one PE. 215 + 216 + We decide to leverage this M64 window to map VFs to individual PEs, since 217 + SR-IOV VF BARs are all the same size. 218 + 219 + But doing so introduces another problem: total_VFs is usually smaller 220 + than the number of M64 window segments, so if we map one VF BAR directly 221 + to one M64 window, some part of the M64 window will map to another 222 + device's MMIO range. 223 + 224 + IODA supports 256 PEs, so segmented windows contain 256 segments, so if 225 + total_VFs is less than 256, we have the situation in Figure 1.0, where 226 + segments [total_VFs, 255] of the M64 window may map to some MMIO range on 227 + other devices: 228 + 229 + 0 1 total_VFs - 1 230 + +------+------+- -+------+------+ 231 + | | | ... | | | 232 + +------+------+- -+------+------+ 233 + 234 + VF(n) BAR space 235 + 236 + 0 1 total_VFs - 1 255 237 + +------+------+- -+------+------+- -+------+------+ 238 + | | | ... | | | ... | | | 239 + +------+------+- -+------+------+- -+------+------+ 240 + 241 + M64 window 242 + 243 + Figure 1.0 Direct map VF(n) BAR space 244 + 245 + Our current solution is to allocate 256 segments even if the VF(n) BAR 246 + space doesn't need that much, as shown in Figure 1.1: 247 + 248 + 0 1 total_VFs - 1 255 249 + +------+------+- -+------+------+- -+------+------+ 250 + | | | ... | | | ... | | | 251 + +------+------+- -+------+------+- -+------+------+ 252 + 253 + VF(n) BAR space + extra 254 + 255 + 0 1 total_VFs - 1 255 256 + +------+------+- -+------+------+- -+------+------+ 257 + | | | ... | | | ... | | | 258 + +------+------+- -+------+------+- -+------+------+ 259 + 260 + M64 window 261 + 262 + Figure 1.1 Map VF(n) BAR space + extra 263 + 264 + Allocating the extra space ensures that the entire M64 window will be 265 + assigned to this one SR-IOV device and none of the space will be 266 + available for other devices. Note that this only expands the space 267 + reserved in software; there are still only total_VFs VFs, and they only 268 + respond to segments [0, total_VFs - 1]. There's nothing in hardware that 269 + responds to segments [total_VFs, 255]. 270 + 271 + 4. Implications for the Generic PCI Code 272 + 273 + The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be 274 + aligned to the size of an individual VF BAR. 275 + 276 + In IODA2, the MMIO address determines the PE#. If the address is in an M32 277 + window, we can set the PE# by updating the table that translates segments 278 + to PE#s. Similarly, if the address is in an unsegmented M64 window, we can 279 + set the PE# for the window. But if it's in a segmented M64 window, the 280 + segment number is the PE#. 281 + 282 + Therefore, the only way to control the PE# for a VF is to change the base 283 + of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact 284 + amount of space required for the VF(n) BAR space, the VF BAR value is fixed 285 + and cannot be changed. 286 + 287 + On the other hand, if the PCI core allocates additional space, the VF BAR 288 + value can be changed as long as the entire VF(n) BAR space remains inside 289 + the space allocated by the core. 290 + 291 + Ideally the segment size will be the same as an individual VF BAR size. 292 + Then each VF will be in its own PE. The VF BARs (and therefore the PE#s) 293 + are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we 294 + allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0. 295 + 296 + If the segment size is smaller than the VF BAR size, it will take several 297 + segments to cover a VF BAR, and a VF will be in several PEs. This is 298 + possible, but the isolation isn't as good, and it reduces the number of PE# 299 + choices because instead of consuming only numVFs segments, the VF(n) BAR 300 + space will consume (numVFs * n) segments. That means there aren't as many 301 + available segments for adjusting base of the VF(n) BAR space.