Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

cxl: Documentation/driver-api/cxl: Describe the x86 Low Memory Hole solution

Add documentation on how to resolve conflicts between CXL Fixed Memory
Windows, Platform Low Memory Holes, intermediate Switch and Endpoint
Decoders.

[dj]: Fixed inconsistent spacing after '.'
[dj]: Fixed subject line from Alison.
[dj]: Removed '::' before table from Bagas.

Reviewed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>

authored by

Fabio M. De Francesco and committed by
Dave Jiang
c5dca386 c4272905

+135
+135
Documentation/driver-api/cxl/conventions.rst
··· 45 45 ---------------------------------- 46 46 47 47 <Propose spec language that corrects the conflict.> 48 + 49 + 50 + Resolve conflict between CFMWS, Platform Memory Holes, and Endpoint Decoders 51 + ============================================================================ 52 + 53 + Document 54 + -------- 55 + 56 + CXL Revision 3.2, Version 1.0 57 + 58 + License 59 + ------- 60 + 61 + SPDX-License Identifier: CC-BY-4.0 62 + 63 + Creator/Contributors 64 + -------------------- 65 + 66 + - Fabio M. De Francesco, Intel 67 + - Dan J. Williams, Intel 68 + - Mahesh Natu, Intel 69 + 70 + Summary of the Change 71 + --------------------- 72 + 73 + According to the current Compute Express Link (CXL) Specifications (Revision 74 + 3.2, Version 1.0), the CXL Fixed Memory Window Structure (CFMWS) describes zero 75 + or more Host Physical Address (HPA) windows associated with each CXL Host 76 + Bridge. Each window represents a contiguous HPA range that may be interleaved 77 + across one or more targets, including CXL Host Bridges. Each window has a set 78 + of restrictions that govern its usage. It is the Operating System-directed 79 + configuration and Power Management (OSPM) responsibility to utilize each window 80 + for the specified use. 81 + 82 + Table 9-22 of the current CXL Specifications states that the Window Size field 83 + contains the total number of consecutive bytes of HPA this window describes. 84 + This value must be a multiple of the Number of Interleave Ways (NIW) * 256 MB. 85 + 86 + Platform Firmware (BIOS) might reserve physical addresses below 4 GB where a 87 + memory gap such as the Low Memory Hole for PCIe MMIO may exist. In such cases, 88 + the CFMWS Range Size may not adhere to the NIW * 256 MB rule. 89 + 90 + The HPA represents the actual physical memory address space that the CXL devices 91 + can decode and respond to, while the System Physical Address (SPA), a related 92 + but distinct concept, represents the system-visible address space that users can 93 + direct transaction to and so it excludes reserved regions. 94 + 95 + BIOS publishes CFMWS to communicate the active SPA ranges that, on platforms 96 + with LMH's, map to a strict subset of the HPA. The SPA range trims out the hole, 97 + resulting in lost capacity in the Endpoints with no SPA to map to that part of 98 + the HPA range that intersects the hole. 99 + 100 + E.g, an x86 platform with two CFMWS and an LMH starting at 2 GB: 101 + 102 + +--------+------------+-------------------+------------------+-------------------+------+ 103 + | Window | CFMWS Base | CFMWS Size | HDM Decoder Base | HDM Decoder Size | Ways | 104 + +========+============+===================+==================+===================+======+ 105 + |  0 | 0 GB | 2 GB | 0 GB | 3 GB | 12 | 106 + +--------+------------+-------------------+------------------+-------------------+------+ 107 + |  1 | 4 GB | NIW*256MB Aligned | 4 GB | NIW*256MB Aligned | 12 | 108 + +--------+------------+-------------------+------------------+-------------------+------+ 109 + 110 + HDM decoder base and HDM decoder size represent all the 12 Endpoint Decoders of 111 + a 12 ways region and all the intermediate Switch Decoders. They are configured 112 + by the BIOS according to the NIW * 256MB rule, resulting in a HPA range size of 113 + 3GB. Instead, the CFMWS Base and CFMWS Size are used to configure the Root 114 + Decoder HPA range that results smaller (2GB) than that of the Switch and 115 + Endpoint Decoders in the hierarchy (3GB). 116 + 117 + This creates 2 issues which lead to a failure to construct a region: 118 + 119 + 1) A mismatch in region size between root and any HDM decoder. The root decoders 120 + will always be smaller due to the trim. 121 + 122 + 2) The trim causes the root decoder to violate the (NIW * 256MB) rule. 123 + 124 + This change allows a region with a base address of 0GB to bypass these checks to 125 + allow for region creation with the trimmed root decoder address range. 126 + 127 + This change does not allow for any other arbitrary region to violate these 128 + checks - it is intended exclusively to enable x86 platforms which map CXL memory 129 + under 4GB. 130 + 131 + Despite the HDM decoders covering the PCIE hole HPA region, it is expected that 132 + the platform will never route address accesses to the CXL complex because the 133 + root decoder only covers the trimmed region (which excludes this). This is 134 + outside the ability of Linux to enforce. 135 + 136 + On the example platform, only the first 2GB will be potentially usable, but 137 + Linux, aiming to adhere to the current specifications, fails to construct 138 + Regions and attach Endpoint and intermediate Switch Decoders to them. 139 + 140 + There are several points of failure that due to the expectation that the Root 141 + Decoder HPA size, that is equal to the CFMWS from which it is configured, has 142 + to be greater or equal to the matching Switch and Endpoint HDM Decoders. 143 + 144 + In order to succeed with construction and attachment, Linux must construct a 145 + Region with Root Decoder HPA range size, and then attach to that all the 146 + intermediate Switch Decoders and Endpoint Decoders that belong to the hierarchy 147 + regardless of their range sizes. 148 + 149 + Benefits of the Change 150 + ---------------------- 151 + 152 + Without the change, the OSPM wouldn't match intermediate Switch and Endpoint 153 + Decoders with Root Decoders configured with CFMWS HPA sizes that don't align 154 + with the NIW * 256MB constraint, and so it leads to lost memdev capacity. 155 + 156 + This change allows the OSPM to construct Regions and attach intermediate Switch 157 + and Endpoint Decoders to them, so that the addressable part of the memory 158 + devices total capacity is made available to the users. 159 + 160 + References 161 + ---------- 162 + 163 + Compute Express Link Specification Revision 3.2, Version 1.0 164 + <https://www.computeexpresslink.org/> 165 + 166 + Detailed Description of the Change 167 + ---------------------------------- 168 + 169 + The description of the Window Size field in table 9-22 needs to account for 170 + platforms with Low Memory Holes, where SPA ranges might be subsets of the 171 + endpoints HPA. Therefore, it has to be changed to the following: 172 + 173 + "The total number of consecutive bytes of HPA this window represents. This value 174 + shall be a multiple of NIW * 256 MB. 175 + 176 + On platforms that reserve physical addresses below 4 GB, such as the Low Memory 177 + Hole for PCIe MMIO on x86, an instance of CFMWS whose Base HPA range is 0 might 178 + have a size that doesn't align with the NIW * 256 MB constraint. 179 + 180 + Note that the matching intermediate Switch Decoders and the Endpoint Decoders 181 + HPA range sizes must still align to the above-mentioned rule, but the memory 182 + capacity that exceeds the CFMWS window size won't be accessible.".