Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

libnvdimm: documentation clarifications

A bunch of changes that I hope will help in understanding it
better for first-time readers.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

authored by

Konrad Rzeszutek Wilk and committed by
Dan Williams
8de5dff8 589e75d1

+28 -21
+28 -21
Documentation/nvdimm/nvdimm.txt
··· 62 62 mmap persistent memory, from a PMEM block device, directly into a 63 63 process address space. 64 64 65 + DSM: Device Specific Method: ACPI method to to control specific 66 + device - in this case the firmware. 67 + 68 + DCR: NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. 69 + It defines a vendor-id, device-id, and interface format for a given DIMM. 70 + 65 71 BTT: Block Translation Table: Persistent memory is byte addressable. 66 72 Existing software may have an expectation that the power-fail-atomicity 67 73 of writes is at least one sector, 512 bytes. The BTT is an indirection ··· 139 133 registered, can be immediately attached to nd_pmem. 140 134 141 135 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform 142 - defined apertures. A set of apertures will all access just one DIMM. 143 - Multiple windows allow multiple concurrent accesses, much like 136 + defined apertures. A set of apertures will access just one DIMM. 137 + Multiple windows (apertures) allow multiple concurrent accesses, much like 144 138 tagged-command-queuing, and would likely be used by different threads or 145 139 different CPUs. 146 140 147 141 The NFIT specification defines a standard format for a BLK-aperture, but 148 142 the spec also allows for vendor specific layouts, and non-NFIT BLK 149 - implementations may other designs for BLK I/O. For this reason "nd_blk" 150 - calls back into platform-specific code to perform the I/O. One such 151 - implementation is defined in the "Driver Writer's Guide" and "DSM 143 + implementations may have other designs for BLK I/O. For this reason 144 + "nd_blk" calls back into platform-specific code to perform the I/O. 145 + One such implementation is defined in the "Driver Writer's Guide" and "DSM 152 146 Interface Example". 153 147 154 148 ··· 158 152 While PMEM provides direct byte-addressable CPU-load/store access to 159 153 NVDIMM storage, it does not provide the best system RAS (recovery, 160 154 availability, and serviceability) model. An access to a corrupted 161 - system-physical-address address causes a cpu exception while an access 155 + system-physical-address address causes a CPU exception while an access 162 156 to a corrupted address through an BLK-aperture causes that block window 163 157 to raise an error status in a register. The latter is more aligned with 164 158 the standard error model that host-bus-adapter attached disks present. ··· 168 162 several DIMMs. 169 163 170 164 PMEM vs BLK 171 - BLK-apertures solve this RAS problem, but their presence is also the 165 + BLK-apertures solve these RAS problems, but their presence is also the 172 166 major contributing factor to the complexity of the ND subsystem. They 173 167 complicate the implementation because PMEM and BLK alias in DPA space. 174 168 Any given DIMM's DPA-range may contribute to one or more ··· 226 220 by a region device with a dynamically assigned id (REGION0 - REGION5). 227 221 228 222 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A 229 - single PMEM namespace is created in the REGION0-SPA-range that spans 230 - DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that 223 + single PMEM namespace is created in the REGION0-SPA-range that spans most 224 + of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that 231 225 interleaved system-physical-address range is reclaimed as BLK-aperture 232 226 accessed space starting at DPA-offset (a) into each DIMM. In that 233 227 reclaimed space we create two BLK-aperture "namespaces" from REGION2 and ··· 236 230 237 231 2. In the last portion of DIMM0 and DIMM1 we have an interleaved 238 232 system-physical-address range, REGION1, that spans those two DIMMs as 239 - well as DIMM2 and DIMM3. Some of REGION1 allocated to a PMEM namespace 240 - named "pm1.0" the rest is reclaimed in 4 BLK-aperture namespaces (for 233 + well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace 234 + named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for 241 235 each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and 242 236 "blk5.0". 243 237 244 238 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 245 - interleaved system-physical-address range (i.e. the DPA address below 239 + interleaved system-physical-address range (i.e. the DPA address past 246 240 offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. 247 241 Note, that this example shows that BLK-aperture namespaces don't need to 248 242 be contiguous in DPA-space. ··· 258 252 259 253 What follows is a description of the LIBNVDIMM sysfs layout and a 260 254 corresponding object hierarchy diagram as viewed through the LIBNDCTL 261 - api. The example sysfs paths and diagrams are relative to the Example 255 + API. The example sysfs paths and diagrams are relative to the Example 262 256 NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit 263 257 test. 264 258 265 259 LIBNDCTL: Context 266 - Every api call in the LIBNDCTL library requires a context that holds the 260 + Every API call in the LIBNDCTL library requires a context that holds the 267 261 logging parameters and other library instance state. The library is 268 262 based on the libabc template: 269 - https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/ 263 + https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git 270 264 271 265 LIBNDCTL: instantiate a new library context example 272 266 ··· 415 409 LIBNVDIMM/LIBNDCTL: Region 416 410 ---------------------- 417 411 418 - A generic REGION device is registered for each PMEM range orBLK-aperture 412 + A generic REGION device is registered for each PMEM range or BLK-aperture 419 413 set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture 420 414 sets on the "nfit_test.0" bus. The primary role of regions are to be a 421 415 container of "mappings". A mapping is a tuple of <DIMM, ··· 515 509 types that we should simply name REGION devices with something derived 516 510 from those type names. However, the ND subsystem explicitly keeps the 517 511 REGION name generic and expects userspace to always consider the 518 - region-attributes for 4 reasons: 512 + region-attributes for four reasons: 519 513 520 514 1. There are already more than two REGION and "namespace" types. For 521 515 PMEM there are two subtypes. As mentioned previously we have PMEM where ··· 704 698 705 699 Why the Term "namespace"? 706 700 707 - 1. Why not "volume" for instance? "volume" ran the risk of confusing ND 708 - as a volume manager like device-mapper. 701 + 1. Why not "volume" for instance? "volume" ran the risk of confusing 702 + ND (libnvdimm subsystem) to a volume manager like device-mapper. 709 703 710 704 2. The term originated to describe the sub-devices that can be created 711 705 within a NVME controller (see the nvme specification: ··· 780 774 needs to be written in raw mode. By default, the kernel will autodetect 781 775 the presence of a BTT and disable raw mode. This autodetect behavior 782 776 can be suppressed by enabling raw mode for the namespace via the 783 - ndctl_namespace_set_raw_mode() api. 777 + ndctl_namespace_set_raw_mode() API. 784 778 785 779 786 780 Summary LIBNDCTL Diagram 787 781 ------------------------ 788 782 789 - For the given example above, here is the view of the objects as seen by the LIBNDCTL api: 783 + For the given example above, here is the view of the objects as seen by the 784 + LIBNDCTL API: 790 785 +---+ 791 786 |CTX| +---------+ +--------------+ +---------------+ 792 787 +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |