Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

nd_btt: atomic sector updates

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the capability of libnvdimm
namespace devices to do byte aligned IO.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata. It is a bio-based
driver because all IO is done synchronously, and there is no queuing or
asynchronous completions at either the device or the driver level.

The BTT uses 'lanes' to index into various 'on-disk' data structures,
and lanes also act as a synchronization mechanism in case there are more
CPUs than available lanes. We did a comparison between two lane lock
strategies - first where we kept an atomic counter around that tracked
which was the last lane that was used, and 'our' lane was determined by
atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
theoretically, no CPU would be blocked waiting for a lane. The other
strategy was to use the cpu number we're scheduled on to and hash it to
a lane number. Theoretically, this could block an IO that could've
otherwise run using a different, free lane. But some fio workloads
showed that the direct cpu -> lane hash performed faster than tracking
'last lane' - my reasoning is the cache thrash caused by moving the
atomic variable made that approach slower than simply waiting out the
in-progress IO. This supports the conclusion that the driver can be a
very simple bio-based one that does synchronous IOs instead of queuing.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

authored by

Vishal Verma and committed by
Dan Williams
5212e11f 8c2f7e86

+1950 -25
+273
Documentation/nvdimm/btt.txt
··· 1 + BTT - Block Translation Table 2 + ============================= 3 + 4 + 5 + 1. Introduction 6 + --------------- 7 + 8 + Persistent memory based storage is able to perform IO at byte (or more 9 + accurately, cache line) granularity. However, we often want to expose such 10 + storage as traditional block devices. The block drivers for persistent memory 11 + will do exactly this. However, they do not provide any atomicity guarantees. 12 + Traditional SSDs typically provide protection against torn sectors in hardware, 13 + using stored energy in capacitors to complete in-flight block writes, or perhaps 14 + in firmware. We don't have this luxury with persistent memory - if a write is in 15 + progress, and we experience a power failure, the block will contain a mix of old 16 + and new data. Applications may not be prepared to handle such a scenario. 17 + 18 + The Block Translation Table (BTT) provides atomic sector update semantics for 19 + persistent memory devices, so that applications that rely on sector writes not 20 + being torn can continue to do so. The BTT manifests itself as a stacked block 21 + device, and reserves a portion of the underlying storage for its metadata. At 22 + the heart of it, is an indirection table that re-maps all the blocks on the 23 + volume. It can be thought of as an extremely simple file system that only 24 + provides atomic sector updates. 25 + 26 + 27 + 2. Static Layout 28 + ---------------- 29 + 30 + The underlying storage on which a BTT can be laid out is not limited in any way. 31 + The BTT, however, splits the available space into chunks of up to 512 GiB, 32 + called "Arenas". 33 + 34 + Each arena follows the same layout for its metadata, and all references in an 35 + arena are internal to it (with the exception of one field that points to the 36 + next arena). The following depicts the "On-disk" metadata layout: 37 + 38 + 39 + Backing Store +-------> Arena 40 + +---------------+ | +------------------+ 41 + | | | | Arena info block | 42 + | Arena 0 +---+ | 4K | 43 + | 512G | +------------------+ 44 + | | | | 45 + +---------------+ | | 46 + | | | | 47 + | Arena 1 | | Data Blocks | 48 + | 512G | | | 49 + | | | | 50 + +---------------+ | | 51 + | . | | | 52 + | . | | | 53 + | . | | | 54 + | | | | 55 + | | | | 56 + +---------------+ +------------------+ 57 + | | 58 + | BTT Map | 59 + | | 60 + | | 61 + +------------------+ 62 + | | 63 + | BTT Flog | 64 + | | 65 + +------------------+ 66 + | Info block copy | 67 + | 4K | 68 + +------------------+ 69 + 70 + 71 + 3. Theory of Operation 72 + ---------------------- 73 + 74 + 75 + a. The BTT Map 76 + -------------- 77 + 78 + The map is a simple lookup/indirection table that maps an LBA to an internal 79 + block. Each map entry is 32 bits. The two most significant bits are special 80 + flags, and the remaining form the internal block number. 81 + 82 + Bit Description 83 + 31 : TRIM flag - marks if the block was trimmed or discarded 84 + 30 : ERROR flag - marks an error block. Cleared on write. 85 + 29 - 0 : Mappings to internal 'postmap' blocks 86 + 87 + 88 + Some of the terminology that will be subsequently used: 89 + 90 + External LBA : LBA as made visible to upper layers. 91 + ABA : Arena Block Address - Block offset/number within an arena 92 + Premap ABA : The block offset into an arena, which was decided upon by range 93 + checking the External LBA 94 + Postmap ABA : The block number in the "Data Blocks" area obtained after 95 + indirection from the map 96 + nfree : The number of free blocks that are maintained at any given time. 97 + This is the number of concurrent writes that can happen to the 98 + arena. 99 + 100 + 101 + For example, after adding a BTT, we surface a disk of 1024G. We get a read for 102 + the external LBA at 768G. This falls into the second arena, and of the 512G 103 + worth of blocks that this arena contributes, this block is at 256G. Thus, the 104 + premap ABA is 256G. We now refer to the map, and find out the mapping for block 105 + 'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. 106 + 107 + 108 + b. The BTT Flog 109 + --------------- 110 + 111 + The BTT provides sector atomicity by making every write an "allocating write", 112 + i.e. Every write goes to a "free" block. A running list of free blocks is 113 + maintained in the form of the BTT flog. 'Flog' is a combination of the words 114 + "free list" and "log". The flog contains 'nfree' entries, and an entry contains: 115 + 116 + lba : The premap ABA that is being written to 117 + old_map : The old postmap ABA - after 'this' write completes, this will be a 118 + free block. 119 + new_map : The new postmap ABA. The map will up updated to reflect this 120 + lba->postmap_aba mapping, but we log it here in case we have to 121 + recover. 122 + seq : Sequence number to mark which of the 2 sections of this flog entry is 123 + valid/newest. It cycles between 01->10->11->01 (binary) under normal 124 + operation, with 00 indicating an uninitialized state. 125 + lba' : alternate lba entry 126 + old_map': alternate old postmap entry 127 + new_map': alternate new postmap entry 128 + seq' : alternate sequence number. 129 + 130 + Each of the above fields is 32-bit, making one entry 16 bytes. Flog updates are 131 + done such that for any entry being written, it: 132 + a. overwrites the 'old' section in the entry based on sequence numbers 133 + b. writes the new entry such that the sequence number is written last. 134 + 135 + 136 + c. The concept of lanes 137 + ----------------------- 138 + 139 + While 'nfree' describes the number of concurrent IOs an arena can process 140 + concurrently, 'nlanes' is the number of IOs the BTT device as a whole can 141 + process. 142 + nlanes = min(nfree, num_cpus) 143 + A lane number is obtained at the start of any IO, and is used for indexing into 144 + all the on-disk and in-memory data structures for the duration of the IO. It is 145 + protected by a spinlock. 146 + 147 + 148 + d. In-memory data structure: Read Tracking Table (RTT) 149 + ------------------------------------------------------ 150 + 151 + Consider a case where we have two threads, one doing reads and the other, 152 + writes. We can hit a condition where the writer thread grabs a free block to do 153 + a new IO, but the (slow) reader thread is still reading from it. In other words, 154 + the reader consulted a map entry, and started reading the corresponding block. A 155 + writer started writing to the same external LBA, and finished the write updating 156 + the map for that external LBA to point to its new postmap ABA. At this point the 157 + internal, postmap block that the reader is (still) reading has been inserted 158 + into the list of free blocks. If another write comes in for the same LBA, it can 159 + grab this free block, and start writing to it, causing the reader to read 160 + incorrect data. To prevent this, we introduce the RTT. 161 + 162 + The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts 163 + into rtt[lane_number], the postmap ABA it is reading, and clears it after the 164 + read is complete. Every writer thread, after grabbing a free block, checks the 165 + RTT for its presence. If the postmap free block is in the RTT, it waits till the 166 + reader clears the RTT entry, and only then starts writing to it. 167 + 168 + 169 + e. In-memory data structure: map locks 170 + -------------------------------------- 171 + 172 + Consider a case where two writer threads are writing to the same LBA. There can 173 + be a race in the following sequence of steps: 174 + 175 + free[lane] = map[premap_aba] 176 + map[premap_aba] = postmap_aba 177 + 178 + Both threads can update their respective free[lane] with the same old, freed 179 + postmap_aba. This has made the layout inconsistent by losing a free entry, and 180 + at the same time, duplicating another free entry for two lanes. 181 + 182 + To solve this, we could have a single map lock (per arena) that has to be taken 183 + before performing the above sequence, but we feel that could be too contentious. 184 + Instead we use an array of (nfree) map_locks that is indexed by 185 + (premap_aba modulo nfree). 186 + 187 + 188 + f. Reconstruction from the Flog 189 + ------------------------------- 190 + 191 + On startup, we analyze the BTT flog to create our list of free blocks. We walk 192 + through all the entries, and for each lane, of the set of two possible 193 + 'sections', we always look at the most recent one only (based on the sequence 194 + number). The reconstruction rules/steps are simple: 195 + - Read map[log_entry.lba]. 196 + - If log_entry.new matches the map entry, then log_entry.old is free. 197 + - If log_entry.new does not match the map entry, then log_entry.new is free. 198 + (This case can only be caused by power-fails/unsafe shutdowns) 199 + 200 + 201 + g. Summarizing - Read and Write flows 202 + ------------------------------------- 203 + 204 + Read: 205 + 206 + 1. Convert external LBA to arena number + pre-map ABA 207 + 2. Get a lane (and take lane_lock) 208 + 3. Read map to get the entry for this pre-map ABA 209 + 4. Enter post-map ABA into RTT[lane] 210 + 5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) 211 + 6. If ERROR flag set in map, end IO with EIO (go to step 8) 212 + 7. Read data from this block 213 + 8. Remove post-map ABA entry from RTT[lane] 214 + 9. Release lane (and lane_lock) 215 + 216 + Write: 217 + 218 + 1. Convert external LBA to Arena number + pre-map ABA 219 + 2. Get a lane (and take lane_lock) 220 + 3. Use lane to index into in-memory free list and obtain a new block, next flog 221 + index, next sequence number 222 + 4. Scan the RTT to check if free block is present, and spin/wait if it is. 223 + 5. Write data to this free block 224 + 6. Read map to get the existing post-map ABA entry for this pre-map ABA 225 + 7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] 226 + 8. Write new post-map ABA into map. 227 + 9. Write old post-map entry into the free list 228 + 10. Calculate next sequence number and write into the free list entry 229 + 11. Release lane (and lane_lock) 230 + 231 + 232 + 4. Error Handling 233 + ================= 234 + 235 + An arena would be in an error state if any of the metadata is corrupted 236 + irrecoverably, either due to a bug or a media error. The following conditions 237 + indicate an error: 238 + - Info block checksum does not match (and recovering from the copy also fails) 239 + - All internal available blocks are not uniquely and entirely addressed by the 240 + sum of mapped blocks and free blocks (from the BTT flog). 241 + - Rebuilding free list from the flog reveals missing/duplicate/impossible 242 + entries 243 + - A map entry is out of bounds 244 + 245 + If any of these error conditions are encountered, the arena is put into a read 246 + only state using a flag in the info block. 247 + 248 + 249 + 5. In-kernel usage 250 + ================== 251 + 252 + Any block driver that supports byte granularity IO to the storage may register 253 + with the BTT. It will have to provide the rw_bytes interface in its 254 + block_device_operations struct: 255 + 256 + int (*rw_bytes)(struct gendisk *, void *, size_t, off_t, int rw); 257 + 258 + It may register with the BTT after it adds its own gendisk, using btt_init: 259 + 260 + struct btt *btt_init(struct gendisk *disk, unsigned long long rawsize, 261 + u32 lbasize, u8 uuid[], int maxlane); 262 + 263 + note that maxlane is the maximum amount of concurrency the driver wishes to 264 + allow the BTT to use. 265 + 266 + The BTT 'disk' appears as a stacked block device that grabs the underlying block 267 + device in the O_EXCL mode. 268 + 269 + When the driver wishes to remove the backing disk, it should similarly call 270 + btt_fini using the same struct btt* handle that was provided to it by btt_init. 271 + 272 + void btt_fini(struct btt *btt); 273 +
+1
drivers/acpi/nfit.c
··· 902 902 } else { 903 903 nd_mapping->size = nfit_mem->bdw->capacity; 904 904 nd_mapping->start = nfit_mem->bdw->start_address; 905 + ndr_desc->num_lanes = nfit_mem->bdw->windows; 905 906 blk_valid = 1; 906 907 } 907 908
+22 -6
drivers/nvdimm/Kconfig
··· 8 8 NFIT, or otherwise can discover NVDIMM resources, a libnvdimm 9 9 bus is registered to advertise PMEM (persistent memory) 10 10 namespaces (/dev/pmemX) and BLK (sliding mmio window(s)) 11 - namespaces (/dev/ndX). A PMEM namespace refers to a memory 12 - resource that may span multiple DIMMs and support DAX (see 13 - CONFIG_DAX). A BLK namespace refers to an NVDIMM control 14 - region which exposes an mmio register set for windowed 15 - access mode to non-volatile memory. 11 + namespaces (/dev/ndblkX.Y). A PMEM namespace refers to a 12 + memory resource that may span multiple DIMMs and support DAX 13 + (see CONFIG_DAX). A BLK namespace refers to an NVDIMM control 14 + region which exposes an mmio register set for windowed access 15 + mode to non-volatile memory. 16 16 17 17 if LIBNVDIMM 18 18 ··· 20 20 tristate "PMEM: Persistent memory block device support" 21 21 default LIBNVDIMM 22 22 depends on HAS_IOMEM 23 + select ND_BTT if BTT 23 24 help 24 25 Memory ranges for PMEM are described by either an NFIT 25 26 (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a ··· 34 33 35 34 Say Y if you want to use an NVDIMM 36 35 36 + config ND_BTT 37 + tristate 38 + 37 39 config BTT 38 - def_bool y 40 + bool "BTT: Block Translation Table (atomic sector updates)" 41 + default y if LIBNVDIMM 42 + help 43 + The Block Translation Table (BTT) provides atomic sector 44 + update semantics for persistent memory devices, so that 45 + applications that rely on sector writes not being torn (a 46 + guarantee that typical disks provide) can continue to do so. 47 + The BTT manifests itself as an alternate personality for an 48 + NVDIMM namespace, i.e. a namespace can be in raw mode (pmemX, 49 + ndblkX.Y, etc...), or 'sectored' mode, (pmemXs, ndblkX.Ys, 50 + etc...). 51 + 52 + Select Y if unsure 39 53 40 54 endif
+3
drivers/nvdimm/Makefile
··· 1 1 obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o 2 2 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o 3 + obj-$(CONFIG_ND_BTT) += nd_btt.o 3 4 4 5 nd_pmem-y := pmem.o 6 + 7 + nd_btt-y := btt.o 5 8 6 9 libnvdimm-y := core.o 7 10 libnvdimm-y += bus.o
+1371
drivers/nvdimm/btt.c
··· 1 + /* 2 + * Block Translation Table 3 + * Copyright (c) 2014-2015, Intel Corporation. 4 + * 5 + * This program is free software; you can redistribute it and/or modify it 6 + * under the terms and conditions of the GNU General Public License, 7 + * version 2, as published by the Free Software Foundation. 8 + * 9 + * This program is distributed in the hope it will be useful, but WITHOUT 10 + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 11 + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for 12 + * more details. 13 + */ 14 + #include <linux/highmem.h> 15 + #include <linux/debugfs.h> 16 + #include <linux/blkdev.h> 17 + #include <linux/module.h> 18 + #include <linux/device.h> 19 + #include <linux/mutex.h> 20 + #include <linux/hdreg.h> 21 + #include <linux/genhd.h> 22 + #include <linux/sizes.h> 23 + #include <linux/ndctl.h> 24 + #include <linux/fs.h> 25 + #include <linux/nd.h> 26 + #include "btt.h" 27 + #include "nd.h" 28 + 29 + enum log_ent_request { 30 + LOG_NEW_ENT = 0, 31 + LOG_OLD_ENT 32 + }; 33 + 34 + static int btt_major; 35 + 36 + static int arena_read_bytes(struct arena_info *arena, resource_size_t offset, 37 + void *buf, size_t n) 38 + { 39 + struct nd_btt *nd_btt = arena->nd_btt; 40 + struct nd_namespace_common *ndns = nd_btt->ndns; 41 + 42 + /* arena offsets are 4K from the base of the device */ 43 + offset += SZ_4K; 44 + return nvdimm_read_bytes(ndns, offset, buf, n); 45 + } 46 + 47 + static int arena_write_bytes(struct arena_info *arena, resource_size_t offset, 48 + void *buf, size_t n) 49 + { 50 + struct nd_btt *nd_btt = arena->nd_btt; 51 + struct nd_namespace_common *ndns = nd_btt->ndns; 52 + 53 + /* arena offsets are 4K from the base of the device */ 54 + offset += SZ_4K; 55 + return nvdimm_write_bytes(ndns, offset, buf, n); 56 + } 57 + 58 + static int btt_info_write(struct arena_info *arena, struct btt_sb *super) 59 + { 60 + int ret; 61 + 62 + ret = arena_write_bytes(arena, arena->info2off, super, 63 + sizeof(struct btt_sb)); 64 + if (ret) 65 + return ret; 66 + 67 + return arena_write_bytes(arena, arena->infooff, super, 68 + sizeof(struct btt_sb)); 69 + } 70 + 71 + static int btt_info_read(struct arena_info *arena, struct btt_sb *super) 72 + { 73 + WARN_ON(!super); 74 + return arena_read_bytes(arena, arena->infooff, super, 75 + sizeof(struct btt_sb)); 76 + } 77 + 78 + /* 79 + * 'raw' version of btt_map write 80 + * Assumptions: 81 + * mapping is in little-endian 82 + * mapping contains 'E' and 'Z' flags as desired 83 + */ 84 + static int __btt_map_write(struct arena_info *arena, u32 lba, __le32 mapping) 85 + { 86 + u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE); 87 + 88 + WARN_ON(lba >= arena->external_nlba); 89 + return arena_write_bytes(arena, ns_off, &mapping, MAP_ENT_SIZE); 90 + } 91 + 92 + static int btt_map_write(struct arena_info *arena, u32 lba, u32 mapping, 93 + u32 z_flag, u32 e_flag) 94 + { 95 + u32 ze; 96 + __le32 mapping_le; 97 + 98 + /* 99 + * This 'mapping' is supposed to be just the LBA mapping, without 100 + * any flags set, so strip the flag bits. 101 + */ 102 + mapping &= MAP_LBA_MASK; 103 + 104 + ze = (z_flag << 1) + e_flag; 105 + switch (ze) { 106 + case 0: 107 + /* 108 + * We want to set neither of the Z or E flags, and 109 + * in the actual layout, this means setting the bit 110 + * positions of both to '1' to indicate a 'normal' 111 + * map entry 112 + */ 113 + mapping |= MAP_ENT_NORMAL; 114 + break; 115 + case 1: 116 + mapping |= (1 << MAP_ERR_SHIFT); 117 + break; 118 + case 2: 119 + mapping |= (1 << MAP_TRIM_SHIFT); 120 + break; 121 + default: 122 + /* 123 + * The case where Z and E are both sent in as '1' could be 124 + * construed as a valid 'normal' case, but we decide not to, 125 + * to avoid confusion 126 + */ 127 + WARN_ONCE(1, "Invalid use of Z and E flags\n"); 128 + return -EIO; 129 + } 130 + 131 + mapping_le = cpu_to_le32(mapping); 132 + return __btt_map_write(arena, lba, mapping_le); 133 + } 134 + 135 + static int btt_map_read(struct arena_info *arena, u32 lba, u32 *mapping, 136 + int *trim, int *error) 137 + { 138 + int ret; 139 + __le32 in; 140 + u32 raw_mapping, postmap, ze, z_flag, e_flag; 141 + u64 ns_off = arena->mapoff + (lba * MAP_ENT_SIZE); 142 + 143 + WARN_ON(lba >= arena->external_nlba); 144 + 145 + ret = arena_read_bytes(arena, ns_off, &in, MAP_ENT_SIZE); 146 + if (ret) 147 + return ret; 148 + 149 + raw_mapping = le32_to_cpu(in); 150 + 151 + z_flag = (raw_mapping & MAP_TRIM_MASK) >> MAP_TRIM_SHIFT; 152 + e_flag = (raw_mapping & MAP_ERR_MASK) >> MAP_ERR_SHIFT; 153 + ze = (z_flag << 1) + e_flag; 154 + postmap = raw_mapping & MAP_LBA_MASK; 155 + 156 + /* Reuse the {z,e}_flag variables for *trim and *error */ 157 + z_flag = 0; 158 + e_flag = 0; 159 + 160 + switch (ze) { 161 + case 0: 162 + /* Initial state. Return postmap = premap */ 163 + *mapping = lba; 164 + break; 165 + case 1: 166 + *mapping = postmap; 167 + e_flag = 1; 168 + break; 169 + case 2: 170 + *mapping = postmap; 171 + z_flag = 1; 172 + break; 173 + case 3: 174 + *mapping = postmap; 175 + break; 176 + default: 177 + return -EIO; 178 + } 179 + 180 + if (trim) 181 + *trim = z_flag; 182 + if (error) 183 + *error = e_flag; 184 + 185 + return ret; 186 + } 187 + 188 + static int btt_log_read_pair(struct arena_info *arena, u32 lane, 189 + struct log_entry *ent) 190 + { 191 + WARN_ON(!ent); 192 + return arena_read_bytes(arena, 193 + arena->logoff + (2 * lane * LOG_ENT_SIZE), ent, 194 + 2 * LOG_ENT_SIZE); 195 + } 196 + 197 + static struct dentry *debugfs_root; 198 + 199 + static void arena_debugfs_init(struct arena_info *a, struct dentry *parent, 200 + int idx) 201 + { 202 + char dirname[32]; 203 + struct dentry *d; 204 + 205 + /* If for some reason, parent bttN was not created, exit */ 206 + if (!parent) 207 + return; 208 + 209 + snprintf(dirname, 32, "arena%d", idx); 210 + d = debugfs_create_dir(dirname, parent); 211 + if (IS_ERR_OR_NULL(d)) 212 + return; 213 + a->debugfs_dir = d; 214 + 215 + debugfs_create_x64("size", S_IRUGO, d, &a->size); 216 + debugfs_create_x64("external_lba_start", S_IRUGO, d, 217 + &a->external_lba_start); 218 + debugfs_create_x32("internal_nlba", S_IRUGO, d, &a->internal_nlba); 219 + debugfs_create_u32("internal_lbasize", S_IRUGO, d, 220 + &a->internal_lbasize); 221 + debugfs_create_x32("external_nlba", S_IRUGO, d, &a->external_nlba); 222 + debugfs_create_u32("external_lbasize", S_IRUGO, d, 223 + &a->external_lbasize); 224 + debugfs_create_u32("nfree", S_IRUGO, d, &a->nfree); 225 + debugfs_create_u16("version_major", S_IRUGO, d, &a->version_major); 226 + debugfs_create_u16("version_minor", S_IRUGO, d, &a->version_minor); 227 + debugfs_create_x64("nextoff", S_IRUGO, d, &a->nextoff); 228 + debugfs_create_x64("infooff", S_IRUGO, d, &a->infooff); 229 + debugfs_create_x64("dataoff", S_IRUGO, d, &a->dataoff); 230 + debugfs_create_x64("mapoff", S_IRUGO, d, &a->mapoff); 231 + debugfs_create_x64("logoff", S_IRUGO, d, &a->logoff); 232 + debugfs_create_x64("info2off", S_IRUGO, d, &a->info2off); 233 + debugfs_create_x32("flags", S_IRUGO, d, &a->flags); 234 + } 235 + 236 + static void btt_debugfs_init(struct btt *btt) 237 + { 238 + int i = 0; 239 + struct arena_info *arena; 240 + 241 + btt->debugfs_dir = debugfs_create_dir(dev_name(&btt->nd_btt->dev), 242 + debugfs_root); 243 + if (IS_ERR_OR_NULL(btt->debugfs_dir)) 244 + return; 245 + 246 + list_for_each_entry(arena, &btt->arena_list, list) { 247 + arena_debugfs_init(arena, btt->debugfs_dir, i); 248 + i++; 249 + } 250 + } 251 + 252 + /* 253 + * This function accepts two log entries, and uses the 254 + * sequence number to find the 'older' entry. 255 + * It also updates the sequence number in this old entry to 256 + * make it the 'new' one if the mark_flag is set. 257 + * Finally, it returns which of the entries was the older one. 258 + * 259 + * TODO The logic feels a bit kludge-y. make it better.. 260 + */ 261 + static int btt_log_get_old(struct log_entry *ent) 262 + { 263 + int old; 264 + 265 + /* 266 + * the first ever time this is seen, the entry goes into [0] 267 + * the next time, the following logic works out to put this 268 + * (next) entry into [1] 269 + */ 270 + if (ent[0].seq == 0) { 271 + ent[0].seq = cpu_to_le32(1); 272 + return 0; 273 + } 274 + 275 + if (ent[0].seq == ent[1].seq) 276 + return -EINVAL; 277 + if (le32_to_cpu(ent[0].seq) + le32_to_cpu(ent[1].seq) > 5) 278 + return -EINVAL; 279 + 280 + if (le32_to_cpu(ent[0].seq) < le32_to_cpu(ent[1].seq)) { 281 + if (le32_to_cpu(ent[1].seq) - le32_to_cpu(ent[0].seq) == 1) 282 + old = 0; 283 + else 284 + old = 1; 285 + } else { 286 + if (le32_to_cpu(ent[0].seq) - le32_to_cpu(ent[1].seq) == 1) 287 + old = 1; 288 + else 289 + old = 0; 290 + } 291 + 292 + return old; 293 + } 294 + 295 + static struct device *to_dev(struct arena_info *arena) 296 + { 297 + return &arena->nd_btt->dev; 298 + } 299 + 300 + /* 301 + * This function copies the desired (old/new) log entry into ent if 302 + * it is not NULL. It returns the sub-slot number (0 or 1) 303 + * where the desired log entry was found. Negative return values 304 + * indicate errors. 305 + */ 306 + static int btt_log_read(struct arena_info *arena, u32 lane, 307 + struct log_entry *ent, int old_flag) 308 + { 309 + int ret; 310 + int old_ent, ret_ent; 311 + struct log_entry log[2]; 312 + 313 + ret = btt_log_read_pair(arena, lane, log); 314 + if (ret) 315 + return -EIO; 316 + 317 + old_ent = btt_log_get_old(log); 318 + if (old_ent < 0 || old_ent > 1) { 319 + dev_info(to_dev(arena), 320 + "log corruption (%d): lane %d seq [%d, %d]\n", 321 + old_ent, lane, log[0].seq, log[1].seq); 322 + /* TODO set error state? */ 323 + return -EIO; 324 + } 325 + 326 + ret_ent = (old_flag ? old_ent : (1 - old_ent)); 327 + 328 + if (ent != NULL) 329 + memcpy(ent, &log[ret_ent], LOG_ENT_SIZE); 330 + 331 + return ret_ent; 332 + } 333 + 334 + /* 335 + * This function commits a log entry to media 336 + * It does _not_ prepare the freelist entry for the next write 337 + * btt_flog_write is the wrapper for updating the freelist elements 338 + */ 339 + static int __btt_log_write(struct arena_info *arena, u32 lane, 340 + u32 sub, struct log_entry *ent) 341 + { 342 + int ret; 343 + /* 344 + * Ignore the padding in log_entry for calculating log_half. 345 + * The entry is 'committed' when we write the sequence number, 346 + * and we want to ensure that that is the last thing written. 347 + * We don't bother writing the padding as that would be extra 348 + * media wear and write amplification 349 + */ 350 + unsigned int log_half = (LOG_ENT_SIZE - 2 * sizeof(u64)) / 2; 351 + u64 ns_off = arena->logoff + (((2 * lane) + sub) * LOG_ENT_SIZE); 352 + void *src = ent; 353 + 354 + /* split the 16B write into atomic, durable halves */ 355 + ret = arena_write_bytes(arena, ns_off, src, log_half); 356 + if (ret) 357 + return ret; 358 + 359 + ns_off += log_half; 360 + src += log_half; 361 + return arena_write_bytes(arena, ns_off, src, log_half); 362 + } 363 + 364 + static int btt_flog_write(struct arena_info *arena, u32 lane, u32 sub, 365 + struct log_entry *ent) 366 + { 367 + int ret; 368 + 369 + ret = __btt_log_write(arena, lane, sub, ent); 370 + if (ret) 371 + return ret; 372 + 373 + /* prepare the next free entry */ 374 + arena->freelist[lane].sub = 1 - arena->freelist[lane].sub; 375 + if (++(arena->freelist[lane].seq) == 4) 376 + arena->freelist[lane].seq = 1; 377 + arena->freelist[lane].block = le32_to_cpu(ent->old_map); 378 + 379 + return ret; 380 + } 381 + 382 + /* 383 + * This function initializes the BTT map to the initial state, which is 384 + * all-zeroes, and indicates an identity mapping 385 + */ 386 + static int btt_map_init(struct arena_info *arena) 387 + { 388 + int ret = -EINVAL; 389 + void *zerobuf; 390 + size_t offset = 0; 391 + size_t chunk_size = SZ_2M; 392 + size_t mapsize = arena->logoff - arena->mapoff; 393 + 394 + zerobuf = kzalloc(chunk_size, GFP_KERNEL); 395 + if (!zerobuf) 396 + return -ENOMEM; 397 + 398 + while (mapsize) { 399 + size_t size = min(mapsize, chunk_size); 400 + 401 + ret = arena_write_bytes(arena, arena->mapoff + offset, zerobuf, 402 + size); 403 + if (ret) 404 + goto free; 405 + 406 + offset += size; 407 + mapsize -= size; 408 + cond_resched(); 409 + } 410 + 411 + free: 412 + kfree(zerobuf); 413 + return ret; 414 + } 415 + 416 + /* 417 + * This function initializes the BTT log with 'fake' entries pointing 418 + * to the initial reserved set of blocks as being free 419 + */ 420 + static int btt_log_init(struct arena_info *arena) 421 + { 422 + int ret; 423 + u32 i; 424 + struct log_entry log, zerolog; 425 + 426 + memset(&zerolog, 0, sizeof(zerolog)); 427 + 428 + for (i = 0; i < arena->nfree; i++) { 429 + log.lba = cpu_to_le32(i); 430 + log.old_map = cpu_to_le32(arena->external_nlba + i); 431 + log.new_map = cpu_to_le32(arena->external_nlba + i); 432 + log.seq = cpu_to_le32(LOG_SEQ_INIT); 433 + ret = __btt_log_write(arena, i, 0, &log); 434 + if (ret) 435 + return ret; 436 + ret = __btt_log_write(arena, i, 1, &zerolog); 437 + if (ret) 438 + return ret; 439 + } 440 + 441 + return 0; 442 + } 443 + 444 + static int btt_freelist_init(struct arena_info *arena) 445 + { 446 + int old, new, ret; 447 + u32 i, map_entry; 448 + struct log_entry log_new, log_old; 449 + 450 + arena->freelist = kcalloc(arena->nfree, sizeof(struct free_entry), 451 + GFP_KERNEL); 452 + if (!arena->freelist) 453 + return -ENOMEM; 454 + 455 + for (i = 0; i < arena->nfree; i++) { 456 + old = btt_log_read(arena, i, &log_old, LOG_OLD_ENT); 457 + if (old < 0) 458 + return old; 459 + 460 + new = btt_log_read(arena, i, &log_new, LOG_NEW_ENT); 461 + if (new < 0) 462 + return new; 463 + 464 + /* sub points to the next one to be overwritten */ 465 + arena->freelist[i].sub = 1 - new; 466 + arena->freelist[i].seq = nd_inc_seq(le32_to_cpu(log_new.seq)); 467 + arena->freelist[i].block = le32_to_cpu(log_new.old_map); 468 + 469 + /* This implies a newly created or untouched flog entry */ 470 + if (log_new.old_map == log_new.new_map) 471 + continue; 472 + 473 + /* Check if map recovery is needed */ 474 + ret = btt_map_read(arena, le32_to_cpu(log_new.lba), &map_entry, 475 + NULL, NULL); 476 + if (ret) 477 + return ret; 478 + if ((le32_to_cpu(log_new.new_map) != map_entry) && 479 + (le32_to_cpu(log_new.old_map) == map_entry)) { 480 + /* 481 + * Last transaction wrote the flog, but wasn't able 482 + * to complete the map write. So fix up the map. 483 + */ 484 + ret = btt_map_write(arena, le32_to_cpu(log_new.lba), 485 + le32_to_cpu(log_new.new_map), 0, 0); 486 + if (ret) 487 + return ret; 488 + } 489 + 490 + } 491 + 492 + return 0; 493 + } 494 + 495 + static int btt_rtt_init(struct arena_info *arena) 496 + { 497 + arena->rtt = kcalloc(arena->nfree, sizeof(u32), GFP_KERNEL); 498 + if (arena->rtt == NULL) 499 + return -ENOMEM; 500 + 501 + return 0; 502 + } 503 + 504 + static int btt_maplocks_init(struct arena_info *arena) 505 + { 506 + u32 i; 507 + 508 + arena->map_locks = kcalloc(arena->nfree, sizeof(struct aligned_lock), 509 + GFP_KERNEL); 510 + if (!arena->map_locks) 511 + return -ENOMEM; 512 + 513 + for (i = 0; i < arena->nfree; i++) 514 + spin_lock_init(&arena->map_locks[i].lock); 515 + 516 + return 0; 517 + } 518 + 519 + static struct arena_info *alloc_arena(struct btt *btt, size_t size, 520 + size_t start, size_t arena_off) 521 + { 522 + struct arena_info *arena; 523 + u64 logsize, mapsize, datasize; 524 + u64 available = size; 525 + 526 + arena = kzalloc(sizeof(struct arena_info), GFP_KERNEL); 527 + if (!arena) 528 + return NULL; 529 + arena->nd_btt = btt->nd_btt; 530 + 531 + if (!size) 532 + return arena; 533 + 534 + arena->size = size; 535 + arena->external_lba_start = start; 536 + arena->external_lbasize = btt->lbasize; 537 + arena->internal_lbasize = roundup(arena->external_lbasize, 538 + INT_LBASIZE_ALIGNMENT); 539 + arena->nfree = BTT_DEFAULT_NFREE; 540 + arena->version_major = 1; 541 + arena->version_minor = 1; 542 + 543 + if (available % BTT_PG_SIZE) 544 + available -= (available % BTT_PG_SIZE); 545 + 546 + /* Two pages are reserved for the super block and its copy */ 547 + available -= 2 * BTT_PG_SIZE; 548 + 549 + /* The log takes a fixed amount of space based on nfree */ 550 + logsize = roundup(2 * arena->nfree * sizeof(struct log_entry), 551 + BTT_PG_SIZE); 552 + available -= logsize; 553 + 554 + /* Calculate optimal split between map and data area */ 555 + arena->internal_nlba = div_u64(available - BTT_PG_SIZE, 556 + arena->internal_lbasize + MAP_ENT_SIZE); 557 + arena->external_nlba = arena->internal_nlba - arena->nfree; 558 + 559 + mapsize = roundup((arena->external_nlba * MAP_ENT_SIZE), BTT_PG_SIZE); 560 + datasize = available - mapsize; 561 + 562 + /* 'Absolute' values, relative to start of storage space */ 563 + arena->infooff = arena_off; 564 + arena->dataoff = arena->infooff + BTT_PG_SIZE; 565 + arena->mapoff = arena->dataoff + datasize; 566 + arena->logoff = arena->mapoff + mapsize; 567 + arena->info2off = arena->logoff + logsize; 568 + return arena; 569 + } 570 + 571 + static void free_arenas(struct btt *btt) 572 + { 573 + struct arena_info *arena, *next; 574 + 575 + list_for_each_entry_safe(arena, next, &btt->arena_list, list) { 576 + list_del(&arena->list); 577 + kfree(arena->rtt); 578 + kfree(arena->map_locks); 579 + kfree(arena->freelist); 580 + debugfs_remove_recursive(arena->debugfs_dir); 581 + kfree(arena); 582 + } 583 + } 584 + 585 + /* 586 + * This function checks if the metadata layout is valid and error free 587 + */ 588 + static int arena_is_valid(struct arena_info *arena, struct btt_sb *super, 589 + u8 *uuid, u32 lbasize) 590 + { 591 + u64 checksum; 592 + 593 + if (memcmp(super->uuid, uuid, 16)) 594 + return 0; 595 + 596 + checksum = le64_to_cpu(super->checksum); 597 + super->checksum = 0; 598 + if (checksum != nd_btt_sb_checksum(super)) 599 + return 0; 600 + super->checksum = cpu_to_le64(checksum); 601 + 602 + if (lbasize != le32_to_cpu(super->external_lbasize)) 603 + return 0; 604 + 605 + /* TODO: figure out action for this */ 606 + if ((le32_to_cpu(super->flags) & IB_FLAG_ERROR_MASK) != 0) 607 + dev_info(to_dev(arena), "Found arena with an error flag\n"); 608 + 609 + return 1; 610 + } 611 + 612 + /* 613 + * This function reads an existing valid btt superblock and 614 + * populates the corresponding arena_info struct 615 + */ 616 + static void parse_arena_meta(struct arena_info *arena, struct btt_sb *super, 617 + u64 arena_off) 618 + { 619 + arena->internal_nlba = le32_to_cpu(super->internal_nlba); 620 + arena->internal_lbasize = le32_to_cpu(super->internal_lbasize); 621 + arena->external_nlba = le32_to_cpu(super->external_nlba); 622 + arena->external_lbasize = le32_to_cpu(super->external_lbasize); 623 + arena->nfree = le32_to_cpu(super->nfree); 624 + arena->version_major = le16_to_cpu(super->version_major); 625 + arena->version_minor = le16_to_cpu(super->version_minor); 626 + 627 + arena->nextoff = (super->nextoff == 0) ? 0 : (arena_off + 628 + le64_to_cpu(super->nextoff)); 629 + arena->infooff = arena_off; 630 + arena->dataoff = arena_off + le64_to_cpu(super->dataoff); 631 + arena->mapoff = arena_off + le64_to_cpu(super->mapoff); 632 + arena->logoff = arena_off + le64_to_cpu(super->logoff); 633 + arena->info2off = arena_off + le64_to_cpu(super->info2off); 634 + 635 + arena->size = (super->nextoff > 0) ? (le64_to_cpu(super->nextoff)) : 636 + (arena->info2off - arena->infooff + BTT_PG_SIZE); 637 + 638 + arena->flags = le32_to_cpu(super->flags); 639 + } 640 + 641 + static int discover_arenas(struct btt *btt) 642 + { 643 + int ret = 0; 644 + struct arena_info *arena; 645 + struct btt_sb *super; 646 + size_t remaining = btt->rawsize; 647 + u64 cur_nlba = 0; 648 + size_t cur_off = 0; 649 + int num_arenas = 0; 650 + 651 + super = kzalloc(sizeof(*super), GFP_KERNEL); 652 + if (!super) 653 + return -ENOMEM; 654 + 655 + while (remaining) { 656 + /* Alloc memory for arena */ 657 + arena = alloc_arena(btt, 0, 0, 0); 658 + if (!arena) { 659 + ret = -ENOMEM; 660 + goto out_super; 661 + } 662 + 663 + arena->infooff = cur_off; 664 + ret = btt_info_read(arena, super); 665 + if (ret) 666 + goto out; 667 + 668 + if (!arena_is_valid(arena, super, btt->nd_btt->uuid, 669 + btt->lbasize)) { 670 + if (remaining == btt->rawsize) { 671 + btt->init_state = INIT_NOTFOUND; 672 + dev_info(to_dev(arena), "No existing arenas\n"); 673 + goto out; 674 + } else { 675 + dev_info(to_dev(arena), 676 + "Found corrupted metadata!\n"); 677 + ret = -ENODEV; 678 + goto out; 679 + } 680 + } 681 + 682 + arena->external_lba_start = cur_nlba; 683 + parse_arena_meta(arena, super, cur_off); 684 + 685 + ret = btt_freelist_init(arena); 686 + if (ret) 687 + goto out; 688 + 689 + ret = btt_rtt_init(arena); 690 + if (ret) 691 + goto out; 692 + 693 + ret = btt_maplocks_init(arena); 694 + if (ret) 695 + goto out; 696 + 697 + list_add_tail(&arena->list, &btt->arena_list); 698 + 699 + remaining -= arena->size; 700 + cur_off += arena->size; 701 + cur_nlba += arena->external_nlba; 702 + num_arenas++; 703 + 704 + if (arena->nextoff == 0) 705 + break; 706 + } 707 + btt->num_arenas = num_arenas; 708 + btt->nlba = cur_nlba; 709 + btt->init_state = INIT_READY; 710 + 711 + kfree(super); 712 + return ret; 713 + 714 + out: 715 + kfree(arena); 716 + free_arenas(btt); 717 + out_super: 718 + kfree(super); 719 + return ret; 720 + } 721 + 722 + static int create_arenas(struct btt *btt) 723 + { 724 + size_t remaining = btt->rawsize; 725 + size_t cur_off = 0; 726 + 727 + while (remaining) { 728 + struct arena_info *arena; 729 + size_t arena_size = min_t(u64, ARENA_MAX_SIZE, remaining); 730 + 731 + remaining -= arena_size; 732 + if (arena_size < ARENA_MIN_SIZE) 733 + break; 734 + 735 + arena = alloc_arena(btt, arena_size, btt->nlba, cur_off); 736 + if (!arena) { 737 + free_arenas(btt); 738 + return -ENOMEM; 739 + } 740 + btt->nlba += arena->external_nlba; 741 + if (remaining >= ARENA_MIN_SIZE) 742 + arena->nextoff = arena->size; 743 + else 744 + arena->nextoff = 0; 745 + cur_off += arena_size; 746 + list_add_tail(&arena->list, &btt->arena_list); 747 + } 748 + 749 + return 0; 750 + } 751 + 752 + /* 753 + * This function completes arena initialization by writing 754 + * all the metadata. 755 + * It is only called for an uninitialized arena when a write 756 + * to that arena occurs for the first time. 757 + */ 758 + static int btt_arena_write_layout(struct arena_info *arena, u8 *uuid) 759 + { 760 + int ret; 761 + struct btt_sb *super; 762 + 763 + ret = btt_map_init(arena); 764 + if (ret) 765 + return ret; 766 + 767 + ret = btt_log_init(arena); 768 + if (ret) 769 + return ret; 770 + 771 + super = kzalloc(sizeof(struct btt_sb), GFP_NOIO); 772 + if (!super) 773 + return -ENOMEM; 774 + 775 + strncpy(super->signature, BTT_SIG, BTT_SIG_LEN); 776 + memcpy(super->uuid, uuid, 16); 777 + super->flags = cpu_to_le32(arena->flags); 778 + super->version_major = cpu_to_le16(arena->version_major); 779 + super->version_minor = cpu_to_le16(arena->version_minor); 780 + super->external_lbasize = cpu_to_le32(arena->external_lbasize); 781 + super->external_nlba = cpu_to_le32(arena->external_nlba); 782 + super->internal_lbasize = cpu_to_le32(arena->internal_lbasize); 783 + super->internal_nlba = cpu_to_le32(arena->internal_nlba); 784 + super->nfree = cpu_to_le32(arena->nfree); 785 + super->infosize = cpu_to_le32(sizeof(struct btt_sb)); 786 + super->nextoff = cpu_to_le64(arena->nextoff); 787 + /* 788 + * Subtract arena->infooff (arena start) so numbers are relative 789 + * to 'this' arena 790 + */ 791 + super->dataoff = cpu_to_le64(arena->dataoff - arena->infooff); 792 + super->mapoff = cpu_to_le64(arena->mapoff - arena->infooff); 793 + super->logoff = cpu_to_le64(arena->logoff - arena->infooff); 794 + super->info2off = cpu_to_le64(arena->info2off - arena->infooff); 795 + 796 + super->flags = 0; 797 + super->checksum = cpu_to_le64(nd_btt_sb_checksum(super)); 798 + 799 + ret = btt_info_write(arena, super); 800 + 801 + kfree(super); 802 + return ret; 803 + } 804 + 805 + /* 806 + * This function completes the initialization for the BTT namespace 807 + * such that it is ready to accept IOs 808 + */ 809 + static int btt_meta_init(struct btt *btt) 810 + { 811 + int ret = 0; 812 + struct arena_info *arena; 813 + 814 + mutex_lock(&btt->init_lock); 815 + list_for_each_entry(arena, &btt->arena_list, list) { 816 + ret = btt_arena_write_layout(arena, btt->nd_btt->uuid); 817 + if (ret) 818 + goto unlock; 819 + 820 + ret = btt_freelist_init(arena); 821 + if (ret) 822 + goto unlock; 823 + 824 + ret = btt_rtt_init(arena); 825 + if (ret) 826 + goto unlock; 827 + 828 + ret = btt_maplocks_init(arena); 829 + if (ret) 830 + goto unlock; 831 + } 832 + 833 + btt->init_state = INIT_READY; 834 + 835 + unlock: 836 + mutex_unlock(&btt->init_lock); 837 + return ret; 838 + } 839 + 840 + /* 841 + * This function calculates the arena in which the given LBA lies 842 + * by doing a linear walk. This is acceptable since we expect only 843 + * a few arenas. If we have backing devices that get much larger, 844 + * we can construct a balanced binary tree of arenas at init time 845 + * so that this range search becomes faster. 846 + */ 847 + static int lba_to_arena(struct btt *btt, sector_t sector, __u32 *premap, 848 + struct arena_info **arena) 849 + { 850 + struct arena_info *arena_list; 851 + __u64 lba = div_u64(sector << SECTOR_SHIFT, btt->sector_size); 852 + 853 + list_for_each_entry(arena_list, &btt->arena_list, list) { 854 + if (lba < arena_list->external_nlba) { 855 + *arena = arena_list; 856 + *premap = lba; 857 + return 0; 858 + } 859 + lba -= arena_list->external_nlba; 860 + } 861 + 862 + return -EIO; 863 + } 864 + 865 + /* 866 + * The following (lock_map, unlock_map) are mostly just to improve 867 + * readability, since they index into an array of locks 868 + */ 869 + static void lock_map(struct arena_info *arena, u32 premap) 870 + __acquires(&arena->map_locks[idx].lock) 871 + { 872 + u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree; 873 + 874 + spin_lock(&arena->map_locks[idx].lock); 875 + } 876 + 877 + static void unlock_map(struct arena_info *arena, u32 premap) 878 + __releases(&arena->map_locks[idx].lock) 879 + { 880 + u32 idx = (premap * MAP_ENT_SIZE / L1_CACHE_BYTES) % arena->nfree; 881 + 882 + spin_unlock(&arena->map_locks[idx].lock); 883 + } 884 + 885 + static u64 to_namespace_offset(struct arena_info *arena, u64 lba) 886 + { 887 + return arena->dataoff + ((u64)lba * arena->internal_lbasize); 888 + } 889 + 890 + static int btt_data_read(struct arena_info *arena, struct page *page, 891 + unsigned int off, u32 lba, u32 len) 892 + { 893 + int ret; 894 + u64 nsoff = to_namespace_offset(arena, lba); 895 + void *mem = kmap_atomic(page); 896 + 897 + ret = arena_read_bytes(arena, nsoff, mem + off, len); 898 + kunmap_atomic(mem); 899 + 900 + return ret; 901 + } 902 + 903 + static int btt_data_write(struct arena_info *arena, u32 lba, 904 + struct page *page, unsigned int off, u32 len) 905 + { 906 + int ret; 907 + u64 nsoff = to_namespace_offset(arena, lba); 908 + void *mem = kmap_atomic(page); 909 + 910 + ret = arena_write_bytes(arena, nsoff, mem + off, len); 911 + kunmap_atomic(mem); 912 + 913 + return ret; 914 + } 915 + 916 + static void zero_fill_data(struct page *page, unsigned int off, u32 len) 917 + { 918 + void *mem = kmap_atomic(page); 919 + 920 + memset(mem + off, 0, len); 921 + kunmap_atomic(mem); 922 + } 923 + 924 + static int btt_read_pg(struct btt *btt, struct page *page, unsigned int off, 925 + sector_t sector, unsigned int len) 926 + { 927 + int ret = 0; 928 + int t_flag, e_flag; 929 + struct arena_info *arena = NULL; 930 + u32 lane = 0, premap, postmap; 931 + 932 + while (len) { 933 + u32 cur_len; 934 + 935 + lane = nd_region_acquire_lane(btt->nd_region); 936 + 937 + ret = lba_to_arena(btt, sector, &premap, &arena); 938 + if (ret) 939 + goto out_lane; 940 + 941 + cur_len = min(btt->sector_size, len); 942 + 943 + ret = btt_map_read(arena, premap, &postmap, &t_flag, &e_flag); 944 + if (ret) 945 + goto out_lane; 946 + 947 + /* 948 + * We loop to make sure that the post map LBA didn't change 949 + * from under us between writing the RTT and doing the actual 950 + * read. 951 + */ 952 + while (1) { 953 + u32 new_map; 954 + 955 + if (t_flag) { 956 + zero_fill_data(page, off, cur_len); 957 + goto out_lane; 958 + } 959 + 960 + if (e_flag) { 961 + ret = -EIO; 962 + goto out_lane; 963 + } 964 + 965 + arena->rtt[lane] = RTT_VALID | postmap; 966 + /* 967 + * Barrier to make sure this write is not reordered 968 + * to do the verification map_read before the RTT store 969 + */ 970 + barrier(); 971 + 972 + ret = btt_map_read(arena, premap, &new_map, &t_flag, 973 + &e_flag); 974 + if (ret) 975 + goto out_rtt; 976 + 977 + if (postmap == new_map) 978 + break; 979 + 980 + postmap = new_map; 981 + } 982 + 983 + ret = btt_data_read(arena, page, off, postmap, cur_len); 984 + if (ret) 985 + goto out_rtt; 986 + 987 + arena->rtt[lane] = RTT_INVALID; 988 + nd_region_release_lane(btt->nd_region, lane); 989 + 990 + len -= cur_len; 991 + off += cur_len; 992 + sector += btt->sector_size >> SECTOR_SHIFT; 993 + } 994 + 995 + return 0; 996 + 997 + out_rtt: 998 + arena->rtt[lane] = RTT_INVALID; 999 + out_lane: 1000 + nd_region_release_lane(btt->nd_region, lane); 1001 + return ret; 1002 + } 1003 + 1004 + static int btt_write_pg(struct btt *btt, sector_t sector, struct page *page, 1005 + unsigned int off, unsigned int len) 1006 + { 1007 + int ret = 0; 1008 + struct arena_info *arena = NULL; 1009 + u32 premap = 0, old_postmap, new_postmap, lane = 0, i; 1010 + struct log_entry log; 1011 + int sub; 1012 + 1013 + while (len) { 1014 + u32 cur_len; 1015 + 1016 + lane = nd_region_acquire_lane(btt->nd_region); 1017 + 1018 + ret = lba_to_arena(btt, sector, &premap, &arena); 1019 + if (ret) 1020 + goto out_lane; 1021 + cur_len = min(btt->sector_size, len); 1022 + 1023 + if ((arena->flags & IB_FLAG_ERROR_MASK) != 0) { 1024 + ret = -EIO; 1025 + goto out_lane; 1026 + } 1027 + 1028 + new_postmap = arena->freelist[lane].block; 1029 + 1030 + /* Wait if the new block is being read from */ 1031 + for (i = 0; i < arena->nfree; i++) 1032 + while (arena->rtt[i] == (RTT_VALID | new_postmap)) 1033 + cpu_relax(); 1034 + 1035 + 1036 + if (new_postmap >= arena->internal_nlba) { 1037 + ret = -EIO; 1038 + goto out_lane; 1039 + } else 1040 + ret = btt_data_write(arena, new_postmap, page, 1041 + off, cur_len); 1042 + if (ret) 1043 + goto out_lane; 1044 + 1045 + lock_map(arena, premap); 1046 + ret = btt_map_read(arena, premap, &old_postmap, NULL, NULL); 1047 + if (ret) 1048 + goto out_map; 1049 + if (old_postmap >= arena->internal_nlba) { 1050 + ret = -EIO; 1051 + goto out_map; 1052 + } 1053 + 1054 + log.lba = cpu_to_le32(premap); 1055 + log.old_map = cpu_to_le32(old_postmap); 1056 + log.new_map = cpu_to_le32(new_postmap); 1057 + log.seq = cpu_to_le32(arena->freelist[lane].seq); 1058 + sub = arena->freelist[lane].sub; 1059 + ret = btt_flog_write(arena, lane, sub, &log); 1060 + if (ret) 1061 + goto out_map; 1062 + 1063 + ret = btt_map_write(arena, premap, new_postmap, 0, 0); 1064 + if (ret) 1065 + goto out_map; 1066 + 1067 + unlock_map(arena, premap); 1068 + nd_region_release_lane(btt->nd_region, lane); 1069 + 1070 + len -= cur_len; 1071 + off += cur_len; 1072 + sector += btt->sector_size >> SECTOR_SHIFT; 1073 + } 1074 + 1075 + return 0; 1076 + 1077 + out_map: 1078 + unlock_map(arena, premap); 1079 + out_lane: 1080 + nd_region_release_lane(btt->nd_region, lane); 1081 + return ret; 1082 + } 1083 + 1084 + static int btt_do_bvec(struct btt *btt, struct page *page, 1085 + unsigned int len, unsigned int off, int rw, 1086 + sector_t sector) 1087 + { 1088 + int ret; 1089 + 1090 + if (rw == READ) { 1091 + ret = btt_read_pg(btt, page, off, sector, len); 1092 + flush_dcache_page(page); 1093 + } else { 1094 + flush_dcache_page(page); 1095 + ret = btt_write_pg(btt, sector, page, off, len); 1096 + } 1097 + 1098 + return ret; 1099 + } 1100 + 1101 + static void btt_make_request(struct request_queue *q, struct bio *bio) 1102 + { 1103 + struct btt *btt = q->queuedata; 1104 + struct bvec_iter iter; 1105 + struct bio_vec bvec; 1106 + int err = 0, rw; 1107 + 1108 + rw = bio_data_dir(bio); 1109 + bio_for_each_segment(bvec, bio, iter) { 1110 + unsigned int len = bvec.bv_len; 1111 + 1112 + BUG_ON(len > PAGE_SIZE); 1113 + /* Make sure len is in multiples of sector size. */ 1114 + /* XXX is this right? */ 1115 + BUG_ON(len < btt->sector_size); 1116 + BUG_ON(len % btt->sector_size); 1117 + 1118 + err = btt_do_bvec(btt, bvec.bv_page, len, bvec.bv_offset, 1119 + rw, iter.bi_sector); 1120 + if (err) { 1121 + dev_info(&btt->nd_btt->dev, 1122 + "io error in %s sector %lld, len %d,\n", 1123 + (rw == READ) ? "READ" : "WRITE", 1124 + (unsigned long long) iter.bi_sector, len); 1125 + goto out; 1126 + } 1127 + } 1128 + 1129 + out: 1130 + bio_endio(bio, err); 1131 + } 1132 + 1133 + static int btt_rw_page(struct block_device *bdev, sector_t sector, 1134 + struct page *page, int rw) 1135 + { 1136 + struct btt *btt = bdev->bd_disk->private_data; 1137 + 1138 + btt_do_bvec(btt, page, PAGE_CACHE_SIZE, 0, rw, sector); 1139 + page_endio(page, rw & WRITE, 0); 1140 + return 0; 1141 + } 1142 + 1143 + 1144 + static int btt_getgeo(struct block_device *bd, struct hd_geometry *geo) 1145 + { 1146 + /* some standard values */ 1147 + geo->heads = 1 << 6; 1148 + geo->sectors = 1 << 5; 1149 + geo->cylinders = get_capacity(bd->bd_disk) >> 11; 1150 + return 0; 1151 + } 1152 + 1153 + static const struct block_device_operations btt_fops = { 1154 + .owner = THIS_MODULE, 1155 + .rw_page = btt_rw_page, 1156 + .getgeo = btt_getgeo, 1157 + }; 1158 + 1159 + static int btt_blk_init(struct btt *btt) 1160 + { 1161 + struct nd_btt *nd_btt = btt->nd_btt; 1162 + struct nd_namespace_common *ndns = nd_btt->ndns; 1163 + 1164 + /* create a new disk and request queue for btt */ 1165 + btt->btt_queue = blk_alloc_queue(GFP_KERNEL); 1166 + if (!btt->btt_queue) 1167 + return -ENOMEM; 1168 + 1169 + btt->btt_disk = alloc_disk(0); 1170 + if (!btt->btt_disk) { 1171 + blk_cleanup_queue(btt->btt_queue); 1172 + return -ENOMEM; 1173 + } 1174 + 1175 + nvdimm_namespace_disk_name(ndns, btt->btt_disk->disk_name); 1176 + btt->btt_disk->driverfs_dev = &btt->nd_btt->dev; 1177 + btt->btt_disk->major = btt_major; 1178 + btt->btt_disk->first_minor = 0; 1179 + btt->btt_disk->fops = &btt_fops; 1180 + btt->btt_disk->private_data = btt; 1181 + btt->btt_disk->queue = btt->btt_queue; 1182 + btt->btt_disk->flags = GENHD_FL_EXT_DEVT; 1183 + 1184 + blk_queue_make_request(btt->btt_queue, btt_make_request); 1185 + blk_queue_logical_block_size(btt->btt_queue, btt->sector_size); 1186 + blk_queue_max_hw_sectors(btt->btt_queue, UINT_MAX); 1187 + blk_queue_bounce_limit(btt->btt_queue, BLK_BOUNCE_ANY); 1188 + queue_flag_set_unlocked(QUEUE_FLAG_NONROT, btt->btt_queue); 1189 + btt->btt_queue->queuedata = btt; 1190 + 1191 + set_capacity(btt->btt_disk, 1192 + btt->nlba * btt->sector_size >> SECTOR_SHIFT); 1193 + add_disk(btt->btt_disk); 1194 + 1195 + return 0; 1196 + } 1197 + 1198 + static void btt_blk_cleanup(struct btt *btt) 1199 + { 1200 + del_gendisk(btt->btt_disk); 1201 + put_disk(btt->btt_disk); 1202 + blk_cleanup_queue(btt->btt_queue); 1203 + } 1204 + 1205 + /** 1206 + * btt_init - initialize a block translation table for the given device 1207 + * @nd_btt: device with BTT geometry and backing device info 1208 + * @rawsize: raw size in bytes of the backing device 1209 + * @lbasize: lba size of the backing device 1210 + * @uuid: A uuid for the backing device - this is stored on media 1211 + * @maxlane: maximum number of parallel requests the device can handle 1212 + * 1213 + * Initialize a Block Translation Table on a backing device to provide 1214 + * single sector power fail atomicity. 1215 + * 1216 + * Context: 1217 + * Might sleep. 1218 + * 1219 + * Returns: 1220 + * Pointer to a new struct btt on success, NULL on failure. 1221 + */ 1222 + static struct btt *btt_init(struct nd_btt *nd_btt, unsigned long long rawsize, 1223 + u32 lbasize, u8 *uuid, struct nd_region *nd_region) 1224 + { 1225 + int ret; 1226 + struct btt *btt; 1227 + struct device *dev = &nd_btt->dev; 1228 + 1229 + btt = kzalloc(sizeof(struct btt), GFP_KERNEL); 1230 + if (!btt) 1231 + return NULL; 1232 + 1233 + btt->nd_btt = nd_btt; 1234 + btt->rawsize = rawsize; 1235 + btt->lbasize = lbasize; 1236 + btt->sector_size = ((lbasize >= 4096) ? 4096 : 512); 1237 + INIT_LIST_HEAD(&btt->arena_list); 1238 + mutex_init(&btt->init_lock); 1239 + btt->nd_region = nd_region; 1240 + 1241 + ret = discover_arenas(btt); 1242 + if (ret) { 1243 + dev_err(dev, "init: error in arena_discover: %d\n", ret); 1244 + goto out_free; 1245 + } 1246 + 1247 + if (btt->init_state != INIT_READY) { 1248 + btt->num_arenas = (rawsize / ARENA_MAX_SIZE) + 1249 + ((rawsize % ARENA_MAX_SIZE) ? 1 : 0); 1250 + dev_dbg(dev, "init: %d arenas for %llu rawsize\n", 1251 + btt->num_arenas, rawsize); 1252 + 1253 + ret = create_arenas(btt); 1254 + if (ret) { 1255 + dev_info(dev, "init: create_arenas: %d\n", ret); 1256 + goto out_free; 1257 + } 1258 + 1259 + ret = btt_meta_init(btt); 1260 + if (ret) { 1261 + dev_err(dev, "init: error in meta_init: %d\n", ret); 1262 + return NULL; 1263 + } 1264 + } 1265 + 1266 + ret = btt_blk_init(btt); 1267 + if (ret) { 1268 + dev_err(dev, "init: error in blk_init: %d\n", ret); 1269 + goto out_free; 1270 + } 1271 + 1272 + btt_debugfs_init(btt); 1273 + 1274 + return btt; 1275 + 1276 + out_free: 1277 + kfree(btt); 1278 + return NULL; 1279 + } 1280 + 1281 + /** 1282 + * btt_fini - de-initialize a BTT 1283 + * @btt: the BTT handle that was generated by btt_init 1284 + * 1285 + * De-initialize a Block Translation Table on device removal 1286 + * 1287 + * Context: 1288 + * Might sleep. 1289 + */ 1290 + static void btt_fini(struct btt *btt) 1291 + { 1292 + if (btt) { 1293 + btt_blk_cleanup(btt); 1294 + free_arenas(btt); 1295 + debugfs_remove_recursive(btt->debugfs_dir); 1296 + kfree(btt); 1297 + } 1298 + } 1299 + 1300 + int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns) 1301 + { 1302 + struct nd_btt *nd_btt = to_nd_btt(ndns->claim); 1303 + struct nd_region *nd_region; 1304 + struct btt *btt; 1305 + size_t rawsize; 1306 + 1307 + if (!nd_btt->uuid || !nd_btt->ndns || !nd_btt->lbasize) 1308 + return -ENODEV; 1309 + 1310 + rawsize = nvdimm_namespace_capacity(ndns) - SZ_4K; 1311 + if (rawsize < ARENA_MIN_SIZE) { 1312 + return -ENXIO; 1313 + } 1314 + nd_region = to_nd_region(nd_btt->dev.parent); 1315 + btt = btt_init(nd_btt, rawsize, nd_btt->lbasize, nd_btt->uuid, 1316 + nd_region); 1317 + if (!btt) 1318 + return -ENOMEM; 1319 + nd_btt->btt = btt; 1320 + 1321 + return 0; 1322 + } 1323 + EXPORT_SYMBOL(nvdimm_namespace_attach_btt); 1324 + 1325 + int nvdimm_namespace_detach_btt(struct nd_namespace_common *ndns) 1326 + { 1327 + struct nd_btt *nd_btt = to_nd_btt(ndns->claim); 1328 + struct btt *btt = nd_btt->btt; 1329 + 1330 + btt_fini(btt); 1331 + nd_btt->btt = NULL; 1332 + 1333 + return 0; 1334 + } 1335 + EXPORT_SYMBOL(nvdimm_namespace_detach_btt); 1336 + 1337 + static int __init nd_btt_init(void) 1338 + { 1339 + int rc; 1340 + 1341 + BUILD_BUG_ON(sizeof(struct btt_sb) != SZ_4K); 1342 + 1343 + btt_major = register_blkdev(0, "btt"); 1344 + if (btt_major < 0) 1345 + return btt_major; 1346 + 1347 + debugfs_root = debugfs_create_dir("btt", NULL); 1348 + if (IS_ERR_OR_NULL(debugfs_root)) { 1349 + rc = -ENXIO; 1350 + goto err_debugfs; 1351 + } 1352 + 1353 + return 0; 1354 + 1355 + err_debugfs: 1356 + unregister_blkdev(btt_major, "btt"); 1357 + 1358 + return rc; 1359 + } 1360 + 1361 + static void __exit nd_btt_exit(void) 1362 + { 1363 + debugfs_remove_recursive(debugfs_root); 1364 + unregister_blkdev(btt_major, "btt"); 1365 + } 1366 + 1367 + MODULE_ALIAS_ND_DEVICE(ND_DEVICE_BTT); 1368 + MODULE_AUTHOR("Vishal Verma <vishal.l.verma@linux.intel.com>"); 1369 + MODULE_LICENSE("GPL v2"); 1370 + module_init(nd_btt_init); 1371 + module_exit(nd_btt_exit);
+141
drivers/nvdimm/btt.h
··· 19 19 20 20 #define BTT_SIG_LEN 16 21 21 #define BTT_SIG "BTT_ARENA_INFO\0" 22 + #define MAP_ENT_SIZE 4 23 + #define MAP_TRIM_SHIFT 31 24 + #define MAP_TRIM_MASK (1 << MAP_TRIM_SHIFT) 25 + #define MAP_ERR_SHIFT 30 26 + #define MAP_ERR_MASK (1 << MAP_ERR_SHIFT) 27 + #define MAP_LBA_MASK (~((1 << MAP_TRIM_SHIFT) | (1 << MAP_ERR_SHIFT))) 28 + #define MAP_ENT_NORMAL 0xC0000000 29 + #define LOG_ENT_SIZE sizeof(struct log_entry) 30 + #define ARENA_MIN_SIZE (1UL << 24) /* 16 MB */ 31 + #define ARENA_MAX_SIZE (1ULL << 39) /* 512 GB */ 32 + #define RTT_VALID (1UL << 31) 33 + #define RTT_INVALID 0 34 + #define INT_LBASIZE_ALIGNMENT 256 35 + #define BTT_PG_SIZE 4096 36 + #define BTT_DEFAULT_NFREE ND_MAX_LANES 37 + #define LOG_SEQ_INIT 1 38 + 39 + #define IB_FLAG_ERROR 0x00000001 40 + #define IB_FLAG_ERROR_MASK 0x00000001 41 + 42 + enum btt_init_state { 43 + INIT_UNCHECKED = 0, 44 + INIT_NOTFOUND, 45 + INIT_READY 46 + }; 47 + 48 + struct log_entry { 49 + __le32 lba; 50 + __le32 old_map; 51 + __le32 new_map; 52 + __le32 seq; 53 + __le64 padding[2]; 54 + }; 22 55 23 56 struct btt_sb { 24 57 u8 signature[BTT_SIG_LEN]; ··· 75 42 __le64 checksum; 76 43 }; 77 44 45 + struct free_entry { 46 + u32 block; 47 + u8 sub; 48 + u8 seq; 49 + }; 50 + 51 + struct aligned_lock { 52 + union { 53 + spinlock_t lock; 54 + u8 cacheline_padding[L1_CACHE_BYTES]; 55 + }; 56 + }; 57 + 58 + /** 59 + * struct arena_info - handle for an arena 60 + * @size: Size in bytes this arena occupies on the raw device. 61 + * This includes arena metadata. 62 + * @external_lba_start: The first external LBA in this arena. 63 + * @internal_nlba: Number of internal blocks available in the arena 64 + * including nfree reserved blocks 65 + * @internal_lbasize: Internal and external lba sizes may be different as 66 + * we can round up 'odd' external lbasizes such as 520B 67 + * to be aligned. 68 + * @external_nlba: Number of blocks contributed by the arena to the number 69 + * reported to upper layers. (internal_nlba - nfree) 70 + * @external_lbasize: LBA size as exposed to upper layers. 71 + * @nfree: A reserve number of 'free' blocks that is used to 72 + * handle incoming writes. 73 + * @version_major: Metadata layout version major. 74 + * @version_minor: Metadata layout version minor. 75 + * @nextoff: Offset in bytes to the start of the next arena. 76 + * @infooff: Offset in bytes to the info block of this arena. 77 + * @dataoff: Offset in bytes to the data area of this arena. 78 + * @mapoff: Offset in bytes to the map area of this arena. 79 + * @logoff: Offset in bytes to the log area of this arena. 80 + * @info2off: Offset in bytes to the backup info block of this arena. 81 + * @freelist: Pointer to in-memory list of free blocks 82 + * @rtt: Pointer to in-memory "Read Tracking Table" 83 + * @map_locks: Spinlocks protecting concurrent map writes 84 + * @nd_btt: Pointer to parent nd_btt structure. 85 + * @list: List head for list of arenas 86 + * @debugfs_dir: Debugfs dentry 87 + * @flags: Arena flags - may signify error states. 88 + * 89 + * arena_info is a per-arena handle. Once an arena is narrowed down for an 90 + * IO, this struct is passed around for the duration of the IO. 91 + */ 92 + struct arena_info { 93 + u64 size; /* Total bytes for this arena */ 94 + u64 external_lba_start; 95 + u32 internal_nlba; 96 + u32 internal_lbasize; 97 + u32 external_nlba; 98 + u32 external_lbasize; 99 + u32 nfree; 100 + u16 version_major; 101 + u16 version_minor; 102 + /* Byte offsets to the different on-media structures */ 103 + u64 nextoff; 104 + u64 infooff; 105 + u64 dataoff; 106 + u64 mapoff; 107 + u64 logoff; 108 + u64 info2off; 109 + /* Pointers to other in-memory structures for this arena */ 110 + struct free_entry *freelist; 111 + u32 *rtt; 112 + struct aligned_lock *map_locks; 113 + struct nd_btt *nd_btt; 114 + struct list_head list; 115 + struct dentry *debugfs_dir; 116 + /* Arena flags */ 117 + u32 flags; 118 + }; 119 + 120 + /** 121 + * struct btt - handle for a BTT instance 122 + * @btt_disk: Pointer to the gendisk for BTT device 123 + * @btt_queue: Pointer to the request queue for the BTT device 124 + * @arena_list: Head of the list of arenas 125 + * @debugfs_dir: Debugfs dentry 126 + * @nd_btt: Parent nd_btt struct 127 + * @nlba: Number of logical blocks exposed to the upper layers 128 + * after removing the amount of space needed by metadata 129 + * @rawsize: Total size in bytes of the available backing device 130 + * @lbasize: LBA size as requested and presented to upper layers. 131 + * This is sector_size + size of any metadata. 132 + * @sector_size: The Linux sector size - 512 or 4096 133 + * @lanes: Per-lane spinlocks 134 + * @init_lock: Mutex used for the BTT initialization 135 + * @init_state: Flag describing the initialization state for the BTT 136 + * @num_arenas: Number of arenas in the BTT instance 137 + */ 138 + struct btt { 139 + struct gendisk *btt_disk; 140 + struct request_queue *btt_queue; 141 + struct list_head arena_list; 142 + struct dentry *debugfs_dir; 143 + struct nd_btt *nd_btt; 144 + u64 nlba; 145 + unsigned long long rawsize; 146 + u32 lbasize; 147 + u32 sector_size; 148 + struct nd_region *nd_region; 149 + struct mutex init_lock; 150 + int init_state; 151 + int num_arenas; 152 + }; 78 153 #endif
+2 -1
drivers/nvdimm/btt_devs.c
··· 348 348 */ 349 349 u64 nd_btt_sb_checksum(struct btt_sb *btt_sb) 350 350 { 351 - u64 sum, sum_save; 351 + u64 sum; 352 + __le64 sum_save; 352 353 353 354 sum_save = btt_sb->checksum; 354 355 btt_sb->checksum = 0;
+24
drivers/nvdimm/namespace_devs.c
··· 76 76 return dev ? dev->type == &namespace_io_device_type : false; 77 77 } 78 78 79 + const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns, 80 + char *name) 81 + { 82 + struct nd_region *nd_region = to_nd_region(ndns->dev.parent); 83 + const char *suffix = ""; 84 + 85 + if (ndns->claim && is_nd_btt(ndns->claim)) 86 + suffix = "s"; 87 + 88 + if (is_namespace_pmem(&ndns->dev) || is_namespace_io(&ndns->dev)) 89 + sprintf(name, "pmem%d%s", nd_region->id, suffix); 90 + else if (is_namespace_blk(&ndns->dev)) { 91 + struct nd_namespace_blk *nsblk; 92 + 93 + nsblk = to_nd_namespace_blk(&ndns->dev); 94 + sprintf(name, "ndblk%d.%d%s", nd_region->id, nsblk->id, suffix); 95 + } else { 96 + return NULL; 97 + } 98 + 99 + return name; 100 + } 101 + EXPORT_SYMBOL(nvdimm_namespace_disk_name); 102 + 79 103 static ssize_t nstype_show(struct device *dev, 80 104 struct device_attribute *attr, char *buf) 81 105 {
+21 -1
drivers/nvdimm/nd.h
··· 20 20 #include "label.h" 21 21 22 22 enum { 23 + /* 24 + * Limits the maximum number of block apertures a dimm can 25 + * support and is an input to the geometry/on-disk-format of a 26 + * BTT instance 27 + */ 28 + ND_MAX_LANES = 256, 23 29 SECTOR_SHIFT = 9, 24 30 }; 25 31 ··· 81 75 for (res = (ndd)->dpa.child, next = res ? res->sibling : NULL; \ 82 76 res; res = next, next = next ? next->sibling : NULL) 83 77 78 + struct nd_percpu_lane { 79 + int count; 80 + spinlock_t lock; 81 + }; 82 + 84 83 struct nd_region { 85 84 struct device dev; 86 85 struct ida ns_ida; ··· 95 84 u16 ndr_mappings; 96 85 u64 ndr_size; 97 86 u64 ndr_start; 98 - int id; 87 + int id, num_lanes; 99 88 void *provider_data; 100 89 struct nd_interleave_set *nd_set; 90 + struct nd_percpu_lane __percpu *lane; 101 91 struct nd_mapping mapping[0]; 102 92 }; 103 93 ··· 112 100 return next[seq & 3]; 113 101 } 114 102 103 + struct btt; 115 104 struct nd_btt { 116 105 struct device dev; 117 106 struct nd_namespace_common *ndns; 107 + struct btt *btt; 118 108 unsigned long lbasize; 119 109 u8 *uuid; 120 110 int id; ··· 171 157 172 158 #endif 173 159 struct nd_region *to_nd_region(struct device *dev); 160 + unsigned int nd_region_acquire_lane(struct nd_region *nd_region); 161 + void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane); 174 162 int nd_region_to_nstype(struct nd_region *nd_region); 175 163 int nd_region_register_namespaces(struct nd_region *nd_region, int *err); 176 164 u64 nd_region_interleave_set_cookie(struct nd_region *nd_region); ··· 188 172 resource_size_t n); 189 173 resource_size_t nvdimm_namespace_capacity(struct nd_namespace_common *ndns); 190 174 struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev); 175 + int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns); 176 + int nvdimm_namespace_detach_btt(struct nd_namespace_common *ndns); 177 + const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns, 178 + char *name); 191 179 #endif /* __ND_H__ */
+1 -13
drivers/nvdimm/pmem.c
··· 160 160 static int pmem_attach_disk(struct nd_namespace_common *ndns, 161 161 struct pmem_device *pmem) 162 162 { 163 - struct nd_region *nd_region = to_nd_region(ndns->dev.parent); 164 163 struct gendisk *disk; 165 164 166 165 pmem->pmem_queue = blk_alloc_queue(GFP_KERNEL); ··· 182 183 disk->private_data = pmem; 183 184 disk->queue = pmem->pmem_queue; 184 185 disk->flags = GENHD_FL_EXT_DEVT; 185 - sprintf(disk->disk_name, "pmem%d", nd_region->id); 186 + nvdimm_namespace_disk_name(ndns, disk->disk_name); 186 187 disk->driverfs_dev = &ndns->dev; 187 188 set_capacity(disk, pmem->size >> 9); 188 189 pmem->pmem_disk = disk; ··· 208 209 memcpy(pmem->virt_addr + offset, buf, size); 209 210 210 211 return 0; 211 - } 212 - 213 - static int nvdimm_namespace_attach_btt(struct nd_namespace_common *ndns) 214 - { 215 - /* TODO */ 216 - return -ENXIO; 217 - } 218 - 219 - static void nvdimm_namespace_detach_btt(struct nd_namespace_common *ndns) 220 - { 221 - /* TODO */ 222 212 } 223 213 224 214 static void pmem_free(struct pmem_device *pmem)
+12
drivers/nvdimm/region.c
··· 10 10 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 11 11 * General Public License for more details. 12 12 */ 13 + #include <linux/cpumask.h> 13 14 #include <linux/module.h> 14 15 #include <linux/device.h> 15 16 #include <linux/nd.h> ··· 19 18 static int nd_region_probe(struct device *dev) 20 19 { 21 20 int err; 21 + static unsigned long once; 22 22 struct nd_region_namespaces *num_ns; 23 23 struct nd_region *nd_region = to_nd_region(dev); 24 24 int rc = nd_region_register_namespaces(nd_region, &err); 25 + 26 + if (nd_region->num_lanes > num_online_cpus() 27 + && nd_region->num_lanes < num_possible_cpus() 28 + && !test_and_set_bit(0, &once)) { 29 + dev_info(dev, "online cpus (%d) < concurrent i/o lanes (%d) < possible cpus (%d)\n", 30 + num_online_cpus(), nd_region->num_lanes, 31 + num_possible_cpus()); 32 + dev_info(dev, "setting nr_cpus=%d may yield better libnvdimm device performance\n", 33 + nd_region->num_lanes); 34 + } 25 35 26 36 num_ns = devm_kzalloc(dev, sizeof(*num_ns), GFP_KERNEL); 27 37 if (!num_ns)
+78 -4
drivers/nvdimm/region_devs.c
··· 32 32 33 33 put_device(&nvdimm->dev); 34 34 } 35 + free_percpu(nd_region->lane); 35 36 ida_simple_remove(&region_ida, nd_region->id); 36 37 kfree(nd_region); 37 38 } ··· 532 531 } 533 532 EXPORT_SYMBOL_GPL(nd_region_provider_data); 534 533 534 + /** 535 + * nd_region_acquire_lane - allocate and lock a lane 536 + * @nd_region: region id and number of lanes possible 537 + * 538 + * A lane correlates to a BLK-data-window and/or a log slot in the BTT. 539 + * We optimize for the common case where there are 256 lanes, one 540 + * per-cpu. For larger systems we need to lock to share lanes. For now 541 + * this implementation assumes the cost of maintaining an allocator for 542 + * free lanes is on the order of the lock hold time, so it implements a 543 + * static lane = cpu % num_lanes mapping. 544 + * 545 + * In the case of a BTT instance on top of a BLK namespace a lane may be 546 + * acquired recursively. We lock on the first instance. 547 + * 548 + * In the case of a BTT instance on top of PMEM, we only acquire a lane 549 + * for the BTT metadata updates. 550 + */ 551 + unsigned int nd_region_acquire_lane(struct nd_region *nd_region) 552 + { 553 + unsigned int cpu, lane; 554 + 555 + cpu = get_cpu(); 556 + if (nd_region->num_lanes < nr_cpu_ids) { 557 + struct nd_percpu_lane *ndl_lock, *ndl_count; 558 + 559 + lane = cpu % nd_region->num_lanes; 560 + ndl_count = per_cpu_ptr(nd_region->lane, cpu); 561 + ndl_lock = per_cpu_ptr(nd_region->lane, lane); 562 + if (ndl_count->count++ == 0) 563 + spin_lock(&ndl_lock->lock); 564 + } else 565 + lane = cpu; 566 + 567 + return lane; 568 + } 569 + EXPORT_SYMBOL(nd_region_acquire_lane); 570 + 571 + void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane) 572 + { 573 + if (nd_region->num_lanes < nr_cpu_ids) { 574 + unsigned int cpu = get_cpu(); 575 + struct nd_percpu_lane *ndl_lock, *ndl_count; 576 + 577 + ndl_count = per_cpu_ptr(nd_region->lane, cpu); 578 + ndl_lock = per_cpu_ptr(nd_region->lane, lane); 579 + if (--ndl_count->count == 0) 580 + spin_unlock(&ndl_lock->lock); 581 + put_cpu(); 582 + } 583 + put_cpu(); 584 + } 585 + EXPORT_SYMBOL(nd_region_release_lane); 586 + 535 587 static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus, 536 588 struct nd_region_desc *ndr_desc, struct device_type *dev_type, 537 589 const char *caller) 538 590 { 539 591 struct nd_region *nd_region; 540 592 struct device *dev; 541 - u16 i; 593 + unsigned int i; 542 594 543 595 for (i = 0; i < ndr_desc->num_mappings; i++) { 544 596 struct nd_mapping *nd_mapping = &ndr_desc->nd_mapping[i]; ··· 611 557 if (!nd_region) 612 558 return NULL; 613 559 nd_region->id = ida_simple_get(&region_ida, 0, 0, GFP_KERNEL); 614 - if (nd_region->id < 0) { 615 - kfree(nd_region); 616 - return NULL; 560 + if (nd_region->id < 0) 561 + goto err_id; 562 + 563 + nd_region->lane = alloc_percpu(struct nd_percpu_lane); 564 + if (!nd_region->lane) 565 + goto err_percpu; 566 + 567 + for (i = 0; i < nr_cpu_ids; i++) { 568 + struct nd_percpu_lane *ndl; 569 + 570 + ndl = per_cpu_ptr(nd_region->lane, i); 571 + spin_lock_init(&ndl->lock); 572 + ndl->count = 0; 617 573 } 618 574 619 575 memcpy(nd_region->mapping, ndr_desc->nd_mapping, ··· 637 573 nd_region->ndr_mappings = ndr_desc->num_mappings; 638 574 nd_region->provider_data = ndr_desc->provider_data; 639 575 nd_region->nd_set = ndr_desc->nd_set; 576 + nd_region->num_lanes = ndr_desc->num_lanes; 640 577 ida_init(&nd_region->ns_ida); 641 578 ida_init(&nd_region->btt_ida); 642 579 dev = &nd_region->dev; ··· 650 585 nd_device_register(dev); 651 586 652 587 return nd_region; 588 + 589 + err_percpu: 590 + ida_simple_remove(&region_ida, nd_region->id); 591 + err_id: 592 + kfree(nd_region); 593 + return NULL; 653 594 } 654 595 655 596 struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, 656 597 struct nd_region_desc *ndr_desc) 657 598 { 599 + ndr_desc->num_lanes = ND_MAX_LANES; 658 600 return nd_region_create(nvdimm_bus, ndr_desc, &nd_pmem_device_type, 659 601 __func__); 660 602 } ··· 672 600 { 673 601 if (ndr_desc->num_mappings > 1) 674 602 return NULL; 603 + ndr_desc->num_lanes = min(ndr_desc->num_lanes, ND_MAX_LANES); 675 604 return nd_region_create(nvdimm_bus, ndr_desc, &nd_blk_device_type, 676 605 __func__); 677 606 } ··· 681 608 struct nd_region *nvdimm_volatile_region_create(struct nvdimm_bus *nvdimm_bus, 682 609 struct nd_region_desc *ndr_desc) 683 610 { 611 + ndr_desc->num_lanes = ND_MAX_LANES; 684 612 return nd_region_create(nvdimm_bus, ndr_desc, &nd_volatile_device_type, 685 613 __func__); 686 614 }
+1
include/linux/libnvdimm.h
··· 85 85 const struct attribute_group **attr_groups; 86 86 struct nd_interleave_set *nd_set; 87 87 void *provider_data; 88 + int num_lanes; 88 89 }; 89 90 90 91 struct nvdimm_bus;