mm: kill frontswap · tjh.dev/kernel@42c06a0

+7 -7

Documentation/admin-guide/mm/zswap.rst

··· 49 49 Design 50 50 ====== 51 51 52 - Zswap receives pages for compression through the Frontswap API and is able to 52 + Zswap receives pages for compression from the swap subsystem and is able to 53 53 evict pages from its own compressed pool on an LRU basis and write them back to 54 54 the backing swap device in the case that the compressed pool is full. 55 55 ··· 70 70 zbud pages). The zsmalloc type zpool has a more complex compressed page 71 71 storage method, and it can achieve greater storage densities. 72 72 73 - When a swap page is passed from frontswap to zswap, zswap maintains a mapping 73 + When a swap page is passed from swapout to zswap, zswap maintains a mapping 74 74 of the swap entry, a combination of the swap type and swap offset, to the zpool 75 75 handle that references that compressed swap page. This mapping is achieved 76 76 with a red-black tree per swap type. The swap offset is the search key for the 77 77 tree nodes. 78 78 79 - During a page fault on a PTE that is a swap entry, frontswap calls the zswap 80 - load function to decompress the page into the page allocated by the page fault 81 - handler. 79 + During a page fault on a PTE that is a swap entry, the swapin code calls the 80 + zswap load function to decompress the page into the page allocated by the page 81 + fault handler. 82 82 83 83 Once there are no PTEs referencing a swap page stored in zswap (i.e. the count 84 - in the swap_map goes to 0) the swap code calls the zswap invalidate function, 85 - via frontswap, to free the compressed entry. 84 + in the swap_map goes to 0) the swap code calls the zswap invalidate function 85 + to free the compressed entry. 86 86 87 87 Zswap seeks to be simple in its policies. Sysfs attributes allow for one user 88 88 controlled policy:

-264

Documentation/mm/frontswap.rst

··· 1 - ========= 2 - Frontswap 3 - ========= 4 - 5 - Frontswap provides a "transcendent memory" interface for swap pages. 6 - In some environments, dramatic performance savings may be obtained because 7 - swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. 8 - 9 - .. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/ 10 - 11 - Frontswap is so named because it can be thought of as the opposite of 12 - a "backing" store for a swap device. The storage is assumed to be 13 - a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming 14 - to the requirements of transcendent memory (such as Xen's "tmem", or 15 - in-kernel compressed memory, aka "zcache", or future RAM-like devices); 16 - this pseudo-RAM device is not directly accessible or addressable by the 17 - kernel and is of unknown and possibly time-varying size. The driver 18 - links itself to frontswap by calling frontswap_register_ops to set the 19 - frontswap_ops funcs appropriately and the functions it provides must 20 - conform to certain policies as follows: 21 - 22 - An "init" prepares the device to receive frontswap pages associated 23 - with the specified swap device number (aka "type"). A "store" will 24 - copy the page to transcendent memory and associate it with the type and 25 - offset associated with the page. A "load" will copy the page, if found, 26 - from transcendent memory into kernel memory, but will NOT remove the page 27 - from transcendent memory. An "invalidate_page" will remove the page 28 - from transcendent memory and an "invalidate_area" will remove ALL pages 29 - associated with the swap type (e.g., like swapoff) and notify the "device" 30 - to refuse further stores with that swap type. 31 - 32 - Once a page is successfully stored, a matching load on the page will normally 33 - succeed. So when the kernel finds itself in a situation where it needs 34 - to swap out a page, it first attempts to use frontswap. If the store returns 35 - success, the data has been successfully saved to transcendent memory and 36 - a disk write and, if the data is later read back, a disk read are avoided. 37 - If a store returns failure, transcendent memory has rejected the data, and the 38 - page can be written to swap as usual. 39 - 40 - Note that if a page is stored and the page already exists in transcendent memory 41 - (a "duplicate" store), either the store succeeds and the data is overwritten, 42 - or the store fails AND the page is invalidated. This ensures stale data may 43 - never be obtained from frontswap. 44 - 45 - If properly configured, monitoring of frontswap is done via debugfs in 46 - the `/sys/kernel/debug/frontswap` directory. The effectiveness of 47 - frontswap can be measured (across all swap devices) with: 48 - 49 - ``failed_stores`` 50 - how many store attempts have failed 51 - 52 - ``loads`` 53 - how many loads were attempted (all should succeed) 54 - 55 - ``succ_stores`` 56 - how many store attempts have succeeded 57 - 58 - ``invalidates`` 59 - how many invalidates were attempted 60 - 61 - A backend implementation may provide additional metrics. 62 - 63 - FAQ 64 - === 65 - 66 - * Where's the value? 67 - 68 - When a workload starts swapping, performance falls through the floor. 69 - Frontswap significantly increases performance in many such workloads by 70 - providing a clean, dynamic interface to read and write swap pages to 71 - "transcendent memory" that is otherwise not directly addressable to the kernel. 72 - This interface is ideal when data is transformed to a different form 73 - and size (such as with compression) or secretly moved (as might be 74 - useful for write-balancing for some RAM-like devices). Swap pages (and 75 - evicted page-cache pages) are a great use for this kind of slower-than-RAM- 76 - but-much-faster-than-disk "pseudo-RAM device". 77 - 78 - Frontswap with a fairly small impact on the kernel, 79 - provides a huge amount of flexibility for more dynamic, flexible RAM 80 - utilization in various system configurations: 81 - 82 - In the single kernel case, aka "zcache", pages are compressed and 83 - stored in local memory, thus increasing the total anonymous pages 84 - that can be safely kept in RAM. Zcache essentially trades off CPU 85 - cycles used in compression/decompression for better memory utilization. 86 - Benchmarks have shown little or no impact when memory pressure is 87 - low while providing a significant performance improvement (25%+) 88 - on some workloads under high memory pressure. 89 - 90 - "RAMster" builds on zcache by adding "peer-to-peer" transcendent memory 91 - support for clustered systems. Frontswap pages are locally compressed 92 - as in zcache, but then "remotified" to another system's RAM. This 93 - allows RAM to be dynamically load-balanced back-and-forth as needed, 94 - i.e. when system A is overcommitted, it can swap to system B, and 95 - vice versa. RAMster can also be configured as a memory server so 96 - many servers in a cluster can swap, dynamically as needed, to a single 97 - server configured with a large amount of RAM... without pre-configuring 98 - how much of the RAM is available for each of the clients! 99 - 100 - In the virtual case, the whole point of virtualization is to statistically 101 - multiplex physical resources across the varying demands of multiple 102 - virtual machines. This is really hard to do with RAM and efforts to do 103 - it well with no kernel changes have essentially failed (except in some 104 - well-publicized special-case workloads). 105 - Specifically, the Xen Transcendent Memory backend allows otherwise 106 - "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple 107 - virtual machines, but the pages can be compressed and deduplicated to 108 - optimize RAM utilization. And when guest OS's are induced to surrender 109 - underutilized RAM (e.g. with "selfballooning"), sudden unexpected 110 - memory pressure may result in swapping; frontswap allows those pages 111 - to be swapped to and from hypervisor RAM (if overall host system memory 112 - conditions allow), thus mitigating the potentially awful performance impact 113 - of unplanned swapping. 114 - 115 - A KVM implementation is underway and has been RFC'ed to lkml. And, 116 - using frontswap, investigation is also underway on the use of NVM as 117 - a memory extension technology. 118 - 119 - * Sure there may be performance advantages in some situations, but 120 - what's the space/time overhead of frontswap? 121 - 122 - If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into 123 - nothingness and the only overhead is a few extra bytes per swapon'ed 124 - swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" 125 - registers, there is one extra global variable compared to zero for 126 - every swap page read or written. If CONFIG_FRONTSWAP is enabled 127 - AND a frontswap backend registers AND the backend fails every "store" 128 - request (i.e. provides no memory despite claiming it might), 129 - CPU overhead is still negligible -- and since every frontswap fail 130 - precedes a swap page write-to-disk, the system is highly likely 131 - to be I/O bound and using a small fraction of a percent of a CPU 132 - will be irrelevant anyway. 133 - 134 - As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend 135 - registers, one bit is allocated for every swap page for every swap 136 - device that is swapon'd. This is added to the EIGHT bits (which 137 - was sixteen until about 2.6.34) that the kernel already allocates 138 - for every swap page for every swap device that is swapon'd. (Hugh 139 - Dickins has observed that frontswap could probably steal one of 140 - the existing eight bits, but let's worry about that minor optimization 141 - later.) For very large swap disks (which are rare) on a standard 142 - 4K pagesize, this is 1MB per 32GB swap. 143 - 144 - When swap pages are stored in transcendent memory instead of written 145 - out to disk, there is a side effect that this may create more memory 146 - pressure that can potentially outweigh the other advantages. A 147 - backend, such as zcache, must implement policies to carefully (but 148 - dynamically) manage memory limits to ensure this doesn't happen. 149 - 150 - * OK, how about a quick overview of what this frontswap patch does 151 - in terms that a kernel hacker can grok? 152 - 153 - Let's assume that a frontswap "backend" has registered during 154 - kernel initialization; this registration indicates that this 155 - frontswap backend has access to some "memory" that is not directly 156 - accessible by the kernel. Exactly how much memory it provides is 157 - entirely dynamic and random. 158 - 159 - Whenever a swap-device is swapon'd frontswap_init() is called, 160 - passing the swap device number (aka "type") as a parameter. 161 - This notifies frontswap to expect attempts to "store" swap pages 162 - associated with that number. 163 - 164 - Whenever the swap subsystem is readying a page to write to a swap 165 - device (c.f swap_writepage()), frontswap_store is called. Frontswap 166 - consults with the frontswap backend and if the backend says it does NOT 167 - have room, frontswap_store returns -1 and the kernel swaps the page 168 - to the swap device as normal. Note that the response from the frontswap 169 - backend is unpredictable to the kernel; it may choose to never accept a 170 - page, it could accept every ninth page, or it might accept every 171 - page. But if the backend does accept a page, the data from the page 172 - has already been copied and associated with the type and offset, 173 - and the backend guarantees the persistence of the data. In this case, 174 - frontswap sets a bit in the "frontswap_map" for the swap device 175 - corresponding to the page offset on the swap device to which it would 176 - otherwise have written the data. 177 - 178 - When the swap subsystem needs to swap-in a page (swap_readpage()), 179 - it first calls frontswap_load() which checks the frontswap_map to 180 - see if the page was earlier accepted by the frontswap backend. If 181 - it was, the page of data is filled from the frontswap backend and 182 - the swap-in is complete. If not, the normal swap-in code is 183 - executed to obtain the page of data from the real swap device. 184 - 185 - So every time the frontswap backend accepts a page, a swap device read 186 - and (potentially) a swap device write are replaced by a "frontswap backend 187 - store" and (possibly) a "frontswap backend loads", which are presumably much 188 - faster. 189 - 190 - * Can't frontswap be configured as a "special" swap device that is 191 - just higher priority than any real swap device (e.g. like zswap, 192 - or maybe swap-over-nbd/NFS)? 193 - 194 - No. First, the existing swap subsystem doesn't allow for any kind of 195 - swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, 196 - but this would require fairly drastic changes. Even if it were 197 - rewritten, the existing swap subsystem uses the block I/O layer which 198 - assumes a swap device is fixed size and any page in it is linearly 199 - addressable. Frontswap barely touches the existing swap subsystem, 200 - and works around the constraints of the block I/O subsystem to provide 201 - a great deal of flexibility and dynamicity. 202 - 203 - For example, the acceptance of any swap page by the frontswap backend is 204 - entirely unpredictable. This is critical to the definition of frontswap 205 - backends because it grants completely dynamic discretion to the 206 - backend. In zcache, one cannot know a priori how compressible a page is. 207 - "Poorly" compressible pages can be rejected, and "poorly" can itself be 208 - defined dynamically depending on current memory constraints. 209 - 210 - Further, frontswap is entirely synchronous whereas a real swap 211 - device is, by definition, asynchronous and uses block I/O. The 212 - block I/O layer is not only unnecessary, but may perform "optimizations" 213 - that are inappropriate for a RAM-oriented device including delaying 214 - the write of some pages for a significant amount of time. Synchrony is 215 - required to ensure the dynamicity of the backend and to avoid thorny race 216 - conditions that would unnecessarily and greatly complicate frontswap 217 - and/or the block I/O subsystem. That said, only the initial "store" 218 - and "load" operations need be synchronous. A separate asynchronous thread 219 - is free to manipulate the pages stored by frontswap. For example, 220 - the "remotification" thread in RAMster uses standard asynchronous 221 - kernel sockets to move compressed frontswap pages to a remote machine. 222 - Similarly, a KVM guest-side implementation could do in-guest compression 223 - and use "batched" hypercalls. 224 - 225 - In a virtualized environment, the dynamicity allows the hypervisor 226 - (or host OS) to do "intelligent overcommit". For example, it can 227 - choose to accept pages only until host-swapping might be imminent, 228 - then force guests to do their own swapping. 229 - 230 - There is a downside to the transcendent memory specifications for 231 - frontswap: Since any "store" might fail, there must always be a real 232 - slot on a real swap device to swap the page. Thus frontswap must be 233 - implemented as a "shadow" to every swapon'd device with the potential 234 - capability of holding every page that the swap device might have held 235 - and the possibility that it might hold no pages at all. This means 236 - that frontswap cannot contain more pages than the total of swapon'd 237 - swap devices. For example, if NO swap device is configured on some 238 - installation, frontswap is useless. Swapless portable devices 239 - can still use frontswap but a backend for such devices must configure 240 - some kind of "ghost" swap device and ensure that it is never used. 241 - 242 - * Why this weird definition about "duplicate stores"? If a page 243 - has been previously successfully stored, can't it always be 244 - successfully overwritten? 245 - 246 - Nearly always it can, but no, sometimes it cannot. Consider an example 247 - where data is compressed and the original 4K page has been compressed 248 - to 1K. Now an attempt is made to overwrite the page with data that 249 - is non-compressible and so would take the entire 4K. But the backend 250 - has no more space. In this case, the store must be rejected. Whenever 251 - frontswap rejects a store that would overwrite, it also must invalidate 252 - the old data and ensure that it is no longer accessible. Since the 253 - swap subsystem then writes the new data to the read swap device, 254 - this is the correct course of action to ensure coherency. 255 - 256 - * Why does the frontswap patch create the new include file swapfile.h? 257 - 258 - The frontswap code depends on some swap-subsystem-internal data 259 - structures that have, over the years, moved back and forth between 260 - static and global. This seemed a reasonable compromise: Define 261 - them as global but declare them in a new include file that isn't 262 - included by the large number of source files that include swap.h. 263 - 264 - Dan Magenheimer, last updated April 9, 2012

-1

Documentation/mm/index.rst

··· 44 44 balance 45 45 damon/index 46 46 free_page_reporting 47 - frontswap 48 47 hmm 49 48 hwpoison 50 49 hugetlbfs_reserv

-196

Documentation/translations/zh_CN/mm/frontswap.rst

··· 1 - :Original: Documentation/mm/frontswap.rst 2 - 3 - :翻译: 4 - 5 - 司延腾 Yanteng Si <siyanteng@loongson.cn> 6 - 7 - :校译: 8 - 9 - ========= 10 - Frontswap 11 - ========= 12 - 13 - Frontswap为交换页提供了一个 “transcendent memory” 的接口。在一些环境中，由 14 - 于交换页被保存在RAM（或类似RAM的设备）中，而不是交换磁盘，因此可以获得巨大的性能 15 - 节省（提高）。 16 - 17 - .. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/ 18 - 19 - Frontswap之所以这么命名，是因为它可以被认为是与swap设备的“back”存储相反。存 20 - 储器被认为是一个同步并发安全的面向页面的“伪RAM设备”，符合transcendent memory 21 - （如Xen的“tmem”，或内核内压缩内存，又称“zcache”，或未来的类似RAM的设备）的要 22 - 求；这个伪RAM设备不能被内核直接访问或寻址，其大小未知且可能随时间变化。驱动程序通过 23 - 调用frontswap_register_ops将自己与frontswap链接起来，以适当地设置frontswap_ops 24 - 的功能，它提供的功能必须符合某些策略，如下所示: 25 - 26 - 一个 “init” 将设备准备好接收与指定的交换设备编号（又称“类型”）相关的frontswap 27 - 交换页。一个 “store” 将把该页复制到transcendent memory，并与该页的类型和偏移 28 - 量相关联。一个 “load” 将把该页，如果找到的话，从transcendent memory复制到内核 29 - 内存，但不会从transcendent memory中删除该页。一个 “invalidate_page” 将从 30 - transcendent memory中删除该页，一个 “invalidate_area” 将删除所有与交换类型 31 - 相关的页（例如，像swapoff）并通知 “device” 拒绝进一步存储该交换类型。 32 - 33 - 一旦一个页面被成功存储，在该页面上的匹配加载通常会成功。因此，当内核发现自己处于需 34 - 要交换页面的情况时，它首先尝试使用frontswap。如果存储的结果是成功的，那么数据就已 35 - 经成功的保存到了transcendent memory中，并且避免了磁盘写入，如果后来再读回数据， 36 - 也避免了磁盘读取。如果存储返回失败，transcendent memory已经拒绝了该数据，且该页 37 - 可以像往常一样被写入交换空间。 38 - 39 - 请注意，如果一个页面被存储，而该页面已经存在于transcendent memory中（一个 “重复” 40 - 的存储），要么存储成功，数据被覆盖，要么存储失败，该页面被废止。这确保了旧的数据永远 41 - 不会从frontswap中获得。 42 - 43 - 如果配置正确，对frontswap的监控是通过 `/sys/kernel/debug/frontswap` 目录下的 44 - debugfs完成的。frontswap的有效性可以通过以下方式测量（在所有交换设备中）: 45 - 46 - ``failed_stores`` 47 - 有多少次存储的尝试是失败的 48 - 49 - ``loads`` 50 - 尝试了多少次加载（应该全部成功） 51 - 52 - ``succ_stores`` 53 - 有多少次存储的尝试是成功的 54 - 55 - ``invalidates`` 56 - 尝试了多少次作废 57 - 58 - 后台实现可以提供额外的指标。 59 - 60 - 经常问到的问题 61 - ============== 62 - 63 - * 价值在哪里? 64 - 65 - 当一个工作负载开始交换时，性能就会下降。Frontswap通过提供一个干净的、动态的接口来 66 - 读取和写入交换页到 “transcendent memory”，从而大大增加了许多这样的工作负载的性 67 - 能，否则内核是无法直接寻址的。当数据被转换为不同的形式和大小（比如压缩）或者被秘密 68 - 移动（对于一些类似RAM的设备来说，这可能对写平衡很有用）时，这个接口是理想的。交换 69 - 页（和被驱逐的页面缓存页）是这种比RAM慢但比磁盘快得多的“伪RAM设备”的一大用途。 70 - 71 - Frontswap对内核的影响相当小，为各种系统配置中更动态、更灵活的RAM利用提供了巨大的 72 - 灵活性： 73 - 74 - 在单一内核的情况下，又称“zcache”，页面被压缩并存储在本地内存中，从而增加了可以安 75 - 全保存在RAM中的匿名页面总数。Zcache本质上是用压缩/解压缩的CPU周期换取更好的内存利 76 - 用率。Benchmarks测试显示，当内存压力较低时，几乎没有影响，而在高内存压力下的一些 77 - 工作负载上，则有明显的性能改善（25%以上）。 78 - 79 - “RAMster” 在zcache的基础上增加了对集群系统的 “peer-to-peer” transcendent memory 80 - 的支持。Frontswap页面像zcache一样被本地压缩，但随后被“remotified” 到另一个系 81 - 统的RAM。这使得RAM可以根据需要动态地来回负载平衡，也就是说，当系统A超载时，它可以 82 - 交换到系统B，反之亦然。RAMster也可以被配置成一个内存服务器，因此集群中的许多服务器 83 - 可以根据需要动态地交换到配置有大量内存的单一服务器上......而不需要预先配置每个客户 84 - 有多少内存可用 85 - 86 - 在虚拟情况下，虚拟化的全部意义在于统计地将物理资源在多个虚拟机的不同需求之间进行复 87 - 用。对于RAM来说，这真的很难做到，而且在不改变内核的情况下，要做好这一点的努力基本上 88 - 是失败的（除了一些广为人知的特殊情况下的工作负载）。具体来说，Xen Transcendent Memory 89 - 后端允许管理器拥有的RAM “fallow”，不仅可以在多个虚拟机之间进行“time-shared”， 90 - 而且页面可以被压缩和重复利用，以优化RAM的利用率。当客户操作系统被诱导交出未充分利用 91 - 的RAM时（如 “selfballooning”），突然出现的意外内存压力可能会导致交换；frontswap 92 - 允许这些页面被交换到管理器RAM中或从管理器RAM中交换（如果整体主机系统内存条件允许）， 93 - 从而减轻计划外交换可能带来的可怕的性能影响。 94 - 95 - 一个KVM的实现正在进行中，并且已经被RFC'ed到lkml。而且，利用frontswap，对NVM作为 96 - 内存扩展技术的调查也在进行中。 97 - 98 - * 当然，在某些情况下可能有性能上的优势，但frontswap的空间/时间开销是多少？ 99 - 100 - 如果 CONFIG_FRONTSWAP 被禁用，每个 frontswap 钩子都会编译成空，唯一的开销是每 101 - 个 swapon'ed swap 设备的几个额外字节。如果 CONFIG_FRONTSWAP 被启用，但没有 102 - frontswap的 “backend” 寄存器，每读或写一个交换页就会有一个额外的全局变量，而不 103 - 是零。如果 CONFIG_FRONTSWAP 被启用，并且有一个frontswap的backend寄存器，并且 104 - 后端每次 “store” 请求都失败（即尽管声称可能，但没有提供内存），CPU 的开销仍然可以 105 - 忽略不计 - 因为每次frontswap失败都是在交换页写到磁盘之前，系统很可能是 I/O 绑定 106 - 的，无论如何使用一小部分的 CPU 都是不相关的。 107 - 108 - 至于空间，如果CONFIG_FRONTSWAP被启用，并且有一个frontswap的backend注册，那么 109 - 每个交换设备的每个交换页都会被分配一个比特。这是在内核已经为每个交换设备的每个交换 110 - 页分配的8位（在2.6.34之前是16位）上增加的。(Hugh Dickins观察到，frontswap可能 111 - 会偷取现有的8个比特，但是我们以后再来担心这个小的优化问题)。对于标准的4K页面大小的 112 - 非常大的交换盘（这很罕见），这是每32GB交换盘1MB开销。 113 - 114 - 当交换页存储在transcendent memory中而不是写到磁盘上时，有一个副作用，即这可能会 115 - 产生更多的内存压力，有可能超过其他的优点。一个backend，比如zcache，必须实现策略 116 - 来仔细（但动态地）管理内存限制，以确保这种情况不会发生。 117 - 118 - * 好吧，那就用内核骇客能理解的术语来快速概述一下这个frontswap补丁的作用如何？ 119 - 120 - 我们假设在内核初始化过程中，一个frontswap 的 “backend” 已经注册了；这个注册表 121 - 明这个frontswap 的 “backend” 可以访问一些不被内核直接访问的“内存”。它到底提 122 - 供了多少内存是完全动态和随机的。 123 - 124 - 每当一个交换设备被交换时，就会调用frontswap_init()，把交换设备的编号（又称“类 125 - 型”）作为一个参数传给它。这就通知了frontswap，以期待 “store” 与该号码相关的交 126 - 换页的尝试。 127 - 128 - 每当交换子系统准备将一个页面写入交换设备时（参见swap_writepage()），就会调用 129 - frontswap_store。Frontswap与frontswap backend协商，如果backend说它没有空 130 - 间，frontswap_store返回-1，内核就会照常把页换到交换设备上。注意，来自frontswap 131 - backend的响应对内核来说是不可预测的；它可能选择从不接受一个页面，可能接受每九个 132 - 页面，也可能接受每一个页面。但是如果backend确实接受了一个页面，那么这个页面的数 133 - 据已经被复制并与类型和偏移量相关联了，而且backend保证了数据的持久性。在这种情况 134 - 下，frontswap在交换设备的“frontswap_map” 中设置了一个位，对应于交换设备上的 135 - 页面偏移量，否则它就会将数据写入该设备。 136 - 137 - 当交换子系统需要交换一个页面时（swap_readpage()），它首先调用frontswap_load()， 138 - 检查frontswap_map，看这个页面是否早先被frontswap backend接受。如果是，该页 139 - 的数据就会从frontswap后端填充，换入就完成了。如果不是，正常的交换代码将被执行， 140 - 以便从真正的交换设备上获得这一页的数据。 141 - 142 - 所以每次frontswap backend接受一个页面时，交换设备的读取和（可能）交换设备的写 143 - 入都被 “frontswap backend store” 和（可能）“frontswap backend loads” 144 - 所取代，这可能会快得多。 145 - 146 - * frontswap不能被配置为一个 “特殊的” 交换设备，它的优先级要高于任何真正的交换 147 - 设备（例如像zswap，或者可能是swap-over-nbd/NFS）？ 148 - 149 - 首先，现有的交换子系统不允许有任何种类的交换层次结构。也许它可以被重写以适应层次 150 - 结构，但这将需要相当大的改变。即使它被重写，现有的交换子系统也使用了块I/O层，它 151 - 假定交换设备是固定大小的，其中的任何页面都是可线性寻址的。Frontswap几乎没有触 152 - 及现有的交换子系统，而是围绕着块I/O子系统的限制，提供了大量的灵活性和动态性。 153 - 154 - 例如，frontswap backend对任何交换页的接受是完全不可预测的。这对frontswap backend 155 - 的定义至关重要，因为它赋予了backend完全动态的决定权。在zcache中，人们无法预 156 - 先知道一个页面的可压缩性如何。可压缩性 “差” 的页面会被拒绝，而 “差” 本身也可 157 - 以根据当前的内存限制动态地定义。 158 - 159 - 此外，frontswap是完全同步的，而真正的交换设备，根据定义，是异步的，并且使用 160 - 块I/O。块I/O层不仅是不必要的，而且可能进行 “优化”，这对面向RAM的设备来说是 161 - 不合适的，包括将一些页面的写入延迟相当长的时间。同步是必须的，以确保后端的动 162 - 态性，并避免棘手的竞争条件，这将不必要地大大增加frontswap和/或块I/O子系统的 163 - 复杂性。也就是说，只有最初的 “store” 和 “load” 操作是需要同步的。一个独立 164 - 的异步线程可以自由地操作由frontswap存储的页面。例如，RAMster中的 “remotification” 165 - 线程使用标准的异步内核套接字，将压缩的frontswap页面移动到远程机器。同样， 166 - KVM的客户方实现可以进行客户内压缩，并使用 “batched” hypercalls。 167 - 168 - 在虚拟化环境中，动态性允许管理程序（或主机操作系统）做“intelligent overcommit”。 169 - 例如，它可以选择只接受页面，直到主机交换可能即将发生，然后强迫客户机做他们 170 - 自己的交换。 171 - 172 - transcendent memory规格的frontswap有一个坏处。因为任何 “store” 都可 173 - 能失败，所以必须在一个真正的交换设备上有一个真正的插槽来交换页面。因此， 174 - frontswap必须作为每个交换设备的 “影子” 来实现，它有可能容纳交换设备可能 175 - 容纳的每一个页面，也有可能根本不容纳任何页面。这意味着frontswap不能包含比 176 - swap设备总数更多的页面。例如，如果在某些安装上没有配置交换设备，frontswap 177 - 就没有用。无交换设备的便携式设备仍然可以使用frontswap，但是这种设备的 178 - backend必须配置某种 “ghost” 交换设备，并确保它永远不会被使用。 179 - 180 - 181 - * 为什么会有这种关于 “重复存储” 的奇怪定义？如果一个页面以前被成功地存储过， 182 - 难道它不能总是被成功地覆盖吗？ 183 - 184 - 几乎总是可以的，不，有时不能。考虑一个例子，数据被压缩了，原来的4K页面被压 185 - 缩到了1K。现在，有人试图用不可压缩的数据覆盖该页，因此会占用整个4K。但是 186 - backend没有更多的空间了。在这种情况下，这个存储必须被拒绝。每当frontswap 187 - 拒绝一个会覆盖的存储时，它也必须使旧的数据作废，并确保它不再被访问。因为交 188 - 换子系统会把新的数据写到读交换设备上，这是确保一致性的正确做法。 189 - 190 - * 为什么frontswap补丁会创建新的头文件swapfile.h？ 191 - 192 - frontswap代码依赖于一些swap子系统内部的数据结构，这些数据结构多年来一直 193 - 在静态和全局之间来回移动。这似乎是一个合理的妥协：将它们定义为全局，但在一 194 - 个新的包含文件中声明它们，该文件不被包含swap.h的大量源文件所包含。 195 - 196 - Dan Magenheimer，最后更新于2012年4月9日

-1

Documentation/translations/zh_CN/mm/index.rst

··· 42 42 damon/index 43 43 free_page_reporting 44 44 ksm 45 - frontswap 46 45 hmm 47 46 hwpoison 48 47 hugetlbfs_reserv

-7

MAINTAINERS

··· 8404 8404 F: include/linux/freezer.h 8405 8405 F: kernel/freezer.c 8406 8406 8407 - FRONTSWAP API 8408 - M: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 8409 - L: linux-kernel@vger.kernel.org 8410 - S: Maintained 8411 - F: include/linux/frontswap.h 8412 - F: mm/frontswap.c 8413 - 8414 8407 FS-CACHE: LOCAL CACHING FOR NETWORK FILESYSTEMS 8415 8408 M: David Howells <dhowells@redhat.com> 8416 8409 L: linux-cachefs@redhat.com (moderated for non-subscribers)

+1

fs/proc/meminfo.c

··· 17 17 #ifdef CONFIG_CMA 18 18 #include <linux/cma.h> 19 19 #endif 20 + #include <linux/zswap.h> 20 21 #include <asm/page.h> 21 22 #include "internal.h" 22 23

-91

include/linux/frontswap.h

··· 1 - /* SPDX-License-Identifier: GPL-2.0 */ 2 - #ifndef _LINUX_FRONTSWAP_H 3 - #define _LINUX_FRONTSWAP_H 4 - 5 - #include <linux/swap.h> 6 - #include <linux/mm.h> 7 - #include <linux/bitops.h> 8 - #include <linux/jump_label.h> 9 - 10 - struct frontswap_ops { 11 - void (*init)(unsigned); /* this swap type was just swapon'ed */ 12 - int (*store)(unsigned, pgoff_t, struct page *); /* store a page */ 13 - int (*load)(unsigned, pgoff_t, struct page *, bool *); /* load a page */ 14 - void (*invalidate_page)(unsigned, pgoff_t); /* page no longer needed */ 15 - void (*invalidate_area)(unsigned); /* swap type just swapoff'ed */ 16 - }; 17 - 18 - int frontswap_register_ops(const struct frontswap_ops *ops); 19 - 20 - extern void frontswap_init(unsigned type, unsigned long *map); 21 - extern int __frontswap_store(struct page *page); 22 - extern int __frontswap_load(struct page *page); 23 - extern void __frontswap_invalidate_page(unsigned, pgoff_t); 24 - extern void __frontswap_invalidate_area(unsigned); 25 - 26 - #ifdef CONFIG_FRONTSWAP 27 - extern struct static_key_false frontswap_enabled_key; 28 - 29 - static inline bool frontswap_enabled(void) 30 - { 31 - return static_branch_unlikely(&frontswap_enabled_key); 32 - } 33 - 34 - static inline void frontswap_map_set(struct swap_info_struct *p, 35 - unsigned long *map) 36 - { 37 - p->frontswap_map = map; 38 - } 39 - 40 - static inline unsigned long *frontswap_map_get(struct swap_info_struct *p) 41 - { 42 - return p->frontswap_map; 43 - } 44 - #else 45 - /* all inline routines become no-ops and all externs are ignored */ 46 - 47 - static inline bool frontswap_enabled(void) 48 - { 49 - return false; 50 - } 51 - 52 - static inline void frontswap_map_set(struct swap_info_struct *p, 53 - unsigned long *map) 54 - { 55 - } 56 - 57 - static inline unsigned long *frontswap_map_get(struct swap_info_struct *p) 58 - { 59 - return NULL; 60 - } 61 - #endif 62 - 63 - static inline int frontswap_store(struct page *page) 64 - { 65 - if (frontswap_enabled()) 66 - return __frontswap_store(page); 67 - 68 - return -1; 69 - } 70 - 71 - static inline int frontswap_load(struct page *page) 72 - { 73 - if (frontswap_enabled()) 74 - return __frontswap_load(page); 75 - 76 - return -1; 77 - } 78 - 79 - static inline void frontswap_invalidate_page(unsigned type, pgoff_t offset) 80 - { 81 - if (frontswap_enabled()) 82 - __frontswap_invalidate_page(type, offset); 83 - } 84 - 85 - static inline void frontswap_invalidate_area(unsigned type) 86 - { 87 - if (frontswap_enabled()) 88 - __frontswap_invalidate_area(type); 89 - } 90 - 91 - #endif /* _LINUX_FRONTSWAP_H */

-9

include/linux/swap.h

··· 302 302 struct file *swap_file; /* seldom referenced */ 303 303 unsigned int old_block_size; /* seldom referenced */ 304 304 struct completion comp; /* seldom referenced */ 305 - #ifdef CONFIG_FRONTSWAP 306 - unsigned long *frontswap_map; /* frontswap in-use, one bit per page */ 307 - atomic_t frontswap_pages; /* frontswap pages in-use counter */ 308 - #endif 309 305 spinlock_t lock; /* 310 306 * protect map scan related fields like 311 307 * swap_map, lowest_bit, highest_bit, ··· 624 628 { 625 629 return READ_ONCE(vm_swappiness); 626 630 } 627 - #endif 628 - 629 - #ifdef CONFIG_ZSWAP 630 - extern u64 zswap_pool_total_size; 631 - extern atomic_t zswap_stored_pages; 632 631 #endif 633 632 634 633 #if defined(CONFIG_SWAP) && defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)

-5

include/linux/swapfile.h

··· 2 2 #ifndef _LINUX_SWAPFILE_H 3 3 #define _LINUX_SWAPFILE_H 4 4 5 - /* 6 - * these were static in swapfile.c but frontswap.c needs them and we don't 7 - * want to expose them to the dozens of source files that include swap.h 8 - */ 9 - extern struct swap_info_struct *swap_info[]; 10 5 extern unsigned long generic_max_swapfile_size(void); 11 6 unsigned long arch_max_swapfile_size(void); 12 7

+37

include/linux/zswap.h

··· 1 + /* SPDX-License-Identifier: GPL-2.0 */ 2 + #ifndef _LINUX_ZSWAP_H 3 + #define _LINUX_ZSWAP_H 4 + 5 + #include <linux/types.h> 6 + #include <linux/mm_types.h> 7 + 8 + extern u64 zswap_pool_total_size; 9 + extern atomic_t zswap_stored_pages; 10 + 11 + #ifdef CONFIG_ZSWAP 12 + 13 + bool zswap_store(struct page *page); 14 + bool zswap_load(struct page *page); 15 + void zswap_invalidate(int type, pgoff_t offset); 16 + void zswap_swapon(int type); 17 + void zswap_swapoff(int type); 18 + 19 + #else 20 + 21 + static inline bool zswap_store(struct page *page) 22 + { 23 + return false; 24 + } 25 + 26 + static inline bool zswap_load(struct page *page) 27 + { 28 + return false; 29 + } 30 + 31 + static inline void zswap_invalidate(int type, pgoff_t offset) {} 32 + static inline void zswap_swapon(int type) {} 33 + static inline void zswap_swapoff(int type) {} 34 + 35 + #endif 36 + 37 + #endif /* _LINUX_ZSWAP_H */

-4

mm/Kconfig

··· 25 25 config ZSWAP 26 26 bool "Compressed cache for swap pages" 27 27 depends on SWAP 28 - select FRONTSWAP 29 28 select CRYPTO 30 29 select ZPOOL 31 30 help ··· 870 871 bool 871 872 872 873 config HAVE_SETUP_PER_CPU_AREA 873 - bool 874 - 875 - config FRONTSWAP 876 874 bool 877 875 878 876 config CMA

-1

mm/Makefile

··· 72 72 endif 73 73 74 74 obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o swap_slots.o 75 - obj-$(CONFIG_FRONTSWAP) += frontswap.o 76 75 obj-$(CONFIG_ZSWAP) += zswap.o 77 76 obj-$(CONFIG_HAS_DMA) += dmapool.o 78 77 obj-$(CONFIG_HUGETLBFS) += hugetlb.o

-283

mm/frontswap.c

··· 1 - // SPDX-License-Identifier: GPL-2.0-only 2 - /* 3 - * Frontswap frontend 4 - * 5 - * This code provides the generic "frontend" layer to call a matching 6 - * "backend" driver implementation of frontswap. See 7 - * Documentation/mm/frontswap.rst for more information. 8 - * 9 - * Copyright (C) 2009-2012 Oracle Corp. All rights reserved. 10 - * Author: Dan Magenheimer 11 - */ 12 - 13 - #include <linux/mman.h> 14 - #include <linux/swap.h> 15 - #include <linux/swapops.h> 16 - #include <linux/security.h> 17 - #include <linux/module.h> 18 - #include <linux/debugfs.h> 19 - #include <linux/frontswap.h> 20 - #include <linux/swapfile.h> 21 - 22 - DEFINE_STATIC_KEY_FALSE(frontswap_enabled_key); 23 - 24 - /* 25 - * frontswap_ops are added by frontswap_register_ops, and provide the 26 - * frontswap "backend" implementation functions. Multiple implementations 27 - * may be registered, but implementations can never deregister. This 28 - * is a simple singly-linked list of all registered implementations. 29 - */ 30 - static const struct frontswap_ops *frontswap_ops __read_mostly; 31 - 32 - #ifdef CONFIG_DEBUG_FS 33 - /* 34 - * Counters available via /sys/kernel/debug/frontswap (if debugfs is 35 - * properly configured). These are for information only so are not protected 36 - * against increment races. 37 - */ 38 - static u64 frontswap_loads; 39 - static u64 frontswap_succ_stores; 40 - static u64 frontswap_failed_stores; 41 - static u64 frontswap_invalidates; 42 - 43 - static inline void inc_frontswap_loads(void) 44 - { 45 - data_race(frontswap_loads++); 46 - } 47 - static inline void inc_frontswap_succ_stores(void) 48 - { 49 - data_race(frontswap_succ_stores++); 50 - } 51 - static inline void inc_frontswap_failed_stores(void) 52 - { 53 - data_race(frontswap_failed_stores++); 54 - } 55 - static inline void inc_frontswap_invalidates(void) 56 - { 57 - data_race(frontswap_invalidates++); 58 - } 59 - #else 60 - static inline void inc_frontswap_loads(void) { } 61 - static inline void inc_frontswap_succ_stores(void) { } 62 - static inline void inc_frontswap_failed_stores(void) { } 63 - static inline void inc_frontswap_invalidates(void) { } 64 - #endif 65 - 66 - /* 67 - * Due to the asynchronous nature of the backends loading potentially 68 - * _after_ the swap system has been activated, we have chokepoints 69 - * on all frontswap functions to not call the backend until the backend 70 - * has registered. 71 - * 72 - * This would not guards us against the user deciding to call swapoff right as 73 - * we are calling the backend to initialize (so swapon is in action). 74 - * Fortunately for us, the swapon_mutex has been taken by the callee so we are 75 - * OK. The other scenario where calls to frontswap_store (called via 76 - * swap_writepage) is racing with frontswap_invalidate_area (called via 77 - * swapoff) is again guarded by the swap subsystem. 78 - * 79 - * While no backend is registered all calls to frontswap_[store|load| 80 - * invalidate_area|invalidate_page] are ignored or fail. 81 - * 82 - * The time between the backend being registered and the swap file system 83 - * calling the backend (via the frontswap_* functions) is indeterminate as 84 - * frontswap_ops is not atomic_t (or a value guarded by a spinlock). 85 - * That is OK as we are comfortable missing some of these calls to the newly 86 - * registered backend. 87 - * 88 - * Obviously the opposite (unloading the backend) must be done after all 89 - * the frontswap_[store|load|invalidate_area|invalidate_page] start 90 - * ignoring or failing the requests. However, there is currently no way 91 - * to unload a backend once it is registered. 92 - */ 93 - 94 - /* 95 - * Register operations for frontswap 96 - */ 97 - int frontswap_register_ops(const struct frontswap_ops *ops) 98 - { 99 - if (frontswap_ops) 100 - return -EINVAL; 101 - 102 - frontswap_ops = ops; 103 - static_branch_inc(&frontswap_enabled_key); 104 - return 0; 105 - } 106 - 107 - /* 108 - * Called when a swap device is swapon'd. 109 - */ 110 - void frontswap_init(unsigned type, unsigned long *map) 111 - { 112 - struct swap_info_struct *sis = swap_info[type]; 113 - 114 - VM_BUG_ON(sis == NULL); 115 - 116 - /* 117 - * p->frontswap is a bitmap that we MUST have to figure out which page 118 - * has gone in frontswap. Without it there is no point of continuing. 119 - */ 120 - if (WARN_ON(!map)) 121 - return; 122 - /* 123 - * Irregardless of whether the frontswap backend has been loaded 124 - * before this function or it will be later, we _MUST_ have the 125 - * p->frontswap set to something valid to work properly. 126 - */ 127 - frontswap_map_set(sis, map); 128 - 129 - if (!frontswap_enabled()) 130 - return; 131 - frontswap_ops->init(type); 132 - } 133 - 134 - static bool __frontswap_test(struct swap_info_struct *sis, 135 - pgoff_t offset) 136 - { 137 - if (sis->frontswap_map) 138 - return test_bit(offset, sis->frontswap_map); 139 - return false; 140 - } 141 - 142 - static inline void __frontswap_set(struct swap_info_struct *sis, 143 - pgoff_t offset) 144 - { 145 - set_bit(offset, sis->frontswap_map); 146 - atomic_inc(&sis->frontswap_pages); 147 - } 148 - 149 - static inline void __frontswap_clear(struct swap_info_struct *sis, 150 - pgoff_t offset) 151 - { 152 - clear_bit(offset, sis->frontswap_map); 153 - atomic_dec(&sis->frontswap_pages); 154 - } 155 - 156 - /* 157 - * "Store" data from a page to frontswap and associate it with the page's 158 - * swaptype and offset. Page must be locked and in the swap cache. 159 - * If frontswap already contains a page with matching swaptype and 160 - * offset, the frontswap implementation may either overwrite the data and 161 - * return success or invalidate the page from frontswap and return failure. 162 - */ 163 - int __frontswap_store(struct page *page) 164 - { 165 - int ret = -1; 166 - swp_entry_t entry = { .val = page_private(page), }; 167 - int type = swp_type(entry); 168 - struct swap_info_struct *sis = swap_info[type]; 169 - pgoff_t offset = swp_offset(entry); 170 - 171 - VM_BUG_ON(!frontswap_ops); 172 - VM_BUG_ON(!PageLocked(page)); 173 - VM_BUG_ON(sis == NULL); 174 - 175 - /* 176 - * If a dup, we must remove the old page first; we can't leave the 177 - * old page no matter if the store of the new page succeeds or fails, 178 - * and we can't rely on the new page replacing the old page as we may 179 - * not store to the same implementation that contains the old page. 180 - */ 181 - if (__frontswap_test(sis, offset)) { 182 - __frontswap_clear(sis, offset); 183 - frontswap_ops->invalidate_page(type, offset); 184 - } 185 - 186 - ret = frontswap_ops->store(type, offset, page); 187 - if (ret == 0) { 188 - __frontswap_set(sis, offset); 189 - inc_frontswap_succ_stores(); 190 - } else { 191 - inc_frontswap_failed_stores(); 192 - } 193 - 194 - return ret; 195 - } 196 - 197 - /* 198 - * "Get" data from frontswap associated with swaptype and offset that were 199 - * specified when the data was put to frontswap and use it to fill the 200 - * specified page with data. Page must be locked and in the swap cache. 201 - */ 202 - int __frontswap_load(struct page *page) 203 - { 204 - int ret = -1; 205 - swp_entry_t entry = { .val = page_private(page), }; 206 - int type = swp_type(entry); 207 - struct swap_info_struct *sis = swap_info[type]; 208 - pgoff_t offset = swp_offset(entry); 209 - bool exclusive = false; 210 - 211 - VM_BUG_ON(!frontswap_ops); 212 - VM_BUG_ON(!PageLocked(page)); 213 - VM_BUG_ON(sis == NULL); 214 - 215 - if (!__frontswap_test(sis, offset)) 216 - return -1; 217 - 218 - /* Try loading from each implementation, until one succeeds. */ 219 - ret = frontswap_ops->load(type, offset, page, &exclusive); 220 - if (ret == 0) { 221 - inc_frontswap_loads(); 222 - if (exclusive) { 223 - SetPageDirty(page); 224 - __frontswap_clear(sis, offset); 225 - } 226 - } 227 - return ret; 228 - } 229 - 230 - /* 231 - * Invalidate any data from frontswap associated with the specified swaptype 232 - * and offset so that a subsequent "get" will fail. 233 - */ 234 - void __frontswap_invalidate_page(unsigned type, pgoff_t offset) 235 - { 236 - struct swap_info_struct *sis = swap_info[type]; 237 - 238 - VM_BUG_ON(!frontswap_ops); 239 - VM_BUG_ON(sis == NULL); 240 - 241 - if (!__frontswap_test(sis, offset)) 242 - return; 243 - 244 - frontswap_ops->invalidate_page(type, offset); 245 - __frontswap_clear(sis, offset); 246 - inc_frontswap_invalidates(); 247 - } 248 - 249 - /* 250 - * Invalidate all data from frontswap associated with all offsets for the 251 - * specified swaptype. 252 - */ 253 - void __frontswap_invalidate_area(unsigned type) 254 - { 255 - struct swap_info_struct *sis = swap_info[type]; 256 - 257 - VM_BUG_ON(!frontswap_ops); 258 - VM_BUG_ON(sis == NULL); 259 - 260 - if (sis->frontswap_map == NULL) 261 - return; 262 - 263 - frontswap_ops->invalidate_area(type); 264 - atomic_set(&sis->frontswap_pages, 0); 265 - bitmap_zero(sis->frontswap_map, sis->max); 266 - } 267 - 268 - static int __init init_frontswap(void) 269 - { 270 - #ifdef CONFIG_DEBUG_FS 271 - struct dentry *root = debugfs_create_dir("frontswap", NULL); 272 - if (root == NULL) 273 - return -ENXIO; 274 - debugfs_create_u64("loads", 0444, root, &frontswap_loads); 275 - debugfs_create_u64("succ_stores", 0444, root, &frontswap_succ_stores); 276 - debugfs_create_u64("failed_stores", 0444, root, 277 - &frontswap_failed_stores); 278 - debugfs_create_u64("invalidates", 0444, root, &frontswap_invalidates); 279 - #endif 280 - return 0; 281 - } 282 - 283 - module_init(init_frontswap);

+3 -3

mm/page_io.c

··· 19 19 #include <linux/bio.h> 20 20 #include <linux/swapops.h> 21 21 #include <linux/writeback.h> 22 - #include <linux/frontswap.h> 23 22 #include <linux/blkdev.h> 24 23 #include <linux/psi.h> 25 24 #include <linux/uio.h> 26 25 #include <linux/sched/task.h> 27 26 #include <linux/delayacct.h> 27 + #include <linux/zswap.h> 28 28 #include "swap.h" 29 29 30 30 static void __end_swap_bio_write(struct bio *bio) ··· 195 195 folio_unlock(folio); 196 196 return ret; 197 197 } 198 - if (frontswap_store(&folio->page) == 0) { 198 + if (zswap_store(&folio->page)) { 199 199 folio_start_writeback(folio); 200 200 folio_unlock(folio); 201 201 folio_end_writeback(folio); ··· 512 512 } 513 513 delayacct_swapin_start(); 514 514 515 - if (frontswap_load(page) == 0) { 515 + if (zswap_load(page)) { 516 516 SetPageUptodate(page); 517 517 unlock_page(page); 518 518 } else if (data_race(sis->flags & SWP_FS_OPS)) {

+10 -23

mm/swapfile.c

··· 35 35 #include <linux/memcontrol.h> 36 36 #include <linux/poll.h> 37 37 #include <linux/oom.h> 38 - #include <linux/frontswap.h> 39 38 #include <linux/swapfile.h> 40 39 #include <linux/export.h> 41 40 #include <linux/swap_slots.h> 42 41 #include <linux/sort.h> 43 42 #include <linux/completion.h> 44 43 #include <linux/suspend.h> 44 + #include <linux/zswap.h> 45 45 46 46 #include <asm/tlbflush.h> 47 47 #include <linux/swapops.h> ··· 95 95 static struct plist_head *swap_avail_heads; 96 96 static DEFINE_SPINLOCK(swap_avail_lock); 97 97 98 - struct swap_info_struct *swap_info[MAX_SWAPFILES]; 98 + static struct swap_info_struct *swap_info[MAX_SWAPFILES]; 99 99 100 100 static DEFINE_MUTEX(swapon_mutex); 101 101 ··· 744 744 swap_slot_free_notify = NULL; 745 745 while (offset <= end) { 746 746 arch_swap_invalidate_page(si->type, offset); 747 - frontswap_invalidate_page(si->type, offset); 747 + zswap_invalidate(si->type, offset); 748 748 if (swap_slot_free_notify) 749 749 swap_slot_free_notify(si->bdev, offset); 750 750 offset++; ··· 2343 2343 2344 2344 static void enable_swap_info(struct swap_info_struct *p, int prio, 2345 2345 unsigned char *swap_map, 2346 - struct swap_cluster_info *cluster_info, 2347 - unsigned long *frontswap_map) 2346 + struct swap_cluster_info *cluster_info) 2348 2347 { 2349 - if (IS_ENABLED(CONFIG_FRONTSWAP)) 2350 - frontswap_init(p->type, frontswap_map); 2348 + zswap_swapon(p->type); 2349 + 2351 2350 spin_lock(&swap_lock); 2352 2351 spin_lock(&p->lock); 2353 2352 setup_swap_info(p, prio, swap_map, cluster_info); ··· 2389 2390 struct swap_info_struct *p = NULL; 2390 2391 unsigned char *swap_map; 2391 2392 struct swap_cluster_info *cluster_info; 2392 - unsigned long *frontswap_map; 2393 2393 struct file *swap_file, *victim; 2394 2394 struct address_space *mapping; 2395 2395 struct inode *inode; ··· 2513 2515 p->swap_map = NULL; 2514 2516 cluster_info = p->cluster_info; 2515 2517 p->cluster_info = NULL; 2516 - frontswap_map = frontswap_map_get(p); 2517 2518 spin_unlock(&p->lock); 2518 2519 spin_unlock(&swap_lock); 2519 2520 arch_swap_invalidate_area(p->type); 2520 - frontswap_invalidate_area(p->type); 2521 - frontswap_map_set(p, NULL); 2521 + zswap_swapoff(p->type); 2522 2522 mutex_unlock(&swapon_mutex); 2523 2523 free_percpu(p->percpu_cluster); 2524 2524 p->percpu_cluster = NULL; ··· 2524 2528 p->cluster_next_cpu = NULL; 2525 2529 vfree(swap_map); 2526 2530 kvfree(cluster_info); 2527 - kvfree(frontswap_map); 2528 2531 /* Destroy swap account information */ 2529 2532 swap_cgroup_swapoff(p->type); 2530 2533 exit_swap_address_space(p->type); ··· 2990 2995 unsigned long maxpages; 2991 2996 unsigned char *swap_map = NULL; 2992 2997 struct swap_cluster_info *cluster_info = NULL; 2993 - unsigned long *frontswap_map = NULL; 2994 2998 struct page *page = NULL; 2995 2999 struct inode *inode = NULL; 2996 3000 bool inced_nr_rotate_swap = false; ··· 3129 3135 error = nr_extents; 3130 3136 goto bad_swap_unlock_inode; 3131 3137 } 3132 - /* frontswap enabled? set up bit-per-page map for frontswap */ 3133 - if (IS_ENABLED(CONFIG_FRONTSWAP)) 3134 - frontswap_map = kvcalloc(BITS_TO_LONGS(maxpages), 3135 - sizeof(long), 3136 - GFP_KERNEL); 3137 3138 3138 3139 if ((swap_flags & SWAP_FLAG_DISCARD) && 3139 3140 p->bdev && bdev_max_discard_sectors(p->bdev)) { ··· 3181 3192 if (swap_flags & SWAP_FLAG_PREFER) 3182 3193 prio = 3183 3194 (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT; 3184 - enable_swap_info(p, prio, swap_map, cluster_info, frontswap_map); 3195 + enable_swap_info(p, prio, swap_map, cluster_info); 3185 3196 3186 - pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s%s\n", 3197 + pr_info("Adding %uk swap on %s. Priority:%d extents:%d across:%lluk %s%s%s%s\n", 3187 3198 p->pages<<(PAGE_SHIFT-10), name->name, p->prio, 3188 3199 nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10), 3189 3200 (p->flags & SWP_SOLIDSTATE) ? "SS" : "", 3190 3201 (p->flags & SWP_DISCARDABLE) ? "D" : "", 3191 3202 (p->flags & SWP_AREA_DISCARD) ? "s" : "", 3192 - (p->flags & SWP_PAGE_DISCARD) ? "c" : "", 3193 - (frontswap_map) ? "FS" : ""); 3203 + (p->flags & SWP_PAGE_DISCARD) ? "c" : ""); 3194 3204 3195 3205 mutex_unlock(&swapon_mutex); 3196 3206 atomic_inc(&proc_poll_event); ··· 3219 3231 spin_unlock(&swap_lock); 3220 3232 vfree(swap_map); 3221 3233 kvfree(cluster_info); 3222 - kvfree(frontswap_map); 3223 3234 if (inced_nr_rotate_swap) 3224 3235 atomic_dec(&nr_rotate_swap); 3225 3236 if (swap_file)

+63 -96

mm/zswap.c

··· 2 2 /* 3 3 * zswap.c - zswap driver file 4 4 * 5 - * zswap is a backend for frontswap that takes pages that are in the process 5 + * zswap is a cache that takes pages that are in the process 6 6 * of being swapped out and attempts to compress and store them in a 7 7 * RAM-based memory pool. This can result in a significant I/O reduction on 8 8 * the swap device and, in the case where decompressing from RAM is faster ··· 20 20 #include <linux/spinlock.h> 21 21 #include <linux/types.h> 22 22 #include <linux/atomic.h> 23 - #include <linux/frontswap.h> 24 23 #include <linux/rbtree.h> 25 24 #include <linux/swap.h> 26 25 #include <linux/crypto.h> ··· 27 28 #include <linux/mempool.h> 28 29 #include <linux/zpool.h> 29 30 #include <crypto/acompress.h> 30 - 31 + #include <linux/zswap.h> 31 32 #include <linux/mm_types.h> 32 33 #include <linux/page-flags.h> 33 34 #include <linux/swapops.h> ··· 1083 1084 * 1084 1085 * This can be thought of as a "resumed writeback" of the page 1085 1086 * to the swap device. We are basically resuming the same swap 1086 - * writeback path that was intercepted with the frontswap_store() 1087 + * writeback path that was intercepted with the zswap_store() 1087 1088 * in the first place. After the page has been decompressed into 1088 1089 * the swap cache, the compressed version stored by zswap can be 1089 1090 * freed. ··· 1223 1224 memset_l(page, value, PAGE_SIZE / sizeof(unsigned long)); 1224 1225 } 1225 1226 1226 - /********************************* 1227 - * frontswap hooks 1228 - **********************************/ 1229 - /* attempts to compress and store an single page */ 1230 - static int zswap_frontswap_store(unsigned type, pgoff_t offset, 1231 - struct page *page) 1227 + bool zswap_store(struct page *page) 1232 1228 { 1229 + swp_entry_t swp = { .val = page_private(page), }; 1230 + int type = swp_type(swp); 1231 + pgoff_t offset = swp_offset(swp); 1233 1232 struct zswap_tree *tree = zswap_trees[type]; 1234 1233 struct zswap_entry *entry, *dupentry; 1235 1234 struct scatterlist input, output; ··· 1235 1238 struct obj_cgroup *objcg = NULL; 1236 1239 struct zswap_pool *pool; 1237 1240 struct zpool *zpool; 1238 - int ret; 1239 1241 unsigned int dlen = PAGE_SIZE; 1240 1242 unsigned long handle, value; 1241 1243 char *buf; 1242 1244 u8 *src, *dst; 1243 1245 gfp_t gfp; 1246 + int ret; 1247 + 1248 + VM_WARN_ON_ONCE(!PageLocked(page)); 1249 + VM_WARN_ON_ONCE(!PageSwapCache(page)); 1244 1250 1245 1251 /* THP isn't supported */ 1246 - if (PageTransHuge(page)) { 1247 - ret = -EINVAL; 1248 - goto reject; 1249 - } 1252 + if (PageTransHuge(page)) 1253 + return false; 1250 1254 1251 - if (!zswap_enabled || !tree) { 1252 - ret = -ENODEV; 1253 - goto reject; 1254 - } 1255 + if (!zswap_enabled || !tree) 1256 + return false; 1255 1257 1256 1258 /* 1257 1259 * XXX: zswap reclaim does not work with cgroups yet. Without a ··· 1258 1262 * local cgroup limits. 1259 1263 */ 1260 1264 objcg = get_obj_cgroup_from_page(page); 1261 - if (objcg && !obj_cgroup_may_zswap(objcg)) { 1262 - ret = -ENOMEM; 1265 + if (objcg && !obj_cgroup_may_zswap(objcg)) 1263 1266 goto reject; 1264 - } 1265 1267 1266 1268 /* reclaim space if needed */ 1267 1269 if (zswap_is_full()) { ··· 1269 1275 } 1270 1276 1271 1277 if (zswap_pool_reached_full) { 1272 - if (!zswap_can_accept()) { 1273 - ret = -ENOMEM; 1278 + if (!zswap_can_accept()) 1274 1279 goto shrink; 1275 - } else 1280 + else 1276 1281 zswap_pool_reached_full = false; 1277 1282 } 1278 1283 ··· 1279 1286 entry = zswap_entry_cache_alloc(GFP_KERNEL); 1280 1287 if (!entry) { 1281 1288 zswap_reject_kmemcache_fail++; 1282 - ret = -ENOMEM; 1283 1289 goto reject; 1284 1290 } 1285 1291 ··· 1295 1303 kunmap_atomic(src); 1296 1304 } 1297 1305 1298 - if (!zswap_non_same_filled_pages_enabled) { 1299 - ret = -EINVAL; 1306 + if (!zswap_non_same_filled_pages_enabled) 1300 1307 goto freepage; 1301 - } 1302 1308 1303 1309 /* if entry is successfully added, it keeps the reference */ 1304 1310 entry->pool = zswap_pool_current_get(); 1305 - if (!entry->pool) { 1306 - ret = -EINVAL; 1311 + if (!entry->pool) 1307 1312 goto freepage; 1308 - } 1309 1313 1310 1314 /* compress */ 1311 1315 acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); ··· 1321 1333 * synchronous in fact. 1322 1334 * Theoretically, acomp supports users send multiple acomp requests in one 1323 1335 * acomp instance, then get those requests done simultaneously. but in this 1324 - * case, frontswap actually does store and load page by page, there is no 1336 + * case, zswap actually does store and load page by page, there is no 1325 1337 * existing method to send the second page before the first page is done 1326 - * in one thread doing frontswap. 1338 + * in one thread doing zwap. 1327 1339 * but in different threads running on different cpu, we have different 1328 1340 * acomp instance, so multiple threads can do (de)compression in parallel. 1329 1341 */ 1330 1342 ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait); 1331 1343 dlen = acomp_ctx->req->dlen; 1332 1344 1333 - if (ret) { 1334 - ret = -EINVAL; 1345 + if (ret) 1335 1346 goto put_dstmem; 1336 - } 1337 1347 1338 1348 /* store */ 1339 1349 zpool = zswap_find_zpool(entry); ··· 1367 1381 1368 1382 /* map */ 1369 1383 spin_lock(&tree->lock); 1370 - do { 1371 - ret = zswap_rb_insert(&tree->rbroot, entry, &dupentry); 1372 - if (ret == -EEXIST) { 1373 - zswap_duplicate_entry++; 1374 - /* remove from rbtree */ 1375 - zswap_rb_erase(&tree->rbroot, dupentry); 1376 - zswap_entry_put(tree, dupentry); 1377 - } 1378 - } while (ret == -EEXIST); 1384 + while (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) { 1385 + zswap_duplicate_entry++; 1386 + /* remove from rbtree */ 1387 + zswap_rb_erase(&tree->rbroot, dupentry); 1388 + zswap_entry_put(tree, dupentry); 1389 + } 1379 1390 if (entry->length) { 1380 1391 spin_lock(&entry->pool->lru_lock); 1381 1392 list_add(&entry->lru, &entry->pool->lru); ··· 1385 1402 zswap_update_total_size(); 1386 1403 count_vm_event(ZSWPOUT); 1387 1404 1388 - return 0; 1405 + return true; 1389 1406 1390 1407 put_dstmem: 1391 1408 mutex_unlock(acomp_ctx->mutex); ··· 1395 1412 reject: 1396 1413 if (objcg) 1397 1414 obj_cgroup_put(objcg); 1398 - return ret; 1415 + return false; 1399 1416 1400 1417 shrink: 1401 1418 pool = zswap_pool_last_get(); 1402 1419 if (pool) 1403 1420 queue_work(shrink_wq, &pool->shrink_work); 1404 - ret = -ENOMEM; 1405 1421 goto reject; 1406 1422 } 1407 1423 1408 - /* 1409 - * returns 0 if the page was successfully decompressed 1410 - * return -1 on entry not found or error 1411 - */ 1412 - static int zswap_frontswap_load(unsigned type, pgoff_t offset, 1413 - struct page *page, bool *exclusive) 1424 + bool zswap_load(struct page *page) 1414 1425 { 1426 + swp_entry_t swp = { .val = page_private(page), }; 1427 + int type = swp_type(swp); 1428 + pgoff_t offset = swp_offset(swp); 1415 1429 struct zswap_tree *tree = zswap_trees[type]; 1416 1430 struct zswap_entry *entry; 1417 1431 struct scatterlist input, output; ··· 1416 1436 u8 *src, *dst, *tmp; 1417 1437 struct zpool *zpool; 1418 1438 unsigned int dlen; 1419 - int ret; 1439 + bool ret; 1440 + 1441 + VM_WARN_ON_ONCE(!PageLocked(page)); 1420 1442 1421 1443 /* find */ 1422 1444 spin_lock(&tree->lock); 1423 1445 entry = zswap_entry_find_get(&tree->rbroot, offset); 1424 1446 if (!entry) { 1425 - /* entry was written back */ 1426 1447 spin_unlock(&tree->lock); 1427 - return -1; 1448 + return false; 1428 1449 } 1429 1450 spin_unlock(&tree->lock); 1430 1451 ··· 1433 1452 dst = kmap_atomic(page); 1434 1453 zswap_fill_page(dst, entry->value); 1435 1454 kunmap_atomic(dst); 1436 - ret = 0; 1455 + ret = true; 1437 1456 goto stats; 1438 1457 } 1439 1458 ··· 1441 1460 if (!zpool_can_sleep_mapped(zpool)) { 1442 1461 tmp = kmalloc(entry->length, GFP_KERNEL); 1443 1462 if (!tmp) { 1444 - ret = -ENOMEM; 1463 + ret = false; 1445 1464 goto freeentry; 1446 1465 } 1447 1466 } ··· 1462 1481 sg_init_table(&output, 1); 1463 1482 sg_set_page(&output, page, PAGE_SIZE, 0); 1464 1483 acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen); 1465 - ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); 1484 + if (crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)) 1485 + WARN_ON(1); 1466 1486 mutex_unlock(acomp_ctx->mutex); 1467 1487 1468 1488 if (zpool_can_sleep_mapped(zpool)) ··· 1471 1489 else 1472 1490 kfree(tmp); 1473 1491 1474 - BUG_ON(ret); 1492 + ret = true; 1475 1493 stats: 1476 1494 count_vm_event(ZSWPIN); 1477 1495 if (entry->objcg) 1478 1496 count_objcg_event(entry->objcg, ZSWPIN); 1479 1497 freeentry: 1480 1498 spin_lock(&tree->lock); 1481 - if (!ret && zswap_exclusive_loads_enabled) { 1499 + if (ret && zswap_exclusive_loads_enabled) { 1482 1500 zswap_invalidate_entry(tree, entry); 1483 - *exclusive = true; 1501 + SetPageDirty(page); 1484 1502 } else if (entry->length) { 1485 1503 spin_lock(&entry->pool->lru_lock); 1486 1504 list_move(&entry->lru, &entry->pool->lru); ··· 1492 1510 return ret; 1493 1511 } 1494 1512 1495 - /* frees an entry in zswap */ 1496 - static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset) 1513 + void zswap_invalidate(int type, pgoff_t offset) 1497 1514 { 1498 1515 struct zswap_tree *tree = zswap_trees[type]; 1499 1516 struct zswap_entry *entry; ··· 1509 1528 spin_unlock(&tree->lock); 1510 1529 } 1511 1530 1512 - /* frees all zswap entries for the given swap type */ 1513 - static void zswap_frontswap_invalidate_area(unsigned type) 1531 + void zswap_swapon(int type) 1532 + { 1533 + struct zswap_tree *tree; 1534 + 1535 + tree = kzalloc(sizeof(*tree), GFP_KERNEL); 1536 + if (!tree) { 1537 + pr_err("alloc failed, zswap disabled for swap type %d\n", type); 1538 + return; 1539 + } 1540 + 1541 + tree->rbroot = RB_ROOT; 1542 + spin_lock_init(&tree->lock); 1543 + zswap_trees[type] = tree; 1544 + } 1545 + 1546 + void zswap_swapoff(int type) 1514 1547 { 1515 1548 struct zswap_tree *tree = zswap_trees[type]; 1516 1549 struct zswap_entry *entry, *n; ··· 1541 1546 kfree(tree); 1542 1547 zswap_trees[type] = NULL; 1543 1548 } 1544 - 1545 - static void zswap_frontswap_init(unsigned type) 1546 - { 1547 - struct zswap_tree *tree; 1548 - 1549 - tree = kzalloc(sizeof(*tree), GFP_KERNEL); 1550 - if (!tree) { 1551 - pr_err("alloc failed, zswap disabled for swap type %d\n", type); 1552 - return; 1553 - } 1554 - 1555 - tree->rbroot = RB_ROOT; 1556 - spin_lock_init(&tree->lock); 1557 - zswap_trees[type] = tree; 1558 - } 1559 - 1560 - static const struct frontswap_ops zswap_frontswap_ops = { 1561 - .store = zswap_frontswap_store, 1562 - .load = zswap_frontswap_load, 1563 - .invalidate_page = zswap_frontswap_invalidate_page, 1564 - .invalidate_area = zswap_frontswap_invalidate_area, 1565 - .init = zswap_frontswap_init 1566 - }; 1567 1549 1568 1550 /********************************* 1569 1551 * debugfs functions ··· 1630 1658 if (!shrink_wq) 1631 1659 goto fallback_fail; 1632 1660 1633 - ret = frontswap_register_ops(&zswap_frontswap_ops); 1634 - if (ret) 1635 - goto destroy_wq; 1636 1661 if (zswap_debugfs_init()) 1637 1662 pr_warn("debugfs initialization failed\n"); 1638 1663 zswap_init_state = ZSWAP_INIT_SUCCEED; 1639 1664 return 0; 1640 1665 1641 - destroy_wq: 1642 - destroy_workqueue(shrink_wq); 1643 1666 fallback_fail: 1644 1667 if (pool) 1645 1668 zswap_pool_destroy(pool);