Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm

Pull frontswap feature from Konrad Rzeszutek Wilk:
"Frontswap provides a "transcendent memory" interface for swap pages.
In some environments, dramatic performance savings may be obtained
because swapped pages are saved in RAM (or a RAM-like device) instead
of a swap disk. This tag provides the basic infrastructure along with
some changes to the existing backends."

Fix up trivial conflict in mm/Makefile due to removal of swap token code
changing a line next to the new frontswap entry.

This pull request came in before the merge window even opened, it got
delayed to after the merge window by me just wanting to make sure it had
actual users. Apparently IBM is using this on their embedded side, and
Jan Beulich says that it's already made available for SLES and OpenSUSE
users.

Also acked by Rik van Riel, and Konrad points to other people liking it
too. So in it goes.

By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2)
via Konrad Rzeszutek Wilk
* tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
frontswap: s/put_page/store/g s/get_page/load
MAINTAINER: Add myself for the frontswap API
mm: frontswap: config and doc files
mm: frontswap: core frontswap functionality
mm: frontswap: core swap subsystem hooks and headers
mm: frontswap: add frontswap header file

+827 -26
+278
Documentation/vm/frontswap.txt
··· 1 + Frontswap provides a "transcendent memory" interface for swap pages. 2 + In some environments, dramatic performance savings may be obtained because 3 + swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. 4 + 5 + (Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" 6 + and the only necessary changes to the core kernel for transcendent memory; 7 + all other supporting code -- the "backends" -- is implemented as drivers. 8 + See the LWN.net article "Transcendent memory in a nutshell" for a detailed 9 + overview of frontswap and related kernel parts: 10 + https://lwn.net/Articles/454795/ ) 11 + 12 + Frontswap is so named because it can be thought of as the opposite of 13 + a "backing" store for a swap device. The storage is assumed to be 14 + a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming 15 + to the requirements of transcendent memory (such as Xen's "tmem", or 16 + in-kernel compressed memory, aka "zcache", or future RAM-like devices); 17 + this pseudo-RAM device is not directly accessible or addressable by the 18 + kernel and is of unknown and possibly time-varying size. The driver 19 + links itself to frontswap by calling frontswap_register_ops to set the 20 + frontswap_ops funcs appropriately and the functions it provides must 21 + conform to certain policies as follows: 22 + 23 + An "init" prepares the device to receive frontswap pages associated 24 + with the specified swap device number (aka "type"). A "store" will 25 + copy the page to transcendent memory and associate it with the type and 26 + offset associated with the page. A "load" will copy the page, if found, 27 + from transcendent memory into kernel memory, but will NOT remove the page 28 + from from transcendent memory. An "invalidate_page" will remove the page 29 + from transcendent memory and an "invalidate_area" will remove ALL pages 30 + associated with the swap type (e.g., like swapoff) and notify the "device" 31 + to refuse further stores with that swap type. 32 + 33 + Once a page is successfully stored, a matching load on the page will normally 34 + succeed. So when the kernel finds itself in a situation where it needs 35 + to swap out a page, it first attempts to use frontswap. If the store returns 36 + success, the data has been successfully saved to transcendent memory and 37 + a disk write and, if the data is later read back, a disk read are avoided. 38 + If a store returns failure, transcendent memory has rejected the data, and the 39 + page can be written to swap as usual. 40 + 41 + If a backend chooses, frontswap can be configured as a "writethrough 42 + cache" by calling frontswap_writethrough(). In this mode, the reduction 43 + in swap device writes is lost (and also a non-trivial performance advantage) 44 + in order to allow the backend to arbitrarily "reclaim" space used to 45 + store frontswap pages to more completely manage its memory usage. 46 + 47 + Note that if a page is stored and the page already exists in transcendent memory 48 + (a "duplicate" store), either the store succeeds and the data is overwritten, 49 + or the store fails AND the page is invalidated. This ensures stale data may 50 + never be obtained from frontswap. 51 + 52 + If properly configured, monitoring of frontswap is done via debugfs in 53 + the /sys/kernel/debug/frontswap directory. The effectiveness of 54 + frontswap can be measured (across all swap devices) with: 55 + 56 + failed_stores - how many store attempts have failed 57 + loads - how many loads were attempted (all should succeed) 58 + succ_stores - how many store attempts have succeeded 59 + invalidates - how many invalidates were attempted 60 + 61 + A backend implementation may provide additional metrics. 62 + 63 + FAQ 64 + 65 + 1) Where's the value? 66 + 67 + When a workload starts swapping, performance falls through the floor. 68 + Frontswap significantly increases performance in many such workloads by 69 + providing a clean, dynamic interface to read and write swap pages to 70 + "transcendent memory" that is otherwise not directly addressable to the kernel. 71 + This interface is ideal when data is transformed to a different form 72 + and size (such as with compression) or secretly moved (as might be 73 + useful for write-balancing for some RAM-like devices). Swap pages (and 74 + evicted page-cache pages) are a great use for this kind of slower-than-RAM- 75 + but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and 76 + cleancache) interface to transcendent memory provides a nice way to read 77 + and write -- and indirectly "name" -- the pages. 78 + 79 + Frontswap -- and cleancache -- with a fairly small impact on the kernel, 80 + provides a huge amount of flexibility for more dynamic, flexible RAM 81 + utilization in various system configurations: 82 + 83 + In the single kernel case, aka "zcache", pages are compressed and 84 + stored in local memory, thus increasing the total anonymous pages 85 + that can be safely kept in RAM. Zcache essentially trades off CPU 86 + cycles used in compression/decompression for better memory utilization. 87 + Benchmarks have shown little or no impact when memory pressure is 88 + low while providing a significant performance improvement (25%+) 89 + on some workloads under high memory pressure. 90 + 91 + "RAMster" builds on zcache by adding "peer-to-peer" transcendent memory 92 + support for clustered systems. Frontswap pages are locally compressed 93 + as in zcache, but then "remotified" to another system's RAM. This 94 + allows RAM to be dynamically load-balanced back-and-forth as needed, 95 + i.e. when system A is overcommitted, it can swap to system B, and 96 + vice versa. RAMster can also be configured as a memory server so 97 + many servers in a cluster can swap, dynamically as needed, to a single 98 + server configured with a large amount of RAM... without pre-configuring 99 + how much of the RAM is available for each of the clients! 100 + 101 + In the virtual case, the whole point of virtualization is to statistically 102 + multiplex physical resources acrosst the varying demands of multiple 103 + virtual machines. This is really hard to do with RAM and efforts to do 104 + it well with no kernel changes have essentially failed (except in some 105 + well-publicized special-case workloads). 106 + Specifically, the Xen Transcendent Memory backend allows otherwise 107 + "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple 108 + virtual machines, but the pages can be compressed and deduplicated to 109 + optimize RAM utilization. And when guest OS's are induced to surrender 110 + underutilized RAM (e.g. with "selfballooning"), sudden unexpected 111 + memory pressure may result in swapping; frontswap allows those pages 112 + to be swapped to and from hypervisor RAM (if overall host system memory 113 + conditions allow), thus mitigating the potentially awful performance impact 114 + of unplanned swapping. 115 + 116 + A KVM implementation is underway and has been RFC'ed to lkml. And, 117 + using frontswap, investigation is also underway on the use of NVM as 118 + a memory extension technology. 119 + 120 + 2) Sure there may be performance advantages in some situations, but 121 + what's the space/time overhead of frontswap? 122 + 123 + If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into 124 + nothingness and the only overhead is a few extra bytes per swapon'ed 125 + swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" 126 + registers, there is one extra global variable compared to zero for 127 + every swap page read or written. If CONFIG_FRONTSWAP is enabled 128 + AND a frontswap backend registers AND the backend fails every "store" 129 + request (i.e. provides no memory despite claiming it might), 130 + CPU overhead is still negligible -- and since every frontswap fail 131 + precedes a swap page write-to-disk, the system is highly likely 132 + to be I/O bound and using a small fraction of a percent of a CPU 133 + will be irrelevant anyway. 134 + 135 + As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend 136 + registers, one bit is allocated for every swap page for every swap 137 + device that is swapon'd. This is added to the EIGHT bits (which 138 + was sixteen until about 2.6.34) that the kernel already allocates 139 + for every swap page for every swap device that is swapon'd. (Hugh 140 + Dickins has observed that frontswap could probably steal one of 141 + the existing eight bits, but let's worry about that minor optimization 142 + later.) For very large swap disks (which are rare) on a standard 143 + 4K pagesize, this is 1MB per 32GB swap. 144 + 145 + When swap pages are stored in transcendent memory instead of written 146 + out to disk, there is a side effect that this may create more memory 147 + pressure that can potentially outweigh the other advantages. A 148 + backend, such as zcache, must implement policies to carefully (but 149 + dynamically) manage memory limits to ensure this doesn't happen. 150 + 151 + 3) OK, how about a quick overview of what this frontswap patch does 152 + in terms that a kernel hacker can grok? 153 + 154 + Let's assume that a frontswap "backend" has registered during 155 + kernel initialization; this registration indicates that this 156 + frontswap backend has access to some "memory" that is not directly 157 + accessible by the kernel. Exactly how much memory it provides is 158 + entirely dynamic and random. 159 + 160 + Whenever a swap-device is swapon'd frontswap_init() is called, 161 + passing the swap device number (aka "type") as a parameter. 162 + This notifies frontswap to expect attempts to "store" swap pages 163 + associated with that number. 164 + 165 + Whenever the swap subsystem is readying a page to write to a swap 166 + device (c.f swap_writepage()), frontswap_store is called. Frontswap 167 + consults with the frontswap backend and if the backend says it does NOT 168 + have room, frontswap_store returns -1 and the kernel swaps the page 169 + to the swap device as normal. Note that the response from the frontswap 170 + backend is unpredictable to the kernel; it may choose to never accept a 171 + page, it could accept every ninth page, or it might accept every 172 + page. But if the backend does accept a page, the data from the page 173 + has already been copied and associated with the type and offset, 174 + and the backend guarantees the persistence of the data. In this case, 175 + frontswap sets a bit in the "frontswap_map" for the swap device 176 + corresponding to the page offset on the swap device to which it would 177 + otherwise have written the data. 178 + 179 + When the swap subsystem needs to swap-in a page (swap_readpage()), 180 + it first calls frontswap_load() which checks the frontswap_map to 181 + see if the page was earlier accepted by the frontswap backend. If 182 + it was, the page of data is filled from the frontswap backend and 183 + the swap-in is complete. If not, the normal swap-in code is 184 + executed to obtain the page of data from the real swap device. 185 + 186 + So every time the frontswap backend accepts a page, a swap device read 187 + and (potentially) a swap device write are replaced by a "frontswap backend 188 + store" and (possibly) a "frontswap backend loads", which are presumably much 189 + faster. 190 + 191 + 4) Can't frontswap be configured as a "special" swap device that is 192 + just higher priority than any real swap device (e.g. like zswap, 193 + or maybe swap-over-nbd/NFS)? 194 + 195 + No. First, the existing swap subsystem doesn't allow for any kind of 196 + swap hierarchy. Perhaps it could be rewritten to accomodate a hierarchy, 197 + but this would require fairly drastic changes. Even if it were 198 + rewritten, the existing swap subsystem uses the block I/O layer which 199 + assumes a swap device is fixed size and any page in it is linearly 200 + addressable. Frontswap barely touches the existing swap subsystem, 201 + and works around the constraints of the block I/O subsystem to provide 202 + a great deal of flexibility and dynamicity. 203 + 204 + For example, the acceptance of any swap page by the frontswap backend is 205 + entirely unpredictable. This is critical to the definition of frontswap 206 + backends because it grants completely dynamic discretion to the 207 + backend. In zcache, one cannot know a priori how compressible a page is. 208 + "Poorly" compressible pages can be rejected, and "poorly" can itself be 209 + defined dynamically depending on current memory constraints. 210 + 211 + Further, frontswap is entirely synchronous whereas a real swap 212 + device is, by definition, asynchronous and uses block I/O. The 213 + block I/O layer is not only unnecessary, but may perform "optimizations" 214 + that are inappropriate for a RAM-oriented device including delaying 215 + the write of some pages for a significant amount of time. Synchrony is 216 + required to ensure the dynamicity of the backend and to avoid thorny race 217 + conditions that would unnecessarily and greatly complicate frontswap 218 + and/or the block I/O subsystem. That said, only the initial "store" 219 + and "load" operations need be synchronous. A separate asynchronous thread 220 + is free to manipulate the pages stored by frontswap. For example, 221 + the "remotification" thread in RAMster uses standard asynchronous 222 + kernel sockets to move compressed frontswap pages to a remote machine. 223 + Similarly, a KVM guest-side implementation could do in-guest compression 224 + and use "batched" hypercalls. 225 + 226 + In a virtualized environment, the dynamicity allows the hypervisor 227 + (or host OS) to do "intelligent overcommit". For example, it can 228 + choose to accept pages only until host-swapping might be imminent, 229 + then force guests to do their own swapping. 230 + 231 + There is a downside to the transcendent memory specifications for 232 + frontswap: Since any "store" might fail, there must always be a real 233 + slot on a real swap device to swap the page. Thus frontswap must be 234 + implemented as a "shadow" to every swapon'd device with the potential 235 + capability of holding every page that the swap device might have held 236 + and the possibility that it might hold no pages at all. This means 237 + that frontswap cannot contain more pages than the total of swapon'd 238 + swap devices. For example, if NO swap device is configured on some 239 + installation, frontswap is useless. Swapless portable devices 240 + can still use frontswap but a backend for such devices must configure 241 + some kind of "ghost" swap device and ensure that it is never used. 242 + 243 + 5) Why this weird definition about "duplicate stores"? If a page 244 + has been previously successfully stored, can't it always be 245 + successfully overwritten? 246 + 247 + Nearly always it can, but no, sometimes it cannot. Consider an example 248 + where data is compressed and the original 4K page has been compressed 249 + to 1K. Now an attempt is made to overwrite the page with data that 250 + is non-compressible and so would take the entire 4K. But the backend 251 + has no more space. In this case, the store must be rejected. Whenever 252 + frontswap rejects a store that would overwrite, it also must invalidate 253 + the old data and ensure that it is no longer accessible. Since the 254 + swap subsystem then writes the new data to the read swap device, 255 + this is the correct course of action to ensure coherency. 256 + 257 + 6) What is frontswap_shrink for? 258 + 259 + When the (non-frontswap) swap subsystem swaps out a page to a real 260 + swap device, that page is only taking up low-value pre-allocated disk 261 + space. But if frontswap has placed a page in transcendent memory, that 262 + page may be taking up valuable real estate. The frontswap_shrink 263 + routine allows code outside of the swap subsystem to force pages out 264 + of the memory managed by frontswap and back into kernel-addressable memory. 265 + For example, in RAMster, a "suction driver" thread will attempt 266 + to "repatriate" pages sent to a remote machine back to the local machine; 267 + this is driven using the frontswap_shrink mechanism when memory pressure 268 + subsides. 269 + 270 + 7) Why does the frontswap patch create the new include file swapfile.h? 271 + 272 + The frontswap code depends on some swap-subsystem-internal data 273 + structures that have, over the years, moved back and forth between 274 + static and global. This seemed a reasonable compromise: Define 275 + them as global but declare them in a new include file that isn't 276 + included by the large number of source files that include swap.h. 277 + 278 + Dan Magenheimer, last updated April 9, 2012
+7
MAINTAINERS
··· 2930 2930 F: include/linux/freezer.h 2931 2931 F: kernel/freezer.c 2932 2932 2933 + FRONTSWAP API 2934 + M: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 2935 + L: linux-kernel@vger.kernel.org 2936 + S: Maintained 2937 + F: mm/frontswap.c 2938 + F: include/linux/frontswap.h 2939 + 2933 2940 FS-CACHE: LOCAL CACHING FOR NETWORK FILESYSTEMS 2934 2941 M: David Howells <dhowells@redhat.com> 2935 2942 L: linux-cachefs@redhat.com
+4 -4
drivers/staging/ramster/zcache-main.c
··· 3002 3002 return oid; 3003 3003 } 3004 3004 3005 - static int zcache_frontswap_put_page(unsigned type, pgoff_t offset, 3005 + static int zcache_frontswap_store(unsigned type, pgoff_t offset, 3006 3006 struct page *page) 3007 3007 { 3008 3008 u64 ind64 = (u64)offset; ··· 3025 3025 3026 3026 /* returns 0 if the page was successfully gotten from frontswap, -1 if 3027 3027 * was not present (should never happen!) */ 3028 - static int zcache_frontswap_get_page(unsigned type, pgoff_t offset, 3028 + static int zcache_frontswap_load(unsigned type, pgoff_t offset, 3029 3029 struct page *page) 3030 3030 { 3031 3031 u64 ind64 = (u64)offset; ··· 3080 3080 } 3081 3081 3082 3082 static struct frontswap_ops zcache_frontswap_ops = { 3083 - .put_page = zcache_frontswap_put_page, 3084 - .get_page = zcache_frontswap_get_page, 3083 + .store = zcache_frontswap_store, 3084 + .load = zcache_frontswap_load, 3085 3085 .invalidate_page = zcache_frontswap_flush_page, 3086 3086 .invalidate_area = zcache_frontswap_flush_area, 3087 3087 .init = zcache_frontswap_init
+5 -5
drivers/staging/zcache/zcache-main.c
··· 1835 1835 * Swizzling increases objects per swaptype, increasing tmem concurrency 1836 1836 * for heavy swaploads. Later, larger nr_cpus -> larger SWIZ_BITS 1837 1837 * Setting SWIZ_BITS to 27 basically reconstructs the swap entry from 1838 - * frontswap_get_page(), but has side-effects. Hence using 8. 1838 + * frontswap_load(), but has side-effects. Hence using 8. 1839 1839 */ 1840 1840 #define SWIZ_BITS 8 1841 1841 #define SWIZ_MASK ((1 << SWIZ_BITS) - 1) ··· 1849 1849 return oid; 1850 1850 } 1851 1851 1852 - static int zcache_frontswap_put_page(unsigned type, pgoff_t offset, 1852 + static int zcache_frontswap_store(unsigned type, pgoff_t offset, 1853 1853 struct page *page) 1854 1854 { 1855 1855 u64 ind64 = (u64)offset; ··· 1870 1870 1871 1871 /* returns 0 if the page was successfully gotten from frontswap, -1 if 1872 1872 * was not present (should never happen!) */ 1873 - static int zcache_frontswap_get_page(unsigned type, pgoff_t offset, 1873 + static int zcache_frontswap_load(unsigned type, pgoff_t offset, 1874 1874 struct page *page) 1875 1875 { 1876 1876 u64 ind64 = (u64)offset; ··· 1919 1919 } 1920 1920 1921 1921 static struct frontswap_ops zcache_frontswap_ops = { 1922 - .put_page = zcache_frontswap_put_page, 1923 - .get_page = zcache_frontswap_get_page, 1922 + .store = zcache_frontswap_store, 1923 + .load = zcache_frontswap_load, 1924 1924 .invalidate_page = zcache_frontswap_flush_page, 1925 1925 .invalidate_area = zcache_frontswap_flush_area, 1926 1926 .init = zcache_frontswap_init
+4 -4
drivers/xen/tmem.c
··· 269 269 } 270 270 271 271 /* returns 0 if the page was successfully put into frontswap, -1 if not */ 272 - static int tmem_frontswap_put_page(unsigned type, pgoff_t offset, 272 + static int tmem_frontswap_store(unsigned type, pgoff_t offset, 273 273 struct page *page) 274 274 { 275 275 u64 ind64 = (u64)offset; ··· 295 295 * returns 0 if the page was successfully gotten from frontswap, -1 if 296 296 * was not present (should never happen!) 297 297 */ 298 - static int tmem_frontswap_get_page(unsigned type, pgoff_t offset, 298 + static int tmem_frontswap_load(unsigned type, pgoff_t offset, 299 299 struct page *page) 300 300 { 301 301 u64 ind64 = (u64)offset; ··· 362 362 __setup("nofrontswap", no_frontswap); 363 363 364 364 static struct frontswap_ops __initdata tmem_frontswap_ops = { 365 - .put_page = tmem_frontswap_put_page, 366 - .get_page = tmem_frontswap_get_page, 365 + .store = tmem_frontswap_store, 366 + .load = tmem_frontswap_load, 367 367 .invalidate_page = tmem_frontswap_flush_page, 368 368 .invalidate_area = tmem_frontswap_flush_area, 369 369 .init = tmem_frontswap_init
+127
include/linux/frontswap.h
··· 1 + #ifndef _LINUX_FRONTSWAP_H 2 + #define _LINUX_FRONTSWAP_H 3 + 4 + #include <linux/swap.h> 5 + #include <linux/mm.h> 6 + #include <linux/bitops.h> 7 + 8 + struct frontswap_ops { 9 + void (*init)(unsigned); 10 + int (*store)(unsigned, pgoff_t, struct page *); 11 + int (*load)(unsigned, pgoff_t, struct page *); 12 + void (*invalidate_page)(unsigned, pgoff_t); 13 + void (*invalidate_area)(unsigned); 14 + }; 15 + 16 + extern bool frontswap_enabled; 17 + extern struct frontswap_ops 18 + frontswap_register_ops(struct frontswap_ops *ops); 19 + extern void frontswap_shrink(unsigned long); 20 + extern unsigned long frontswap_curr_pages(void); 21 + extern void frontswap_writethrough(bool); 22 + 23 + extern void __frontswap_init(unsigned type); 24 + extern int __frontswap_store(struct page *page); 25 + extern int __frontswap_load(struct page *page); 26 + extern void __frontswap_invalidate_page(unsigned, pgoff_t); 27 + extern void __frontswap_invalidate_area(unsigned); 28 + 29 + #ifdef CONFIG_FRONTSWAP 30 + 31 + static inline bool frontswap_test(struct swap_info_struct *sis, pgoff_t offset) 32 + { 33 + bool ret = false; 34 + 35 + if (frontswap_enabled && sis->frontswap_map) 36 + ret = test_bit(offset, sis->frontswap_map); 37 + return ret; 38 + } 39 + 40 + static inline void frontswap_set(struct swap_info_struct *sis, pgoff_t offset) 41 + { 42 + if (frontswap_enabled && sis->frontswap_map) 43 + set_bit(offset, sis->frontswap_map); 44 + } 45 + 46 + static inline void frontswap_clear(struct swap_info_struct *sis, pgoff_t offset) 47 + { 48 + if (frontswap_enabled && sis->frontswap_map) 49 + clear_bit(offset, sis->frontswap_map); 50 + } 51 + 52 + static inline void frontswap_map_set(struct swap_info_struct *p, 53 + unsigned long *map) 54 + { 55 + p->frontswap_map = map; 56 + } 57 + 58 + static inline unsigned long *frontswap_map_get(struct swap_info_struct *p) 59 + { 60 + return p->frontswap_map; 61 + } 62 + #else 63 + /* all inline routines become no-ops and all externs are ignored */ 64 + 65 + #define frontswap_enabled (0) 66 + 67 + static inline bool frontswap_test(struct swap_info_struct *sis, pgoff_t offset) 68 + { 69 + return false; 70 + } 71 + 72 + static inline void frontswap_set(struct swap_info_struct *sis, pgoff_t offset) 73 + { 74 + } 75 + 76 + static inline void frontswap_clear(struct swap_info_struct *sis, pgoff_t offset) 77 + { 78 + } 79 + 80 + static inline void frontswap_map_set(struct swap_info_struct *p, 81 + unsigned long *map) 82 + { 83 + } 84 + 85 + static inline unsigned long *frontswap_map_get(struct swap_info_struct *p) 86 + { 87 + return NULL; 88 + } 89 + #endif 90 + 91 + static inline int frontswap_store(struct page *page) 92 + { 93 + int ret = -1; 94 + 95 + if (frontswap_enabled) 96 + ret = __frontswap_store(page); 97 + return ret; 98 + } 99 + 100 + static inline int frontswap_load(struct page *page) 101 + { 102 + int ret = -1; 103 + 104 + if (frontswap_enabled) 105 + ret = __frontswap_load(page); 106 + return ret; 107 + } 108 + 109 + static inline void frontswap_invalidate_page(unsigned type, pgoff_t offset) 110 + { 111 + if (frontswap_enabled) 112 + __frontswap_invalidate_page(type, offset); 113 + } 114 + 115 + static inline void frontswap_invalidate_area(unsigned type) 116 + { 117 + if (frontswap_enabled) 118 + __frontswap_invalidate_area(type); 119 + } 120 + 121 + static inline void frontswap_init(unsigned type) 122 + { 123 + if (frontswap_enabled) 124 + __frontswap_init(type); 125 + } 126 + 127 + #endif /* _LINUX_FRONTSWAP_H */
+4
include/linux/swap.h
··· 197 197 struct block_device *bdev; /* swap device or bdev of swap file */ 198 198 struct file *swap_file; /* seldom referenced */ 199 199 unsigned int old_block_size; /* seldom referenced */ 200 + #ifdef CONFIG_FRONTSWAP 201 + unsigned long *frontswap_map; /* frontswap in-use, one bit per page */ 202 + atomic_t frontswap_pages; /* frontswap pages in-use counter */ 203 + #endif 200 204 }; 201 205 202 206 struct swap_list_t {
+13
include/linux/swapfile.h
··· 1 + #ifndef _LINUX_SWAPFILE_H 2 + #define _LINUX_SWAPFILE_H 3 + 4 + /* 5 + * these were static in swapfile.c but frontswap.c needs them and we don't 6 + * want to expose them to the dozens of source files that include swap.h 7 + */ 8 + extern spinlock_t swap_lock; 9 + extern struct swap_list_t swap_list; 10 + extern struct swap_info_struct *swap_info[]; 11 + extern int try_to_unuse(unsigned int, bool, unsigned long); 12 + 13 + #endif /* _LINUX_SWAPFILE_H */
+17
mm/Kconfig
··· 389 389 in a negligible performance hit. 390 390 391 391 If unsure, say Y to enable cleancache 392 + 393 + config FRONTSWAP 394 + bool "Enable frontswap to cache swap pages if tmem is present" 395 + depends on SWAP 396 + default n 397 + help 398 + Frontswap is so named because it can be thought of as the opposite 399 + of a "backing" store for a swap device. The data is stored into 400 + "transcendent memory", memory that is not directly accessible or 401 + addressable by the kernel and is of unknown and possibly 402 + time-varying size. When space in transcendent memory is available, 403 + a significant swap I/O reduction may be achieved. When none is 404 + available, all frontswap calls are reduced to a single pointer- 405 + compare-against-NULL resulting in a negligible performance hit 406 + and swap data is stored as normal on the matching swap device. 407 + 408 + If unsure, say Y to enable frontswap.
+1
mm/Makefile
··· 29 29 30 30 obj-$(CONFIG_BOUNCE) += bounce.o 31 31 obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o 32 + obj-$(CONFIG_FRONTSWAP) += frontswap.o 32 33 obj-$(CONFIG_HAS_DMA) += dmapool.o 33 34 obj-$(CONFIG_HUGETLBFS) += hugetlb.o 34 35 obj-$(CONFIG_NUMA) += mempolicy.o
+314
mm/frontswap.c
··· 1 + /* 2 + * Frontswap frontend 3 + * 4 + * This code provides the generic "frontend" layer to call a matching 5 + * "backend" driver implementation of frontswap. See 6 + * Documentation/vm/frontswap.txt for more information. 7 + * 8 + * Copyright (C) 2009-2012 Oracle Corp. All rights reserved. 9 + * Author: Dan Magenheimer 10 + * 11 + * This work is licensed under the terms of the GNU GPL, version 2. 12 + */ 13 + 14 + #include <linux/mm.h> 15 + #include <linux/mman.h> 16 + #include <linux/swap.h> 17 + #include <linux/swapops.h> 18 + #include <linux/proc_fs.h> 19 + #include <linux/security.h> 20 + #include <linux/capability.h> 21 + #include <linux/module.h> 22 + #include <linux/uaccess.h> 23 + #include <linux/debugfs.h> 24 + #include <linux/frontswap.h> 25 + #include <linux/swapfile.h> 26 + 27 + /* 28 + * frontswap_ops is set by frontswap_register_ops to contain the pointers 29 + * to the frontswap "backend" implementation functions. 30 + */ 31 + static struct frontswap_ops frontswap_ops __read_mostly; 32 + 33 + /* 34 + * This global enablement flag reduces overhead on systems where frontswap_ops 35 + * has not been registered, so is preferred to the slower alternative: a 36 + * function call that checks a non-global. 37 + */ 38 + bool frontswap_enabled __read_mostly; 39 + EXPORT_SYMBOL(frontswap_enabled); 40 + 41 + /* 42 + * If enabled, frontswap_store will return failure even on success. As 43 + * a result, the swap subsystem will always write the page to swap, in 44 + * effect converting frontswap into a writethrough cache. In this mode, 45 + * there is no direct reduction in swap writes, but a frontswap backend 46 + * can unilaterally "reclaim" any pages in use with no data loss, thus 47 + * providing increases control over maximum memory usage due to frontswap. 48 + */ 49 + static bool frontswap_writethrough_enabled __read_mostly; 50 + 51 + #ifdef CONFIG_DEBUG_FS 52 + /* 53 + * Counters available via /sys/kernel/debug/frontswap (if debugfs is 54 + * properly configured). These are for information only so are not protected 55 + * against increment races. 56 + */ 57 + static u64 frontswap_loads; 58 + static u64 frontswap_succ_stores; 59 + static u64 frontswap_failed_stores; 60 + static u64 frontswap_invalidates; 61 + 62 + static inline void inc_frontswap_loads(void) { 63 + frontswap_loads++; 64 + } 65 + static inline void inc_frontswap_succ_stores(void) { 66 + frontswap_succ_stores++; 67 + } 68 + static inline void inc_frontswap_failed_stores(void) { 69 + frontswap_failed_stores++; 70 + } 71 + static inline void inc_frontswap_invalidates(void) { 72 + frontswap_invalidates++; 73 + } 74 + #else 75 + static inline void inc_frontswap_loads(void) { } 76 + static inline void inc_frontswap_succ_stores(void) { } 77 + static inline void inc_frontswap_failed_stores(void) { } 78 + static inline void inc_frontswap_invalidates(void) { } 79 + #endif 80 + /* 81 + * Register operations for frontswap, returning previous thus allowing 82 + * detection of multiple backends and possible nesting. 83 + */ 84 + struct frontswap_ops frontswap_register_ops(struct frontswap_ops *ops) 85 + { 86 + struct frontswap_ops old = frontswap_ops; 87 + 88 + frontswap_ops = *ops; 89 + frontswap_enabled = true; 90 + return old; 91 + } 92 + EXPORT_SYMBOL(frontswap_register_ops); 93 + 94 + /* 95 + * Enable/disable frontswap writethrough (see above). 96 + */ 97 + void frontswap_writethrough(bool enable) 98 + { 99 + frontswap_writethrough_enabled = enable; 100 + } 101 + EXPORT_SYMBOL(frontswap_writethrough); 102 + 103 + /* 104 + * Called when a swap device is swapon'd. 105 + */ 106 + void __frontswap_init(unsigned type) 107 + { 108 + struct swap_info_struct *sis = swap_info[type]; 109 + 110 + BUG_ON(sis == NULL); 111 + if (sis->frontswap_map == NULL) 112 + return; 113 + if (frontswap_enabled) 114 + (*frontswap_ops.init)(type); 115 + } 116 + EXPORT_SYMBOL(__frontswap_init); 117 + 118 + /* 119 + * "Store" data from a page to frontswap and associate it with the page's 120 + * swaptype and offset. Page must be locked and in the swap cache. 121 + * If frontswap already contains a page with matching swaptype and 122 + * offset, the frontswap implmentation may either overwrite the data and 123 + * return success or invalidate the page from frontswap and return failure. 124 + */ 125 + int __frontswap_store(struct page *page) 126 + { 127 + int ret = -1, dup = 0; 128 + swp_entry_t entry = { .val = page_private(page), }; 129 + int type = swp_type(entry); 130 + struct swap_info_struct *sis = swap_info[type]; 131 + pgoff_t offset = swp_offset(entry); 132 + 133 + BUG_ON(!PageLocked(page)); 134 + BUG_ON(sis == NULL); 135 + if (frontswap_test(sis, offset)) 136 + dup = 1; 137 + ret = (*frontswap_ops.store)(type, offset, page); 138 + if (ret == 0) { 139 + frontswap_set(sis, offset); 140 + inc_frontswap_succ_stores(); 141 + if (!dup) 142 + atomic_inc(&sis->frontswap_pages); 143 + } else if (dup) { 144 + /* 145 + failed dup always results in automatic invalidate of 146 + the (older) page from frontswap 147 + */ 148 + frontswap_clear(sis, offset); 149 + atomic_dec(&sis->frontswap_pages); 150 + inc_frontswap_failed_stores(); 151 + } else 152 + inc_frontswap_failed_stores(); 153 + if (frontswap_writethrough_enabled) 154 + /* report failure so swap also writes to swap device */ 155 + ret = -1; 156 + return ret; 157 + } 158 + EXPORT_SYMBOL(__frontswap_store); 159 + 160 + /* 161 + * "Get" data from frontswap associated with swaptype and offset that were 162 + * specified when the data was put to frontswap and use it to fill the 163 + * specified page with data. Page must be locked and in the swap cache. 164 + */ 165 + int __frontswap_load(struct page *page) 166 + { 167 + int ret = -1; 168 + swp_entry_t entry = { .val = page_private(page), }; 169 + int type = swp_type(entry); 170 + struct swap_info_struct *sis = swap_info[type]; 171 + pgoff_t offset = swp_offset(entry); 172 + 173 + BUG_ON(!PageLocked(page)); 174 + BUG_ON(sis == NULL); 175 + if (frontswap_test(sis, offset)) 176 + ret = (*frontswap_ops.load)(type, offset, page); 177 + if (ret == 0) 178 + inc_frontswap_loads(); 179 + return ret; 180 + } 181 + EXPORT_SYMBOL(__frontswap_load); 182 + 183 + /* 184 + * Invalidate any data from frontswap associated with the specified swaptype 185 + * and offset so that a subsequent "get" will fail. 186 + */ 187 + void __frontswap_invalidate_page(unsigned type, pgoff_t offset) 188 + { 189 + struct swap_info_struct *sis = swap_info[type]; 190 + 191 + BUG_ON(sis == NULL); 192 + if (frontswap_test(sis, offset)) { 193 + (*frontswap_ops.invalidate_page)(type, offset); 194 + atomic_dec(&sis->frontswap_pages); 195 + frontswap_clear(sis, offset); 196 + inc_frontswap_invalidates(); 197 + } 198 + } 199 + EXPORT_SYMBOL(__frontswap_invalidate_page); 200 + 201 + /* 202 + * Invalidate all data from frontswap associated with all offsets for the 203 + * specified swaptype. 204 + */ 205 + void __frontswap_invalidate_area(unsigned type) 206 + { 207 + struct swap_info_struct *sis = swap_info[type]; 208 + 209 + BUG_ON(sis == NULL); 210 + if (sis->frontswap_map == NULL) 211 + return; 212 + (*frontswap_ops.invalidate_area)(type); 213 + atomic_set(&sis->frontswap_pages, 0); 214 + memset(sis->frontswap_map, 0, sis->max / sizeof(long)); 215 + } 216 + EXPORT_SYMBOL(__frontswap_invalidate_area); 217 + 218 + /* 219 + * Frontswap, like a true swap device, may unnecessarily retain pages 220 + * under certain circumstances; "shrink" frontswap is essentially a 221 + * "partial swapoff" and works by calling try_to_unuse to attempt to 222 + * unuse enough frontswap pages to attempt to -- subject to memory 223 + * constraints -- reduce the number of pages in frontswap to the 224 + * number given in the parameter target_pages. 225 + */ 226 + void frontswap_shrink(unsigned long target_pages) 227 + { 228 + struct swap_info_struct *si = NULL; 229 + int si_frontswap_pages; 230 + unsigned long total_pages = 0, total_pages_to_unuse; 231 + unsigned long pages = 0, pages_to_unuse = 0; 232 + int type; 233 + bool locked = false; 234 + 235 + /* 236 + * we don't want to hold swap_lock while doing a very 237 + * lengthy try_to_unuse, but swap_list may change 238 + * so restart scan from swap_list.head each time 239 + */ 240 + spin_lock(&swap_lock); 241 + locked = true; 242 + total_pages = 0; 243 + for (type = swap_list.head; type >= 0; type = si->next) { 244 + si = swap_info[type]; 245 + total_pages += atomic_read(&si->frontswap_pages); 246 + } 247 + if (total_pages <= target_pages) 248 + goto out; 249 + total_pages_to_unuse = total_pages - target_pages; 250 + for (type = swap_list.head; type >= 0; type = si->next) { 251 + si = swap_info[type]; 252 + si_frontswap_pages = atomic_read(&si->frontswap_pages); 253 + if (total_pages_to_unuse < si_frontswap_pages) 254 + pages = pages_to_unuse = total_pages_to_unuse; 255 + else { 256 + pages = si_frontswap_pages; 257 + pages_to_unuse = 0; /* unuse all */ 258 + } 259 + /* ensure there is enough RAM to fetch pages from frontswap */ 260 + if (security_vm_enough_memory_mm(current->mm, pages)) 261 + continue; 262 + vm_unacct_memory(pages); 263 + break; 264 + } 265 + if (type < 0) 266 + goto out; 267 + locked = false; 268 + spin_unlock(&swap_lock); 269 + try_to_unuse(type, true, pages_to_unuse); 270 + out: 271 + if (locked) 272 + spin_unlock(&swap_lock); 273 + return; 274 + } 275 + EXPORT_SYMBOL(frontswap_shrink); 276 + 277 + /* 278 + * Count and return the number of frontswap pages across all 279 + * swap devices. This is exported so that backend drivers can 280 + * determine current usage without reading debugfs. 281 + */ 282 + unsigned long frontswap_curr_pages(void) 283 + { 284 + int type; 285 + unsigned long totalpages = 0; 286 + struct swap_info_struct *si = NULL; 287 + 288 + spin_lock(&swap_lock); 289 + for (type = swap_list.head; type >= 0; type = si->next) { 290 + si = swap_info[type]; 291 + totalpages += atomic_read(&si->frontswap_pages); 292 + } 293 + spin_unlock(&swap_lock); 294 + return totalpages; 295 + } 296 + EXPORT_SYMBOL(frontswap_curr_pages); 297 + 298 + static int __init init_frontswap(void) 299 + { 300 + #ifdef CONFIG_DEBUG_FS 301 + struct dentry *root = debugfs_create_dir("frontswap", NULL); 302 + if (root == NULL) 303 + return -ENXIO; 304 + debugfs_create_u64("loads", S_IRUGO, root, &frontswap_loads); 305 + debugfs_create_u64("succ_stores", S_IRUGO, root, &frontswap_succ_stores); 306 + debugfs_create_u64("failed_stores", S_IRUGO, root, 307 + &frontswap_failed_stores); 308 + debugfs_create_u64("invalidates", S_IRUGO, 309 + root, &frontswap_invalidates); 310 + #endif 311 + return 0; 312 + } 313 + 314 + module_init(init_frontswap);
+12
mm/page_io.c
··· 18 18 #include <linux/bio.h> 19 19 #include <linux/swapops.h> 20 20 #include <linux/writeback.h> 21 + #include <linux/frontswap.h> 21 22 #include <asm/pgtable.h> 22 23 23 24 static struct bio *get_swap_bio(gfp_t gfp_flags, ··· 99 98 unlock_page(page); 100 99 goto out; 101 100 } 101 + if (frontswap_store(page) == 0) { 102 + set_page_writeback(page); 103 + unlock_page(page); 104 + end_page_writeback(page); 105 + goto out; 106 + } 102 107 bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write); 103 108 if (bio == NULL) { 104 109 set_page_dirty(page); ··· 129 122 130 123 VM_BUG_ON(!PageLocked(page)); 131 124 VM_BUG_ON(PageUptodate(page)); 125 + if (frontswap_load(page) == 0) { 126 + SetPageUptodate(page); 127 + unlock_page(page); 128 + goto out; 129 + } 132 130 bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read); 133 131 if (bio == NULL) { 134 132 unlock_page(page);
+41 -13
mm/swapfile.c
··· 31 31 #include <linux/memcontrol.h> 32 32 #include <linux/poll.h> 33 33 #include <linux/oom.h> 34 + #include <linux/frontswap.h> 35 + #include <linux/swapfile.h> 34 36 35 37 #include <asm/pgtable.h> 36 38 #include <asm/tlbflush.h> ··· 44 42 static void free_swap_count_continuations(struct swap_info_struct *); 45 43 static sector_t map_swap_entry(swp_entry_t, struct block_device**); 46 44 47 - static DEFINE_SPINLOCK(swap_lock); 45 + DEFINE_SPINLOCK(swap_lock); 48 46 static unsigned int nr_swapfiles; 49 47 long nr_swap_pages; 50 48 long total_swap_pages; ··· 55 53 static const char Bad_offset[] = "Bad swap offset entry "; 56 54 static const char Unused_offset[] = "Unused swap offset entry "; 57 55 58 - static struct swap_list_t swap_list = {-1, -1}; 56 + struct swap_list_t swap_list = {-1, -1}; 59 57 60 - static struct swap_info_struct *swap_info[MAX_SWAPFILES]; 58 + struct swap_info_struct *swap_info[MAX_SWAPFILES]; 61 59 62 60 static DEFINE_MUTEX(swapon_mutex); 63 61 ··· 558 556 swap_list.next = p->type; 559 557 nr_swap_pages++; 560 558 p->inuse_pages--; 559 + frontswap_invalidate_page(p->type, offset); 561 560 if ((p->flags & SWP_BLKDEV) && 562 561 disk->fops->swap_slot_free_notify) 563 562 disk->fops->swap_slot_free_notify(p->bdev, offset); ··· 988 985 } 989 986 990 987 /* 991 - * Scan swap_map from current position to next entry still in use. 988 + * Scan swap_map (or frontswap_map if frontswap parameter is true) 989 + * from current position to next entry still in use. 992 990 * Recycle to start on reaching the end, returning 0 when empty. 993 991 */ 994 992 static unsigned int find_next_to_unuse(struct swap_info_struct *si, 995 - unsigned int prev) 993 + unsigned int prev, bool frontswap) 996 994 { 997 995 unsigned int max = si->max; 998 996 unsigned int i = prev; ··· 1019 1015 prev = 0; 1020 1016 i = 1; 1021 1017 } 1018 + if (frontswap) { 1019 + if (frontswap_test(si, i)) 1020 + break; 1021 + else 1022 + continue; 1023 + } 1022 1024 count = si->swap_map[i]; 1023 1025 if (count && swap_count(count) != SWAP_MAP_BAD) 1024 1026 break; ··· 1036 1026 * We completely avoid races by reading each swap page in advance, 1037 1027 * and then search for the process using it. All the necessary 1038 1028 * page table adjustments can then be made atomically. 1029 + * 1030 + * if the boolean frontswap is true, only unuse pages_to_unuse pages; 1031 + * pages_to_unuse==0 means all pages; ignored if frontswap is false 1039 1032 */ 1040 - static int try_to_unuse(unsigned int type) 1033 + int try_to_unuse(unsigned int type, bool frontswap, 1034 + unsigned long pages_to_unuse) 1041 1035 { 1042 1036 struct swap_info_struct *si = swap_info[type]; 1043 1037 struct mm_struct *start_mm; ··· 1074 1060 * one pass through swap_map is enough, but not necessarily: 1075 1061 * there are races when an instance of an entry might be missed. 1076 1062 */ 1077 - while ((i = find_next_to_unuse(si, i)) != 0) { 1063 + while ((i = find_next_to_unuse(si, i, frontswap)) != 0) { 1078 1064 if (signal_pending(current)) { 1079 1065 retval = -EINTR; 1080 1066 break; ··· 1241 1227 * interactive performance. 1242 1228 */ 1243 1229 cond_resched(); 1230 + if (frontswap && pages_to_unuse > 0) { 1231 + if (!--pages_to_unuse) 1232 + break; 1233 + } 1244 1234 } 1245 1235 1246 1236 mmput(start_mm); ··· 1504 1486 } 1505 1487 1506 1488 static void enable_swap_info(struct swap_info_struct *p, int prio, 1507 - unsigned char *swap_map) 1489 + unsigned char *swap_map, 1490 + unsigned long *frontswap_map) 1508 1491 { 1509 1492 int i, prev; 1510 1493 ··· 1515 1496 else 1516 1497 p->prio = --least_priority; 1517 1498 p->swap_map = swap_map; 1499 + frontswap_map_set(p, frontswap_map); 1518 1500 p->flags |= SWP_WRITEOK; 1519 1501 nr_swap_pages += p->pages; 1520 1502 total_swap_pages += p->pages; ··· 1532 1512 swap_list.head = swap_list.next = p->type; 1533 1513 else 1534 1514 swap_info[prev]->next = p->type; 1515 + frontswap_init(p->type); 1535 1516 spin_unlock(&swap_lock); 1536 1517 } 1537 1518 ··· 1606 1585 spin_unlock(&swap_lock); 1607 1586 1608 1587 oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); 1609 - err = try_to_unuse(type); 1588 + err = try_to_unuse(type, false, 0); /* force all pages to be unused */ 1610 1589 compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); 1611 1590 1612 1591 if (err) { ··· 1617 1596 * sys_swapoff for this swap_info_struct at this point. 1618 1597 */ 1619 1598 /* re-insert swap space back into swap_list */ 1620 - enable_swap_info(p, p->prio, p->swap_map); 1599 + enable_swap_info(p, p->prio, p->swap_map, frontswap_map_get(p)); 1621 1600 goto out_dput; 1622 1601 } 1623 1602 ··· 1643 1622 swap_map = p->swap_map; 1644 1623 p->swap_map = NULL; 1645 1624 p->flags = 0; 1625 + frontswap_invalidate_area(type); 1646 1626 spin_unlock(&swap_lock); 1647 1627 mutex_unlock(&swapon_mutex); 1648 1628 vfree(swap_map); 1629 + vfree(frontswap_map_get(p)); 1649 1630 /* Destroy swap account informatin */ 1650 1631 swap_cgroup_swapoff(type); 1651 1632 ··· 2011 1988 sector_t span; 2012 1989 unsigned long maxpages; 2013 1990 unsigned char *swap_map = NULL; 1991 + unsigned long *frontswap_map = NULL; 2014 1992 struct page *page = NULL; 2015 1993 struct inode *inode = NULL; 2016 1994 ··· 2095 2071 error = nr_extents; 2096 2072 goto bad_swap; 2097 2073 } 2074 + /* frontswap enabled? set up bit-per-page map for frontswap */ 2075 + if (frontswap_enabled) 2076 + frontswap_map = vzalloc(maxpages / sizeof(long)); 2098 2077 2099 2078 if (p->bdev) { 2100 2079 if (blk_queue_nonrot(bdev_get_queue(p->bdev))) { ··· 2113 2086 if (swap_flags & SWAP_FLAG_PREFER) 2114 2087 prio = 2115 2088 (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT; 2116 - enable_swap_info(p, prio, swap_map); 2089 + enable_swap_info(p, prio, swap_map, frontswap_map); 2117 2090 2118 2091 printk(KERN_INFO "Adding %uk swap on %s. " 2119 - "Priority:%d extents:%d across:%lluk %s%s\n", 2092 + "Priority:%d extents:%d across:%lluk %s%s%s\n", 2120 2093 p->pages<<(PAGE_SHIFT-10), name, p->prio, 2121 2094 nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10), 2122 2095 (p->flags & SWP_SOLIDSTATE) ? "SS" : "", 2123 - (p->flags & SWP_DISCARDABLE) ? "D" : ""); 2096 + (p->flags & SWP_DISCARDABLE) ? "D" : "", 2097 + (frontswap_map) ? "FS" : ""); 2124 2098 2125 2099 mutex_unlock(&swapon_mutex); 2126 2100 atomic_inc(&proc_poll_event);