Merge tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm

+278

Documentation/vm/frontswap.txt

··· 1 + Frontswap provides a "transcendent memory" interface for swap pages. 2 + In some environments, dramatic performance savings may be obtained because 3 + swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. 4 + 5 + (Note, frontswap -- and cleancache (merged at 3.0) -- are the "frontends" 6 + and the only necessary changes to the core kernel for transcendent memory; 7 + all other supporting code -- the "backends" -- is implemented as drivers. 8 + See the LWN.net article "Transcendent memory in a nutshell" for a detailed 9 + overview of frontswap and related kernel parts: 10 + https://lwn.net/Articles/454795/ ) 11 + 12 + Frontswap is so named because it can be thought of as the opposite of 13 + a "backing" store for a swap device. The storage is assumed to be 14 + a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming 15 + to the requirements of transcendent memory (such as Xen's "tmem", or 16 + in-kernel compressed memory, aka "zcache", or future RAM-like devices); 17 + this pseudo-RAM device is not directly accessible or addressable by the 18 + kernel and is of unknown and possibly time-varying size. The driver 19 + links itself to frontswap by calling frontswap_register_ops to set the 20 + frontswap_ops funcs appropriately and the functions it provides must 21 + conform to certain policies as follows: 22 + 23 + An "init" prepares the device to receive frontswap pages associated 24 + with the specified swap device number (aka "type"). A "store" will 25 + copy the page to transcendent memory and associate it with the type and 26 + offset associated with the page. A "load" will copy the page, if found, 27 + from transcendent memory into kernel memory, but will NOT remove the page 28 + from from transcendent memory. An "invalidate_page" will remove the page 29 + from transcendent memory and an "invalidate_area" will remove ALL pages 30 + associated with the swap type (e.g., like swapoff) and notify the "device" 31 + to refuse further stores with that swap type. 32 + 33 + Once a page is successfully stored, a matching load on the page will normally 34 + succeed. So when the kernel finds itself in a situation where it needs 35 + to swap out a page, it first attempts to use frontswap. If the store returns 36 + success, the data has been successfully saved to transcendent memory and 37 + a disk write and, if the data is later read back, a disk read are avoided. 38 + If a store returns failure, transcendent memory has rejected the data, and the 39 + page can be written to swap as usual. 40 + 41 + If a backend chooses, frontswap can be configured as a "writethrough 42 + cache" by calling frontswap_writethrough(). In this mode, the reduction 43 + in swap device writes is lost (and also a non-trivial performance advantage) 44 + in order to allow the backend to arbitrarily "reclaim" space used to 45 + store frontswap pages to more completely manage its memory usage. 46 + 47 + Note that if a page is stored and the page already exists in transcendent memory 48 + (a "duplicate" store), either the store succeeds and the data is overwritten, 49 + or the store fails AND the page is invalidated. This ensures stale data may 50 + never be obtained from frontswap. 51 + 52 + If properly configured, monitoring of frontswap is done via debugfs in 53 + the /sys/kernel/debug/frontswap directory. The effectiveness of 54 + frontswap can be measured (across all swap devices) with: 55 + 56 + failed_stores - how many store attempts have failed 57 + loads - how many loads were attempted (all should succeed) 58 + succ_stores - how many store attempts have succeeded 59 + invalidates - how many invalidates were attempted 60 + 61 + A backend implementation may provide additional metrics. 62 + 63 + FAQ 64 + 65 + 1) Where's the value? 66 + 67 + When a workload starts swapping, performance falls through the floor. 68 + Frontswap significantly increases performance in many such workloads by 69 + providing a clean, dynamic interface to read and write swap pages to 70 + "transcendent memory" that is otherwise not directly addressable to the kernel. 71 + This interface is ideal when data is transformed to a different form 72 + and size (such as with compression) or secretly moved (as might be 73 + useful for write-balancing for some RAM-like devices). Swap pages (and 74 + evicted page-cache pages) are a great use for this kind of slower-than-RAM- 75 + but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and 76 + cleancache) interface to transcendent memory provides a nice way to read 77 + and write -- and indirectly "name" -- the pages. 78 + 79 + Frontswap -- and cleancache -- with a fairly small impact on the kernel, 80 + provides a huge amount of flexibility for more dynamic, flexible RAM 81 + utilization in various system configurations: 82 + 83 + In the single kernel case, aka "zcache", pages are compressed and 84 + stored in local memory, thus increasing the total anonymous pages 85 + that can be safely kept in RAM. Zcache essentially trades off CPU 86 + cycles used in compression/decompression for better memory utilization. 87 + Benchmarks have shown little or no impact when memory pressure is 88 + low while providing a significant performance improvement (25%+) 89 + on some workloads under high memory pressure. 90 + 91 + "RAMster" builds on zcache by adding "peer-to-peer" transcendent memory 92 + support for clustered systems. Frontswap pages are locally compressed 93 + as in zcache, but then "remotified" to another system's RAM. This 94 + allows RAM to be dynamically load-balanced back-and-forth as needed, 95 + i.e. when system A is overcommitted, it can swap to system B, and 96 + vice versa. RAMster can also be configured as a memory server so 97 + many servers in a cluster can swap, dynamically as needed, to a single 98 + server configured with a large amount of RAM... without pre-configuring 99 + how much of the RAM is available for each of the clients! 100 + 101 + In the virtual case, the whole point of virtualization is to statistically 102 + multiplex physical resources acrosst the varying demands of multiple 103 + virtual machines. This is really hard to do with RAM and efforts to do 104 + it well with no kernel changes have essentially failed (except in some 105 + well-publicized special-case workloads). 106 + Specifically, the Xen Transcendent Memory backend allows otherwise 107 + "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple 108 + virtual machines, but the pages can be compressed and deduplicated to 109 + optimize RAM utilization. And when guest OS's are induced to surrender 110 + underutilized RAM (e.g. with "selfballooning"), sudden unexpected 111 + memory pressure may result in swapping; frontswap allows those pages 112 + to be swapped to and from hypervisor RAM (if overall host system memory 113 + conditions allow), thus mitigating the potentially awful performance impact 114 + of unplanned swapping. 115 + 116 + A KVM implementation is underway and has been RFC'ed to lkml. And, 117 + using frontswap, investigation is also underway on the use of NVM as 118 + a memory extension technology. 119 + 120 + 2) Sure there may be performance advantages in some situations, but 121 + what's the space/time overhead of frontswap? 122 + 123 + If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into 124 + nothingness and the only overhead is a few extra bytes per swapon'ed 125 + swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend" 126 + registers, there is one extra global variable compared to zero for 127 + every swap page read or written. If CONFIG_FRONTSWAP is enabled 128 + AND a frontswap backend registers AND the backend fails every "store" 129 + request (i.e. provides no memory despite claiming it might), 130 + CPU overhead is still negligible -- and since every frontswap fail 131 + precedes a swap page write-to-disk, the system is highly likely 132 + to be I/O bound and using a small fraction of a percent of a CPU 133 + will be irrelevant anyway. 134 + 135 + As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend 136 + registers, one bit is allocated for every swap page for every swap 137 + device that is swapon'd. This is added to the EIGHT bits (which 138 + was sixteen until about 2.6.34) that the kernel already allocates 139 + for every swap page for every swap device that is swapon'd. (Hugh 140 + Dickins has observed that frontswap could probably steal one of 141 + the existing eight bits, but let's worry about that minor optimization 142 + later.) For very large swap disks (which are rare) on a standard 143 + 4K pagesize, this is 1MB per 32GB swap. 144 + 145 + When swap pages are stored in transcendent memory instead of written 146 + out to disk, there is a side effect that this may create more memory 147 + pressure that can potentially outweigh the other advantages. A 148 + backend, such as zcache, must implement policies to carefully (but 149 + dynamically) manage memory limits to ensure this doesn't happen. 150 + 151 + 3) OK, how about a quick overview of what this frontswap patch does 152 + in terms that a kernel hacker can grok? 153 + 154 + Let's assume that a frontswap "backend" has registered during 155 + kernel initialization; this registration indicates that this 156 + frontswap backend has access to some "memory" that is not directly 157 + accessible by the kernel. Exactly how much memory it provides is 158 + entirely dynamic and random. 159 + 160 + Whenever a swap-device is swapon'd frontswap_init() is called, 161 + passing the swap device number (aka "type") as a parameter. 162 + This notifies frontswap to expect attempts to "store" swap pages 163 + associated with that number. 164 + 165 + Whenever the swap subsystem is readying a page to write to a swap 166 + device (c.f swap_writepage()), frontswap_store is called. Frontswap 167 + consults with the frontswap backend and if the backend says it does NOT 168 + have room, frontswap_store returns -1 and the kernel swaps the page 169 + to the swap device as normal. Note that the response from the frontswap 170 + backend is unpredictable to the kernel; it may choose to never accept a 171 + page, it could accept every ninth page, or it might accept every 172 + page. But if the backend does accept a page, the data from the page 173 + has already been copied and associated with the type and offset, 174 + and the backend guarantees the persistence of the data. In this case, 175 + frontswap sets a bit in the "frontswap_map" for the swap device 176 + corresponding to the page offset on the swap device to which it would 177 + otherwise have written the data. 178 + 179 + When the swap subsystem needs to swap-in a page (swap_readpage()), 180 + it first calls frontswap_load() which checks the frontswap_map to 181 + see if the page was earlier accepted by the frontswap backend. If 182 + it was, the page of data is filled from the frontswap backend and 183 + the swap-in is complete. If not, the normal swap-in code is 184 + executed to obtain the page of data from the real swap device. 185 + 186 + So every time the frontswap backend accepts a page, a swap device read 187 + and (potentially) a swap device write are replaced by a "frontswap backend 188 + store" and (possibly) a "frontswap backend loads", which are presumably much 189 + faster. 190 + 191 + 4) Can't frontswap be configured as a "special" swap device that is 192 + just higher priority than any real swap device (e.g. like zswap, 193 + or maybe swap-over-nbd/NFS)? 194 + 195 + No. First, the existing swap subsystem doesn't allow for any kind of 196 + swap hierarchy. Perhaps it could be rewritten to accomodate a hierarchy, 197 + but this would require fairly drastic changes. Even if it were 198 + rewritten, the existing swap subsystem uses the block I/O layer which 199 + assumes a swap device is fixed size and any page in it is linearly 200 + addressable. Frontswap barely touches the existing swap subsystem, 201 + and works around the constraints of the block I/O subsystem to provide 202 + a great deal of flexibility and dynamicity. 203 + 204 + For example, the acceptance of any swap page by the frontswap backend is 205 + entirely unpredictable. This is critical to the definition of frontswap 206 + backends because it grants completely dynamic discretion to the 207 + backend. In zcache, one cannot know a priori how compressible a page is. 208 + "Poorly" compressible pages can be rejected, and "poorly" can itself be 209 + defined dynamically depending on current memory constraints. 210 + 211 + Further, frontswap is entirely synchronous whereas a real swap 212 + device is, by definition, asynchronous and uses block I/O. The 213 + block I/O layer is not only unnecessary, but may perform "optimizations" 214 + that are inappropriate for a RAM-oriented device including delaying 215 + the write of some pages for a significant amount of time. Synchrony is 216 + required to ensure the dynamicity of the backend and to avoid thorny race 217 + conditions that would unnecessarily and greatly complicate frontswap 218 + and/or the block I/O subsystem. That said, only the initial "store" 219 + and "load" operations need be synchronous. A separate asynchronous thread 220 + is free to manipulate the pages stored by frontswap. For example, 221 + the "remotification" thread in RAMster uses standard asynchronous 222 + kernel sockets to move compressed frontswap pages to a remote machine. 223 + Similarly, a KVM guest-side implementation could do in-guest compression 224 + and use "batched" hypercalls. 225 + 226 + In a virtualized environment, the dynamicity allows the hypervisor 227 + (or host OS) to do "intelligent overcommit". For example, it can 228 + choose to accept pages only until host-swapping might be imminent, 229 + then force guests to do their own swapping. 230 + 231 + There is a downside to the transcendent memory specifications for 232 + frontswap: Since any "store" might fail, there must always be a real 233 + slot on a real swap device to swap the page. Thus frontswap must be 234 + implemented as a "shadow" to every swapon'd device with the potential 235 + capability of holding every page that the swap device might have held 236 + and the possibility that it might hold no pages at all. This means 237 + that frontswap cannot contain more pages than the total of swapon'd 238 + swap devices. For example, if NO swap device is configured on some 239 + installation, frontswap is useless. Swapless portable devices 240 + can still use frontswap but a backend for such devices must configure 241 + some kind of "ghost" swap device and ensure that it is never used. 242 + 243 + 5) Why this weird definition about "duplicate stores"? If a page 244 + has been previously successfully stored, can't it always be 245 + successfully overwritten? 246 + 247 + Nearly always it can, but no, sometimes it cannot. Consider an example 248 + where data is compressed and the original 4K page has been compressed 249 + to 1K. Now an attempt is made to overwrite the page with data that 250 + is non-compressible and so would take the entire 4K. But the backend 251 + has no more space. In this case, the store must be rejected. Whenever 252 + frontswap rejects a store that would overwrite, it also must invalidate 253 + the old data and ensure that it is no longer accessible. Since the 254 + swap subsystem then writes the new data to the read swap device, 255 + this is the correct course of action to ensure coherency. 256 + 257 + 6) What is frontswap_shrink for? 258 + 259 + When the (non-frontswap) swap subsystem swaps out a page to a real 260 + swap device, that page is only taking up low-value pre-allocated disk 261 + space. But if frontswap has placed a page in transcendent memory, that 262 + page may be taking up valuable real estate. The frontswap_shrink 263 + routine allows code outside of the swap subsystem to force pages out 264 + of the memory managed by frontswap and back into kernel-addressable memory. 265 + For example, in RAMster, a "suction driver" thread will attempt 266 + to "repatriate" pages sent to a remote machine back to the local machine; 267 + this is driven using the frontswap_shrink mechanism when memory pressure 268 + subsides. 269 + 270 + 7) Why does the frontswap patch create the new include file swapfile.h? 271 + 272 + The frontswap code depends on some swap-subsystem-internal data 273 + structures that have, over the years, moved back and forth between 274 + static and global. This seemed a reasonable compromise: Define 275 + them as global but declare them in a new include file that isn't 276 + included by the large number of source files that include swap.h. 277 + 278 + Dan Magenheimer, last updated April 9, 2012

+7

MAINTAINERS

··· 2930 2930 F: include/linux/freezer.h 2931 2931 F: kernel/freezer.c 2932 2932 2933 + FRONTSWAP API 2934 + M: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 2935 + L: linux-kernel@vger.kernel.org 2936 + S: Maintained 2937 + F: mm/frontswap.c 2938 + F: include/linux/frontswap.h 2939 + 2933 2940 FS-CACHE: LOCAL CACHING FOR NETWORK FILESYSTEMS 2934 2941 M: David Howells <dhowells@redhat.com> 2935 2942 L: linux-cachefs@redhat.com

+4 -4

drivers/staging/ramster/zcache-main.c

··· 3002 3002 return oid; 3003 3003 } 3004 3004 3005 - static int zcache_frontswap_put_page(unsigned type, pgoff_t offset, 3005 + static int zcache_frontswap_store(unsigned type, pgoff_t offset, 3006 3006 struct page *page) 3007 3007 { 3008 3008 u64 ind64 = (u64)offset; ··· 3025 3025 3026 3026 /* returns 0 if the page was successfully gotten from frontswap, -1 if 3027 3027 * was not present (should never happen!) */ 3028 - static int zcache_frontswap_get_page(unsigned type, pgoff_t offset, 3028 + static int zcache_frontswap_load(unsigned type, pgoff_t offset, 3029 3029 struct page *page) 3030 3030 { 3031 3031 u64 ind64 = (u64)offset; ··· 3080 3080 } 3081 3081 3082 3082 static struct frontswap_ops zcache_frontswap_ops = { 3083 - .put_page = zcache_frontswap_put_page, 3084 - .get_page = zcache_frontswap_get_page, 3083 + .store = zcache_frontswap_store, 3084 + .load = zcache_frontswap_load, 3085 3085 .invalidate_page = zcache_frontswap_flush_page, 3086 3086 .invalidate_area = zcache_frontswap_flush_area, 3087 3087 .init = zcache_frontswap_init

+5 -5

drivers/staging/zcache/zcache-main.c

··· 1835 1835 * Swizzling increases objects per swaptype, increasing tmem concurrency 1836 1836 * for heavy swaploads. Later, larger nr_cpus -> larger SWIZ_BITS 1837 1837 * Setting SWIZ_BITS to 27 basically reconstructs the swap entry from 1838 - * frontswap_get_page(), but has side-effects. Hence using 8. 1838 + * frontswap_load(), but has side-effects. Hence using 8. 1839 1839 */ 1840 1840 #define SWIZ_BITS 8 1841 1841 #define SWIZ_MASK ((1 << SWIZ_BITS) - 1) ··· 1849 1849 return oid; 1850 1850 } 1851 1851 1852 - static int zcache_frontswap_put_page(unsigned type, pgoff_t offset, 1852 + static int zcache_frontswap_store(unsigned type, pgoff_t offset, 1853 1853 struct page *page) 1854 1854 { 1855 1855 u64 ind64 = (u64)offset; ··· 1870 1870 1871 1871 /* returns 0 if the page was successfully gotten from frontswap, -1 if 1872 1872 * was not present (should never happen!) */ 1873 - static int zcache_frontswap_get_page(unsigned type, pgoff_t offset, 1873 + static int zcache_frontswap_load(unsigned type, pgoff_t offset, 1874 1874 struct page *page) 1875 1875 { 1876 1876 u64 ind64 = (u64)offset; ··· 1919 1919 } 1920 1920 1921 1921 static struct frontswap_ops zcache_frontswap_ops = { 1922 - .put_page = zcache_frontswap_put_page, 1923 - .get_page = zcache_frontswap_get_page, 1922 + .store = zcache_frontswap_store, 1923 + .load = zcache_frontswap_load, 1924 1924 .invalidate_page = zcache_frontswap_flush_page, 1925 1925 .invalidate_area = zcache_frontswap_flush_area, 1926 1926 .init = zcache_frontswap_init

+4 -4

drivers/xen/tmem.c

··· 269 269 } 270 270 271 271 /* returns 0 if the page was successfully put into frontswap, -1 if not */ 272 - static int tmem_frontswap_put_page(unsigned type, pgoff_t offset, 272 + static int tmem_frontswap_store(unsigned type, pgoff_t offset, 273 273 struct page *page) 274 274 { 275 275 u64 ind64 = (u64)offset; ··· 295 295 * returns 0 if the page was successfully gotten from frontswap, -1 if 296 296 * was not present (should never happen!) 297 297 */ 298 - static int tmem_frontswap_get_page(unsigned type, pgoff_t offset, 298 + static int tmem_frontswap_load(unsigned type, pgoff_t offset, 299 299 struct page *page) 300 300 { 301 301 u64 ind64 = (u64)offset; ··· 362 362 __setup("nofrontswap", no_frontswap); 363 363 364 364 static struct frontswap_ops __initdata tmem_frontswap_ops = { 365 - .put_page = tmem_frontswap_put_page, 366 - .get_page = tmem_frontswap_get_page, 365 + .store = tmem_frontswap_store, 366 + .load = tmem_frontswap_load, 367 367 .invalidate_page = tmem_frontswap_flush_page, 368 368 .invalidate_area = tmem_frontswap_flush_area, 369 369 .init = tmem_frontswap_init

+127

include/linux/frontswap.h

··· 1 + #ifndef _LINUX_FRONTSWAP_H 2 + #define _LINUX_FRONTSWAP_H 3 + 4 + #include <linux/swap.h> 5 + #include <linux/mm.h> 6 + #include <linux/bitops.h> 7 + 8 + struct frontswap_ops { 9 + void (*init)(unsigned); 10 + int (*store)(unsigned, pgoff_t, struct page *); 11 + int (*load)(unsigned, pgoff_t, struct page *); 12 + void (*invalidate_page)(unsigned, pgoff_t); 13 + void (*invalidate_area)(unsigned); 14 + }; 15 + 16 + extern bool frontswap_enabled; 17 + extern struct frontswap_ops 18 + frontswap_register_ops(struct frontswap_ops *ops); 19 + extern void frontswap_shrink(unsigned long); 20 + extern unsigned long frontswap_curr_pages(void); 21 + extern void frontswap_writethrough(bool); 22 + 23 + extern void __frontswap_init(unsigned type); 24 + extern int __frontswap_store(struct page *page); 25 + extern int __frontswap_load(struct page *page); 26 + extern void __frontswap_invalidate_page(unsigned, pgoff_t); 27 + extern void __frontswap_invalidate_area(unsigned); 28 + 29 + #ifdef CONFIG_FRONTSWAP 30 + 31 + static inline bool frontswap_test(struct swap_info_struct *sis, pgoff_t offset) 32 + { 33 + bool ret = false; 34 + 35 + if (frontswap_enabled && sis->frontswap_map) 36 + ret = test_bit(offset, sis->frontswap_map); 37 + return ret; 38 + } 39 + 40 + static inline void frontswap_set(struct swap_info_struct *sis, pgoff_t offset) 41 + { 42 + if (frontswap_enabled && sis->frontswap_map) 43 + set_bit(offset, sis->frontswap_map); 44 + } 45 + 46 + static inline void frontswap_clear(struct swap_info_struct *sis, pgoff_t offset) 47 + { 48 + if (frontswap_enabled && sis->frontswap_map) 49 + clear_bit(offset, sis->frontswap_map); 50 + } 51 + 52 + static inline void frontswap_map_set(struct swap_info_struct *p, 53 + unsigned long *map) 54 + { 55 + p->frontswap_map = map; 56 + } 57 + 58 + static inline unsigned long *frontswap_map_get(struct swap_info_struct *p) 59 + { 60 + return p->frontswap_map; 61 + } 62 + #else 63 + /* all inline routines become no-ops and all externs are ignored */ 64 + 65 + #define frontswap_enabled (0) 66 + 67 + static inline bool frontswap_test(struct swap_info_struct *sis, pgoff_t offset) 68 + { 69 + return false; 70 + } 71 + 72 + static inline void frontswap_set(struct swap_info_struct *sis, pgoff_t offset) 73 + { 74 + } 75 + 76 + static inline void frontswap_clear(struct swap_info_struct *sis, pgoff_t offset) 77 + { 78 + } 79 + 80 + static inline void frontswap_map_set(struct swap_info_struct *p, 81 + unsigned long *map) 82 + { 83 + } 84 + 85 + static inline unsigned long *frontswap_map_get(struct swap_info_struct *p) 86 + { 87 + return NULL; 88 + } 89 + #endif 90 + 91 + static inline int frontswap_store(struct page *page) 92 + { 93 + int ret = -1; 94 + 95 + if (frontswap_enabled) 96 + ret = __frontswap_store(page); 97 + return ret; 98 + } 99 + 100 + static inline int frontswap_load(struct page *page) 101 + { 102 + int ret = -1; 103 + 104 + if (frontswap_enabled) 105 + ret = __frontswap_load(page); 106 + return ret; 107 + } 108 + 109 + static inline void frontswap_invalidate_page(unsigned type, pgoff_t offset) 110 + { 111 + if (frontswap_enabled) 112 + __frontswap_invalidate_page(type, offset); 113 + } 114 + 115 + static inline void frontswap_invalidate_area(unsigned type) 116 + { 117 + if (frontswap_enabled) 118 + __frontswap_invalidate_area(type); 119 + } 120 + 121 + static inline void frontswap_init(unsigned type) 122 + { 123 + if (frontswap_enabled) 124 + __frontswap_init(type); 125 + } 126 + 127 + #endif /* _LINUX_FRONTSWAP_H */

+4

include/linux/swap.h

··· 197 197 struct block_device *bdev; /* swap device or bdev of swap file */ 198 198 struct file *swap_file; /* seldom referenced */ 199 199 unsigned int old_block_size; /* seldom referenced */ 200 + #ifdef CONFIG_FRONTSWAP 201 + unsigned long *frontswap_map; /* frontswap in-use, one bit per page */ 202 + atomic_t frontswap_pages; /* frontswap pages in-use counter */ 203 + #endif 200 204 }; 201 205 202 206 struct swap_list_t {

+13

include/linux/swapfile.h

··· 1 + #ifndef _LINUX_SWAPFILE_H 2 + #define _LINUX_SWAPFILE_H 3 + 4 + /* 5 + * these were static in swapfile.c but frontswap.c needs them and we don't 6 + * want to expose them to the dozens of source files that include swap.h 7 + */ 8 + extern spinlock_t swap_lock; 9 + extern struct swap_list_t swap_list; 10 + extern struct swap_info_struct *swap_info[]; 11 + extern int try_to_unuse(unsigned int, bool, unsigned long); 12 + 13 + #endif /* _LINUX_SWAPFILE_H */

+17

mm/Kconfig

··· 389 389 in a negligible performance hit. 390 390 391 391 If unsure, say Y to enable cleancache 392 + 393 + config FRONTSWAP 394 + bool "Enable frontswap to cache swap pages if tmem is present" 395 + depends on SWAP 396 + default n 397 + help 398 + Frontswap is so named because it can be thought of as the opposite 399 + of a "backing" store for a swap device. The data is stored into 400 + "transcendent memory", memory that is not directly accessible or 401 + addressable by the kernel and is of unknown and possibly 402 + time-varying size. When space in transcendent memory is available, 403 + a significant swap I/O reduction may be achieved. When none is 404 + available, all frontswap calls are reduced to a single pointer- 405 + compare-against-NULL resulting in a negligible performance hit 406 + and swap data is stored as normal on the matching swap device. 407 + 408 + If unsure, say Y to enable frontswap.

+1

mm/Makefile

··· 29 29 30 30 obj-$(CONFIG_BOUNCE) += bounce.o 31 31 obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o 32 + obj-$(CONFIG_FRONTSWAP) += frontswap.o 32 33 obj-$(CONFIG_HAS_DMA) += dmapool.o 33 34 obj-$(CONFIG_HUGETLBFS) += hugetlb.o 34 35 obj-$(CONFIG_NUMA) += mempolicy.o

+314

mm/frontswap.c

··· 1 + /* 2 + * Frontswap frontend 3 + * 4 + * This code provides the generic "frontend" layer to call a matching 5 + * "backend" driver implementation of frontswap. See 6 + * Documentation/vm/frontswap.txt for more information. 7 + * 8 + * Copyright (C) 2009-2012 Oracle Corp. All rights reserved. 9 + * Author: Dan Magenheimer 10 + * 11 + * This work is licensed under the terms of the GNU GPL, version 2. 12 + */ 13 + 14 + #include <linux/mm.h> 15 + #include <linux/mman.h> 16 + #include <linux/swap.h> 17 + #include <linux/swapops.h> 18 + #include <linux/proc_fs.h> 19 + #include <linux/security.h> 20 + #include <linux/capability.h> 21 + #include <linux/module.h> 22 + #include <linux/uaccess.h> 23 + #include <linux/debugfs.h> 24 + #include <linux/frontswap.h> 25 + #include <linux/swapfile.h> 26 + 27 + /* 28 + * frontswap_ops is set by frontswap_register_ops to contain the pointers 29 + * to the frontswap "backend" implementation functions. 30 + */ 31 + static struct frontswap_ops frontswap_ops __read_mostly; 32 + 33 + /* 34 + * This global enablement flag reduces overhead on systems where frontswap_ops 35 + * has not been registered, so is preferred to the slower alternative: a 36 + * function call that checks a non-global. 37 + */ 38 + bool frontswap_enabled __read_mostly; 39 + EXPORT_SYMBOL(frontswap_enabled); 40 + 41 + /* 42 + * If enabled, frontswap_store will return failure even on success. As 43 + * a result, the swap subsystem will always write the page to swap, in 44 + * effect converting frontswap into a writethrough cache. In this mode, 45 + * there is no direct reduction in swap writes, but a frontswap backend 46 + * can unilaterally "reclaim" any pages in use with no data loss, thus 47 + * providing increases control over maximum memory usage due to frontswap. 48 + */ 49 + static bool frontswap_writethrough_enabled __read_mostly; 50 + 51 + #ifdef CONFIG_DEBUG_FS 52 + /* 53 + * Counters available via /sys/kernel/debug/frontswap (if debugfs is 54 + * properly configured). These are for information only so are not protected 55 + * against increment races. 56 + */ 57 + static u64 frontswap_loads; 58 + static u64 frontswap_succ_stores; 59 + static u64 frontswap_failed_stores; 60 + static u64 frontswap_invalidates; 61 + 62 + static inline void inc_frontswap_loads(void) { 63 + frontswap_loads++; 64 + } 65 + static inline void inc_frontswap_succ_stores(void) { 66 + frontswap_succ_stores++; 67 + } 68 + static inline void inc_frontswap_failed_stores(void) { 69 + frontswap_failed_stores++; 70 + } 71 + static inline void inc_frontswap_invalidates(void) { 72 + frontswap_invalidates++; 73 + } 74 + #else 75 + static inline void inc_frontswap_loads(void) { } 76 + static inline void inc_frontswap_succ_stores(void) { } 77 + static inline void inc_frontswap_failed_stores(void) { } 78 + static inline void inc_frontswap_invalidates(void) { } 79 + #endif 80 + /* 81 + * Register operations for frontswap, returning previous thus allowing 82 + * detection of multiple backends and possible nesting. 83 + */ 84 + struct frontswap_ops frontswap_register_ops(struct frontswap_ops *ops) 85 + { 86 + struct frontswap_ops old = frontswap_ops; 87 + 88 + frontswap_ops = *ops; 89 + frontswap_enabled = true; 90 + return old; 91 + } 92 + EXPORT_SYMBOL(frontswap_register_ops); 93 + 94 + /* 95 + * Enable/disable frontswap writethrough (see above). 96 + */ 97 + void frontswap_writethrough(bool enable) 98 + { 99 + frontswap_writethrough_enabled = enable; 100 + } 101 + EXPORT_SYMBOL(frontswap_writethrough); 102 + 103 + /* 104 + * Called when a swap device is swapon'd. 105 + */ 106 + void __frontswap_init(unsigned type) 107 + { 108 + struct swap_info_struct *sis = swap_info[type]; 109 + 110 + BUG_ON(sis == NULL); 111 + if (sis->frontswap_map == NULL) 112 + return; 113 + if (frontswap_enabled) 114 + (*frontswap_ops.init)(type); 115 + } 116 + EXPORT_SYMBOL(__frontswap_init); 117 + 118 + /* 119 + * "Store" data from a page to frontswap and associate it with the page's 120 + * swaptype and offset. Page must be locked and in the swap cache. 121 + * If frontswap already contains a page with matching swaptype and 122 + * offset, the frontswap implmentation may either overwrite the data and 123 + * return success or invalidate the page from frontswap and return failure. 124 + */ 125 + int __frontswap_store(struct page *page) 126 + { 127 + int ret = -1, dup = 0; 128 + swp_entry_t entry = { .val = page_private(page), }; 129 + int type = swp_type(entry); 130 + struct swap_info_struct *sis = swap_info[type]; 131 + pgoff_t offset = swp_offset(entry); 132 + 133 + BUG_ON(!PageLocked(page)); 134 + BUG_ON(sis == NULL); 135 + if (frontswap_test(sis, offset)) 136 + dup = 1; 137 + ret = (*frontswap_ops.store)(type, offset, page); 138 + if (ret == 0) { 139 + frontswap_set(sis, offset); 140 + inc_frontswap_succ_stores(); 141 + if (!dup) 142 + atomic_inc(&sis->frontswap_pages); 143 + } else if (dup) { 144 + /* 145 + failed dup always results in automatic invalidate of 146 + the (older) page from frontswap 147 + */ 148 + frontswap_clear(sis, offset); 149 + atomic_dec(&sis->frontswap_pages); 150 + inc_frontswap_failed_stores(); 151 + } else 152 + inc_frontswap_failed_stores(); 153 + if (frontswap_writethrough_enabled) 154 + /* report failure so swap also writes to swap device */ 155 + ret = -1; 156 + return ret; 157 + } 158 + EXPORT_SYMBOL(__frontswap_store); 159 + 160 + /* 161 + * "Get" data from frontswap associated with swaptype and offset that were 162 + * specified when the data was put to frontswap and use it to fill the 163 + * specified page with data. Page must be locked and in the swap cache. 164 + */ 165 + int __frontswap_load(struct page *page) 166 + { 167 + int ret = -1; 168 + swp_entry_t entry = { .val = page_private(page), }; 169 + int type = swp_type(entry); 170 + struct swap_info_struct *sis = swap_info[type]; 171 + pgoff_t offset = swp_offset(entry); 172 + 173 + BUG_ON(!PageLocked(page)); 174 + BUG_ON(sis == NULL); 175 + if (frontswap_test(sis, offset)) 176 + ret = (*frontswap_ops.load)(type, offset, page); 177 + if (ret == 0) 178 + inc_frontswap_loads(); 179 + return ret; 180 + } 181 + EXPORT_SYMBOL(__frontswap_load); 182 + 183 + /* 184 + * Invalidate any data from frontswap associated with the specified swaptype 185 + * and offset so that a subsequent "get" will fail. 186 + */ 187 + void __frontswap_invalidate_page(unsigned type, pgoff_t offset) 188 + { 189 + struct swap_info_struct *sis = swap_info[type]; 190 + 191 + BUG_ON(sis == NULL); 192 + if (frontswap_test(sis, offset)) { 193 + (*frontswap_ops.invalidate_page)(type, offset); 194 + atomic_dec(&sis->frontswap_pages); 195 + frontswap_clear(sis, offset); 196 + inc_frontswap_invalidates(); 197 + } 198 + } 199 + EXPORT_SYMBOL(__frontswap_invalidate_page); 200 + 201 + /* 202 + * Invalidate all data from frontswap associated with all offsets for the 203 + * specified swaptype. 204 + */ 205 + void __frontswap_invalidate_area(unsigned type) 206 + { 207 + struct swap_info_struct *sis = swap_info[type]; 208 + 209 + BUG_ON(sis == NULL); 210 + if (sis->frontswap_map == NULL) 211 + return; 212 + (*frontswap_ops.invalidate_area)(type); 213 + atomic_set(&sis->frontswap_pages, 0); 214 + memset(sis->frontswap_map, 0, sis->max / sizeof(long)); 215 + } 216 + EXPORT_SYMBOL(__frontswap_invalidate_area); 217 + 218 + /* 219 + * Frontswap, like a true swap device, may unnecessarily retain pages 220 + * under certain circumstances; "shrink" frontswap is essentially a 221 + * "partial swapoff" and works by calling try_to_unuse to attempt to 222 + * unuse enough frontswap pages to attempt to -- subject to memory 223 + * constraints -- reduce the number of pages in frontswap to the 224 + * number given in the parameter target_pages. 225 + */ 226 + void frontswap_shrink(unsigned long target_pages) 227 + { 228 + struct swap_info_struct *si = NULL; 229 + int si_frontswap_pages; 230 + unsigned long total_pages = 0, total_pages_to_unuse; 231 + unsigned long pages = 0, pages_to_unuse = 0; 232 + int type; 233 + bool locked = false; 234 + 235 + /* 236 + * we don't want to hold swap_lock while doing a very 237 + * lengthy try_to_unuse, but swap_list may change 238 + * so restart scan from swap_list.head each time 239 + */ 240 + spin_lock(&swap_lock); 241 + locked = true; 242 + total_pages = 0; 243 + for (type = swap_list.head; type >= 0; type = si->next) { 244 + si = swap_info[type]; 245 + total_pages += atomic_read(&si->frontswap_pages); 246 + } 247 + if (total_pages <= target_pages) 248 + goto out; 249 + total_pages_to_unuse = total_pages - target_pages; 250 + for (type = swap_list.head; type >= 0; type = si->next) { 251 + si = swap_info[type]; 252 + si_frontswap_pages = atomic_read(&si->frontswap_pages); 253 + if (total_pages_to_unuse < si_frontswap_pages) 254 + pages = pages_to_unuse = total_pages_to_unuse; 255 + else { 256 + pages = si_frontswap_pages; 257 + pages_to_unuse = 0; /* unuse all */ 258 + } 259 + /* ensure there is enough RAM to fetch pages from frontswap */ 260 + if (security_vm_enough_memory_mm(current->mm, pages)) 261 + continue; 262 + vm_unacct_memory(pages); 263 + break; 264 + } 265 + if (type < 0) 266 + goto out; 267 + locked = false; 268 + spin_unlock(&swap_lock); 269 + try_to_unuse(type, true, pages_to_unuse); 270 + out: 271 + if (locked) 272 + spin_unlock(&swap_lock); 273 + return; 274 + } 275 + EXPORT_SYMBOL(frontswap_shrink); 276 + 277 + /* 278 + * Count and return the number of frontswap pages across all 279 + * swap devices. This is exported so that backend drivers can 280 + * determine current usage without reading debugfs. 281 + */ 282 + unsigned long frontswap_curr_pages(void) 283 + { 284 + int type; 285 + unsigned long totalpages = 0; 286 + struct swap_info_struct *si = NULL; 287 + 288 + spin_lock(&swap_lock); 289 + for (type = swap_list.head; type >= 0; type = si->next) { 290 + si = swap_info[type]; 291 + totalpages += atomic_read(&si->frontswap_pages); 292 + } 293 + spin_unlock(&swap_lock); 294 + return totalpages; 295 + } 296 + EXPORT_SYMBOL(frontswap_curr_pages); 297 + 298 + static int __init init_frontswap(void) 299 + { 300 + #ifdef CONFIG_DEBUG_FS 301 + struct dentry *root = debugfs_create_dir("frontswap", NULL); 302 + if (root == NULL) 303 + return -ENXIO; 304 + debugfs_create_u64("loads", S_IRUGO, root, &frontswap_loads); 305 + debugfs_create_u64("succ_stores", S_IRUGO, root, &frontswap_succ_stores); 306 + debugfs_create_u64("failed_stores", S_IRUGO, root, 307 + &frontswap_failed_stores); 308 + debugfs_create_u64("invalidates", S_IRUGO, 309 + root, &frontswap_invalidates); 310 + #endif 311 + return 0; 312 + } 313 + 314 + module_init(init_frontswap);

+12

mm/page_io.c

··· 18 18 #include <linux/bio.h> 19 19 #include <linux/swapops.h> 20 20 #include <linux/writeback.h> 21 + #include <linux/frontswap.h> 21 22 #include <asm/pgtable.h> 22 23 23 24 static struct bio *get_swap_bio(gfp_t gfp_flags, ··· 99 98 unlock_page(page); 100 99 goto out; 101 100 } 101 + if (frontswap_store(page) == 0) { 102 + set_page_writeback(page); 103 + unlock_page(page); 104 + end_page_writeback(page); 105 + goto out; 106 + } 102 107 bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write); 103 108 if (bio == NULL) { 104 109 set_page_dirty(page); ··· 129 122 130 123 VM_BUG_ON(!PageLocked(page)); 131 124 VM_BUG_ON(PageUptodate(page)); 125 + if (frontswap_load(page) == 0) { 126 + SetPageUptodate(page); 127 + unlock_page(page); 128 + goto out; 129 + } 132 130 bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read); 133 131 if (bio == NULL) { 134 132 unlock_page(page);

+41 -13

mm/swapfile.c

··· 31 31 #include <linux/memcontrol.h> 32 32 #include <linux/poll.h> 33 33 #include <linux/oom.h> 34 + #include <linux/frontswap.h> 35 + #include <linux/swapfile.h> 34 36 35 37 #include <asm/pgtable.h> 36 38 #include <asm/tlbflush.h> ··· 44 42 static void free_swap_count_continuations(struct swap_info_struct *); 45 43 static sector_t map_swap_entry(swp_entry_t, struct block_device**); 46 44 47 - static DEFINE_SPINLOCK(swap_lock); 45 + DEFINE_SPINLOCK(swap_lock); 48 46 static unsigned int nr_swapfiles; 49 47 long nr_swap_pages; 50 48 long total_swap_pages; ··· 55 53 static const char Bad_offset[] = "Bad swap offset entry "; 56 54 static const char Unused_offset[] = "Unused swap offset entry "; 57 55 58 - static struct swap_list_t swap_list = {-1, -1}; 56 + struct swap_list_t swap_list = {-1, -1}; 59 57 60 - static struct swap_info_struct *swap_info[MAX_SWAPFILES]; 58 + struct swap_info_struct *swap_info[MAX_SWAPFILES]; 61 59 62 60 static DEFINE_MUTEX(swapon_mutex); 63 61 ··· 558 556 swap_list.next = p->type; 559 557 nr_swap_pages++; 560 558 p->inuse_pages--; 559 + frontswap_invalidate_page(p->type, offset); 561 560 if ((p->flags & SWP_BLKDEV) && 562 561 disk->fops->swap_slot_free_notify) 563 562 disk->fops->swap_slot_free_notify(p->bdev, offset); ··· 988 985 } 989 986 990 987 /* 991 - * Scan swap_map from current position to next entry still in use. 988 + * Scan swap_map (or frontswap_map if frontswap parameter is true) 989 + * from current position to next entry still in use. 992 990 * Recycle to start on reaching the end, returning 0 when empty. 993 991 */ 994 992 static unsigned int find_next_to_unuse(struct swap_info_struct *si, 995 - unsigned int prev) 993 + unsigned int prev, bool frontswap) 996 994 { 997 995 unsigned int max = si->max; 998 996 unsigned int i = prev; ··· 1019 1015 prev = 0; 1020 1016 i = 1; 1021 1017 } 1018 + if (frontswap) { 1019 + if (frontswap_test(si, i)) 1020 + break; 1021 + else 1022 + continue; 1023 + } 1022 1024 count = si->swap_map[i]; 1023 1025 if (count && swap_count(count) != SWAP_MAP_BAD) 1024 1026 break; ··· 1036 1026 * We completely avoid races by reading each swap page in advance, 1037 1027 * and then search for the process using it. All the necessary 1038 1028 * page table adjustments can then be made atomically. 1029 + * 1030 + * if the boolean frontswap is true, only unuse pages_to_unuse pages; 1031 + * pages_to_unuse==0 means all pages; ignored if frontswap is false 1039 1032 */ 1040 - static int try_to_unuse(unsigned int type) 1033 + int try_to_unuse(unsigned int type, bool frontswap, 1034 + unsigned long pages_to_unuse) 1041 1035 { 1042 1036 struct swap_info_struct *si = swap_info[type]; 1043 1037 struct mm_struct *start_mm; ··· 1074 1060 * one pass through swap_map is enough, but not necessarily: 1075 1061 * there are races when an instance of an entry might be missed. 1076 1062 */ 1077 - while ((i = find_next_to_unuse(si, i)) != 0) { 1063 + while ((i = find_next_to_unuse(si, i, frontswap)) != 0) { 1078 1064 if (signal_pending(current)) { 1079 1065 retval = -EINTR; 1080 1066 break; ··· 1241 1227 * interactive performance. 1242 1228 */ 1243 1229 cond_resched(); 1230 + if (frontswap && pages_to_unuse > 0) { 1231 + if (!--pages_to_unuse) 1232 + break; 1233 + } 1244 1234 } 1245 1235 1246 1236 mmput(start_mm); ··· 1504 1486 } 1505 1487 1506 1488 static void enable_swap_info(struct swap_info_struct *p, int prio, 1507 - unsigned char *swap_map) 1489 + unsigned char *swap_map, 1490 + unsigned long *frontswap_map) 1508 1491 { 1509 1492 int i, prev; 1510 1493 ··· 1515 1496 else 1516 1497 p->prio = --least_priority; 1517 1498 p->swap_map = swap_map; 1499 + frontswap_map_set(p, frontswap_map); 1518 1500 p->flags |= SWP_WRITEOK; 1519 1501 nr_swap_pages += p->pages; 1520 1502 total_swap_pages += p->pages; ··· 1532 1512 swap_list.head = swap_list.next = p->type; 1533 1513 else 1534 1514 swap_info[prev]->next = p->type; 1515 + frontswap_init(p->type); 1535 1516 spin_unlock(&swap_lock); 1536 1517 } 1537 1518 ··· 1606 1585 spin_unlock(&swap_lock); 1607 1586 1608 1587 oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); 1609 - err = try_to_unuse(type); 1588 + err = try_to_unuse(type, false, 0); /* force all pages to be unused */ 1610 1589 compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); 1611 1590 1612 1591 if (err) { ··· 1617 1596 * sys_swapoff for this swap_info_struct at this point. 1618 1597 */ 1619 1598 /* re-insert swap space back into swap_list */ 1620 - enable_swap_info(p, p->prio, p->swap_map); 1599 + enable_swap_info(p, p->prio, p->swap_map, frontswap_map_get(p)); 1621 1600 goto out_dput; 1622 1601 } 1623 1602 ··· 1643 1622 swap_map = p->swap_map; 1644 1623 p->swap_map = NULL; 1645 1624 p->flags = 0; 1625 + frontswap_invalidate_area(type); 1646 1626 spin_unlock(&swap_lock); 1647 1627 mutex_unlock(&swapon_mutex); 1648 1628 vfree(swap_map); 1629 + vfree(frontswap_map_get(p)); 1649 1630 /* Destroy swap account informatin */ 1650 1631 swap_cgroup_swapoff(type); 1651 1632 ··· 2011 1988 sector_t span; 2012 1989 unsigned long maxpages; 2013 1990 unsigned char *swap_map = NULL; 1991 + unsigned long *frontswap_map = NULL; 2014 1992 struct page *page = NULL; 2015 1993 struct inode *inode = NULL; 2016 1994 ··· 2095 2071 error = nr_extents; 2096 2072 goto bad_swap; 2097 2073 } 2074 + /* frontswap enabled? set up bit-per-page map for frontswap */ 2075 + if (frontswap_enabled) 2076 + frontswap_map = vzalloc(maxpages / sizeof(long)); 2098 2077 2099 2078 if (p->bdev) { 2100 2079 if (blk_queue_nonrot(bdev_get_queue(p->bdev))) { ··· 2113 2086 if (swap_flags & SWAP_FLAG_PREFER) 2114 2087 prio = 2115 2088 (swap_flags & SWAP_FLAG_PRIO_MASK) >> SWAP_FLAG_PRIO_SHIFT; 2116 - enable_swap_info(p, prio, swap_map); 2089 + enable_swap_info(p, prio, swap_map, frontswap_map); 2117 2090 2118 2091 printk(KERN_INFO "Adding %uk swap on %s. " 2119 - "Priority:%d extents:%d across:%lluk %s%s\n", 2092 + "Priority:%d extents:%d across:%lluk %s%s%s\n", 2120 2093 p->pages<<(PAGE_SHIFT-10), name, p->prio, 2121 2094 nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10), 2122 2095 (p->flags & SWP_SOLIDSTATE) ? "SS" : "", 2123 - (p->flags & SWP_DISCARDABLE) ? "D" : ""); 2096 + (p->flags & SWP_DISCARDABLE) ? "D" : "", 2097 + (frontswap_map) ? "FS" : ""); 2124 2098 2125 2099 mutex_unlock(&swapon_mutex); 2126 2100 atomic_inc(&proc_poll_event);