at v2.6.23-rc8 322 lines 12 kB view raw
1============== 2Memory Hotplug 3============== 4 5Last Updated: Jul 28 2007 6 7This document is about memory hotplug including how-to-use and current status. 8Because Memory Hotplug is still under development, contents of this text will 9be changed often. 10 111. Introduction 12 1.1 purpose of memory hotplug 13 1.2. Phases of memory hotplug 14 1.3. Unit of Memory online/offline operation 152. Kernel Configuration 163. sysfs files for memory hotplug 174. Physical memory hot-add phase 18 4.1 Hardware(Firmware) Support 19 4.2 Notify memory hot-add event by hand 205. Logical Memory hot-add phase 21 5.1. State of memory 22 5.2. How to online memory 236. Logical memory remove 24 6.1 Memory offline and ZONE_MOVABLE 25 6.2. How to offline memory 267. Physical memory remove 278. Future Work List 28 29Note(1): x86_64's has special implementation for memory hotplug. 30 This text does not describe it. 31Note(2): This text assumes that sysfs is mounted at /sys. 32 33 34--------------- 351. Introduction 36--------------- 37 381.1 purpose of memory hotplug 39------------ 40Memory Hotplug allows users to increase/decrease the amount of memory. 41Generally, there are two purposes. 42 43(A) For changing the amount of memory. 44 This is to allow a feature like capacity on demand. 45(B) For installing/removing DIMMs or NUMA-nodes physically. 46 This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc. 47 48(A) is required by highly virtualized environments and (B) is required by 49hardware which supports memory power management. 50 51Linux memory hotplug is designed for both purpose. 52 53 541.2. Phases of memory hotplug 55--------------- 56There are 2 phases in Memory Hotplug. 57 1) Physical Memory Hotplug phase 58 2) Logical Memory Hotplug phase. 59 60The First phase is to communicate hardware/firmware and make/erase 61environment for hotplugged memory. Basically, this phase is necessary 62for the purpose (B), but this is good phase for communication between 63highly virtualized environments too. 64 65When memory is hotplugged, the kernel recognizes new memory, makes new memory 66management tables, and makes sysfs files for new memory's operation. 67 68If firmware supports notification of connection of new memory to OS, 69this phase is triggered automatically. ACPI can notify this event. If not, 70"probe" operation by system administration is used instead. 71(see Section 4.). 72 73Logical Memory Hotplug phase is to change memory state into 74avaiable/unavailable for users. Amount of memory from user's view is 75changed by this phase. The kernel makes all memory in it as free pages 76when a memory range is available. 77 78In this document, this phase is described as online/offline. 79 80Logical Memory Hotplug phase is triggred by write of sysfs file by system 81administrator. For the hot-add case, it must be executed after Physical Hotplug 82phase by hand. 83(However, if you writes udev's hotplug scripts for memory hotplug, these 84 phases can be execute in seamless way.) 85 86 871.3. Unit of Memory online/offline operation 88------------ 89Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory 90into chunks of the same size. The chunk is called a "section". The size of 91a section is architecture dependent. For example, power uses 16MiB, ia64 uses 921GiB. The unit of online/offline operation is "one section". (see Section 3.) 93 94To determine the size of sections, please read this file: 95 96/sys/devices/system/memory/block_size_bytes 97 98This file shows the size of sections in byte. 99 100----------------------- 1012. Kernel Configuration 102----------------------- 103To use memory hotplug feature, kernel must be compiled with following 104config options. 105 106- For all memory hotplug 107 Memory model -> Sparse Memory (CONFIG_SPARSEMEM) 108 Allow for memory hot-add (CONFIG_MEMORY_HOTPLUG) 109 110- To enable memory removal, the followings are also necessary 111 Allow for memory hot remove (CONFIG_MEMORY_HOTREMOVE) 112 Page Migration (CONFIG_MIGRATION) 113 114- For ACPI memory hotplug, the followings are also necessary 115 Memory hotplug (under ACPI Support menu) (CONFIG_ACPI_HOTPLUG_MEMORY) 116 This option can be kernel module. 117 118- As a related configuration, if your box has a feature of NUMA-node hotplug 119 via ACPI, then this option is necessary too. 120 ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu) 121 (CONFIG_ACPI_CONTAINER). 122 This option can be kernel module too. 123 124-------------------------------- 1253 sysfs files for memory hotplug 126-------------------------------- 127All sections have their device information under /sys/devices/system/memory as 128 129/sys/devices/system/memory/memoryXXX 130(XXX is section id.) 131 132Now, XXX is defined as start_address_of_section / section_size. 133 134For example, assume 1GiB section size. A device for a memory starting at 1350x100000000 is /sys/device/system/memory/memory4 136(0x100000000 / 1Gib = 4) 137This device covers address range [0x100000000 ... 0x140000000) 138 139Under each section, you can see 3 files. 140 141/sys/devices/system/memory/memoryXXX/phys_index 142/sys/devices/system/memory/memoryXXX/phys_device 143/sys/devices/system/memory/memoryXXX/state 144 145'phys_index' : read-only and contains section id, same as XXX. 146'state' : read-write 147 at read: contains online/offline state of memory. 148 at write: user can specify "online", "offline" command 149'phys_device': read-only: designed to show the name of physical memory device. 150 This is not well implemented now. 151 152NOTE: 153 These directories/files appear after physical memory hotplug phase. 154 155 156-------------------------------- 1574. Physical memory hot-add phase 158-------------------------------- 159 1604.1 Hardware(Firmware) Support 161------------ 162On x86_64/ia64 platform, memory hotplug by ACPI is supported. 163 164In general, the firmware (ACPI) which supports memory hotplug defines 165memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80, 166Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev 167script. This will be done automatically. 168 169But scripts for memory hotplug are not contained in generic udev package(now). 170You may have to write it by yourself or online/offline memory by hand. 171Please see "How to online memory", "How to offline memory" in this text. 172 173If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004", 174"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler 175calls hotplug code for all of objects which are defined in it. 176If memory device is found, memory hotplug code will be called. 177 178 1794.2 Notify memory hot-add event by hand 180------------ 181In some environments, especially virtualized environment, firmware will not 182notify memory hotplug event to the kernel. For such environment, "probe" 183interface is supported. This interface depends on CONFIG_ARCH_MEMORY_PROBE. 184 185Now, CONFIG_ARCH_MEMORY_PROBE is supported only by powerpc but it does not 186contain highly architecture codes. Please add config if you need "probe" 187interface. 188 189Probe interface is located at 190/sys/devices/system/memory/probe 191 192You can tell the physical address of new memory to the kernel by 193 194% echo start_address_of_new_memory > /sys/devices/system/memory/probe 195 196Then, [start_address_of_new_memory, start_address_of_new_memory + section_size) 197memory range is hot-added. In this case, hotplug script is not called (in 198current implementation). You'll have to online memory by yourself. 199Please see "How to online memory" in this text. 200 201 202 203------------------------------ 2045. Logical Memory hot-add phase 205------------------------------ 206 2075.1. State of memory 208------------ 209To see (online/offline) state of memory section, read 'state' file. 210 211% cat /sys/device/system/memory/memoryXXX/state 212 213 214If the memory section is online, you'll read "online". 215If the memory section is offline, you'll read "offline". 216 217 2185.2. How to online memory 219------------ 220Even if the memory is hot-added, it is not at ready-to-use state. 221For using newly added memory, you have to "online" the memory section. 222 223For onlining, you have to write "online" to the section's state file as: 224 225% echo online > /sys/devices/system/memory/memoryXXX/state 226 227After this, section memoryXXX's state will be 'online' and the amount of 228available memory will be increased. 229 230Currently, newly added memory is added as ZONE_NORMAL (for powerpc, ZONE_DMA). 231This may be changed in future. 232 233 234 235------------------------ 2366. Logical memory remove 237------------------------ 238 2396.1 Memory offline and ZONE_MOVABLE 240------------ 241Memory offlining is more complicated than memory online. Because memory offline 242has to make the whole memory section be unused, memory offline can fail if 243the section includes memory which cannot be freed. 244 245In general, memory offline can use 2 techniques. 246 247(1) reclaim and free all memory in the section. 248(2) migrate all pages in the section. 249 250In the current implementation, Linux's memory offline uses method (2), freeing 251all pages in the section by page migration. But not all pages are 252migratable. Under current Linux, migratable pages are anonymous pages and 253page caches. For offlining a section by migration, the kernel has to guarantee 254that the section contains only migratable pages. 255 256Now, a boot option for making a section which consists of migratable pages is 257supported. By specifying "kernelcore=" or "movablecore=" boot option, you can 258create ZONE_MOVABLE...a zone which is just used for movable pages. 259(See also Documentation/kernel-parameters.txt) 260 261Assume the system has "TOTAL" amount of memory at boot time, this boot option 262creates ZONE_MOVABLE as following. 263 2641) When kernelcore=YYYY boot option is used, 265 Size of memory not for movable pages (not for offline) is YYYY. 266 Size of memory for movable pages (for offline) is TOTAL-YYYY. 267 2682) When movablecore=ZZZZ boot option is used, 269 Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ. 270 Size of memory for movable pages (for offline) is ZZZZ. 271 272 273Note) Unfortunately, there is no information to show which section belongs 274to ZONE_MOVABLE. This is TBD. 275 276 2776.2. How to offline memory 278------------ 279You can offline a section by using the same sysfs interface that was used in 280memory onlining. 281 282% echo offline > /sys/devices/system/memory/memoryXXX/state 283 284If offline succeeds, the state of the memory section is changed to be "offline". 285If it fails, some error core (like -EBUSY) will be returned by the kernel. 286Even if a section does not belong to ZONE_MOVABLE, you can try to offline it. 287If it doesn't contain 'unmovable' memory, you'll get success. 288 289A section under ZONE_MOVABLE is considered to be able to be offlined easily. 290But under some busy state, it may return -EBUSY. Even if a memory section 291cannot be offlined due to -EBUSY, you can retry offlining it and may be able to 292offline it (or not). 293(For example, a page is referred to by some kernel internal call and released 294 soon.) 295 296Consideration: 297Memory hotplug's design direction is to make the possibility of memory offlining 298higher and to guarantee unplugging memory under any situation. But it needs 299more work. Returning -EBUSY under some situation may be good because the user 300can decide to retry more or not by himself. Currently, memory offlining code 301does some amount of retry with 120 seconds timeout. 302 303------------------------- 3047. Physical memory remove 305------------------------- 306Need more implementation yet.... 307 - Notification completion of remove works by OS to firmware. 308 - Guard from remove if not yet. 309 310-------------- 3118. Future Work 312-------------- 313 - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like 314 sysctl or new control file. 315 - showing memory section and physical device relationship. 316 - showing memory section and node relationship (maybe good for NUMA) 317 - showing memory section is under ZONE_MOVABLE or not 318 - test and make it better memory offlining. 319 - support HugeTLB page migration and offlining. 320 - memmap removing at memory offline. 321 - physical remove memory. 322