Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

docs/mm: Physical Memory: Populate the "Zones" section

Briefly describe what zones are and the fields of struct zone.

Link: https://lkml.kernel.org/r/20250315211317.27612-1-jiwen7.qi@gmail.com
Signed-off-by: Jiwen Qi <jiwen7.qi@gmail.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

authored by

Jiwen Qi and committed by
Andrew Morton
9f171d94 f6a09e68

+264 -2
+264 -2
Documentation/mm/physical_memory.rst
··· 338 338 339 339 Zones 340 340 ===== 341 + As we have mentioned, each zone in memory is described by a ``struct zone`` 342 + which is an element of the ``node_zones`` array of the node it belongs to. 343 + ``struct zone`` is the core data structure of the page allocator. A zone 344 + represents a range of physical memory and may have holes. 341 345 342 - .. admonition:: Stub 346 + The page allocator uses the GFP flags, see :ref:`mm-api-gfp-flags`, specified by 347 + a memory allocation to determine the highest zone in a node from which the 348 + memory allocation can allocate memory. The page allocator first allocates memory 349 + from that zone, if the page allocator can't allocate the requested amount of 350 + memory from the zone, it will allocate memory from the next lower zone in the 351 + node, the process continues up to and including the lowest zone. For example, if 352 + a node contains ``ZONE_DMA32``, ``ZONE_NORMAL`` and ``ZONE_MOVABLE`` and the 353 + highest zone of a memory allocation is ``ZONE_MOVABLE``, the order of the zones 354 + from which the page allocator allocates memory is ``ZONE_MOVABLE`` > 355 + ``ZONE_NORMAL`` > ``ZONE_DMA32``. 343 356 344 - This section is incomplete. Please list and describe the appropriate fields. 357 + At runtime, free pages in a zone are in the Per-CPU Pagesets (PCP) or free areas 358 + of the zone. The Per-CPU Pagesets are a vital mechanism in the kernel's memory 359 + management system. By handling most frequent allocations and frees locally on 360 + each CPU, the Per-CPU Pagesets improve performance and scalability, especially 361 + on systems with many cores. The page allocator in the kernel employs a two-step 362 + strategy for memory allocation, starting with the Per-CPU Pagesets before 363 + falling back to the buddy allocator. Pages are transferred between the Per-CPU 364 + Pagesets and the global free areas (managed by the buddy allocator) in batches. 365 + This minimizes the overhead of frequent interactions with the global buddy 366 + allocator. 367 + 368 + Architecture specific code calls free_area_init() to initializes zones. 369 + 370 + Zone structure 371 + -------------- 372 + The zones structure ``struct zone`` is defined in ``include/linux/mmzone.h``. 373 + Here we briefly describe fields of this structure: 374 + 375 + General 376 + ~~~~~~~ 377 + 378 + ``_watermark`` 379 + The watermarks for this zone. When the amount of free pages in a zone is below 380 + the min watermark, boosting is ignored, an allocation may trigger direct 381 + reclaim and direct compaction, it is also used to throttle direct reclaim. 382 + When the amount of free pages in a zone is below the low watermark, kswapd is 383 + woken up. When the amount of free pages in a zone is above the high watermark, 384 + kswapd stops reclaiming (a zone is balanced) when the 385 + ``NUMA_BALANCING_MEMORY_TIERING`` bit of ``sysctl_numa_balancing_mode`` is not 386 + set. The promo watermark is used for memory tiering and NUMA balancing. When 387 + the amount of free pages in a zone is above the promo watermark, kswapd stops 388 + reclaiming when the ``NUMA_BALANCING_MEMORY_TIERING`` bit of 389 + ``sysctl_numa_balancing_mode`` is set. The watermarks are set by 390 + ``__setup_per_zone_wmarks()``. The min watermark is calculated according to 391 + ``vm.min_free_kbytes`` sysctl. The other three watermarks are set according 392 + to the distance between two watermarks. The distance itself is calculated 393 + taking ``vm.watermark_scale_factor`` sysctl into account. 394 + 395 + ``watermark_boost`` 396 + The number of pages which are used to boost watermarks to increase reclaim 397 + pressure to reduce the likelihood of future fallbacks and wake kswapd now 398 + as the node may be balanced overall and kswapd will not wake naturally. 399 + 400 + ``nr_reserved_highatomic`` 401 + The number of pages which are reserved for high-order atomic allocations. 402 + 403 + ``nr_free_highatomic`` 404 + The number of free pages in reserved highatomic pageblocks 405 + 406 + ``lowmem_reserve`` 407 + The array of the amounts of the memory reserved in this zone for memory 408 + allocations. For example, if the highest zone a memory allocation can 409 + allocate memory from is ``ZONE_MOVABLE``, the amount of memory reserved in 410 + this zone for this allocation is ``lowmem_reserve[ZONE_MOVABLE]`` when 411 + attempting to allocate memory from this zone. This is a mechanism the page 412 + allocator uses to prevent allocations which could use ``highmem`` from using 413 + too much ``lowmem``. For some specialised workloads on ``highmem`` machines, 414 + it is dangerous for the kernel to allow process memory to be allocated from 415 + the ``lowmem`` zone. This is because that memory could then be pinned via the 416 + ``mlock()`` system call, or by unavailability of swapspace. 417 + ``vm.lowmem_reserve_ratio`` sysctl determines how aggressive the kernel is in 418 + defending these lower zones. This array is recalculated by 419 + ``setup_per_zone_lowmem_reserve()`` at runtime if ``vm.lowmem_reserve_ratio`` 420 + sysctl changes. 421 + 422 + ``node`` 423 + The index of the node this zone belongs to. Available only when 424 + ``CONFIG_NUMA`` is enabled because there is only one zone in a UMA system. 425 + 426 + ``zone_pgdat`` 427 + Pointer to the ``struct pglist_data`` of the node this zone belongs to. 428 + 429 + ``per_cpu_pageset`` 430 + Pointer to the Per-CPU Pagesets (PCP) allocated and initialized by 431 + ``setup_zone_pageset()``. By handling most frequent allocations and frees 432 + locally on each CPU, PCP improves performance and scalability on systems with 433 + many cores. 434 + 435 + ``pageset_high_min`` 436 + Copied to the ``high_min`` of the Per-CPU Pagesets for faster access. 437 + 438 + ``pageset_high_max`` 439 + Copied to the ``high_max`` of the Per-CPU Pagesets for faster access. 440 + 441 + ``pageset_batch`` 442 + Copied to the ``batch`` of the Per-CPU Pagesets for faster access. The 443 + ``batch``, ``high_min`` and ``high_max`` of the Per-CPU Pagesets are used to 444 + calculate the number of elements the Per-CPU Pagesets obtain from the buddy 445 + allocator under a single hold of the lock for efficiency. They are also used 446 + to decide if the Per-CPU Pagesets return pages to the buddy allocator in page 447 + free process. 448 + 449 + ``pageblock_flags`` 450 + The pointer to the flags for the pageblocks in the zone (see 451 + ``include/linux/pageblock-flags.h`` for flags list). The memory is allocated 452 + in ``setup_usemap()``. Each pageblock occupies ``NR_PAGEBLOCK_BITS`` bits. 453 + Defined only when ``CONFIG_FLATMEM`` is enabled. The flags is stored in 454 + ``mem_section`` when ``CONFIG_SPARSEMEM`` is enabled. 455 + 456 + ``zone_start_pfn`` 457 + The start pfn of the zone. It is initialized by 458 + ``calculate_node_totalpages()``. 459 + 460 + ``managed_pages`` 461 + The present pages managed by the buddy system, which is calculated as: 462 + ``managed_pages`` = ``present_pages`` - ``reserved_pages``, ``reserved_pages`` 463 + includes pages allocated by the memblock allocator. It should be used by page 464 + allocator and vm scanner to calculate all kinds of watermarks and thresholds. 465 + It is accessed using ``atomic_long_xxx()`` functions. It is initialized in 466 + ``free_area_init_core()`` and then is reinitialized when memblock allocator 467 + frees pages into buddy system. 468 + 469 + ``spanned_pages`` 470 + The total pages spanned by the zone, including holes, which is calculated as: 471 + ``spanned_pages`` = ``zone_end_pfn`` - ``zone_start_pfn``. It is initialized 472 + by ``calculate_node_totalpages()``. 473 + 474 + ``present_pages`` 475 + The physical pages existing within the zone, which is calculated as: 476 + ``present_pages`` = ``spanned_pages`` - ``absent_pages`` (pages in holes). It 477 + may be used by memory hotplug or memory power management logic to figure out 478 + unmanaged pages by checking (``present_pages`` - ``managed_pages``). Write 479 + access to ``present_pages`` at runtime should be protected by 480 + ``mem_hotplug_begin/done()``. Any reader who can't tolerant drift of 481 + ``present_pages`` should use ``get_online_mems()`` to get a stable value. It 482 + is initialized by ``calculate_node_totalpages()``. 483 + 484 + ``present_early_pages`` 485 + The present pages existing within the zone located on memory available since 486 + early boot, excluding hotplugged memory. Defined only when 487 + ``CONFIG_MEMORY_HOTPLUG`` is enabled and initialized by 488 + ``calculate_node_totalpages()``. 489 + 490 + ``cma_pages`` 491 + The pages reserved for CMA use. These pages behave like ``ZONE_MOVABLE`` when 492 + they are not used for CMA. Defined only when ``CONFIG_CMA`` is enabled. 493 + 494 + ``name`` 495 + The name of the zone. It is a pointer to the corresponding element of 496 + the ``zone_names`` array. 497 + 498 + ``nr_isolate_pageblock`` 499 + Number of isolated pageblocks. It is used to solve incorrect freepage counting 500 + problem due to racy retrieving migratetype of pageblock. Protected by 501 + ``zone->lock``. Defined only when ``CONFIG_MEMORY_ISOLATION`` is enabled. 502 + 503 + ``span_seqlock`` 504 + The seqlock to protect ``zone_start_pfn`` and ``spanned_pages``. It is a 505 + seqlock because it has to be read outside of ``zone->lock``, and it is done in 506 + the main allocator path. However, the seqlock is written quite infrequently. 507 + Defined only when ``CONFIG_MEMORY_HOTPLUG`` is enabled. 508 + 509 + ``initialized`` 510 + The flag indicating if the zone is initialized. Set by 511 + ``init_currently_empty_zone()`` during boot. 512 + 513 + ``free_area`` 514 + The array of free areas, where each element corresponds to a specific order 515 + which is a power of two. The buddy allocator uses this structure to manage 516 + free memory efficiently. When allocating, it tries to find the smallest 517 + sufficient block, if the smallest sufficient block is larger than the 518 + requested size, it will be recursively split into the next smaller blocks 519 + until the required size is reached. When a page is freed, it may be merged 520 + with its buddy to form a larger block. It is initialized by 521 + ``zone_init_free_lists()``. 522 + 523 + ``unaccepted_pages`` 524 + The list of pages to be accepted. All pages on the list are ``MAX_PAGE_ORDER``. 525 + Defined only when ``CONFIG_UNACCEPTED_MEMORY`` is enabled. 526 + 527 + ``flags`` 528 + The zone flags. The least three bits are used and defined by 529 + ``enum zone_flags``. ``ZONE_BOOSTED_WATERMARK`` (bit 0): zone recently boosted 530 + watermarks. Cleared when kswapd is woken. ``ZONE_RECLAIM_ACTIVE`` (bit 1): 531 + kswapd may be scanning the zone. ``ZONE_BELOW_HIGH`` (bit 2): zone is below 532 + high watermark. 533 + 534 + ``lock`` 535 + The main lock that protects the internal data structures of the page allocator 536 + specific to the zone, especially protects ``free_area``. 537 + 538 + ``percpu_drift_mark`` 539 + When free pages are below this point, additional steps are taken when reading 540 + the number of free pages to avoid per-cpu counter drift allowing watermarks 541 + to be breached. It is updated in ``refresh_zone_stat_thresholds()``. 542 + 543 + Compaction control 544 + ~~~~~~~~~~~~~~~~~~ 545 + 546 + ``compact_cached_free_pfn`` 547 + The PFN where compaction free scanner should start in the next scan. 548 + 549 + ``compact_cached_migrate_pfn`` 550 + The PFNs where compaction migration scanner should start in the next scan. 551 + This array has two elements: the first one is used in ``MIGRATE_ASYNC`` mode, 552 + and the other one is used in ``MIGRATE_SYNC`` mode. 553 + 554 + ``compact_init_migrate_pfn`` 555 + The initial migration PFN which is initialized to 0 at boot time, and to the 556 + first pageblock with migratable pages in the zone after a full compaction 557 + finishes. It is used to check if a scan is a whole zone scan or not. 558 + 559 + ``compact_init_free_pfn`` 560 + The initial free PFN which is initialized to 0 at boot time and to the last 561 + pageblock with free ``MIGRATE_MOVABLE`` pages in the zone. It is used to check 562 + if it is the start of a scan. 563 + 564 + ``compact_considered`` 565 + The number of compactions attempted since last failure. It is reset in 566 + ``defer_compaction()`` when a compaction fails to result in a page allocation 567 + success. It is increased by 1 in ``compaction_deferred()`` when a compaction 568 + should be skipped. ``compaction_deferred()`` is called before 569 + ``compact_zone()`` is called, ``compaction_defer_reset()`` is called when 570 + ``compact_zone()`` returns ``COMPACT_SUCCESS``, ``defer_compaction()`` is 571 + called when ``compact_zone()`` returns ``COMPACT_PARTIAL_SKIPPED`` or 572 + ``COMPACT_COMPLETE``. 573 + 574 + ``compact_defer_shift`` 575 + The number of compactions skipped before trying again is 576 + ``1<<compact_defer_shift``. It is increased by 1 in ``defer_compaction()``. 577 + It is reset in ``compaction_defer_reset()`` when a direct compaction results 578 + in a page allocation success. Its maximum value is ``COMPACT_MAX_DEFER_SHIFT``. 579 + 580 + ``compact_order_failed`` 581 + The minimum compaction failed order. It is set in ``compaction_defer_reset()`` 582 + when a compaction succeeds and in ``defer_compaction()`` when a compaction 583 + fails to result in a page allocation success. 584 + 585 + ``compact_blockskip_flush`` 586 + Set to true when compaction migration scanner and free scanner meet, which 587 + means the ``PB_migrate_skip`` bits should be cleared. 588 + 589 + ``contiguous`` 590 + Set to true when the zone is contiguous (in other words, no hole). 591 + 592 + Statistics 593 + ~~~~~~~~~~ 594 + 595 + ``vm_stat`` 596 + VM statistics for the zone. The items tracked are defined by 597 + ``enum zone_stat_item``. 598 + 599 + ``vm_numa_event`` 600 + VM NUMA event statistics for the zone. The items tracked are defined by 601 + ``enum numa_stat_item``. 602 + 603 + ``per_cpu_zonestats`` 604 + Per-CPU VM statistics for the zone. It records VM statistics and VM NUMA event 605 + statistics on a per-CPU basis. It reduces updates to the global ``vm_stat`` 606 + and ``vm_numa_event`` fields of the zone to improve performance. 345 607 346 608 .. _pages: 347 609