Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

proc: report file/anon bit in /proc/pid/pagemap

This is an implementation of Andrew's proposal to extend the pagemap file
bits to report what is missing about tasks' working set.

The problem with the working set detection is multilateral. In the criu
(checkpoint/restore) project we dump the tasks' memory into image files
and to do it properly we need to detect which pages inside mappings are
really in use. The mincore syscall I though could help with this did not.
First, it doesn't report swapped pages, thus we cannot find out which
parts of anonymous mappings to dump. Next, it does report pages from page
cache as present even if they are not mapped, and it doesn't make that has
not been cow-ed.

Note, that issue with swap pages is critical -- we must dump swap pages to
image file. But the issues with file pages are optimization -- we can
take all file pages to image, this would be correct, but if we know that a
page is not mapped or not cow-ed, we can remove them from dump file. The
dump would still be self-consistent, though significantly smaller in size
(up to 10 times smaller on real apps).

Andrew noticed, that the proc pagemap file solved 2 of 3 above issues --
it reports whether a page is present or swapped and it doesn't report not
mapped page cache pages. But, it doesn't distinguish cow-ed file pages
from not cow-ed.

I would like to make the last unused bit in this file to report whether the
page mapped into respective pte is PageAnon or not.

[comment stolen from Pavel Emelyanov's v1 patch]

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Konstantin Khlebnikov and committed by
Linus Torvalds
052fb0d6 715be1fc

+30 -18
+1 -1
Documentation/vm/pagemap.txt
··· 16 16 * Bits 0-4 swap type if swapped 17 17 * Bits 5-54 swap offset if swapped 18 18 * Bits 55-60 page shift (page size = 1<<page shift) 19 - * Bit 61 reserved for future use 19 + * Bit 61 page is file-page or shared-anon 20 20 * Bit 62 page swapped 21 21 * Bit 63 page present 22 22
+29 -17
fs/proc/task_mmu.c
··· 700 700 701 701 #define PM_PRESENT PM_STATUS(4LL) 702 702 #define PM_SWAP PM_STATUS(2LL) 703 + #define PM_FILE PM_STATUS(1LL) 703 704 #define PM_NOT_PRESENT PM_PSHIFT(PAGE_SHIFT) 704 705 #define PM_END_OF_BUFFER 1 705 706 ··· 734 733 return err; 735 734 } 736 735 737 - static u64 swap_pte_to_pagemap_entry(pte_t pte) 736 + static void pte_to_pagemap_entry(pagemap_entry_t *pme, 737 + struct vm_area_struct *vma, unsigned long addr, pte_t pte) 738 738 { 739 - swp_entry_t e = pte_to_swp_entry(pte); 740 - return swp_type(e) | (swp_offset(e) << MAX_SWAPFILES_SHIFT); 741 - } 739 + u64 frame, flags; 740 + struct page *page = NULL; 742 741 743 - static void pte_to_pagemap_entry(pagemap_entry_t *pme, pte_t pte) 744 - { 745 - if (is_swap_pte(pte)) 746 - *pme = make_pme(PM_PFRAME(swap_pte_to_pagemap_entry(pte)) 747 - | PM_PSHIFT(PAGE_SHIFT) | PM_SWAP); 748 - else if (pte_present(pte)) 749 - *pme = make_pme(PM_PFRAME(pte_pfn(pte)) 750 - | PM_PSHIFT(PAGE_SHIFT) | PM_PRESENT); 751 - else 742 + if (pte_present(pte)) { 743 + frame = pte_pfn(pte); 744 + flags = PM_PRESENT; 745 + page = vm_normal_page(vma, addr, pte); 746 + } else if (is_swap_pte(pte)) { 747 + swp_entry_t entry = pte_to_swp_entry(pte); 748 + 749 + frame = swp_type(entry) | 750 + (swp_offset(entry) << MAX_SWAPFILES_SHIFT); 751 + flags = PM_SWAP; 752 + if (is_migration_entry(entry)) 753 + page = migration_entry_to_page(entry); 754 + } else { 752 755 *pme = make_pme(PM_NOT_PRESENT); 756 + return; 757 + } 758 + 759 + if (page && !PageAnon(page)) 760 + flags |= PM_FILE; 761 + 762 + *pme = make_pme(PM_PFRAME(frame) | PM_PSHIFT(PAGE_SHIFT) | flags); 753 763 } 754 764 755 765 #ifdef CONFIG_TRANSPARENT_HUGEPAGE ··· 827 815 if (vma && (vma->vm_start <= addr) && 828 816 !is_vm_hugetlb_page(vma)) { 829 817 pte = pte_offset_map(pmd, addr); 830 - pte_to_pagemap_entry(&pme, *pte); 818 + pte_to_pagemap_entry(&pme, vma, addr, *pte); 831 819 /* unmap before userspace copy */ 832 820 pte_unmap(pte); 833 821 } ··· 881 869 * For each page in the address space, this file contains one 64-bit entry 882 870 * consisting of the following: 883 871 * 884 - * Bits 0-55 page frame number (PFN) if present 872 + * Bits 0-54 page frame number (PFN) if present 885 873 * Bits 0-4 swap type if swapped 886 - * Bits 5-55 swap offset if swapped 874 + * Bits 5-54 swap offset if swapped 887 875 * Bits 55-60 page shift (page size = 1<<page shift) 888 - * Bit 61 reserved for future use 876 + * Bit 61 page is file-page or shared-anon 889 877 * Bit 62 page swapped 890 878 * Bit 63 page present 891 879 *