Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm/memory-failure.c: shift page lock from head page to tail page after thp split

After thp split in hwpoison_user_mappings(), we hold page lock on the
raw error page only between try_to_unmap, hence we are in danger of race
condition.

I found in the RHEL7 MCE-relay testing that we have "bad page" error
when a memory error happens on a thp tail page used by qemu-kvm:

Triggering MCE exception on CPU 10
mce: [Hardware Error]: Machine check events logged
MCE exception done on CPU 10
MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption
MCE 0x38c535: dirty LRU page recovery: Recovered
qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000]
BUG: Bad page state in process qemu-kvm pfn:38c400
page:ffffea000e310000 count:0 mapcount:0 mapping: (null) index:0x7ffae3c00
page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked)
Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ...
CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G M -------------- 3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1
Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011
Call Trace:
dump_stack+0x19/0x1b
bad_page.part.59+0xcf/0xe8
free_pages_prepare+0x148/0x160
free_hot_cold_page+0x31/0x140
free_hot_cold_page_list+0x46/0xa0
release_pages+0x1c1/0x200
free_pages_and_swap_cache+0xad/0xd0
tlb_flush_mmu.part.46+0x4c/0x90
tlb_finish_mmu+0x55/0x60
exit_mmap+0xcb/0x170
mmput+0x67/0xf0
vhost_dev_cleanup+0x231/0x260 [vhost_net]
vhost_net_release+0x3f/0x90 [vhost_net]
__fput+0xe9/0x270
____fput+0xe/0x10
task_work_run+0xc4/0xe0
do_exit+0x2bb/0xa40
do_group_exit+0x3f/0xa0
get_signal_to_deliver+0x1d0/0x6e0
do_signal+0x48/0x5e0
do_notify_resume+0x71/0xc0
retint_signal+0x48/0x8c

The reason of this bug is that a page fault happens before unlocking the
head page at the end of memory_failure(). This strange page fault is
trying to access to address 0x20 and I'm not sure why qemu-kvm does
this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the
way we catch the bad page bug/warning because we try to free a locked
page (which was the former head page.)

To fix this, this patch suggests to shift page lock from head page to
tail page just after thp split. SIGSEGV still happens, but it affects
only error affected VMs, not a whole system.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> [3.9+] # a3e0f9e47d5ef "mm/memory-failure.c: transfer page count from head page to tail page after split thp"
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Naoya Horiguchi and committed by
Linus Torvalds
54b9dd14 54a43d54

+11 -10
+11 -10
mm/memory-failure.c
··· 856 856 * the pages and send SIGBUS to the processes if the data was dirty. 857 857 */ 858 858 static int hwpoison_user_mappings(struct page *p, unsigned long pfn, 859 - int trapno, int flags) 859 + int trapno, int flags, struct page **hpagep) 860 860 { 861 861 enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS; 862 862 struct address_space *mapping; 863 863 LIST_HEAD(tokill); 864 864 int ret; 865 865 int kill = 1, forcekill; 866 - struct page *hpage = compound_head(p); 866 + struct page *hpage = *hpagep; 867 867 struct page *ppage; 868 868 869 869 if (PageReserved(p) || PageSlab(p)) ··· 942 942 * We pinned the head page for hwpoison handling, 943 943 * now we split the thp and we are interested in 944 944 * the hwpoisoned raw page, so move the refcount 945 - * to it. 945 + * to it. Similarly, page lock is shifted. 946 946 */ 947 947 if (hpage != p) { 948 948 put_page(hpage); 949 949 get_page(p); 950 + lock_page(p); 951 + unlock_page(hpage); 952 + *hpagep = p; 950 953 } 951 954 /* THP is split, so ppage should be the real poisoned page. */ 952 955 ppage = p; ··· 967 964 if (kill) 968 965 collect_procs(ppage, &tokill); 969 966 970 - if (hpage != ppage) 971 - lock_page(ppage); 972 - 973 967 ret = try_to_unmap(ppage, ttu); 974 968 if (ret != SWAP_SUCCESS) 975 969 printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n", 976 970 pfn, page_mapcount(ppage)); 977 - 978 - if (hpage != ppage) 979 - unlock_page(ppage); 980 971 981 972 /* 982 973 * Now that the dirty bit has been propagated to the ··· 1190 1193 /* 1191 1194 * Now take care of user space mappings. 1192 1195 * Abort on fail: __delete_from_page_cache() assumes unmapped page. 1196 + * 1197 + * When the raw error page is thp tail page, hpage points to the raw 1198 + * page after thp split. 1193 1199 */ 1194 - if (hwpoison_user_mappings(p, pfn, trapno, flags) != SWAP_SUCCESS) { 1200 + if (hwpoison_user_mappings(p, pfn, trapno, flags, &hpage) 1201 + != SWAP_SUCCESS) { 1195 1202 printk(KERN_ERR "MCE %#lx: cannot unmap page, give up\n", pfn); 1196 1203 res = -EBUSY; 1197 1204 goto out;