fix SMP data race in pagetable setup vs walking

There is a possible data race in the page table walking code. After the split
ptlock patches, it actually seems to have been introduced to the core code, but
even before that I think it would have impacted some architectures (powerpc
and sparc64, at least, walk the page tables without taking locks eg. see
find_linux_pte()).

The race is as follows:
The pte page is allocated, zeroed, and its struct page gets its spinlock
initialized. The mm-wide ptl is then taken, and then the pte page is inserted
into the pagetables.

At this point, the spinlock is not guaranteed to have ordered the previous
stores to initialize the pte page with the subsequent store to put it in the
page tables. So another Linux page table walker might be walking down (without
any locks, because we have split-leaf-ptls), and find that new pte we've
inserted. It might try to take the spinlock before the store from the other
CPU initializes it. And subsequently it might read a pte_t out before stores
from the other CPU have cleared the memory.

There are also similar races in higher levels of the page tables. They
obviously don't involve the spinlock, but could see uninitialized memory.

Arch code and hardware pagetable walkers that walk the pagetables without
locks could see similar uninitialized memory problems, regardless of whether
split ptes are enabled or not.

I prefer to put the barriers in core code, because that's where the higher
level logic happens, but the page table accessors are per-arch, and open-coding
them everywhere I don't think is an option. I'll put the read-side barriers
in alpha arch code for now (other architectures perform data-dependent loads
in order).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by Nick Piggin and committed by Linus Torvalds 362a61ad 73f10281

+40 -2
+19 -2
include/asm-alpha/pgtable.h
··· 287 287 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1)) 288 288 #define pgd_offset(mm, address) ((mm)->pgd+pgd_index(address)) 289 289 290 + /* 291 + * The smp_read_barrier_depends() in the following functions are required to 292 + * order the load of *dir (the pointer in the top level page table) with any 293 + * subsequent load of the returned pmd_t *ret (ret is data dependent on *dir). 294 + * 295 + * If this ordering is not enforced, the CPU might load an older value of 296 + * *ret, which may be uninitialized data. See mm/memory.c:__pte_alloc for 297 + * more details. 298 + * 299 + * Note that we never change the mm->pgd pointer after the task is running, so 300 + * pgd_offset does not require such a barrier. 301 + */ 302 + 290 303 /* Find an entry in the second-level page table.. */ 291 304 extern inline pmd_t * pmd_offset(pgd_t * dir, unsigned long address) 292 305 { 293 - return (pmd_t *) pgd_page_vaddr(*dir) + ((address >> PMD_SHIFT) & (PTRS_PER_PAGE - 1)); 306 + pmd_t *ret = (pmd_t *) pgd_page_vaddr(*dir) + ((address >> PMD_SHIFT) & (PTRS_PER_PAGE - 1)); 307 + smp_read_barrier_depends(); /* see above */ 308 + return ret; 294 309 } 295 310 296 311 /* Find an entry in the third-level page table.. */ 297 312 extern inline pte_t * pte_offset_kernel(pmd_t * dir, unsigned long address) 298 313 { 299 - return (pte_t *) pmd_page_vaddr(*dir) 314 + pte_t *ret = (pte_t *) pmd_page_vaddr(*dir) 300 315 + ((address >> PAGE_SHIFT) & (PTRS_PER_PAGE - 1)); 316 + smp_read_barrier_depends(); /* see above */ 317 + return ret; 301 318 } 302 319 303 320 #define pte_offset_map(dir,addr) pte_offset_kernel((dir),(addr))
+21
mm/memory.c
··· 311 311 if (!new) 312 312 return -ENOMEM; 313 313 314 + /* 315 + * Ensure all pte setup (eg. pte page lock and page clearing) are 316 + * visible before the pte is made visible to other CPUs by being 317 + * put into page tables. 318 + * 319 + * The other side of the story is the pointer chasing in the page 320 + * table walking code (when walking the page table without locking; 321 + * ie. most of the time). Fortunately, these data accesses consist 322 + * of a chain of data-dependent loads, meaning most CPUs (alpha 323 + * being the notable exception) will already guarantee loads are 324 + * seen in-order. See the alpha page table accessors for the 325 + * smp_read_barrier_depends() barriers in page table walking code. 326 + */ 327 + smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */ 328 + 314 329 spin_lock(&mm->page_table_lock); 315 330 if (!pmd_present(*pmd)) { /* Has another populated it ? */ 316 331 mm->nr_ptes++; ··· 343 328 pte_t *new = pte_alloc_one_kernel(&init_mm, address); 344 329 if (!new) 345 330 return -ENOMEM; 331 + 332 + smp_wmb(); /* See comment in __pte_alloc */ 346 333 347 334 spin_lock(&init_mm.page_table_lock); 348 335 if (!pmd_present(*pmd)) { /* Has another populated it ? */ ··· 2636 2619 if (!new) 2637 2620 return -ENOMEM; 2638 2621 2622 + smp_wmb(); /* See comment in __pte_alloc */ 2623 + 2639 2624 spin_lock(&mm->page_table_lock); 2640 2625 if (pgd_present(*pgd)) /* Another has populated it */ 2641 2626 pud_free(mm, new); ··· 2658 2639 pmd_t *new = pmd_alloc_one(mm, address); 2659 2640 if (!new) 2660 2641 return -ENOMEM; 2642 + 2643 + smp_wmb(); /* See comment in __pte_alloc */ 2661 2644 2662 2645 spin_lock(&mm->page_table_lock); 2663 2646 #ifndef __ARCH_HAS_4LEVEL_HACK