Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

mm: fix PageUptodate data race

After running SetPageUptodate, preceeding stores to the page contents to
actually bring it uptodate may not be ordered with the store to set the
page uptodate.

Therefore, another CPU which checks PageUptodate is true, then reads the
page contents can get stale data.

Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
PageUptodate.

Many places that test PageUptodate, do so with the page locked, and this
would be enough to ensure memory ordering in those places if
SetPageUptodate were only called while the page is locked. Unfortunately
that is not always the case for some filesystems, but it could be an idea
for the future.

Also bring the handling of anonymous page uptodateness in line with that of
file backed page management, by marking anon pages as uptodate when they
_are_ uptodate, rather than when our implementation requires that they be
marked as such. Doing allows us to get rid of the smp_wmb's in the page
copying functions, which were especially added for anonymous pages for an
analogous memory ordering problem. Both file and anonymous pages are
handled with the same barriers.

FAQ:
Q. Why not do this in flush_dcache_page?
A. Firstly, flush_dcache_page handles only one side (the smb side) of the
ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
memory barriers in a completely unrelated function is nasty; at least in the
PageUptodate macros, they are located together with (half) the operations
involved in the ordering. Thirdly, the smp_wmb is only required when first
bringing the page uptodate, wheras flush_dcache_page should be called each time
it is written to through the kernel mapping. It is logically the wrong place to
put it.

Q. Why does this increase my text size / reduce my performance / etc.
A. Because it is adding the necessary instructions to eliminate the data-race.

Q. Can it be improved?
A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
run under the page lock, we could avoid the smp_rmb places where PageUptodate
is queried under the page lock. Requires audit of all filesystems and at least
some would need reworking. That's great you're interested, I'm eagerly awaiting
your patches.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Nick Piggin and committed by
Linus Torvalds
0ed361de 62e1c553

+48 -13
-4
include/linux/highmem.h
··· 68 68 void *addr = kmap_atomic(page, KM_USER0); 69 69 clear_user_page(addr, vaddr, page); 70 70 kunmap_atomic(addr, KM_USER0); 71 - /* Make sure this page is cleared on other CPU's too before using it */ 72 - smp_wmb(); 73 71 } 74 72 75 73 #ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE ··· 170 172 copy_user_page(vto, vfrom, vaddr, to); 171 173 kunmap_atomic(vfrom, KM_USER0); 172 174 kunmap_atomic(vto, KM_USER1); 173 - /* Make sure this page is cleared on other CPU's too before using it */ 174 - smp_wmb(); 175 175 } 176 176 177 177 #endif
+39 -3
include/linux/page-flags.h
··· 131 131 #define ClearPageReferenced(page) clear_bit(PG_referenced, &(page)->flags) 132 132 #define TestClearPageReferenced(page) test_and_clear_bit(PG_referenced, &(page)->flags) 133 133 134 - #define PageUptodate(page) test_bit(PG_uptodate, &(page)->flags) 134 + static inline int PageUptodate(struct page *page) 135 + { 136 + int ret = test_bit(PG_uptodate, &(page)->flags); 137 + 138 + /* 139 + * Must ensure that the data we read out of the page is loaded 140 + * _after_ we've loaded page->flags to check for PageUptodate. 141 + * We can skip the barrier if the page is not uptodate, because 142 + * we wouldn't be reading anything from it. 143 + * 144 + * See SetPageUptodate() for the other side of the story. 145 + */ 146 + if (ret) 147 + smp_rmb(); 148 + 149 + return ret; 150 + } 151 + 152 + static inline void __SetPageUptodate(struct page *page) 153 + { 154 + smp_wmb(); 155 + __set_bit(PG_uptodate, &(page)->flags); 135 156 #ifdef CONFIG_S390 157 + page_clear_dirty(page); 158 + #endif 159 + } 160 + 136 161 static inline void SetPageUptodate(struct page *page) 137 162 { 163 + #ifdef CONFIG_S390 138 164 if (!test_and_set_bit(PG_uptodate, &page->flags)) 139 165 page_clear_dirty(page); 140 - } 141 166 #else 142 - #define SetPageUptodate(page) set_bit(PG_uptodate, &(page)->flags) 167 + /* 168 + * Memory barrier must be issued before setting the PG_uptodate bit, 169 + * so that all previous stores issued in order to bring the page 170 + * uptodate are actually visible before PageUptodate becomes true. 171 + * 172 + * s390 doesn't need an explicit smp_wmb here because the test and 173 + * set bit already provides full barriers. 174 + */ 175 + smp_wmb(); 176 + set_bit(PG_uptodate, &(page)->flags); 143 177 #endif 178 + } 179 + 144 180 #define ClearPageUptodate(page) clear_bit(PG_uptodate, &(page)->flags) 145 181 146 182 #define PageDirty(page) test_bit(PG_dirty, &(page)->flags)
+2
mm/hugetlb.c
··· 813 813 814 814 spin_unlock(&mm->page_table_lock); 815 815 copy_huge_page(new_page, old_page, address, vma); 816 + __SetPageUptodate(new_page); 816 817 spin_lock(&mm->page_table_lock); 817 818 818 819 ptep = huge_pte_offset(mm, address & HPAGE_MASK); ··· 859 858 goto out; 860 859 } 861 860 clear_huge_page(page, address); 861 + __SetPageUptodate(page); 862 862 863 863 if (vma->vm_flags & VM_SHARED) { 864 864 int err;
+5 -4
mm/memory.c
··· 1518 1518 memset(kaddr, 0, PAGE_SIZE); 1519 1519 kunmap_atomic(kaddr, KM_USER0); 1520 1520 flush_dcache_page(dst); 1521 - return; 1522 - 1523 - } 1524 - copy_user_highpage(dst, src, va, vma); 1521 + } else 1522 + copy_user_highpage(dst, src, va, vma); 1525 1523 } 1526 1524 1527 1525 /* ··· 1628 1630 if (!new_page) 1629 1631 goto oom; 1630 1632 cow_user_page(new_page, old_page, address, vma); 1633 + __SetPageUptodate(new_page); 1631 1634 1632 1635 /* 1633 1636 * Re-check the pte - we dropped the lock ··· 2101 2102 page = alloc_zeroed_user_highpage_movable(vma, address); 2102 2103 if (!page) 2103 2104 goto oom; 2105 + __SetPageUptodate(page); 2104 2106 2105 2107 entry = mk_pte(page, vma->vm_page_prot); 2106 2108 entry = maybe_mkwrite(pte_mkdirty(entry), vma); ··· 2202 2202 goto out; 2203 2203 } 2204 2204 copy_user_highpage(page, vmf.page, address, vma); 2205 + __SetPageUptodate(page); 2205 2206 } else { 2206 2207 /* 2207 2208 * If the page will be shareable, see if the backing
+1 -1
mm/page_io.c
··· 126 126 int ret = 0; 127 127 128 128 BUG_ON(!PageLocked(page)); 129 - ClearPageUptodate(page); 129 + BUG_ON(PageUptodate(page)); 130 130 bio = get_swap_bio(GFP_KERNEL, page_private(page), page, 131 131 end_swap_bio_read); 132 132 if (bio == NULL) {
+1 -1
mm/swap_state.c
··· 125 125 int err; 126 126 127 127 BUG_ON(!PageLocked(page)); 128 + BUG_ON(!PageUptodate(page)); 128 129 129 130 for (;;) { 130 131 entry = get_swap_page(); ··· 148 147 149 148 switch (err) { 150 149 case 0: /* Success */ 151 - SetPageUptodate(page); 152 150 SetPageDirty(page); 153 151 return 1; 154 152 case -EEXIST: