memcg: fix mem_cgroup_move_lists locking

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

Ever since the VM_BUG_ON(page_get_page_cgroup(page)) (now Bad page state) went
into page freeing, I've hit it from time to time in testing on some machines,
sometimes only after many days. Recently found a machine which could usually
produce it within a few hours, which got me there at last.

The culprit is mem_cgroup_move_lists, whose locking is inadequate; and the
arrangement of structures was such that you got page_cgroups from the lru list
neatly put on to SLUB's freelist. Kamezawa-san identified the same hole
independently.

The main problem was that it was missing the lock_page_cgroup it needs to
safely page_get_page_cgroup; but it's tricky to go beyond that too, and I
couldn't do it with SLAB_DESTROY_BY_RCU as I'd expected. See the code for
comments on the constraints.

This patch immediately gets replaced by a simpler one from Hirokazu-san; but
is it just foolish pride that tells me to put this one on record, in case we
need to come back to it later?

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Hugh Dickins and committed by

Linus Torvalds 18 years ago 2680eed7 6d48ff8b

+43 -6

1 changed file

expand all

memcontrol.c

+43 -6

mm/memcontrol.c

··· 277 277 bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup); 278 278 } 279 279 280 + static int try_lock_page_cgroup(struct page *page) 281 + { 282 + return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup); 283 + } 284 + 280 285 static void unlock_page_cgroup(struct page *page) 281 286 { 282 287 bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup); ··· 353 348 void mem_cgroup_move_lists(struct page *page, bool active) 354 349 { 355 350 struct page_cgroup *pc; 351 + struct mem_cgroup *mem; 356 352 struct mem_cgroup_per_zone *mz; 357 353 unsigned long flags; 358 354 359 - pc = page_get_page_cgroup(page); 360 - if (!pc) 355 + /* 356 + * We cannot lock_page_cgroup while holding zone's lru_lock, 357 + * because other holders of lock_page_cgroup can be interrupted 358 + * with an attempt to rotate_reclaimable_page. But we cannot 359 + * safely get to page_cgroup without it, so just try_lock it: 360 + * mem_cgroup_isolate_pages allows for page left on wrong list. 361 + */ 362 + if (!try_lock_page_cgroup(page)) 361 363 return; 362 364 363 - mz = page_cgroup_zoneinfo(pc); 364 - spin_lock_irqsave(&mz->lru_lock, flags); 365 - __mem_cgroup_move_lists(pc, active); 366 - spin_unlock_irqrestore(&mz->lru_lock, flags); 365 + /* 366 + * Now page_cgroup is stable, but we cannot acquire mz->lru_lock 367 + * while holding it, because mem_cgroup_force_empty_list does the 368 + * reverse. Get a hold on the mem_cgroup before unlocking, so that 369 + * the zoneinfo remains stable, then take mz->lru_lock; then check 370 + * that page still points to pc and pc (even if freed and reassigned 371 + * to that same page meanwhile) still points to the same mem_cgroup. 372 + * Then we know mz still points to the right spinlock, so it's safe 373 + * to move_lists (page->page_cgroup might be reset while we do so, but 374 + * that doesn't matter: pc->page is stable till we drop mz->lru_lock). 375 + * We're being a little naughty not to try_lock_page_cgroup again 376 + * inside there, but we are safe, aren't we? Aren't we? Whistle... 377 + */ 378 + pc = page_get_page_cgroup(page); 379 + if (pc) { 380 + mem = pc->mem_cgroup; 381 + mz = page_cgroup_zoneinfo(pc); 382 + css_get(&mem->css); 383 + 384 + unlock_page_cgroup(page); 385 + 386 + spin_lock_irqsave(&mz->lru_lock, flags); 387 + if (page_get_page_cgroup(page) == pc && pc->mem_cgroup == mem) 388 + __mem_cgroup_move_lists(pc, active); 389 + spin_unlock_irqrestore(&mz->lru_lock, flags); 390 + 391 + css_put(&mem->css); 392 + } else 393 + unlock_page_cgroup(page); 367 394 } 368 395 369 396 /*