Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

ubifs: Fix to add refcount once page is set private

MM defined the rule [1] very clearly that once page was set with PG_private
flag, we should increment the refcount in that page, also main flows like
pageout(), migrate_page() will assume there is one additional page
reference count if page_has_private() returns true. Otherwise, we may
get a BUG in page migration:

page:0000000080d05b9d refcount:-1 mapcount:0 mapping:000000005f4d82a8
index:0xe2 pfn:0x14c12
aops:ubifs_file_address_operations [ubifs] ino:8f1 dentry name:"f30e"
flags: 0x1fffff80002405(locked|uptodate|owner_priv_1|private|node=0|
zone=1|lastcpupid=0x1fffff)
page dumped because: VM_BUG_ON_PAGE(page_count(page) != 0)
------------[ cut here ]------------
kernel BUG at include/linux/page_ref.h:184!
invalid opcode: 0000 [#1] SMP
CPU: 3 PID: 38 Comm: kcompactd0 Not tainted 5.15.0-rc5
RIP: 0010:migrate_page_move_mapping+0xac3/0xe70
Call Trace:
ubifs_migrate_page+0x22/0xc0 [ubifs]
move_to_new_page+0xb4/0x600
migrate_pages+0x1523/0x1cc0
compact_zone+0x8c5/0x14b0
kcompactd+0x2bc/0x560
kthread+0x18c/0x1e0
ret_from_fork+0x1f/0x30

Before the time, we should make clean a concept, what does refcount means
in page gotten from grab_cache_page_write_begin(). There are 2 situations:
Situation 1: refcount is 3, page is created by __page_cache_alloc.
TYPE_A - the write process is using this page
TYPE_B - page is assigned to one certain mapping by calling
__add_to_page_cache_locked()
TYPE_C - page is added into pagevec list corresponding current cpu by
calling lru_cache_add()
Situation 2: refcount is 2, page is gotten from the mapping's tree
TYPE_B - page has been assigned to one certain mapping
TYPE_A - the write process is using this page (by calling
page_cache_get_speculative())
Filesystem releases one refcount by calling put_page() in xxx_write_end(),
the released refcount corresponds to TYPE_A (write task is using it). If
there are any processes using a page, page migration process will skip the
page by judging whether expected_page_refs() equals to page refcount.

The BUG is caused by following process:
PA(cpu 0) kcompactd(cpu 1)
compact_zone
ubifs_write_begin
page_a = grab_cache_page_write_begin
add_to_page_cache_lru
lru_cache_add
pagevec_add // put page into cpu 0's pagevec
(refcnf = 3, for page creation process)
ubifs_write_end
SetPagePrivate(page_a) // doesn't increase page count !
unlock_page(page_a)
put_page(page_a) // refcnt = 2
[...]

PB(cpu 0)
filemap_read
filemap_get_pages
add_to_page_cache_lru
lru_cache_add
__pagevec_lru_add // traverse all pages in cpu 0's pagevec
__pagevec_lru_add_fn
SetPageLRU(page_a)
isolate_migratepages
isolate_migratepages_block
get_page_unless_zero(page_a)
// refcnt = 3
list_add(page_a, from_list)
migrate_pages(from_list)
__unmap_and_move
move_to_new_page
ubifs_migrate_page(page_a)
migrate_page_move_mapping
expected_page_refs get 3
(migration[1] + mapping[1] + private[1])
release_pages
put_page_testzero(page_a) // refcnt = 3
page_ref_freeze // refcnt = 0
page_ref_dec_and_test(0 - 1 = -1)
page_ref_unfreeze
VM_BUG_ON_PAGE(-1 != 0, page)

UBIFS doesn't increase the page refcount after setting private flag, which
leads to page migration task believes the page is not used by any other
processes, so the page is migrated. This causes concurrent accessing on
page refcount between put_page() called by other process(eg. read process
calls lru_cache_add) and page_ref_unfreeze() called by migration task.

Actually zhangjun has tried to fix this problem [2] by recalculating page
refcnt in ubifs_migrate_page(). It's better to follow MM rules [1], because
just like Kirill suggested in [2], we need to check all users of
page_has_private() helper. Like f2fs does in [3], fix it by adding/deleting
refcount when setting/clearing private for a page. BTW, according to [4],
we set 'page->private' as 1 because ubifs just simply SetPagePrivate().
And, [5] provided a common helper to set/clear page private, ubifs can
use this helper following the example of iomap, afs, btrfs, etc.

Jump [6] to find a reproducer.

[1] https://lore.kernel.org/lkml/2b19b3c4-2bc4-15fa-15cc-27a13e5c7af1@aol.com
[2] https://www.spinics.net/lists/linux-mtd/msg04018.html
[3] http://lkml.iu.edu/hypermail/linux/kernel/1903.0/03313.html
[4] https://lore.kernel.org/linux-f2fs-devel/20210422154705.GO3596236@casper.infradead.org
[5] https://lore.kernel.org/all/20200517214718.468-1-guoqing.jiang@cloud.ionos.com
[6] https://bugzilla.kernel.org/show_bug.cgi?id=214961

Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>

authored by

Zhihao Cheng and committed by
Richard Weinberger
3b67db8a 4f2262a3

+7 -7
+7 -7
fs/ubifs/file.c
··· 570 570 } 571 571 572 572 if (!PagePrivate(page)) { 573 - SetPagePrivate(page); 573 + attach_page_private(page, (void *)1); 574 574 atomic_long_inc(&c->dirty_pg_cnt); 575 575 __set_page_dirty_nobuffers(page); 576 576 } ··· 947 947 release_existing_page_budget(c); 948 948 949 949 atomic_long_dec(&c->dirty_pg_cnt); 950 - ClearPagePrivate(page); 950 + detach_page_private(page); 951 951 ClearPageChecked(page); 952 952 953 953 kunmap(page); ··· 1304 1304 release_existing_page_budget(c); 1305 1305 1306 1306 atomic_long_dec(&c->dirty_pg_cnt); 1307 - ClearPagePrivate(page); 1307 + detach_page_private(page); 1308 1308 ClearPageChecked(page); 1309 1309 } 1310 1310 ··· 1471 1471 return rc; 1472 1472 1473 1473 if (PagePrivate(page)) { 1474 - ClearPagePrivate(page); 1475 - SetPagePrivate(newpage); 1474 + detach_page_private(page); 1475 + attach_page_private(newpage, (void *)1); 1476 1476 } 1477 1477 1478 1478 if (mode != MIGRATE_SYNC_NO_COPY) ··· 1496 1496 return 0; 1497 1497 ubifs_assert(c, PagePrivate(page)); 1498 1498 ubifs_assert(c, 0); 1499 - ClearPagePrivate(page); 1499 + detach_page_private(page); 1500 1500 ClearPageChecked(page); 1501 1501 return 1; 1502 1502 } ··· 1567 1567 else { 1568 1568 if (!PageChecked(page)) 1569 1569 ubifs_convert_page_budget(c); 1570 - SetPagePrivate(page); 1570 + attach_page_private(page, (void *)1); 1571 1571 atomic_long_inc(&c->dirty_pg_cnt); 1572 1572 __set_page_dirty_nobuffers(page); 1573 1573 }