Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

bcache: avoid journal no-space deadlock by reserving 1 journal bucket

The journal no-space deadlock was reported time to time. Such deadlock
can happen in the following situation.

When all journal buckets are fully filled by active jset with heavy
write I/O load, the cache set registration (after a reboot) will load
all active jsets and inserting them into the btree again (which is
called journal replay). If a journaled bkey is inserted into a btree
node and results btree node split, new journal request might be
triggered. For example, the btree grows one more level after the node
split, then the root node record in cache device super block will be
upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
space in journal buckets, the journal replay has to wait for new journal
bucket to be reclaimed after at least one journal bucket replayed. This
is one example that how the journal no-space deadlock happens.

The solution to avoid the deadlock is to reserve 1 journal bucket in
run time, and only permit the reserved journal bucket to be used during
cache set registration procedure for things like journal replay. Then
the journal space will never be fully filled, there is no chance for
journal no-space deadlock to happen anymore.

This patch adds a new member "bool do_reserve" in struct journal, it is
inititalized to 0 (false) when struct journal is allocated, and set to
1 (true) by bch_journal_space_reserve() when all initialization done in
run_cache_set(). In the run time when journal_reclaim() tries to
allocate a new journal bucket, free_journal_buckets() is called to check
whether there are enough free journal buckets to use. If there is only
1 free journal bucket and journal->do_reserve is 1 (true), the last
bucket is reserved and free_journal_buckets() will return 0 to indicate
no free journal bucket. Then journal_reclaim() will give up, and try
next time to see whetheer there is free journal bucket to allocate. By
this method, there is always 1 jouranl bucket reserved in run time.

During the cache set registration, journal->do_reserve is 0 (false), so
the reserved journal bucket can be used to avoid the no-space deadlock.

Reported-by: Nikhil Kshirsagar <nkshirsagar@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

authored by

Coly Li and committed by
Jens Axboe
32feee36 80db4e47

+29 -5
+26 -5
drivers/md/bcache/journal.c
··· 405 405 return ret; 406 406 } 407 407 408 + void bch_journal_space_reserve(struct journal *j) 409 + { 410 + j->do_reserve = true; 411 + } 412 + 408 413 /* Journalling */ 409 414 410 415 static void btree_flush_write(struct cache_set *c) ··· 626 621 } 627 622 } 628 623 624 + static unsigned int free_journal_buckets(struct cache_set *c) 625 + { 626 + struct journal *j = &c->journal; 627 + struct cache *ca = c->cache; 628 + struct journal_device *ja = &c->cache->journal; 629 + unsigned int n; 630 + 631 + /* In case njournal_buckets is not power of 2 */ 632 + if (ja->cur_idx >= ja->discard_idx) 633 + n = ca->sb.njournal_buckets + ja->discard_idx - ja->cur_idx; 634 + else 635 + n = ja->discard_idx - ja->cur_idx; 636 + 637 + if (n > (1 + j->do_reserve)) 638 + return n - (1 + j->do_reserve); 639 + 640 + return 0; 641 + } 642 + 629 643 static void journal_reclaim(struct cache_set *c) 630 644 { 631 645 struct bkey *k = &c->journal.key; 632 646 struct cache *ca = c->cache; 633 647 uint64_t last_seq; 634 - unsigned int next; 635 648 struct journal_device *ja = &ca->journal; 636 649 atomic_t p __maybe_unused; 637 650 ··· 672 649 if (c->journal.blocks_free) 673 650 goto out; 674 651 675 - next = (ja->cur_idx + 1) % ca->sb.njournal_buckets; 676 - /* No space available on this device */ 677 - if (next == ja->discard_idx) 652 + if (!free_journal_buckets(c)) 678 653 goto out; 679 654 680 - ja->cur_idx = next; 655 + ja->cur_idx = (ja->cur_idx + 1) % ca->sb.njournal_buckets; 681 656 k->ptr[0] = MAKE_PTR(0, 682 657 bucket_to_sector(c, ca->sb.d[ja->cur_idx]), 683 658 ca->sb.nr_this_dev);
+2
drivers/md/bcache/journal.h
··· 105 105 spinlock_t lock; 106 106 spinlock_t flush_write_lock; 107 107 bool btree_flushing; 108 + bool do_reserve; 108 109 /* used when waiting because the journal was full */ 109 110 struct closure_waitlist wait; 110 111 struct closure io; ··· 183 182 184 183 void bch_journal_free(struct cache_set *c); 185 184 int bch_journal_alloc(struct cache_set *c); 185 + void bch_journal_space_reserve(struct journal *j); 186 186 187 187 #endif /* _BCACHE_JOURNAL_H */
+1
drivers/md/bcache/super.c
··· 2127 2127 2128 2128 flash_devs_run(c); 2129 2129 2130 + bch_journal_space_reserve(&c->journal); 2130 2131 set_bit(CACHE_SET_RUNNING, &c->flags); 2131 2132 return 0; 2132 2133 err: