bcache: reduce gc latency by processing less nodes and sleep less time

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

When bcache device is busy for high I/O loads, there are two methods to
reduce the garbage collection latency,
- Process less nodes in eac loop of incremental garbage collection in
btree_gc_recurse().
- Sleep less time between two full garbage collection in
bch_btree_gc().

This patch introduces to hleper routines to provide different garbage
collection nodes number and sleep intervel time.
- btree_gc_min_nodes()
If there is no front end I/O, return 128 nodes to process in each
incremental loop, otherwise only 10 nodes are returned. Then front I/O
is able to access the btree earlier.
- btree_gc_sleep_ms()
If there is no synchronized wait for bucket allocation, sleep 100 ms
between two incremental GC loop. Othersize only sleep 10 ms before
incremental GC loop. Then a faster GC may provide available buckets
earlier, to avoid most of bcache working threads from being starved by
buckets allocation.

The idea is inspired by works from Mingzhe Zou and Robert Pang, but much
simpler and the expected behavior is more predictable.

Signed-off-by: Coly Li <colyli@fnnas.com>
Signed-off-by: Robert Pang <robertpang@google.com>
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

authored by

Coly Li and committed by

Jens Axboe 4 months ago 70bc173c 7bf90cd7

+29 -24

3 changed files

expand all

drivers

bcache

alloc.c

bcache.h

btree.c

drivers/md/bcache/alloc.c

··· 399 399 TASK_UNINTERRUPTIBLE); 400 400 401 401 mutex_unlock(&ca->set->bucket_lock); 402 + 403 + atomic_inc(&ca->set->bucket_wait_cnt); 402 404 schedule(); 405 + atomic_dec(&ca->set->bucket_wait_cnt); 406 + 403 407 mutex_lock(&ca->set->bucket_lock); 404 408 } while (!fifo_pop(&ca->free[RESERVE_NONE], r) && 405 409 !fifo_pop(&ca->free[reserve], r));

drivers/md/bcache/bcache.h

··· 604 604 */ 605 605 atomic_t prio_blocked; 606 606 wait_queue_head_t bucket_wait; 607 + atomic_t bucket_wait_cnt; 607 608 608 609 /* 609 610 * For any bio we don't skip we subtract the number of sectors from

+24 -24

drivers/md/bcache/btree.c

··· 89 89 * Test module load/unload 90 90 */ 91 91 92 - #define MAX_GC_TIMES 100 93 - #define MIN_GC_NODES 100 92 + #define MAX_GC_TIMES_SHIFT 7 /* 128 loops */ 93 + #define GC_NODES_MIN 10 94 + #define GC_SLEEP_MS_MIN 10 94 95 #define GC_SLEEP_MS 100 95 96 96 97 #define PTR_DIRTY_BIT (((uint64_t) 1 << 36)) ··· 1579 1578 1580 1579 static size_t btree_gc_min_nodes(struct cache_set *c) 1581 1580 { 1582 - size_t min_nodes; 1581 + size_t min_nodes = GC_NODES_MIN; 1583 1582 1584 - /* 1585 - * Since incremental GC would stop 100ms when front 1586 - * side I/O comes, so when there are many btree nodes, 1587 - * if GC only processes constant (100) nodes each time, 1588 - * GC would last a long time, and the front side I/Os 1589 - * would run out of the buckets (since no new bucket 1590 - * can be allocated during GC), and be blocked again. 1591 - * So GC should not process constant nodes, but varied 1592 - * nodes according to the number of btree nodes, which 1593 - * realized by dividing GC into constant(100) times, 1594 - * so when there are many btree nodes, GC can process 1595 - * more nodes each time, otherwise, GC will process less 1596 - * nodes each time (but no less than MIN_GC_NODES) 1597 - */ 1598 - min_nodes = c->gc_stats.nodes / MAX_GC_TIMES; 1599 - if (min_nodes < MIN_GC_NODES) 1600 - min_nodes = MIN_GC_NODES; 1583 + if (atomic_read(&c->search_inflight) == 0) { 1584 + size_t n = c->gc_stats.nodes >> MAX_GC_TIMES_SHIFT; 1585 + 1586 + if (min_nodes < n) 1587 + min_nodes = n; 1588 + } 1601 1589 1602 1590 return min_nodes; 1603 1591 } 1604 1592 1593 + static uint64_t btree_gc_sleep_ms(struct cache_set *c) 1594 + { 1595 + uint64_t sleep_ms; 1596 + 1597 + if (atomic_read(&c->bucket_wait_cnt) > 0) 1598 + sleep_ms = GC_SLEEP_MS_MIN; 1599 + else 1600 + sleep_ms = GC_SLEEP_MS; 1601 + 1602 + return sleep_ms; 1603 + } 1605 1604 1606 1605 static int btree_gc_recurse(struct btree *b, struct btree_op *op, 1607 1606 struct closure *writes, struct gc_stat *gc) ··· 1669 1668 memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1)); 1670 1669 r->b = NULL; 1671 1670 1672 - if (atomic_read(&b->c->search_inflight) && 1673 - gc->nodes >= gc->nodes_pre + btree_gc_min_nodes(b->c)) { 1671 + if (gc->nodes >= (gc->nodes_pre + btree_gc_min_nodes(b->c))) { 1674 1672 gc->nodes_pre = gc->nodes; 1675 1673 ret = -EAGAIN; 1676 1674 break; ··· 1846 1846 cond_resched(); 1847 1847 1848 1848 if (ret == -EAGAIN) 1849 - schedule_timeout_interruptible(msecs_to_jiffies 1850 - (GC_SLEEP_MS)); 1849 + schedule_timeout_interruptible( 1850 + msecs_to_jiffies(btree_gc_sleep_ms(c))); 1851 1851 else if (ret) 1852 1852 pr_warn("gc failed!\n"); 1853 1853 } while (ret && !test_bit(CACHE_SET_IO_DISABLE, &c->flags));