Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

bpf: Enable IRQ after irq_work_raise() completes in unit_alloc()

When doing stress test for qp-trie, bpf_mem_alloc() returned NULL
unexpectedly because all qp-trie operations were initiated from
bpf syscalls and there was still available free memory. bpf_obj_new()
has the same problem as shown by the following selftest.

The failure is due to the preemption. irq_work_raise() will invoke
irq_work_claim() first to mark the irq work as pending and then inovke
__irq_work_queue_local() to raise an IPI. So when the current task
which is invoking irq_work_raise() is preempted by other task,
unit_alloc() may return NULL for preemption task as shown below:

task A task B

unit_alloc()
// low_watermark = 32
// free_cnt = 31 after alloc
irq_work_raise()
// mark irq work as IRQ_WORK_PENDING
irq_work_claim()

// task B preempts task A
unit_alloc()
// free_cnt = 30 after alloc
// irq work is already PENDING,
// so just return
irq_work_raise()
// does unit_alloc() 30-times
......
unit_alloc()
// free_cnt = 0 before alloc
return NULL

Fix it by enabling IRQ after irq_work_raise() completes. An alternative
fix is using preempt_{disable|enable}_notrace() pair, but it may have
extra overhead. Another feasible fix is to only disable preemption or
IRQ before invoking irq_work_queue() and enable preemption or IRQ after
the invocation completes, but it can't handle the case when
c->low_watermark is 1.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230901111954.1804721-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

authored by

Hou Tao and committed by
Alexei Starovoitov
566f6de3 1e4a6d97

+6 -1
+6 -1
kernel/bpf/memalloc.c
··· 732 732 } 733 733 } 734 734 local_dec(&c->active); 735 - local_irq_restore(flags); 736 735 737 736 WARN_ON(cnt < 0); 738 737 739 738 if (cnt < c->low_watermark) 740 739 irq_work_raise(c); 740 + /* Enable IRQ after the enqueue of irq work completes, so irq work 741 + * will run after IRQ is enabled and free_llist may be refilled by 742 + * irq work before other task preempts current task. 743 + */ 744 + local_irq_restore(flags); 745 + 741 746 return llnode; 742 747 } 743 748