Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

RDMA/addr: Fix race with netevent_callback()/rdma_addr_cancel()

This three thread race can result in the work being run once the callback
becomes NULL:

CPU1 CPU2 CPU3
netevent_callback()
process_one_req() rdma_addr_cancel()
[..]
spin_lock_bh()
set_timeout()
spin_unlock_bh()

spin_lock_bh()
list_del_init(&req->list);
spin_unlock_bh()

req->callback = NULL
spin_lock_bh()
if (!list_empty(&req->list))
// Skipped!
// cancel_delayed_work(&req->work);
spin_unlock_bh()

process_one_req() // again
req->callback() // BOOM
cancel_delayed_work_sync()

The solution is to always cancel the work once it is completed so any
in between set_timeout() does not result in it running again.

Cc: stable@vger.kernel.org
Fixes: 44e75052bc2a ("RDMA/rdma_cm: Make rdma_addr_cancel into a fence")
Link: https://lore.kernel.org/r/20200930072007.1009692-1-leon@kernel.org
Reported-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

+5 -6
+5 -6
drivers/infiniband/core/addr.c
··· 647 647 req->callback = NULL; 648 648 649 649 spin_lock_bh(&lock); 650 + /* 651 + * Although the work will normally have been canceled by the workqueue, 652 + * it can still be requeued as long as it is on the req_list. 653 + */ 654 + cancel_delayed_work(&req->work); 650 655 if (!list_empty(&req->list)) { 651 - /* 652 - * Although the work will normally have been canceled by the 653 - * workqueue, it can still be requeued as long as it is on the 654 - * req_list. 655 - */ 656 - cancel_delayed_work(&req->work); 657 656 list_del_init(&req->list); 658 657 kfree(req); 659 658 }