Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

RDMA/hns: Do not destroy QP resources in the hw resetting phase

When hns_roce_v2_destroy_qp() is called, the brief calling process of the
driver is as follows:

......
hns_roce_v2_destroy_qp
hns_roce_v2_qp_modify
hns_roce_cmd_mbox
hns_roce_qp_destroy

If hns_roce_cmd_mbox() detects that the hardware is being reset during the
execution of the hns_roce_cmd_mbox(), the driver will not be able to get
the return value from the hardware (the firmware cannot respond to the
driver's mailbox during the hardware reset phase).

The driver needs to wait for the hardware reset to complete before
continuing to execute hns_roce_qp_destroy(), otherwise it may happen that
the driver releases the resources but the hardware is still accessing. In
order to fix this problem, HNS RoCE needs to add a piece of code to wait
for the hardware reset to complete.

The original interface get_hw_reset_stat() is the instantaneous state of
the hardware reset, which cannot accurately reflect whether the hardware
reset is completed, so it needs to be replaced with the ae_dev_reset_cnt
interface.

The sign that the hardware reset is complete is that the return value of
the ae_dev_reset_cnt interface is greater than the original value
reset_cnt recorded by the driver.

Fixes: 6a04aed6afae ("RDMA/hns: Fix the chip hanging caused by sending mailbox&CMQ during reset")
Link: https://lore.kernel.org/r/20211123142402.26936-1-liangwenpeng@huawei.com
Signed-off-by: Yangyang Li <liyangyang20@huawei.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

authored by

Yangyang Li and committed by
Jason Gunthorpe
b0969f83 52414e27

+11 -1
+11 -1
drivers/infiniband/hw/hns/hns_roce_hw_v2.c
··· 33 33 #include <linux/acpi.h> 34 34 #include <linux/etherdevice.h> 35 35 #include <linux/interrupt.h> 36 + #include <linux/iopoll.h> 36 37 #include <linux/kernel.h> 37 38 #include <linux/types.h> 38 39 #include <net/addrconf.h> ··· 1051 1050 unsigned long instance_stage, 1052 1051 unsigned long reset_stage) 1053 1052 { 1053 + #define HW_RESET_TIMEOUT_US 1000000 1054 + #define HW_RESET_SLEEP_US 1000 1055 + 1054 1056 struct hns_roce_v2_priv *priv = hr_dev->priv; 1055 1057 struct hnae3_handle *handle = priv->handle; 1056 1058 const struct hnae3_ae_ops *ops = handle->ae_algo->ops; 1059 + unsigned long val; 1060 + int ret; 1057 1061 1058 1062 /* When hardware reset is detected, we should stop sending mailbox&cmq& 1059 1063 * doorbell to hardware. If now in .init_instance() function, we should ··· 1070 1064 * again. 1071 1065 */ 1072 1066 hr_dev->dis_db = true; 1073 - if (!ops->get_hw_reset_stat(handle)) 1067 + 1068 + ret = read_poll_timeout(ops->ae_dev_reset_cnt, val, 1069 + val > hr_dev->reset_cnt, HW_RESET_SLEEP_US, 1070 + HW_RESET_TIMEOUT_US, false, handle); 1071 + if (!ret) 1074 1072 hr_dev->is_reset = true; 1075 1073 1076 1074 if (!hr_dev->is_reset || reset_stage == HNS_ROCE_STATE_RST_INIT ||