Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

nvme-tcp: fix premature queue removal and I/O failover

This patch addresses a data corruption issue observed in nvme-tcp during
testing.

In an NVMe native multipath setup, when an I/O timeout occurs, all
inflight I/Os are canceled almost immediately after the kernel socket is
shut down. These canceled I/Os are reported as host path errors,
triggering a failover that succeeds on a different path.

However, at this point, the original I/O may still be outstanding in the
host's network transmission path (e.g., the NIC’s TX queue). From the
user-space app's perspective, the buffer associated with the I/O is
considered completed since they're acked on the different path and may
be reused for new I/O requests.

Because nvme-tcp enables zero-copy by default in the transmission path,
this can lead to corrupted data being sent to the original target,
ultimately causing data corruption.

We can reproduce this data corruption by injecting delay on one path and
triggering i/o timeout.

To prevent this issue, this change ensures that all inflight
transmissions are fully completed from host's perspective before
returning from queue stop. To handle concurrent I/O timeout from multiple
namespaces under the same controller, always wait in queue stop
regardless of queue's state.

This aligns with the behavior of queue stopping in other NVMe fabric
transports.

Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

authored by

Michael Liang and committed by
Christoph Hellwig
77e40bbc ab35ad95

+29 -2
+29 -2
drivers/nvme/host/tcp.c
··· 1946 1946 cancel_work_sync(&queue->io_work); 1947 1947 } 1948 1948 1949 - static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid) 1949 + static void nvme_tcp_stop_queue_nowait(struct nvme_ctrl *nctrl, int qid) 1950 1950 { 1951 1951 struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl); 1952 1952 struct nvme_tcp_queue *queue = &ctrl->queues[qid]; ··· 1964 1964 queue->tls_enabled = false; 1965 1965 mutex_unlock(&queue->queue_lock); 1966 1966 } 1967 + 1968 + static void nvme_tcp_wait_queue(struct nvme_ctrl *nctrl, int qid) 1969 + { 1970 + struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl); 1971 + struct nvme_tcp_queue *queue = &ctrl->queues[qid]; 1972 + int timeout = 100; 1973 + 1974 + while (timeout > 0) { 1975 + if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags) || 1976 + !sk_wmem_alloc_get(queue->sock->sk)) 1977 + return; 1978 + msleep(2); 1979 + timeout -= 2; 1980 + } 1981 + dev_warn(nctrl->device, 1982 + "qid %d: timeout draining sock wmem allocation expired\n", 1983 + qid); 1984 + } 1985 + 1986 + static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid) 1987 + { 1988 + nvme_tcp_stop_queue_nowait(nctrl, qid); 1989 + nvme_tcp_wait_queue(nctrl, qid); 1990 + } 1991 + 1967 1992 1968 1993 static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue) 1969 1994 { ··· 2057 2032 int i; 2058 2033 2059 2034 for (i = 1; i < ctrl->queue_count; i++) 2060 - nvme_tcp_stop_queue(ctrl, i); 2035 + nvme_tcp_stop_queue_nowait(ctrl, i); 2036 + for (i = 1; i < ctrl->queue_count; i++) 2037 + nvme_tcp_wait_queue(ctrl, i); 2061 2038 } 2062 2039 2063 2040 static int nvme_tcp_start_io_queues(struct nvme_ctrl *ctrl,