Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

scsi: scsi_error: count medium access timeout only once per EH run

The current medium access timeout counter will be increased for
each command, so if there are enough failed commands we'll hit
the medium access timeout for even a single device failure and
the following kernel message is displayed:

sd H:C:T:L: [sdXY] Medium access timeout failure. Offlining disk!

Fix this by making the timeout per EH run, ie the counter will
only be increased once per device and EH run.

Fixes: 18a4d0a ("[SCSI] Handle disk devices which can not process medium access commands")
Cc: Ewan Milne <emilne@redhat.com>
Cc: Lawrence Obermann <loberman@redhat.com>
Cc: Benjamin Block <bblock@linux.vnet.ibm.com>
Cc: Steffen Maier <maier@linux.vnet.ibm.com>
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

authored by

Hannes Reinecke and committed by
Martin K. Petersen
7a38dc0b 104d9c7f

+46 -1
+18
drivers/scsi/scsi_error.c
··· 221 221 } 222 222 223 223 /** 224 + * scsi_eh_reset - call into ->eh_action to reset internal counters 225 + * @scmd: scmd to run eh on. 226 + * 227 + * The scsi driver might be carrying internal state about the 228 + * devices, so we need to call into the driver to reset the 229 + * internal state once the error handler is started. 230 + */ 231 + static void scsi_eh_reset(struct scsi_cmnd *scmd) 232 + { 233 + if (!blk_rq_is_passthrough(scmd->request)) { 234 + struct scsi_driver *sdrv = scsi_cmd_to_driver(scmd); 235 + if (sdrv->eh_reset) 236 + sdrv->eh_reset(scmd); 237 + } 238 + } 239 + 240 + /** 224 241 * scsi_eh_scmd_add - add scsi cmd to error handling. 225 242 * @scmd: scmd to run eh on. 226 243 * @eh_flag: optional SCSI_EH flag. ··· 266 249 if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED) 267 250 eh_flag &= ~SCSI_EH_CANCEL_CMD; 268 251 scmd->eh_eflags |= eh_flag; 252 + scsi_eh_reset(scmd); 269 253 list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q); 270 254 shost->host_failed++; 271 255 scsi_eh_wakeup(shost);
+26 -1
drivers/scsi/sd.c
··· 115 115 static int sd_init_command(struct scsi_cmnd *SCpnt); 116 116 static void sd_uninit_command(struct scsi_cmnd *SCpnt); 117 117 static int sd_done(struct scsi_cmnd *); 118 + static void sd_eh_reset(struct scsi_cmnd *); 118 119 static int sd_eh_action(struct scsi_cmnd *, int); 119 120 static void sd_read_capacity(struct scsi_disk *sdkp, unsigned char *buffer); 120 121 static void scsi_disk_release(struct device *cdev); ··· 533 532 .uninit_command = sd_uninit_command, 534 533 .done = sd_done, 535 534 .eh_action = sd_eh_action, 535 + .eh_reset = sd_eh_reset, 536 536 }; 537 537 538 538 /* ··· 1688 1686 }; 1689 1687 1690 1688 /** 1689 + * sd_eh_reset - reset error handling callback 1690 + * @scmd: sd-issued command that has failed 1691 + * 1692 + * This function is called by the SCSI midlayer before starting 1693 + * SCSI EH. When counting medium access failures we have to be 1694 + * careful to register it only only once per device and SCSI EH run; 1695 + * there might be several timed out commands which will cause the 1696 + * 'max_medium_access_timeouts' counter to trigger after the first 1697 + * SCSI EH run already and set the device to offline. 1698 + * So this function resets the internal counter before starting SCSI EH. 1699 + **/ 1700 + static void sd_eh_reset(struct scsi_cmnd *scmd) 1701 + { 1702 + struct scsi_disk *sdkp = scsi_disk(scmd->request->rq_disk); 1703 + 1704 + /* New SCSI EH run, reset gate variable */ 1705 + sdkp->ignore_medium_access_errors = false; 1706 + } 1707 + 1708 + /** 1691 1709 * sd_eh_action - error handling callback 1692 1710 * @scmd: sd-issued command that has failed 1693 1711 * @eh_disp: The recovery disposition suggested by the midlayer ··· 1736 1714 * process of recovering or has it suffered an internal failure 1737 1715 * that prevents access to the storage medium. 1738 1716 */ 1739 - sdkp->medium_access_timed_out++; 1717 + if (!sdkp->ignore_medium_access_errors) { 1718 + sdkp->medium_access_timed_out++; 1719 + sdkp->ignore_medium_access_errors = true; 1720 + } 1740 1721 1741 1722 /* 1742 1723 * If the device keeps failing read/write commands but TEST UNIT
+1
drivers/scsi/sd.h
··· 106 106 unsigned rc_basis: 2; 107 107 unsigned zoned: 2; 108 108 unsigned urswrz : 1; 109 + unsigned ignore_medium_access_errors : 1; 109 110 }; 110 111 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,dev) 111 112
+1
include/scsi/scsi_driver.h
··· 16 16 void (*uninit_command)(struct scsi_cmnd *); 17 17 int (*done)(struct scsi_cmnd *); 18 18 int (*eh_action)(struct scsi_cmnd *, int); 19 + void (*eh_reset)(struct scsi_cmnd *); 19 20 }; 20 21 #define to_scsi_driver(drv) \ 21 22 container_of((drv), struct scsi_driver, gendrv)