Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

scsi: mpt3sas: Irq poll to avoid CPU hard lockups

Issue Description:
We have seen cpu lock up issue from fields if system has greater (more than
96) logical cpu count. SAS3.0 controller (Invader series) supports at max
96 msix vector and SAS3.5 product (Ventura) supports at max 128 msix
vectors.

This may be a generic issue (if PCI device supports completion on multiple
reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to
simplify the problem and possible changes to handle such issues. IT HBA
(mpt3sas) supports multiple reply queues in completion path. Driver creates
MSI-x vectors for controller as "min of (FW supported Reply queue, Logical
CPUs)". If submitter is not interrupted via completion on same CPU, there
is a loop in the IO path. This behavior can cause hard/soft CPU lockups, IO
timeout, system sluggish etc.

Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
(e.g. CPU B) is busy with processing the corresponding IO's reply
descriptors from reply descriptor queue upon receiving the interrupts from
HBA. If the CPU A is continuously pumping the IOs then always CPU B (which
is executing the ISR) will see the valid reply descriptors in the reply
descriptor queue and it will be continuously processing those reply
descriptor in a loop without quitting the ISR handler.

Mpt3sas driver will exit ISR handler if it finds unused reply descriptor in
the reply descriptor queue. Since CPU A will be continuously sending the
IOs, CPU B may always see a valid reply descriptor (posted by HBA Firmware
after processing the IO) in the reply descriptor queue. In worst case,
driver will not quit from this loop in the ISR handler. Eventually, CPU
lockup will be detected by watchdog.

Above mentioned behavior is not common if "rq_affinity" set to 2 or
affinity_hint is honored by irqbalance as "exact". If rq_affinity is set
to 2, submitter will be always interrupted via completion on same CPU. If
irqbalance is using "exact" policy, interrupt will be delivered to
submitter CPU.

If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is not
1:1, we still have exposure of issue explained above and for that we don't
have any solution.

Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
device.

If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
counts to MSI-x vector count ratio is something like X:1, where X > 1) then
'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
hard/soft lockups. There won't be any one to one mapping between CPU to
MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
shared with group/set of CPUs and there is a possibility of having a loop
in the IO path within that CPU group and may observe lockups.

For example: Consider a system having two NUMA nodes and each node having
four logical CPUs and also consider that number of MSI-x vectors enabled on
the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1. e.g.
MSIx vector 0 is affinity to CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0 and
MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node 1.

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 --> MSI-x 0
node 0 size: 65536 MB
node 0 free: 63176 MB
node 1 cpus: 4 5 6 7 -->MSI-x 1
node 1 size: 65536 MB
node 1 free: 63176 MB

Assume that user started an application which uses all the CPUs of NUMA
node 0 for issuing the IOs. Only one CPU from affinity list (it can be any
cpu since this behavior depends upon irqbalance) CPU0 will receive the
interrupts from MSIx vector 0 for all the IOs. Eventually, CPU 0 IO
submission percentage will be decreasing and ISR processing percentage will
be increasing as it is more busy with processing the interrupts. Gradually
IO submission percentage on CPU 0 will be zero and it's ISR processing
percentage will be 100 percentage as IO loop has already formed within the
NUMA node 0, i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with
submitting the heavy IOs and only CPU 0 is busy in the ISR path as it
always find the valid reply descriptor in the reply descriptor
queue. Eventually, we will observe the hard lockup here.

Chances of occurring of hard/soft lockups are directly proportional to
value of X. If value of X is high, then chances of observing CPU lockups is
high.

Solution: Use IRQ poll interface defined in " irq_poll.c". mpt3sas driver
will execute ISR routine in Softirq context and it will always quit the
loop based on budget provided in IRQ poll interface.

In these scenarios (i.e. where CPUs count to MSI-X vectors count ratio is
X:1 (where X > 1)), IRQ poll interface will avoid CPU hard lockups due to
voluntary exit from the reply queue processing based on budget. Note -
Only one MSI-x vector is busy doing processing.

Irqstat output:

IRQs / 1 second(s)
IRQ# TOTAL NODE0 NODE1 NODE2 NODE3 NAME
44 122871 122871 0 0 0 IR-PCI-MSI-edge mpt3sas0-msix0
45 0 0 0 0 0 IR-PCI-MSI-edge mpt3sas0-msix1

We use this approach only if cpu count is more than FW supported MSI-x
vector

Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

authored by

Suganath Prabu and committed by
Martin K. Petersen
320e77ac 233af108

+85 -2
+1
drivers/scsi/mpt3sas/Kconfig
··· 45 45 depends on PCI && SCSI 46 46 select SCSI_SAS_ATTRS 47 47 select RAID_ATTRS 48 + select IRQ_POLL 48 49 ---help--- 49 50 This driver supports PCI-Express SAS 12Gb/s Host Adapters. 50 51
+76 -2
drivers/scsi/mpt3sas/mpt3sas_base.c
··· 1504 1504 MPI2_RPHI_MSIX_INDEX_SHIFT), 1505 1505 &ioc->chip->ReplyPostHostIndex); 1506 1506 } 1507 - completed_cmds = 1; 1507 + if (!reply_q->irq_poll_scheduled) { 1508 + reply_q->irq_poll_scheduled = true; 1509 + irq_poll_sched(&reply_q->irqpoll); 1510 + } 1511 + atomic_dec(&reply_q->busy); 1512 + return completed_cmds; 1508 1513 } 1509 1514 if (request_descript_type == MPI2_RPY_DESCRIPT_FLAGS_UNUSED) 1510 1515 goto out; ··· 1575 1570 1576 1571 if (ioc->mask_interrupts) 1577 1572 return IRQ_NONE; 1578 - 1573 + if (reply_q->irq_poll_scheduled) 1574 + return IRQ_HANDLED; 1579 1575 return ((_base_process_reply_queue(reply_q) > 0) ? 1580 1576 IRQ_HANDLED : IRQ_NONE); 1577 + } 1578 + 1579 + /** 1580 + * _base_irqpoll - IRQ poll callback handler 1581 + * @irqpoll - irq_poll object 1582 + * @budget - irq poll weight 1583 + * 1584 + * returns number of reply descriptors processed 1585 + */ 1586 + static int 1587 + _base_irqpoll(struct irq_poll *irqpoll, int budget) 1588 + { 1589 + struct adapter_reply_queue *reply_q; 1590 + int num_entries = 0; 1591 + 1592 + reply_q = container_of(irqpoll, struct adapter_reply_queue, 1593 + irqpoll); 1594 + if (reply_q->irq_line_enable) { 1595 + disable_irq(reply_q->os_irq); 1596 + reply_q->irq_line_enable = false; 1597 + } 1598 + num_entries = _base_process_reply_queue(reply_q); 1599 + if (num_entries < budget) { 1600 + irq_poll_complete(irqpoll); 1601 + reply_q->irq_poll_scheduled = false; 1602 + reply_q->irq_line_enable = true; 1603 + enable_irq(reply_q->os_irq); 1604 + } 1605 + 1606 + return num_entries; 1607 + } 1608 + 1609 + /** 1610 + * _base_init_irqpolls - initliaze IRQ polls 1611 + * @ioc: per adapter object 1612 + * 1613 + * returns nothing 1614 + */ 1615 + static void 1616 + _base_init_irqpolls(struct MPT3SAS_ADAPTER *ioc) 1617 + { 1618 + struct adapter_reply_queue *reply_q, *next; 1619 + 1620 + if (list_empty(&ioc->reply_queue_list)) 1621 + return; 1622 + 1623 + list_for_each_entry_safe(reply_q, next, &ioc->reply_queue_list, list) { 1624 + irq_poll_init(&reply_q->irqpoll, 1625 + ioc->hba_queue_depth/4, _base_irqpoll); 1626 + reply_q->irq_poll_scheduled = false; 1627 + reply_q->irq_line_enable = true; 1628 + reply_q->os_irq = pci_irq_vector(ioc->pdev, 1629 + reply_q->msix_index); 1630 + } 1581 1631 } 1582 1632 1583 1633 /** ··· 1673 1613 /* TMs are on msix_index == 0 */ 1674 1614 if (reply_q->msix_index == 0) 1675 1615 continue; 1616 + if (reply_q->irq_poll_scheduled) { 1617 + /* Calling irq_poll_disable will wait for any pending 1618 + * callbacks to have completed. 1619 + */ 1620 + irq_poll_disable(&reply_q->irqpoll); 1621 + irq_poll_enable(&reply_q->irqpoll); 1622 + reply_q->irq_poll_scheduled = false; 1623 + reply_q->irq_line_enable = true; 1624 + enable_irq(reply_q->os_irq); 1625 + continue; 1626 + } 1676 1627 synchronize_irq(pci_irq_vector(ioc->pdev, reply_q->msix_index)); 1677 1628 } 1678 1629 } ··· 3103 3032 if (r) 3104 3033 goto out_fail; 3105 3034 3035 + if (!ioc->is_driver_loading) 3036 + _base_init_irqpolls(ioc); 3106 3037 /* Use the Combined reply queue feature only for SAS3 C0 & higher 3107 3038 * revision HBAs and also only when reply queue count is greater than 8 3108 3039 */ ··· 6591 6518 if (r) 6592 6519 goto out_free_resources; 6593 6520 6521 + _base_init_irqpolls(ioc); 6594 6522 init_waitqueue_head(&ioc->reset_wq); 6595 6523 6596 6524 /* allocate memory pd handle bitmask list */
+8
drivers/scsi/mpt3sas/mpt3sas_base.h
··· 67 67 #include <scsi/scsi_eh.h> 68 68 #include <linux/pci.h> 69 69 #include <linux/poll.h> 70 + #include <linux/irq_poll.h> 70 71 71 72 #include "mpt3sas_debug.h" 72 73 #include "mpt3sas_trigger_diag.h" ··· 883 882 * @reply_post_free: reply post base virt address 884 883 * @name: the name registered to request_irq() 885 884 * @busy: isr is actively processing replies on another cpu 885 + * @os_irq: irq number 886 + * @irqpoll: irq_poll object 887 + * @irq_poll_scheduled: Tells whether irq poll is scheduled or not 886 888 * @list: this list 887 889 */ 888 890 struct adapter_reply_queue { ··· 895 891 Mpi2ReplyDescriptorsUnion_t *reply_post_free; 896 892 char name[MPT_NAME_LENGTH]; 897 893 atomic_t busy; 894 + u32 os_irq; 895 + struct irq_poll irqpoll; 896 + bool irq_poll_scheduled; 897 + bool irq_line_enable; 898 898 struct list_head list; 899 899 }; 900 900