Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

netpoll: Close race condition between poll_one_napi and napi_disable

Drivers might call napi_disable while not holding the napi instance poll_lock.
In those instances, its possible for a race condition to exist between
poll_one_napi and napi_disable. That is to say, poll_one_napi only tests the
NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
the following may happen:

CPU0 CPU1
ndo_tx_timeout napi_poll_dev
napi_disable poll_one_napi
test_and_set_bit (ret 0)
test_bit (ret 1)
reset adapter napi_poll_routine

If the adapter gets a tx timeout without a napi instance scheduled, its possible
for the adapter to think it has exclusive access to the hardware (as the napi
instance is now scheduled via the napi_disable call), while the netpoll code
thinks there is simply work to do. The result is parallel hardware access
leading to corrupt data structures in the driver, and a crash.

Additionaly, there is another, more critical race between netpoll and
napi_disable. The disabled napi state is actually identical to the scheduled
state for a given napi instance. The implication being that, if a napi instance
is disabled, a netconsole instance would see the napi state of the device as
having been scheduled, and poll it, likely while the driver was dong something
requiring exclusive access. In the case above, its fairly clear that not having
the rings in a state ready to be polled will cause any number of crashes.

The fix should be pretty easy. netpoll uses its own bit to indicate that that
the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
We can just gate disabling on that bit as well as the sched bit. That should
prevent netpoll from conducting a napi poll if we convert its set bit to a
test_and_set_bit operation to provide mutual exclusion

Change notes:
V2)
Remove a trailing whtiespace
Resubmit with proper subject prefix

V3)
Clean up spacing nits

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: jmaxwell@redhat.com
Tested-by: jmaxwell@redhat.com
Signed-off-by: David S. Miller <davem@davemloft.net>

authored by

Neil Horman and committed by
David S. Miller
2d8bff12 675ee231

+11 -2
+1
include/linux/netdevice.h
··· 507 507 BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state)); 508 508 smp_mb__before_atomic(); 509 509 clear_bit(NAPI_STATE_SCHED, &n->state); 510 + clear_bit(NAPI_STATE_NPSVC, &n->state); 510 511 } 511 512 512 513 #ifdef CONFIG_SMP
+2
net/core/dev.c
··· 4713 4713 4714 4714 while (test_and_set_bit(NAPI_STATE_SCHED, &n->state)) 4715 4715 msleep(1); 4716 + while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state)) 4717 + msleep(1); 4716 4718 4717 4719 hrtimer_cancel(&n->timer); 4718 4720
+8 -2
net/core/netpoll.c
··· 142 142 */ 143 143 static int poll_one_napi(struct napi_struct *napi, int budget) 144 144 { 145 - int work; 145 + int work = 0; 146 146 147 147 /* net_rx_action's ->poll() invocations and our's are 148 148 * synchronized by this test which is only made while ··· 151 151 if (!test_bit(NAPI_STATE_SCHED, &napi->state)) 152 152 return budget; 153 153 154 - set_bit(NAPI_STATE_NPSVC, &napi->state); 154 + /* If we set this bit but see that it has already been set, 155 + * that indicates that napi has been disabled and we need 156 + * to abort this operation 157 + */ 158 + if (test_and_set_bit(NAPI_STATE_NPSVC, &napi->state)) 159 + goto out; 155 160 156 161 work = napi->poll(napi, budget); 157 162 WARN_ONCE(work > budget, "%pF exceeded budget in poll\n", napi->poll); ··· 164 159 165 160 clear_bit(NAPI_STATE_NPSVC, &napi->state); 166 161 162 + out: 167 163 return budget - work; 168 164 } 169 165