Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

poll: add poll_requested_events() and poll_does_not_wait() functions

In some cases the poll() implementation in a driver has to do different
things depending on the events the caller wants to poll for. An example
is when a driver needs to start a DMA engine if the caller polls for
POLLIN, but doesn't want to do that if POLLIN is not requested but instead
only POLLOUT or POLLPRI is requested. This is something that can happen
in the video4linux subsystem among others.

Unfortunately, the current epoll/poll/select implementation doesn't
provide that information reliably. The poll_table_struct does have it: it
has a key field with the event mask. But once a poll() call matches one
or more bits of that mask any following poll() calls are passed a NULL
poll_table pointer.

Also, the eventpoll implementation always left the key field at ~0 instead
of using the requested events mask.

This was changed in eventpoll.c so the key field now contains the actual
events that should be polled for as set by the caller.

The solution to the NULL poll_table pointer is to set the qproc field to
NULL in poll_table once poll() matches the events, not the poll_table
pointer itself. That way drivers can obtain the mask through a new
poll_requested_events inline.

The poll_table_struct can still be NULL since some kernel code calls it
internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h). In
that case poll_requested_events() returns ~0 (i.e. all events).

Very rarely drivers might want to know whether poll_wait will actually
wait. If another earlier file descriptor in the set already matched the
events the caller wanted to wait for, then the kernel will return from the
select() call without waiting. This might be useful information in order
to avoid doing expensive work.

A new helper function poll_does_not_wait() is added that drivers can use
to detect this situation. This is now used in sock_poll_wait() in
include/net/sock.h. This was the only place in the kernel that needed
this information.

Drivers should no longer access any of the poll_table internals, but use
the poll_requested_events() and poll_does_not_wait() access functions
instead. In order to enforce that the poll_table fields are now prepended
with an underscore and a comment was added warning against using them
directly.

This required a change in unix_dgram_poll() in unix/af_unix.c which used
the key field to get the requested events. It's been replaced by a call
to poll_requested_events().

For qproc it was especially important to change its name since the
behavior of that field changes with this patch since this function pointer
can now be NULL when that wasn't possible in the past.

Any driver accessing the qproc or key fields directly will now fail to compile.

Some notes regarding the correctness of this patch: the driver's poll()
function is called with a 'struct poll_table_struct *wait' argument. This
pointer may or may not be NULL, drivers can never rely on it being one or
the other as that depends on whether or not an earlier file descriptor in
the select()'s fdset matched the requested events.

There are only three things a driver can do with the wait argument:

1) obtain the key field:

events = wait ? wait->key : ~0;

This will still work although it should be replaced with the new
poll_requested_events() function (which does exactly the same).
This will now even work better, since wait is no longer set to NULL
unnecessarily.

2) use the qproc callback. This could be deadly since qproc can now be
NULL. Renaming qproc should prevent this from happening. There are no
kernel drivers that actually access this callback directly, BTW.

3) test whether wait == NULL to determine whether poll would return without
waiting. This is no longer sufficient as the correct test is now
wait == NULL || wait->_qproc == NULL.

However, the worst that can happen here is a slight performance hit in
the case where wait != NULL and wait->_qproc == NULL. In that case the
driver will assume that poll_wait() will actually add the fd to the set
of waiting file descriptors. Of course, poll_wait() will not do that
since it tests for wait->_qproc. This will not break anything, though.

There is only one place in the whole kernel where this happens
(sock_poll_wait() in include/net/sock.h) and that code will be replaced
by a call to poll_does_not_wait() in the next patch.

Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
actually waiting. The next file descriptor from the set might match the
event mask and thus any possible waits will never happen.

Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Reviewed-by: Jonathan Corbet <corbet@lwn.net>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Hans Verkuil and committed by
Linus Torvalds
626cf236 5cde7656

+66 -33
+15 -3
fs/eventpoll.c
··· 699 699 void *priv) 700 700 { 701 701 struct epitem *epi, *tmp; 702 + poll_table pt; 702 703 704 + init_poll_funcptr(&pt, NULL); 703 705 list_for_each_entry_safe(epi, tmp, head, rdllink) { 704 - if (epi->ffd.file->f_op->poll(epi->ffd.file, NULL) & 706 + pt._key = epi->event.events; 707 + if (epi->ffd.file->f_op->poll(epi->ffd.file, &pt) & 705 708 epi->event.events) 706 709 return POLLIN | POLLRDNORM; 707 710 else { ··· 1100 1097 /* Initialize the poll table using the queue callback */ 1101 1098 epq.epi = epi; 1102 1099 init_poll_funcptr(&epq.pt, ep_ptable_queue_proc); 1100 + epq.pt._key = event->events; 1103 1101 1104 1102 /* 1105 1103 * Attach the item to the poll hooks and get current event bits. ··· 1195 1191 { 1196 1192 int pwake = 0; 1197 1193 unsigned int revents; 1194 + poll_table pt; 1195 + 1196 + init_poll_funcptr(&pt, NULL); 1198 1197 1199 1198 /* 1200 1199 * Set the new event interest mask before calling f_op->poll(); ··· 1205 1198 * f_op->poll() call and the new event set registering. 1206 1199 */ 1207 1200 epi->event.events = event->events; 1201 + pt._key = event->events; 1208 1202 epi->event.data = event->data; /* protected by mtx */ 1209 1203 1210 1204 /* 1211 1205 * Get current event bits. We can safely use the file* here because 1212 1206 * its usage count has been increased by the caller of this function. 1213 1207 */ 1214 - revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL); 1208 + revents = epi->ffd.file->f_op->poll(epi->ffd.file, &pt); 1215 1209 1216 1210 /* 1217 1211 * If the item is "hot" and it is not registered inside the ready ··· 1247 1239 unsigned int revents; 1248 1240 struct epitem *epi; 1249 1241 struct epoll_event __user *uevent; 1242 + poll_table pt; 1243 + 1244 + init_poll_funcptr(&pt, NULL); 1250 1245 1251 1246 /* 1252 1247 * We can loop without lock because we are passed a task private list. ··· 1262 1251 1263 1252 list_del_init(&epi->rdllink); 1264 1253 1265 - revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) & 1254 + pt._key = epi->event.events; 1255 + revents = epi->ffd.file->f_op->poll(epi->ffd.file, &pt) & 1266 1256 epi->event.events; 1267 1257 1268 1258 /*
+18 -22
fs/select.c
··· 223 223 get_file(filp); 224 224 entry->filp = filp; 225 225 entry->wait_address = wait_address; 226 - entry->key = p->key; 226 + entry->key = p->_key; 227 227 init_waitqueue_func_entry(&entry->wait, pollwake); 228 228 entry->wait.private = pwq; 229 229 add_wait_queue(wait_address, &entry->wait); ··· 386 386 static inline void wait_key_set(poll_table *wait, unsigned long in, 387 387 unsigned long out, unsigned long bit) 388 388 { 389 - if (wait) { 390 - wait->key = POLLEX_SET; 391 - if (in & bit) 392 - wait->key |= POLLIN_SET; 393 - if (out & bit) 394 - wait->key |= POLLOUT_SET; 395 - } 389 + wait->_key = POLLEX_SET; 390 + if (in & bit) 391 + wait->_key |= POLLIN_SET; 392 + if (out & bit) 393 + wait->_key |= POLLOUT_SET; 396 394 } 397 395 398 396 int do_select(int n, fd_set_bits *fds, struct timespec *end_time) ··· 412 414 poll_initwait(&table); 413 415 wait = &table.pt; 414 416 if (end_time && !end_time->tv_sec && !end_time->tv_nsec) { 415 - wait = NULL; 417 + wait->_qproc = NULL; 416 418 timed_out = 1; 417 419 } 418 420 ··· 457 459 if ((mask & POLLIN_SET) && (in & bit)) { 458 460 res_in |= bit; 459 461 retval++; 460 - wait = NULL; 462 + wait->_qproc = NULL; 461 463 } 462 464 if ((mask & POLLOUT_SET) && (out & bit)) { 463 465 res_out |= bit; 464 466 retval++; 465 - wait = NULL; 467 + wait->_qproc = NULL; 466 468 } 467 469 if ((mask & POLLEX_SET) && (ex & bit)) { 468 470 res_ex |= bit; 469 471 retval++; 470 - wait = NULL; 472 + wait->_qproc = NULL; 471 473 } 472 474 } 473 475 } ··· 479 481 *rexp = res_ex; 480 482 cond_resched(); 481 483 } 482 - wait = NULL; 484 + wait->_qproc = NULL; 483 485 if (retval || timed_out || signal_pending(current)) 484 486 break; 485 487 if (table.error) { ··· 718 720 * interested in events matching the pollfd->events mask, and the result 719 721 * matching that mask is both recorded in pollfd->revents and returned. The 720 722 * pwait poll_table will be used by the fd-provided poll handler for waiting, 721 - * if non-NULL. 723 + * if pwait->_qproc is non-NULL. 722 724 */ 723 725 static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait) 724 726 { ··· 736 738 if (file != NULL) { 737 739 mask = DEFAULT_POLLMASK; 738 740 if (file->f_op && file->f_op->poll) { 739 - if (pwait) 740 - pwait->key = pollfd->events | 741 - POLLERR | POLLHUP; 741 + pwait->_key = pollfd->events|POLLERR|POLLHUP; 742 742 mask = file->f_op->poll(file, pwait); 743 743 } 744 744 /* Mask out unneeded events. */ ··· 759 763 760 764 /* Optimise the no-wait case */ 761 765 if (end_time && !end_time->tv_sec && !end_time->tv_nsec) { 762 - pt = NULL; 766 + pt->_qproc = NULL; 763 767 timed_out = 1; 764 768 } 765 769 ··· 777 781 for (; pfd != pfd_end; pfd++) { 778 782 /* 779 783 * Fish for events. If we found one, record it 780 - * and kill the poll_table, so we don't 784 + * and kill poll_table->_qproc, so we don't 781 785 * needlessly register any other waiters after 782 786 * this. They'll get immediately deregistered 783 787 * when we break out and return. 784 788 */ 785 789 if (do_pollfd(pfd, pt)) { 786 790 count++; 787 - pt = NULL; 791 + pt->_qproc = NULL; 788 792 } 789 793 } 790 794 } 791 795 /* 792 796 * All waiters have already been registered, so don't provide 793 - * a poll_table to them on the next loop iteration. 797 + * a poll_table->_qproc to them on the next loop iteration. 794 798 */ 795 - pt = NULL; 799 + pt->_qproc = NULL; 796 800 if (!count) { 797 801 count = wait->error; 798 802 if (signal_pending(current))
+31 -6
include/linux/poll.h
··· 32 32 */ 33 33 typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *); 34 34 35 + /* 36 + * Do not touch the structure directly, use the access functions 37 + * poll_does_not_wait() and poll_requested_events() instead. 38 + */ 35 39 typedef struct poll_table_struct { 36 - poll_queue_proc qproc; 37 - unsigned long key; 40 + poll_queue_proc _qproc; 41 + unsigned long _key; 38 42 } poll_table; 39 43 40 44 static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p) 41 45 { 42 - if (p && wait_address) 43 - p->qproc(filp, wait_address, p); 46 + if (p && p->_qproc && wait_address) 47 + p->_qproc(filp, wait_address, p); 48 + } 49 + 50 + /* 51 + * Return true if it is guaranteed that poll will not wait. This is the case 52 + * if the poll() of another file descriptor in the set got an event, so there 53 + * is no need for waiting. 54 + */ 55 + static inline bool poll_does_not_wait(const poll_table *p) 56 + { 57 + return p == NULL || p->_qproc == NULL; 58 + } 59 + 60 + /* 61 + * Return the set of events that the application wants to poll for. 62 + * This is useful for drivers that need to know whether a DMA transfer has 63 + * to be started implicitly on poll(). You typically only want to do that 64 + * if the application is actually polling for POLLIN and/or POLLOUT. 65 + */ 66 + static inline unsigned long poll_requested_events(const poll_table *p) 67 + { 68 + return p ? p->_key : ~0UL; 44 69 } 45 70 46 71 static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc) 47 72 { 48 - pt->qproc = qproc; 49 - pt->key = ~0UL; /* all events enabled */ 73 + pt->_qproc = qproc; 74 + pt->_key = ~0UL; /* all events enabled */ 50 75 } 51 76 52 77 struct poll_table_entry {
+1 -1
include/net/sock.h
··· 1854 1854 static inline void sock_poll_wait(struct file *filp, 1855 1855 wait_queue_head_t *wait_address, poll_table *p) 1856 1856 { 1857 - if (p && wait_address) { 1857 + if (!poll_does_not_wait(p) && wait_address) { 1858 1858 poll_wait(filp, wait_address, p); 1859 1859 /* 1860 1860 * We need to be sure we are in sync with the
+1 -1
net/unix/af_unix.c
··· 2206 2206 } 2207 2207 2208 2208 /* No write status requested, avoid expensive OUT tests. */ 2209 - if (wait && !(wait->key & (POLLWRBAND | POLLWRNORM | POLLOUT))) 2209 + if (!(poll_requested_events(wait) & (POLLWRBAND|POLLWRNORM|POLLOUT))) 2210 2210 return mask; 2211 2211 2212 2212 writable = unix_writable(sk);