Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

epoll: use rwlock in order to reduce ep_poll_callback() contention

The goal of this patch is to reduce contention of ep_poll_callback()
which can be called concurrently from different CPUs in case of high
events rates and many fds per epoll. Problem can be very well
reproduced by generating events (write to pipe or eventfd) from many
threads, while consumer thread does polling. In other words this patch
increases the bandwidth of events which can be delivered from sources to
the poller by adding poll items in a lockless way to the list.

The main change is in replacement of the spinlock with a rwlock, which
is taken on read in ep_poll_callback(), and then by adding poll items to
the tail of the list using xchg atomic instruction. Write lock is taken
everywhere else in order to stop list modifications and guarantee that
list updates are fully completed (I assume that write side of a rwlock
does not starve, it seems qrwlock implementation has these guarantees).

The following are some microbenchmark results based on the test [1]
which starts threads which generate N events each. The test ends when
all events are successfully fetched by the poller thread:

spinlock
========

threads events/ms run-time ms
8 6402 12495
16 7045 22709
32 7395 43268

rwlock + xchg
=============

threads events/ms run-time ms
8 10038 7969
16 12178 13138
32 13223 24199

According to the results bandwidth of delivered events is significantly
increased, thus execution time is reduced.

This patch was tested with different sort of microbenchmarks and
artificial delays (e.g. "udelay(get_random_int() & 0xff)") introduced
in kernel on paths where items are added to lists.

[1] https://github.com/rouming/test-tools/blob/master/stress-epoll.c

Link: http://lkml.kernel.org/r/20190103150104.17128-5-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Roman Penyaev and committed by
Linus Torvalds
a218cc49 c3e320b6

+122 -36
+122 -36
fs/eventpoll.c
··· 50 50 * 51 51 * 1) epmutex (mutex) 52 52 * 2) ep->mtx (mutex) 53 - * 3) ep->wq.lock (spinlock) 53 + * 3) ep->lock (rwlock) 54 54 * 55 55 * The acquire order is the one listed above, from 1 to 3. 56 - * We need a spinlock (ep->wq.lock) because we manipulate objects 56 + * We need a rwlock (ep->lock) because we manipulate objects 57 57 * from inside the poll callback, that might be triggered from 58 58 * a wake_up() that in turn might be called from IRQ context. 59 59 * So we can't sleep inside the poll callback and hence we need ··· 85 85 * of epoll file descriptors, we use the current recursion depth as 86 86 * the lockdep subkey. 87 87 * It is possible to drop the "ep->mtx" and to use the global 88 - * mutex "epmutex" (together with "ep->wq.lock") to have it working, 88 + * mutex "epmutex" (together with "ep->lock") to have it working, 89 89 * but having "ep->mtx" will make the interface more scalable. 90 90 * Events that require holding "epmutex" are very rare, while for 91 91 * normal operations the epoll private "ep->mtx" will guarantee ··· 182 182 * This structure is stored inside the "private_data" member of the file 183 183 * structure and represents the main data structure for the eventpoll 184 184 * interface. 185 - * 186 - * Access to it is protected by the lock inside wq. 187 185 */ 188 186 struct eventpoll { 189 187 /* ··· 201 203 /* List of ready file descriptors */ 202 204 struct list_head rdllist; 203 205 206 + /* Lock which protects rdllist and ovflist */ 207 + rwlock_t lock; 208 + 204 209 /* RB tree root used to store monitored fd structs */ 205 210 struct rb_root_cached rbr; 206 211 207 212 /* 208 213 * This is a single linked list that chains all the "struct epitem" that 209 214 * happened while transferring ready events to userspace w/out 210 - * holding ->wq.lock. 215 + * holding ->lock. 211 216 */ 212 217 struct epitem *ovflist; 213 218 ··· 698 697 * because we want the "sproc" callback to be able to do it 699 698 * in a lockless way. 700 699 */ 701 - spin_lock_irq(&ep->wq.lock); 700 + write_lock_irq(&ep->lock); 702 701 list_splice_init(&ep->rdllist, &txlist); 703 702 WRITE_ONCE(ep->ovflist, NULL); 704 - spin_unlock_irq(&ep->wq.lock); 703 + write_unlock_irq(&ep->lock); 705 704 706 705 /* 707 706 * Now call the callback function. 708 707 */ 709 708 res = (*sproc)(ep, &txlist, priv); 710 709 711 - spin_lock_irq(&ep->wq.lock); 710 + write_lock_irq(&ep->lock); 712 711 /* 713 712 * During the time we spent inside the "sproc" callback, some 714 713 * other events might have been queued by the poll callback. ··· 750 749 * the ->poll() wait list (delayed after we release the lock). 751 750 */ 752 751 if (waitqueue_active(&ep->wq)) 753 - wake_up_locked(&ep->wq); 752 + wake_up(&ep->wq); 754 753 if (waitqueue_active(&ep->poll_wait)) 755 754 pwake++; 756 755 } 757 - spin_unlock_irq(&ep->wq.lock); 756 + write_unlock_irq(&ep->lock); 758 757 759 758 if (!ep_locked) 760 759 mutex_unlock(&ep->mtx); ··· 794 793 795 794 rb_erase_cached(&epi->rbn, &ep->rbr); 796 795 797 - spin_lock_irq(&ep->wq.lock); 796 + write_lock_irq(&ep->lock); 798 797 if (ep_is_linked(epi)) 799 798 list_del_init(&epi->rdllink); 800 - spin_unlock_irq(&ep->wq.lock); 799 + write_unlock_irq(&ep->lock); 801 800 802 801 wakeup_source_unregister(ep_wakeup_source(epi)); 803 802 /* ··· 847 846 * Walks through the whole tree by freeing each "struct epitem". At this 848 847 * point we are sure no poll callbacks will be lingering around, and also by 849 848 * holding "epmutex" we can be sure that no file cleanup code will hit 850 - * us during this operation. So we can avoid the lock on "ep->wq.lock". 849 + * us during this operation. So we can avoid the lock on "ep->lock". 851 850 * We do not need to lock ep->mtx, either, we only do it to prevent 852 851 * a lockdep warning. 853 852 */ ··· 1028 1027 goto free_uid; 1029 1028 1030 1029 mutex_init(&ep->mtx); 1030 + rwlock_init(&ep->lock); 1031 1031 init_waitqueue_head(&ep->wq); 1032 1032 init_waitqueue_head(&ep->poll_wait); 1033 1033 INIT_LIST_HEAD(&ep->rdllist); ··· 1118 1116 } 1119 1117 #endif /* CONFIG_CHECKPOINT_RESTORE */ 1120 1118 1119 + /** 1120 + * Adds a new entry to the tail of the list in a lockless way, i.e. 1121 + * multiple CPUs are allowed to call this function concurrently. 1122 + * 1123 + * Beware: it is necessary to prevent any other modifications of the 1124 + * existing list until all changes are completed, in other words 1125 + * concurrent list_add_tail_lockless() calls should be protected 1126 + * with a read lock, where write lock acts as a barrier which 1127 + * makes sure all list_add_tail_lockless() calls are fully 1128 + * completed. 1129 + * 1130 + * Also an element can be locklessly added to the list only in one 1131 + * direction i.e. either to the tail either to the head, otherwise 1132 + * concurrent access will corrupt the list. 1133 + * 1134 + * Returns %false if element has been already added to the list, %true 1135 + * otherwise. 1136 + */ 1137 + static inline bool list_add_tail_lockless(struct list_head *new, 1138 + struct list_head *head) 1139 + { 1140 + struct list_head *prev; 1141 + 1142 + /* 1143 + * This is simple 'new->next = head' operation, but cmpxchg() 1144 + * is used in order to detect that same element has been just 1145 + * added to the list from another CPU: the winner observes 1146 + * new->next == new. 1147 + */ 1148 + if (cmpxchg(&new->next, new, head) != new) 1149 + return false; 1150 + 1151 + /* 1152 + * Initially ->next of a new element must be updated with the head 1153 + * (we are inserting to the tail) and only then pointers are atomically 1154 + * exchanged. XCHG guarantees memory ordering, thus ->next should be 1155 + * updated before pointers are actually swapped and pointers are 1156 + * swapped before prev->next is updated. 1157 + */ 1158 + 1159 + prev = xchg(&head->prev, new); 1160 + 1161 + /* 1162 + * It is safe to modify prev->next and new->prev, because a new element 1163 + * is added only to the tail and new->next is updated before XCHG. 1164 + */ 1165 + 1166 + prev->next = new; 1167 + new->prev = prev; 1168 + 1169 + return true; 1170 + } 1171 + 1172 + /** 1173 + * Chains a new epi entry to the tail of the ep->ovflist in a lockless way, 1174 + * i.e. multiple CPUs are allowed to call this function concurrently. 1175 + * 1176 + * Returns %false if epi element has been already chained, %true otherwise. 1177 + */ 1178 + static inline bool chain_epi_lockless(struct epitem *epi) 1179 + { 1180 + struct eventpoll *ep = epi->ep; 1181 + 1182 + /* Check that the same epi has not been just chained from another CPU */ 1183 + if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR) 1184 + return false; 1185 + 1186 + /* Atomically exchange tail */ 1187 + epi->next = xchg(&ep->ovflist, epi); 1188 + 1189 + return true; 1190 + } 1191 + 1121 1192 /* 1122 1193 * This is the callback that is passed to the wait queue wakeup 1123 1194 * mechanism. It is called by the stored file descriptors when they 1124 1195 * have events to report. 1196 + * 1197 + * This callback takes a read lock in order not to content with concurrent 1198 + * events from another file descriptors, thus all modifications to ->rdllist 1199 + * or ->ovflist are lockless. Read lock is paired with the write lock from 1200 + * ep_scan_ready_list(), which stops all list modifications and guarantees 1201 + * that lists state is seen correctly. 1202 + * 1203 + * Another thing worth to mention is that ep_poll_callback() can be called 1204 + * concurrently for the same @epi from different CPUs if poll table was inited 1205 + * with several wait queues entries. Plural wakeup from different CPUs of a 1206 + * single wait queue is serialized by wq.lock, but the case when multiple wait 1207 + * queues are used should be detected accordingly. This is detected using 1208 + * cmpxchg() operation. 1125 1209 */ 1126 1210 static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) 1127 1211 { 1128 1212 int pwake = 0; 1129 - unsigned long flags; 1130 1213 struct epitem *epi = ep_item_from_wait(wait); 1131 1214 struct eventpoll *ep = epi->ep; 1132 1215 __poll_t pollflags = key_to_poll(key); 1216 + unsigned long flags; 1133 1217 int ewake = 0; 1134 1218 1135 - spin_lock_irqsave(&ep->wq.lock, flags); 1219 + read_lock_irqsave(&ep->lock, flags); 1136 1220 1137 1221 ep_set_busy_poll_napi_id(epi); 1138 1222 ··· 1247 1159 * chained in ep->ovflist and requeued later on. 1248 1160 */ 1249 1161 if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) { 1250 - if (epi->next == EP_UNACTIVE_PTR) { 1251 - epi->next = READ_ONCE(ep->ovflist); 1252 - WRITE_ONCE(ep->ovflist, epi); 1162 + if (epi->next == EP_UNACTIVE_PTR && 1163 + chain_epi_lockless(epi)) 1253 1164 ep_pm_stay_awake_rcu(epi); 1254 - } 1255 1165 goto out_unlock; 1256 1166 } 1257 1167 1258 1168 /* If this file is already in the ready list we exit soon */ 1259 - if (!ep_is_linked(epi)) { 1260 - list_add_tail(&epi->rdllink, &ep->rdllist); 1169 + if (!ep_is_linked(epi) && 1170 + list_add_tail_lockless(&epi->rdllink, &ep->rdllist)) { 1261 1171 ep_pm_stay_awake_rcu(epi); 1262 1172 } 1263 1173 ··· 1280 1194 break; 1281 1195 } 1282 1196 } 1283 - wake_up_locked(&ep->wq); 1197 + wake_up(&ep->wq); 1284 1198 } 1285 1199 if (waitqueue_active(&ep->poll_wait)) 1286 1200 pwake++; 1287 1201 1288 1202 out_unlock: 1289 - spin_unlock_irqrestore(&ep->wq.lock, flags); 1203 + read_unlock_irqrestore(&ep->lock, flags); 1290 1204 1291 1205 /* We have to call this outside the lock */ 1292 1206 if (pwake) ··· 1571 1485 goto error_remove_epi; 1572 1486 1573 1487 /* We have to drop the new item inside our item list to keep track of it */ 1574 - spin_lock_irq(&ep->wq.lock); 1488 + write_lock_irq(&ep->lock); 1575 1489 1576 1490 /* record NAPI ID of new item if present */ 1577 1491 ep_set_busy_poll_napi_id(epi); ··· 1583 1497 1584 1498 /* Notify waiting tasks that events are available */ 1585 1499 if (waitqueue_active(&ep->wq)) 1586 - wake_up_locked(&ep->wq); 1500 + wake_up(&ep->wq); 1587 1501 if (waitqueue_active(&ep->poll_wait)) 1588 1502 pwake++; 1589 1503 } 1590 1504 1591 - spin_unlock_irq(&ep->wq.lock); 1505 + write_unlock_irq(&ep->lock); 1592 1506 1593 1507 atomic_long_inc(&ep->user->epoll_watches); 1594 1508 ··· 1614 1528 * list, since that is used/cleaned only inside a section bound by "mtx". 1615 1529 * And ep_insert() is called with "mtx" held. 1616 1530 */ 1617 - spin_lock_irq(&ep->wq.lock); 1531 + write_lock_irq(&ep->lock); 1618 1532 if (ep_is_linked(epi)) 1619 1533 list_del_init(&epi->rdllink); 1620 - spin_unlock_irq(&ep->wq.lock); 1534 + write_unlock_irq(&ep->lock); 1621 1535 1622 1536 wakeup_source_unregister(ep_wakeup_source(epi)); 1623 1537 ··· 1661 1575 * 1) Flush epi changes above to other CPUs. This ensures 1662 1576 * we do not miss events from ep_poll_callback if an 1663 1577 * event occurs immediately after we call f_op->poll(). 1664 - * We need this because we did not take ep->wq.lock while 1578 + * We need this because we did not take ep->lock while 1665 1579 * changing epi above (but ep_poll_callback does take 1666 - * ep->wq.lock). 1580 + * ep->lock). 1667 1581 * 1668 1582 * 2) We also need to ensure we do not miss _past_ events 1669 1583 * when calling f_op->poll(). This barrier also ··· 1682 1596 * list, push it inside. 1683 1597 */ 1684 1598 if (ep_item_poll(epi, &pt, 1)) { 1685 - spin_lock_irq(&ep->wq.lock); 1599 + write_lock_irq(&ep->lock); 1686 1600 if (!ep_is_linked(epi)) { 1687 1601 list_add_tail(&epi->rdllink, &ep->rdllist); 1688 1602 ep_pm_stay_awake(epi); 1689 1603 1690 1604 /* Notify waiting tasks that events are available */ 1691 1605 if (waitqueue_active(&ep->wq)) 1692 - wake_up_locked(&ep->wq); 1606 + wake_up(&ep->wq); 1693 1607 if (waitqueue_active(&ep->poll_wait)) 1694 1608 pwake++; 1695 1609 } 1696 - spin_unlock_irq(&ep->wq.lock); 1610 + write_unlock_irq(&ep->lock); 1697 1611 } 1698 1612 1699 1613 /* We have to call this outside the lock */ ··· 1854 1768 */ 1855 1769 timed_out = 1; 1856 1770 1857 - spin_lock_irq(&ep->wq.lock); 1771 + write_lock_irq(&ep->lock); 1858 1772 eavail = ep_events_available(ep); 1859 - spin_unlock_irq(&ep->wq.lock); 1773 + write_unlock_irq(&ep->lock); 1860 1774 1861 1775 goto send_events; 1862 1776 }