epoll: limit paths · tjh.dev/kernel@28d82dc

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

epoll: limit paths

The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.

To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.

Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.

This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.

In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.

In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.

Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:

/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4

This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.

Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.

I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Nelson Elhage <nelhage@ksplice.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Jason Baron and committed by

Linus Torvalds 14 years ago 28d82dc1 2ccd4f4d

+210 -24

3 changed files

expand all

eventpoll.c

include

linux

eventpoll.h

fs.h

+208 -24

fs/eventpoll.c

··· 197 197 198 198 /* The user that created the eventpoll descriptor */ 199 199 struct user_struct *user; 200 + 201 + struct file *file; 202 + 203 + /* used to optimize loop detection check */ 204 + int visited; 205 + struct list_head visited_list_link; 200 206 }; 201 207 202 208 /* Wait structure used by the poll hooks */ ··· 261 255 /* Slab cache used to allocate "struct eppoll_entry" */ 262 256 static struct kmem_cache *pwq_cache __read_mostly; 263 257 258 + /* Visited nodes during ep_loop_check(), so we can unset them when we finish */ 259 + static LIST_HEAD(visited_list); 260 + 261 + /* 262 + * List of files with newly added links, where we may need to limit the number 263 + * of emanating paths. Protected by the epmutex. 264 + */ 265 + static LIST_HEAD(tfile_check_list); 266 + 264 267 #ifdef CONFIG_SYSCTL 265 268 266 269 #include <linux/sysctl.h> ··· 291 276 }; 292 277 #endif /* CONFIG_SYSCTL */ 293 278 279 + static const struct file_operations eventpoll_fops; 280 + 281 + static inline int is_file_epoll(struct file *f) 282 + { 283 + return f->f_op == &eventpoll_fops; 284 + } 294 285 295 286 /* Setup the structure that is used as key for the RB tree */ 296 287 static inline void ep_set_ffd(struct epoll_filefd *ffd, ··· 732 711 .llseek = noop_llseek, 733 712 }; 734 713 735 - /* Fast test to see if the file is an eventpoll file */ 736 - static inline int is_file_epoll(struct file *f) 737 - { 738 - return f->f_op == &eventpoll_fops; 739 - } 740 - 741 714 /* 742 715 * This is called from eventpoll_release() to unlink files from the eventpoll 743 716 * interface. We need to have this facility to cleanup correctly files that are ··· 941 926 rb_insert_color(&epi->rbn, &ep->rbr); 942 927 } 943 928 929 + 930 + 931 + #define PATH_ARR_SIZE 5 932 + /* 933 + * These are the number paths of length 1 to 5, that we are allowing to emanate 934 + * from a single file of interest. For example, we allow 1000 paths of length 935 + * 1, to emanate from each file of interest. This essentially represents the 936 + * potential wakeup paths, which need to be limited in order to avoid massive 937 + * uncontrolled wakeup storms. The common use case should be a single ep which 938 + * is connected to n file sources. In this case each file source has 1 path 939 + * of length 1. Thus, the numbers below should be more than sufficient. These 940 + * path limits are enforced during an EPOLL_CTL_ADD operation, since a modify 941 + * and delete can't add additional paths. Protected by the epmutex. 942 + */ 943 + static const int path_limits[PATH_ARR_SIZE] = { 1000, 500, 100, 50, 10 }; 944 + static int path_count[PATH_ARR_SIZE]; 945 + 946 + static int path_count_inc(int nests) 947 + { 948 + if (++path_count[nests] > path_limits[nests]) 949 + return -1; 950 + return 0; 951 + } 952 + 953 + static void path_count_init(void) 954 + { 955 + int i; 956 + 957 + for (i = 0; i < PATH_ARR_SIZE; i++) 958 + path_count[i] = 0; 959 + } 960 + 961 + static int reverse_path_check_proc(void *priv, void *cookie, int call_nests) 962 + { 963 + int error = 0; 964 + struct file *file = priv; 965 + struct file *child_file; 966 + struct epitem *epi; 967 + 968 + list_for_each_entry(epi, &file->f_ep_links, fllink) { 969 + child_file = epi->ep->file; 970 + if (is_file_epoll(child_file)) { 971 + if (list_empty(&child_file->f_ep_links)) { 972 + if (path_count_inc(call_nests)) { 973 + error = -1; 974 + break; 975 + } 976 + } else { 977 + error = ep_call_nested(&poll_loop_ncalls, 978 + EP_MAX_NESTS, 979 + reverse_path_check_proc, 980 + child_file, child_file, 981 + current); 982 + } 983 + if (error != 0) 984 + break; 985 + } else { 986 + printk(KERN_ERR "reverse_path_check_proc: " 987 + "file is not an ep!\n"); 988 + } 989 + } 990 + return error; 991 + } 992 + 993 + /** 994 + * reverse_path_check - The tfile_check_list is list of file *, which have 995 + * links that are proposed to be newly added. We need to 996 + * make sure that those added links don't add too many 997 + * paths such that we will spend all our time waking up 998 + * eventpoll objects. 999 + * 1000 + * Returns: Returns zero if the proposed links don't create too many paths, 1001 + * -1 otherwise. 1002 + */ 1003 + static int reverse_path_check(void) 1004 + { 1005 + int length = 0; 1006 + int error = 0; 1007 + struct file *current_file; 1008 + 1009 + /* let's call this for all tfiles */ 1010 + list_for_each_entry(current_file, &tfile_check_list, f_tfile_llink) { 1011 + length++; 1012 + path_count_init(); 1013 + error = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS, 1014 + reverse_path_check_proc, current_file, 1015 + current_file, current); 1016 + if (error) 1017 + break; 1018 + } 1019 + return error; 1020 + } 1021 + 944 1022 /* 945 1023 * Must be called with "mtx" held. 946 1024 */ ··· 1095 987 */ 1096 988 ep_rbtree_insert(ep, epi); 1097 989 990 + /* now check if we've created too many backpaths */ 991 + error = -EINVAL; 992 + if (reverse_path_check()) 993 + goto error_remove_epi; 994 + 1098 995 /* We have to drop the new item inside our item list to keep track of it */ 1099 996 spin_lock_irqsave(&ep->lock, flags); 1100 997 ··· 1123 1010 ep_poll_safewake(&ep->poll_wait); 1124 1011 1125 1012 return 0; 1013 + 1014 + error_remove_epi: 1015 + spin_lock(&tfile->f_lock); 1016 + if (ep_is_linked(&epi->fllink)) 1017 + list_del_init(&epi->fllink); 1018 + spin_unlock(&tfile->f_lock); 1019 + 1020 + rb_erase(&epi->rbn, &ep->rbr); 1126 1021 1127 1022 error_unregister: 1128 1023 ep_unregister_pollwait(ep, epi); ··· 1396 1275 int error = 0; 1397 1276 struct file *file = priv; 1398 1277 struct eventpoll *ep = file->private_data; 1278 + struct eventpoll *ep_tovisit; 1399 1279 struct rb_node *rbp; 1400 1280 struct epitem *epi; 1401 1281 1402 1282 mutex_lock_nested(&ep->mtx, call_nests + 1); 1283 + ep->visited = 1; 1284 + list_add(&ep->visited_list_link, &visited_list); 1403 1285 for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) { 1404 1286 epi = rb_entry(rbp, struct epitem, rbn); 1405 1287 if (unlikely(is_file_epoll(epi->ffd.file))) { 1288 + ep_tovisit = epi->ffd.file->private_data; 1289 + if (ep_tovisit->visited) 1290 + continue; 1406 1291 error = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS, 1407 - ep_loop_check_proc, epi->ffd.file, 1408 - epi->ffd.file->private_data, current); 1292 + ep_loop_check_proc, epi->ffd.file, 1293 + ep_tovisit, current); 1409 1294 if (error != 0) 1410 1295 break; 1296 + } else { 1297 + /* 1298 + * If we've reached a file that is not associated with 1299 + * an ep, then we need to check if the newly added 1300 + * links are going to add too many wakeup paths. We do 1301 + * this by adding it to the tfile_check_list, if it's 1302 + * not already there, and calling reverse_path_check() 1303 + * during ep_insert(). 1304 + */ 1305 + if (list_empty(&epi->ffd.file->f_tfile_llink)) 1306 + list_add(&epi->ffd.file->f_tfile_llink, 1307 + &tfile_check_list); 1411 1308 } 1412 1309 } 1413 1310 mutex_unlock(&ep->mtx); ··· 1446 1307 */ 1447 1308 static int ep_loop_check(struct eventpoll *ep, struct file *file) 1448 1309 { 1449 - return ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS, 1310 + int ret; 1311 + struct eventpoll *ep_cur, *ep_next; 1312 + 1313 + ret = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS, 1450 1314 ep_loop_check_proc, file, ep, current); 1315 + /* clear visited list */ 1316 + list_for_each_entry_safe(ep_cur, ep_next, &visited_list, 1317 + visited_list_link) { 1318 + ep_cur->visited = 0; 1319 + list_del(&ep_cur->visited_list_link); 1320 + } 1321 + return ret; 1322 + } 1323 + 1324 + static void clear_tfile_check_list(void) 1325 + { 1326 + struct file *file; 1327 + 1328 + /* first clear the tfile_check_list */ 1329 + while (!list_empty(&tfile_check_list)) { 1330 + file = list_first_entry(&tfile_check_list, struct file, 1331 + f_tfile_llink); 1332 + list_del_init(&file->f_tfile_llink); 1333 + } 1334 + INIT_LIST_HEAD(&tfile_check_list); 1451 1335 } 1452 1336 1453 1337 /* ··· 1478 1316 */ 1479 1317 SYSCALL_DEFINE1(epoll_create1, int, flags) 1480 1318 { 1481 - int error; 1319 + int error, fd; 1482 1320 struct eventpoll *ep = NULL; 1321 + struct file *file; 1483 1322 1484 1323 /* Check the EPOLL_* constant for consistency. */ 1485 1324 BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC); ··· 1497 1334 * Creates all the items needed to setup an eventpoll file. That is, 1498 1335 * a file structure and a free file descriptor. 1499 1336 */ 1500 - error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep, 1337 + fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC)); 1338 + if (fd < 0) { 1339 + error = fd; 1340 + goto out_free_ep; 1341 + } 1342 + file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep, 1501 1343 O_RDWR | (flags & O_CLOEXEC)); 1502 - if (error < 0) 1503 - ep_free(ep); 1344 + if (IS_ERR(file)) { 1345 + error = PTR_ERR(file); 1346 + goto out_free_fd; 1347 + } 1348 + fd_install(fd, file); 1349 + ep->file = file; 1350 + return fd; 1504 1351 1352 + out_free_fd: 1353 + put_unused_fd(fd); 1354 + out_free_ep: 1355 + ep_free(ep); 1505 1356 return error; 1506 1357 } 1507 1358 ··· 1581 1404 /* 1582 1405 * When we insert an epoll file descriptor, inside another epoll file 1583 1406 * descriptor, there is the change of creating closed loops, which are 1584 - * better be handled here, than in more critical paths. 1407 + * better be handled here, than in more critical paths. While we are 1408 + * checking for loops we also determine the list of files reachable 1409 + * and hang them on the tfile_check_list, so we can check that we 1410 + * haven't created too many possible wakeup paths. 1585 1411 * 1586 - * We hold epmutex across the loop check and the insert in this case, in 1587 - * order to prevent two separate inserts from racing and each doing the 1588 - * insert "at the same time" such that ep_loop_check passes on both 1589 - * before either one does the insert, thereby creating a cycle. 1412 + * We need to hold the epmutex across both ep_insert and ep_remove 1413 + * b/c we want to make sure we are looking at a coherent view of 1414 + * epoll network. 1590 1415 */ 1591 - if (unlikely(is_file_epoll(tfile) && op == EPOLL_CTL_ADD)) { 1416 + if (op == EPOLL_CTL_ADD || op == EPOLL_CTL_DEL) { 1592 1417 mutex_lock(&epmutex); 1593 1418 did_lock_epmutex = 1; 1594 - error = -ELOOP; 1595 - if (ep_loop_check(ep, tfile) != 0) 1596 - goto error_tgt_fput; 1597 1419 } 1598 - 1420 + if (op == EPOLL_CTL_ADD) { 1421 + if (is_file_epoll(tfile)) { 1422 + error = -ELOOP; 1423 + if (ep_loop_check(ep, tfile) != 0) 1424 + goto error_tgt_fput; 1425 + } else 1426 + list_add(&tfile->f_tfile_llink, &tfile_check_list); 1427 + } 1599 1428 1600 1429 mutex_lock_nested(&ep->mtx, 0); 1601 1430 ··· 1620 1437 error = ep_insert(ep, &epds, tfile, fd); 1621 1438 } else 1622 1439 error = -EEXIST; 1440 + clear_tfile_check_list(); 1623 1441 break; 1624 1442 case EPOLL_CTL_DEL: 1625 1443 if (epi) ··· 1639 1455 mutex_unlock(&ep->mtx); 1640 1456 1641 1457 error_tgt_fput: 1642 - if (unlikely(did_lock_epmutex)) 1458 + if (did_lock_epmutex) 1643 1459 mutex_unlock(&epmutex); 1644 1460 1645 1461 fput(tfile);

include/linux/eventpoll.h

··· 61 61 static inline void eventpoll_init_file(struct file *file) 62 62 { 63 63 INIT_LIST_HEAD(&file->f_ep_links); 64 + INIT_LIST_HEAD(&file->f_tfile_llink); 64 65 } 65 66 66 67

include/linux/fs.h

··· 1001 1001 #ifdef CONFIG_EPOLL 1002 1002 /* Used by fs/eventpoll.c to link all the hooks to this file */ 1003 1003 struct list_head f_ep_links; 1004 + struct list_head f_tfile_llink; 1004 1005 #endif /* #ifdef CONFIG_EPOLL */ 1005 1006 struct address_space *f_mapping; 1006 1007 #ifdef CONFIG_DEBUG_WRITECOUNT