[PATCH] aio: remove unlocked task_list test and resulting race

Only one of the run or kick path is supposed to put an iocb on the run
list. If both of them do it than one of them can end up referencing a
freed iocb. The kick path could delete the task_list item from the wait
queue before getting the ctx_lock and putting the iocb on the run list.
The run path was testing the task_list item outside the lock so that it
could catch ki_retry methods that return -EIOCBRETRY *without* putting the
iocb on a wait queue and promising to call kick_iocb. This unlocked check
could then race with the kick path to cause both to try and put the iocb on
the run list.

The patch stops the run path from testing task_list by requring that any
ki_retry that returns -EIOCBRETRY *must* guarantee that kick_iocb() will be
called in the future. aio_p{read,write}, the only in-tree -EIOCBRETRY
users, are updated.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

authored by Zach Brown and committed by Linus Torvalds 897f15fb 998765e5

+68 -47
+34 -47
fs/aio.c
··· 741 741 ret = retry(iocb); 742 742 current->io_wait = NULL; 743 743 744 - if (-EIOCBRETRY != ret) { 745 - if (-EIOCBQUEUED != ret) { 746 - BUG_ON(!list_empty(&iocb->ki_wait.task_list)); 747 - aio_complete(iocb, ret, 0); 748 - /* must not access the iocb after this */ 749 - } 750 - } else { 751 - /* 752 - * Issue an additional retry to avoid waiting forever if 753 - * no waits were queued (e.g. in case of a short read). 754 - */ 755 - if (list_empty(&iocb->ki_wait.task_list)) 756 - kiocbSetKicked(iocb); 744 + if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) { 745 + BUG_ON(!list_empty(&iocb->ki_wait.task_list)); 746 + aio_complete(iocb, ret, 0); 757 747 } 758 748 out: 759 749 spin_lock_irq(&ctx->ctx_lock); ··· 1317 1327 } 1318 1328 1319 1329 /* 1320 - * Default retry method for aio_read (also used for first time submit) 1321 - * Responsible for updating iocb state as retries progress 1330 + * aio_p{read,write} are the default ki_retry methods for 1331 + * IO_CMD_P{READ,WRITE}. They maintains kiocb retry state around potentially 1332 + * multiple calls to f_op->aio_read(). They loop around partial progress 1333 + * instead of returning -EIOCBRETRY because they don't have the means to call 1334 + * kick_iocb(). 1322 1335 */ 1323 1336 static ssize_t aio_pread(struct kiocb *iocb) 1324 1337 { ··· 1330 1337 struct inode *inode = mapping->host; 1331 1338 ssize_t ret = 0; 1332 1339 1333 - ret = file->f_op->aio_read(iocb, iocb->ki_buf, 1334 - iocb->ki_left, iocb->ki_pos); 1335 - 1336 - /* 1337 - * Can't just depend on iocb->ki_left to determine 1338 - * whether we are done. This may have been a short read. 1339 - */ 1340 - if (ret > 0) { 1341 - iocb->ki_buf += ret; 1342 - iocb->ki_left -= ret; 1340 + do { 1341 + ret = file->f_op->aio_read(iocb, iocb->ki_buf, 1342 + iocb->ki_left, iocb->ki_pos); 1343 1343 /* 1344 - * For pipes and sockets we return once we have 1345 - * some data; for regular files we retry till we 1346 - * complete the entire read or find that we can't 1347 - * read any more data (e.g short reads). 1344 + * Can't just depend on iocb->ki_left to determine 1345 + * whether we are done. This may have been a short read. 1348 1346 */ 1349 - if (!S_ISFIFO(inode->i_mode) && !S_ISSOCK(inode->i_mode)) 1350 - ret = -EIOCBRETRY; 1351 - } 1347 + if (ret > 0) { 1348 + iocb->ki_buf += ret; 1349 + iocb->ki_left -= ret; 1350 + } 1351 + 1352 + /* 1353 + * For pipes and sockets we return once we have some data; for 1354 + * regular files we retry till we complete the entire read or 1355 + * find that we can't read any more data (e.g short reads). 1356 + */ 1357 + } while (ret > 0 && 1358 + !S_ISFIFO(inode->i_mode) && !S_ISSOCK(inode->i_mode)); 1352 1359 1353 1360 /* This means we must have transferred all that we could */ 1354 1361 /* No need to retry anymore */ ··· 1358 1365 return ret; 1359 1366 } 1360 1367 1361 - /* 1362 - * Default retry method for aio_write (also used for first time submit) 1363 - * Responsible for updating iocb state as retries progress 1364 - */ 1368 + /* see aio_pread() */ 1365 1369 static ssize_t aio_pwrite(struct kiocb *iocb) 1366 1370 { 1367 1371 struct file *file = iocb->ki_filp; 1368 1372 ssize_t ret = 0; 1369 1373 1370 - ret = file->f_op->aio_write(iocb, iocb->ki_buf, 1371 - iocb->ki_left, iocb->ki_pos); 1374 + do { 1375 + ret = file->f_op->aio_write(iocb, iocb->ki_buf, 1376 + iocb->ki_left, iocb->ki_pos); 1377 + if (ret > 0) { 1378 + iocb->ki_buf += ret; 1379 + iocb->ki_left -= ret; 1380 + } 1381 + } while (ret > 0); 1372 1382 1373 - if (ret > 0) { 1374 - iocb->ki_buf += ret; 1375 - iocb->ki_left -= ret; 1376 - 1377 - ret = -EIOCBRETRY; 1378 - } 1379 - 1380 - /* This means we must have transferred all that we could */ 1381 - /* No need to retry anymore */ 1382 1383 if ((ret == 0) || (iocb->ki_left == 0)) 1383 1384 ret = iocb->ki_nbytes - iocb->ki_left; 1384 1385
+34
include/linux/aio.h
··· 43 43 #define kiocbIsKicked(iocb) test_bit(KIF_KICKED, &(iocb)->ki_flags) 44 44 #define kiocbIsCancelled(iocb) test_bit(KIF_CANCELLED, &(iocb)->ki_flags) 45 45 46 + /* is there a better place to document function pointer methods? */ 47 + /** 48 + * ki_retry - iocb forward progress callback 49 + * @kiocb: The kiocb struct to advance by performing an operation. 50 + * 51 + * This callback is called when the AIO core wants a given AIO operation 52 + * to make forward progress. The kiocb argument describes the operation 53 + * that is to be performed. As the operation proceeds, perhaps partially, 54 + * ki_retry is expected to update the kiocb with progress made. Typically 55 + * ki_retry is set in the AIO core and it itself calls file_operations 56 + * helpers. 57 + * 58 + * ki_retry's return value determines when the AIO operation is completed 59 + * and an event is generated in the AIO event ring. Except the special 60 + * return values described below, the value that is returned from ki_retry 61 + * is transferred directly into the completion ring as the operation's 62 + * resulting status. Once this has happened ki_retry *MUST NOT* reference 63 + * the kiocb pointer again. 64 + * 65 + * If ki_retry returns -EIOCBQUEUED it has made a promise that aio_complete() 66 + * will be called on the kiocb pointer in the future. The AIO core will 67 + * not ask the method again -- ki_retry must ensure forward progress. 68 + * aio_complete() must be called once and only once in the future, multiple 69 + * calls may result in undefined behaviour. 70 + * 71 + * If ki_retry returns -EIOCBRETRY it has made a promise that kick_iocb() 72 + * will be called on the kiocb pointer in the future. This may happen 73 + * through generic helpers that associate kiocb->ki_wait with a wait 74 + * queue head that ki_retry uses via current->io_wait. It can also happen 75 + * with custom tracking and manual calls to kick_iocb(), though that is 76 + * discouraged. In either case, kick_iocb() must be called once and only 77 + * once. ki_retry must ensure forward progress, the AIO core will wait 78 + * indefinitely for kick_iocb() to be called. 79 + */ 46 80 struct kiocb { 47 81 struct list_head ki_run_list; 48 82 long ki_flags;