Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

NTFS: Fix a mount time deadlock.

Big thanks go to Mathias Kolehmainen for reporting the bug, providing
debug output and testing the patches I sent him to get it working.

The fix was to stop calling ntfs_attr_set() at mount time as that causes
balance_dirty_pages_ratelimited() to be called which on systems with
little memory actually tries to go and balance the dirty pages which tries
to take the s_umount semaphore but because we are still in fill_super()
across which the VFS holds s_umount for writing this results in a
deadlock.

We now do the dirty work by hand by submitting individual buffers. This
has the annoying "feature" that mounting can take a few seconds if the
journal is large as we have clear it all. One day someone should improve
on this by deferring the journal clearing to a helper kernel thread so it
can be done in the background but I don't have time for this at the moment
and the current solution works fine so I am leaving it like this for now.

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Anton Altaparmakov and committed by
Linus Torvalds
bfab36e8 f26e51f6

+181 -53
+3 -1
Documentation/filesystems/ntfs.txt
··· 407 407 device /dev/hda5 408 408 raid-disk 0 409 409 device /dev/hdb1 410 - raid-disl 1 410 + raid-disk 1 411 411 412 412 For linear raid, just change the raid-level above to "raid-level linear", for 413 413 mirrors, change it to "raid-level 1", and for stripe sets with parity, change ··· 457 457 458 458 Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. 459 459 460 + 2.1.29: 461 + - Fix a deadlock when mounting read-write. 460 462 2.1.28: 461 463 - Fix a deadlock. 462 464 2.1.27:
+12
fs/ntfs/ChangeLog
··· 17 17 happen is unclear however so it is worth waiting until someone hits 18 18 the problem. 19 19 20 + 2.1.29 - Fix a deadlock at mount time. 21 + 22 + - During mount the VFS holds s_umount lock on the superblock. So when 23 + we try to empty the journal $LogFile contents by calling 24 + ntfs_attr_set() when the machine does not have much memory and the 25 + journal is large ntfs_attr_set() results in the VM trying to balance 26 + dirty pages which in turn tries to that the s_umount lock and thus we 27 + get a deadlock. The solution is to not use ntfs_attr_set() and 28 + instead do the zeroing by hand at the block level rather than page 29 + cache level. 30 + - Fix sparse warnings. 31 + 20 32 2.1.28 - Fix a deadlock. 21 33 22 34 - Fix deadlock in fs/ntfs/inode.c::ntfs_put_inode(). Thanks to Sergey
+1 -1
fs/ntfs/Makefile
··· 6 6 index.o inode.o mft.o mst.o namei.o runlist.o super.o sysctl.o \ 7 7 unistr.o upcase.o 8 8 9 - EXTRA_CFLAGS = -DNTFS_VERSION=\"2.1.28\" 9 + EXTRA_CFLAGS = -DNTFS_VERSION=\"2.1.29\" 10 10 11 11 ifeq ($(CONFIG_NTFS_DEBUG),y) 12 12 EXTRA_CFLAGS += -DDEBUG
+11 -11
fs/ntfs/aops.c
··· 2 2 * aops.c - NTFS kernel address space operations and page cache handling. 3 3 * Part of the Linux-NTFS project. 4 4 * 5 - * Copyright (c) 2001-2006 Anton Altaparmakov 5 + * Copyright (c) 2001-2007 Anton Altaparmakov 6 6 * Copyright (c) 2002 Richard Russon 7 7 * 8 8 * This program/include file is free software; you can redistribute it and/or ··· 396 396 loff_t i_size; 397 397 struct inode *vi; 398 398 ntfs_inode *ni, *base_ni; 399 - u8 *kaddr; 399 + u8 *addr; 400 400 ntfs_attr_search_ctx *ctx; 401 401 MFT_RECORD *mrec; 402 402 unsigned long flags; ··· 491 491 /* Race with shrinking truncate. */ 492 492 attr_len = i_size; 493 493 } 494 - kaddr = kmap_atomic(page, KM_USER0); 494 + addr = kmap_atomic(page, KM_USER0); 495 495 /* Copy the data to the page. */ 496 - memcpy(kaddr, (u8*)ctx->attr + 496 + memcpy(addr, (u8*)ctx->attr + 497 497 le16_to_cpu(ctx->attr->data.resident.value_offset), 498 498 attr_len); 499 499 /* Zero the remainder of the page. */ 500 - memset(kaddr + attr_len, 0, PAGE_CACHE_SIZE - attr_len); 500 + memset(addr + attr_len, 0, PAGE_CACHE_SIZE - attr_len); 501 501 flush_dcache_page(page); 502 - kunmap_atomic(kaddr, KM_USER0); 502 + kunmap_atomic(addr, KM_USER0); 503 503 put_unm_err_out: 504 504 ntfs_attr_put_search_ctx(ctx); 505 505 unm_err_out: ··· 1344 1344 loff_t i_size; 1345 1345 struct inode *vi = page->mapping->host; 1346 1346 ntfs_inode *base_ni = NULL, *ni = NTFS_I(vi); 1347 - char *kaddr; 1347 + char *addr; 1348 1348 ntfs_attr_search_ctx *ctx = NULL; 1349 1349 MFT_RECORD *m = NULL; 1350 1350 u32 attr_len; ··· 1484 1484 /* Shrinking cannot fail. */ 1485 1485 BUG_ON(err); 1486 1486 } 1487 - kaddr = kmap_atomic(page, KM_USER0); 1487 + addr = kmap_atomic(page, KM_USER0); 1488 1488 /* Copy the data from the page to the mft record. */ 1489 1489 memcpy((u8*)ctx->attr + 1490 1490 le16_to_cpu(ctx->attr->data.resident.value_offset), 1491 - kaddr, attr_len); 1491 + addr, attr_len); 1492 1492 /* Zero out of bounds area in the page cache page. */ 1493 - memset(kaddr + attr_len, 0, PAGE_CACHE_SIZE - attr_len); 1494 - kunmap_atomic(kaddr, KM_USER0); 1493 + memset(addr + attr_len, 0, PAGE_CACHE_SIZE - attr_len); 1494 + kunmap_atomic(addr, KM_USER0); 1495 1495 flush_dcache_page(page); 1496 1496 flush_dcache_mft_record_page(ctx->ntfs_ino); 1497 1497 /* We are done with the page. */
+6 -2
fs/ntfs/attrib.c
··· 1 1 /** 2 2 * attrib.c - NTFS attribute operations. Part of the Linux-NTFS project. 3 3 * 4 - * Copyright (c) 2001-2006 Anton Altaparmakov 4 + * Copyright (c) 2001-2007 Anton Altaparmakov 5 5 * Copyright (c) 2002 Richard Russon 6 6 * 7 7 * This program/include file is free software; you can redistribute it and/or ··· 2500 2500 struct page *page; 2501 2501 u8 *kaddr; 2502 2502 pgoff_t idx, end; 2503 - unsigned int start_ofs, end_ofs, size; 2503 + unsigned start_ofs, end_ofs, size; 2504 2504 2505 2505 ntfs_debug("Entering for ofs 0x%llx, cnt 0x%llx, val 0x%hx.", 2506 2506 (long long)ofs, (long long)cnt, val); ··· 2548 2548 kunmap_atomic(kaddr, KM_USER0); 2549 2549 set_page_dirty(page); 2550 2550 page_cache_release(page); 2551 + balance_dirty_pages_ratelimited(mapping); 2552 + cond_resched(); 2551 2553 if (idx == end) 2552 2554 goto done; 2553 2555 idx++; ··· 2606 2604 kunmap_atomic(kaddr, KM_USER0); 2607 2605 set_page_dirty(page); 2608 2606 page_cache_release(page); 2607 + balance_dirty_pages_ratelimited(mapping); 2608 + cond_resched(); 2609 2609 } 2610 2610 done: 2611 2611 ntfs_debug("Done.");
+17 -19
fs/ntfs/file.c
··· 1 1 /* 2 2 * file.c - NTFS kernel file operations. Part of the Linux-NTFS project. 3 3 * 4 - * Copyright (c) 2001-2006 Anton Altaparmakov 4 + * Copyright (c) 2001-2007 Anton Altaparmakov 5 5 * 6 6 * This program/include file is free software; you can redistribute it and/or 7 7 * modify it under the terms of the GNU General Public License as published ··· 26 26 #include <linux/swap.h> 27 27 #include <linux/uio.h> 28 28 #include <linux/writeback.h> 29 - #include <linux/sched.h> 30 29 31 30 #include <asm/page.h> 32 31 #include <asm/uaccess.h> ··· 361 362 volatile char c; 362 363 363 364 /* Set @end to the first byte outside the last page we care about. */ 364 - end = (const char __user*)PAGE_ALIGN((ptrdiff_t __user)uaddr + bytes); 365 + end = (const char __user*)PAGE_ALIGN((unsigned long)uaddr + bytes); 365 366 366 367 while (!__get_user(c, uaddr) && (uaddr += PAGE_SIZE, uaddr < end)) 367 368 ; ··· 531 532 blocksize_bits = vol->sb->s_blocksize_bits; 532 533 u = 0; 533 534 do { 534 - struct page *page = pages[u]; 535 + page = pages[u]; 536 + BUG_ON(!page); 535 537 /* 536 538 * create_empty_buffers() will create uptodate/dirty buffers if 537 539 * the page is uptodate/dirty. ··· 1291 1291 size_t bytes) 1292 1292 { 1293 1293 struct page **last_page = pages + nr_pages; 1294 - char *kaddr; 1294 + char *addr; 1295 1295 size_t total = 0; 1296 1296 unsigned len; 1297 1297 int left; ··· 1300 1300 len = PAGE_CACHE_SIZE - ofs; 1301 1301 if (len > bytes) 1302 1302 len = bytes; 1303 - kaddr = kmap_atomic(*pages, KM_USER0); 1304 - left = __copy_from_user_inatomic(kaddr + ofs, buf, len); 1305 - kunmap_atomic(kaddr, KM_USER0); 1303 + addr = kmap_atomic(*pages, KM_USER0); 1304 + left = __copy_from_user_inatomic(addr + ofs, buf, len); 1305 + kunmap_atomic(addr, KM_USER0); 1306 1306 if (unlikely(left)) { 1307 1307 /* Do it the slow way. */ 1308 - kaddr = kmap(*pages); 1309 - left = __copy_from_user(kaddr + ofs, buf, len); 1308 + addr = kmap(*pages); 1309 + left = __copy_from_user(addr + ofs, buf, len); 1310 1310 kunmap(*pages); 1311 1311 if (unlikely(left)) 1312 1312 goto err_out; ··· 1408 1408 size_t *iov_ofs, size_t bytes) 1409 1409 { 1410 1410 struct page **last_page = pages + nr_pages; 1411 - char *kaddr; 1411 + char *addr; 1412 1412 size_t copied, len, total = 0; 1413 1413 1414 1414 do { 1415 1415 len = PAGE_CACHE_SIZE - ofs; 1416 1416 if (len > bytes) 1417 1417 len = bytes; 1418 - kaddr = kmap_atomic(*pages, KM_USER0); 1419 - copied = __ntfs_copy_from_user_iovec_inatomic(kaddr + ofs, 1418 + addr = kmap_atomic(*pages, KM_USER0); 1419 + copied = __ntfs_copy_from_user_iovec_inatomic(addr + ofs, 1420 1420 *iov, *iov_ofs, len); 1421 - kunmap_atomic(kaddr, KM_USER0); 1421 + kunmap_atomic(addr, KM_USER0); 1422 1422 if (unlikely(copied != len)) { 1423 1423 /* Do it the slow way. */ 1424 - kaddr = kmap(*pages); 1425 - copied = __ntfs_copy_from_user_iovec_inatomic(kaddr + ofs, 1424 + addr = kmap(*pages); 1425 + copied = __ntfs_copy_from_user_iovec_inatomic(addr + ofs, 1426 1426 *iov, *iov_ofs, len); 1427 1427 /* 1428 1428 * Zero the rest of the target like __copy_from_user(). 1429 1429 */ 1430 - memset(kaddr + ofs + copied, 0, len - copied); 1430 + memset(addr + ofs + copied, 0, len - copied); 1431 1431 kunmap(*pages); 1432 1432 if (unlikely(copied != len)) 1433 1433 goto err_out; ··· 1735 1735 read_unlock_irqrestore(&ni->size_lock, flags); 1736 1736 BUG_ON(initialized_size != i_size); 1737 1737 if (end > initialized_size) { 1738 - unsigned long flags; 1739 - 1740 1738 write_lock_irqsave(&ni->size_lock, flags); 1741 1739 ni->initialized_size = end; 1742 1740 i_size_write(vi, end);
-3
fs/ntfs/inode.c
··· 34 34 #include "dir.h" 35 35 #include "debug.h" 36 36 #include "inode.h" 37 - #include "attrib.h" 38 37 #include "lcnalloc.h" 39 38 #include "malloc.h" 40 39 #include "mft.h" ··· 2499 2500 /* Resize the attribute record to best fit the new attribute size. */ 2500 2501 if (new_size < vol->mft_record_size && 2501 2502 !ntfs_resident_attr_value_resize(m, a, new_size)) { 2502 - unsigned long flags; 2503 - 2504 2503 /* The resize succeeded! */ 2505 2504 flush_dcache_mft_record_page(ctx->ntfs_ino); 2506 2505 mark_mft_record_dirty(ctx->ntfs_ino);
+129 -14
fs/ntfs/logfile.c
··· 1 1 /* 2 2 * logfile.c - NTFS kernel journal handling. Part of the Linux-NTFS project. 3 3 * 4 - * Copyright (c) 2002-2005 Anton Altaparmakov 4 + * Copyright (c) 2002-2007 Anton Altaparmakov 5 5 * 6 6 * This program/include file is free software; you can redistribute it and/or 7 7 * modify it under the terms of the GNU General Public License as published ··· 724 724 */ 725 725 bool ntfs_empty_logfile(struct inode *log_vi) 726 726 { 727 - ntfs_volume *vol = NTFS_SB(log_vi->i_sb); 727 + VCN vcn, end_vcn; 728 + ntfs_inode *log_ni = NTFS_I(log_vi); 729 + ntfs_volume *vol = log_ni->vol; 730 + struct super_block *sb = vol->sb; 731 + runlist_element *rl; 732 + unsigned long flags; 733 + unsigned block_size, block_size_bits; 734 + int err; 735 + bool should_wait = true; 728 736 729 737 ntfs_debug("Entering."); 730 - if (!NVolLogFileEmpty(vol)) { 731 - int err; 732 - 733 - err = ntfs_attr_set(NTFS_I(log_vi), 0, i_size_read(log_vi), 734 - 0xff); 735 - if (unlikely(err)) { 736 - ntfs_error(vol->sb, "Failed to fill $LogFile with " 737 - "0xff bytes (error code %i).", err); 738 - return false; 739 - } 740 - /* Set the flag so we do not have to do it again on remount. */ 741 - NVolSetLogFileEmpty(vol); 738 + if (NVolLogFileEmpty(vol)) { 739 + ntfs_debug("Done."); 740 + return true; 742 741 } 742 + /* 743 + * We cannot use ntfs_attr_set() because we may be still in the middle 744 + * of a mount operation. Thus we do the emptying by hand by first 745 + * zapping the page cache pages for the $LogFile/$DATA attribute and 746 + * then emptying each of the buffers in each of the clusters specified 747 + * by the runlist by hand. 748 + */ 749 + block_size = sb->s_blocksize; 750 + block_size_bits = sb->s_blocksize_bits; 751 + vcn = 0; 752 + read_lock_irqsave(&log_ni->size_lock, flags); 753 + end_vcn = (log_ni->initialized_size + vol->cluster_size_mask) >> 754 + vol->cluster_size_bits; 755 + read_unlock_irqrestore(&log_ni->size_lock, flags); 756 + truncate_inode_pages(log_vi->i_mapping, 0); 757 + down_write(&log_ni->runlist.lock); 758 + rl = log_ni->runlist.rl; 759 + if (unlikely(!rl || vcn < rl->vcn || !rl->length)) { 760 + map_vcn: 761 + err = ntfs_map_runlist_nolock(log_ni, vcn, NULL); 762 + if (err) { 763 + ntfs_error(sb, "Failed to map runlist fragment (error " 764 + "%d).", -err); 765 + goto err; 766 + } 767 + rl = log_ni->runlist.rl; 768 + BUG_ON(!rl || vcn < rl->vcn || !rl->length); 769 + } 770 + /* Seek to the runlist element containing @vcn. */ 771 + while (rl->length && vcn >= rl[1].vcn) 772 + rl++; 773 + do { 774 + LCN lcn; 775 + sector_t block, end_block; 776 + s64 len; 777 + 778 + /* 779 + * If this run is not mapped map it now and start again as the 780 + * runlist will have been updated. 781 + */ 782 + lcn = rl->lcn; 783 + if (unlikely(lcn == LCN_RL_NOT_MAPPED)) { 784 + vcn = rl->vcn; 785 + goto map_vcn; 786 + } 787 + /* If this run is not valid abort with an error. */ 788 + if (unlikely(!rl->length || lcn < LCN_HOLE)) 789 + goto rl_err; 790 + /* Skip holes. */ 791 + if (lcn == LCN_HOLE) 792 + continue; 793 + block = lcn << vol->cluster_size_bits >> block_size_bits; 794 + len = rl->length; 795 + if (rl[1].vcn > end_vcn) 796 + len = end_vcn - rl->vcn; 797 + end_block = (lcn + len) << vol->cluster_size_bits >> 798 + block_size_bits; 799 + /* Iterate over the blocks in the run and empty them. */ 800 + do { 801 + struct buffer_head *bh; 802 + 803 + /* Obtain the buffer, possibly not uptodate. */ 804 + bh = sb_getblk(sb, block); 805 + BUG_ON(!bh); 806 + /* Setup buffer i/o submission. */ 807 + lock_buffer(bh); 808 + bh->b_end_io = end_buffer_write_sync; 809 + get_bh(bh); 810 + /* Set the entire contents of the buffer to 0xff. */ 811 + memset(bh->b_data, -1, block_size); 812 + if (!buffer_uptodate(bh)) 813 + set_buffer_uptodate(bh); 814 + if (buffer_dirty(bh)) 815 + clear_buffer_dirty(bh); 816 + /* 817 + * Submit the buffer and wait for i/o to complete but 818 + * only for the first buffer so we do not miss really 819 + * serious i/o errors. Once the first buffer has 820 + * completed ignore errors afterwards as we can assume 821 + * that if one buffer worked all of them will work. 822 + */ 823 + submit_bh(WRITE, bh); 824 + if (should_wait) { 825 + should_wait = false; 826 + wait_on_buffer(bh); 827 + if (unlikely(!buffer_uptodate(bh))) 828 + goto io_err; 829 + } 830 + brelse(bh); 831 + } while (++block < end_block); 832 + } while ((++rl)->vcn < end_vcn); 833 + up_write(&log_ni->runlist.lock); 834 + /* 835 + * Zap the pages again just in case any got instantiated whilst we were 836 + * emptying the blocks by hand. FIXME: We may not have completed 837 + * writing to all the buffer heads yet so this may happen too early. 838 + * We really should use a kernel thread to do the emptying 839 + * asynchronously and then we can also set the volume dirty and output 840 + * an error message if emptying should fail. 841 + */ 842 + truncate_inode_pages(log_vi->i_mapping, 0); 843 + /* Set the flag so we do not have to do it again on remount. */ 844 + NVolSetLogFileEmpty(vol); 743 845 ntfs_debug("Done."); 744 846 return true; 847 + io_err: 848 + ntfs_error(sb, "Failed to write buffer. Unmount and run chkdsk."); 849 + goto dirty_err; 850 + rl_err: 851 + ntfs_error(sb, "Runlist is corrupt. Unmount and run chkdsk."); 852 + dirty_err: 853 + NVolSetErrors(vol); 854 + err = -EIO; 855 + err: 856 + up_write(&log_ni->runlist.lock); 857 + ntfs_error(sb, "Failed to fill $LogFile with 0xff bytes (error %d).", 858 + -err); 859 + return false; 745 860 } 746 861 747 862 #endif /* NTFS_RW */
+2 -2
fs/ntfs/runlist.c
··· 1 1 /** 2 2 * runlist.c - NTFS runlist handling code. Part of the Linux-NTFS project. 3 3 * 4 - * Copyright (c) 2001-2005 Anton Altaparmakov 4 + * Copyright (c) 2001-2007 Anton Altaparmakov 5 5 * Copyright (c) 2002-2005 Richard Russon 6 6 * 7 7 * This program/include file is free software; you can redistribute it and/or ··· 1714 1714 sizeof(*rl)); 1715 1715 /* Adjust the beginning of the tail if necessary. */ 1716 1716 if (end > rl->vcn) { 1717 - s64 delta = end - rl->vcn; 1717 + delta = end - rl->vcn; 1718 1718 rl->vcn = end; 1719 1719 rl->length -= delta; 1720 1720 /* Only adjust the lcn if it is real. */