Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

shmem: fix faulting into a hole while it's punched

Trinity finds that mmap access to a hole while it's punched from shmem
can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
from completing, until the reader chooses to stop; with the puncher's
hold on i_mutex locking out all other writers until it can complete.

It appears that the tmpfs fault path is too light in comparison with its
hole-punching path, lacking an i_data_sem to obstruct it; but we don't
want to slow down the common case.

Extend shmem_fallocate()'s existing range notification mechanism, so
shmem_fault() can refrain from faulting pages into the hole while it's
punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
faulting when not).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

Hugh Dickins and committed by
Linus Torvalds
f00cdc6d f72e7dcd

+52 -4
+52 -4
mm/shmem.c
··· 80 80 #define SHORT_SYMLINK_LEN 128 81 81 82 82 /* 83 - * shmem_fallocate and shmem_writepage communicate via inode->i_private 84 - * (with i_mutex making sure that it has only one user at a time): 85 - * we would prefer not to enlarge the shmem inode just for that. 83 + * shmem_fallocate communicates with shmem_fault or shmem_writepage via 84 + * inode->i_private (with i_mutex making sure that it has only one user at 85 + * a time): we would prefer not to enlarge the shmem inode just for that. 86 86 */ 87 87 struct shmem_falloc { 88 + int mode; /* FALLOC_FL mode currently operating */ 88 89 pgoff_t start; /* start of range currently being fallocated */ 89 90 pgoff_t next; /* the next page offset to be fallocated */ 90 91 pgoff_t nr_falloced; /* how many new pages have been fallocated */ ··· 760 759 spin_lock(&inode->i_lock); 761 760 shmem_falloc = inode->i_private; 762 761 if (shmem_falloc && 762 + !shmem_falloc->mode && 763 763 index >= shmem_falloc->start && 764 764 index < shmem_falloc->next) 765 765 shmem_falloc->nr_unswapped++; ··· 1234 1232 struct inode *inode = file_inode(vma->vm_file); 1235 1233 int error; 1236 1234 int ret = VM_FAULT_LOCKED; 1235 + 1236 + /* 1237 + * Trinity finds that probing a hole which tmpfs is punching can 1238 + * prevent the hole-punch from ever completing: which in turn 1239 + * locks writers out with its hold on i_mutex. So refrain from 1240 + * faulting pages into the hole while it's being punched, and 1241 + * wait on i_mutex to be released if vmf->flags permits. 1242 + */ 1243 + if (unlikely(inode->i_private)) { 1244 + struct shmem_falloc *shmem_falloc; 1245 + 1246 + spin_lock(&inode->i_lock); 1247 + shmem_falloc = inode->i_private; 1248 + if (!shmem_falloc || 1249 + shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE || 1250 + vmf->pgoff < shmem_falloc->start || 1251 + vmf->pgoff >= shmem_falloc->next) 1252 + shmem_falloc = NULL; 1253 + spin_unlock(&inode->i_lock); 1254 + /* 1255 + * i_lock has protected us from taking shmem_falloc seriously 1256 + * once return from shmem_fallocate() went back up that stack. 1257 + * i_lock does not serialize with i_mutex at all, but it does 1258 + * not matter if sometimes we wait unnecessarily, or sometimes 1259 + * miss out on waiting: we just need to make those cases rare. 1260 + */ 1261 + if (shmem_falloc) { 1262 + if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) && 1263 + !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) { 1264 + up_read(&vma->vm_mm->mmap_sem); 1265 + mutex_lock(&inode->i_mutex); 1266 + mutex_unlock(&inode->i_mutex); 1267 + return VM_FAULT_RETRY; 1268 + } 1269 + /* cond_resched? Leave that to GUP or return to user */ 1270 + return VM_FAULT_NOPAGE; 1271 + } 1272 + } 1237 1273 1238 1274 error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret); 1239 1275 if (error) ··· 1769 1729 1770 1730 mutex_lock(&inode->i_mutex); 1771 1731 1732 + shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE; 1733 + 1772 1734 if (mode & FALLOC_FL_PUNCH_HOLE) { 1773 1735 struct address_space *mapping = file->f_mapping; 1774 1736 loff_t unmap_start = round_up(offset, PAGE_SIZE); 1775 1737 loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; 1738 + 1739 + shmem_falloc.start = unmap_start >> PAGE_SHIFT; 1740 + shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; 1741 + spin_lock(&inode->i_lock); 1742 + inode->i_private = &shmem_falloc; 1743 + spin_unlock(&inode->i_lock); 1776 1744 1777 1745 if ((u64)unmap_end > (u64)unmap_start) 1778 1746 unmap_mapping_range(mapping, unmap_start, ··· 1788 1740 shmem_truncate_range(inode, offset, offset + len - 1); 1789 1741 /* No need to unmap again: hole-punching leaves COWed pages */ 1790 1742 error = 0; 1791 - goto out; 1743 + goto undone; 1792 1744 } 1793 1745 1794 1746 /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */