Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

NFSv3: handle out-of-order write replies.

NFSv3 includes pre/post wcc attributes which allow the client to
determine if all changes to the file have been made by the client
itself, or if any might have been made by some other client.

If there are gaps in the pre/post ctime sequence it must be assumed that
some other client changed the file in that gap and the local cache must
be suspect. The next time the file is opened the cache should be
invalidated.

Since Commit 1c341b777501 ("NFS: Add deferred cache invalidation for
close-to-open consistency violations") in linux 5.3 the Linux client has
been triggering this invalidation. The chunk in nfs_update_inode() in
particularly triggers.

Unfortunately Linux NFS assumes that all replies will be processed in
the order sent, and will arrive in the order processed. This is not
true in general. Consequently Linux NFS might ignore the wcc info in a
WRITE reply because the reply is in response to a WRITE that was sent
before some other request for which a reply has already been seen. This
is detected by Linux using the gencount tests in nfs_inode_attr_cmp().

Also, when the gencount tests pass it is still possible that the request
were processed on the server in a different order, and a gap seen in
the ctime sequence might be filled in by a subsequent reply, so gaps
should not immediately trigger delayed invalidation.

The net result is that writing to a server and then reading the file
back can result in going to the server for the read rather than serving
it from cache - all because a couple of replies arrived out-of-order.
This is a performance regression over kernels before 5.3, though the
change in 5.3 is a correctness improvement.

This has been seen with Linux writing to a Netapp server which
occasionally re-orders requests. In testing the majority of requests
were in-order, but a few (maybe 2 or three at a time) could be
re-ordered.

This patch addresses the problem by recording any gaps seen in the
pre/post ctime sequence and not triggering invalidation until either
there are too many gaps to fit in the table, or until there are no more
active writes and the remaining gaps cannot be resolved.

We allocate a table of 16 gaps on demand. If the allocation fails we
revert to current behaviour which is of little cost as we are unlikely
to be able to cache the writes anyway.

In the table we store "start->end" pair when iversion is updated and
"end<-start" pairs pre/post pairs reported by the server. Usually these
exactly cancel out and so nothing is stored. When there are
out-of-order replies we do store gaps and these will eventually be
cancelled against later replies when this client is the only writer.

If the final write is out-of-order there may be one gap remaining when
the file is closed. This will be noticed and if there is precisely on
gap and if the iversion can be advanced to match it, then we do so.

This patch makes no attempt to handle directories correctly. The same
problem potentially exists in the out-of-order replies to create/unlink
requests can cause future lookup requires to be sent to the server
unnecessarily. A similar scheme using the same primitives could be used
to notice and handle out-of-order replies.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

authored by

NeilBrown and committed by
Anna Schumaker
3db63daa 03f5bd75

+144 -15
+97 -15
fs/nfs/inode.c
··· 208 208 209 209 nfsi->cache_validity |= flags; 210 210 211 - if (inode->i_mapping->nrpages == 0) 212 - nfsi->cache_validity &= ~(NFS_INO_INVALID_DATA | 213 - NFS_INO_DATA_INVAL_DEFER); 214 - else if (nfsi->cache_validity & NFS_INO_INVALID_DATA) 215 - nfsi->cache_validity &= ~NFS_INO_DATA_INVAL_DEFER; 211 + if (inode->i_mapping->nrpages == 0) { 212 + nfsi->cache_validity &= ~NFS_INO_INVALID_DATA; 213 + nfs_ooo_clear(nfsi); 214 + } else if (nfsi->cache_validity & NFS_INO_INVALID_DATA) { 215 + nfs_ooo_clear(nfsi); 216 + } 216 217 trace_nfs_set_cache_invalid(inode, 0); 217 218 } 218 219 EXPORT_SYMBOL_GPL(nfs_set_cache_invalid); ··· 678 677 trace_nfs_size_truncate(inode, offset); 679 678 i_size_write(inode, offset); 680 679 /* Optimisation */ 681 - if (offset == 0) 682 - NFS_I(inode)->cache_validity &= ~(NFS_INO_INVALID_DATA | 683 - NFS_INO_DATA_INVAL_DEFER); 680 + if (offset == 0) { 681 + NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_DATA; 682 + nfs_ooo_clear(NFS_I(inode)); 683 + } 684 684 NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_SIZE; 685 685 686 686 spin_unlock(&inode->i_lock); ··· 1111 1109 1112 1110 spin_lock(&inode->i_lock); 1113 1111 if (list_empty(&nfsi->open_files) && 1114 - (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)) 1112 + nfs_ooo_test(nfsi)) 1115 1113 nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA | 1116 1114 NFS_INO_REVAL_FORCED); 1117 1115 list_add_tail_rcu(&ctx->list, &nfsi->open_files); ··· 1355 1353 1356 1354 set_bit(NFS_INO_INVALIDATING, bitlock); 1357 1355 smp_wmb(); 1358 - nfsi->cache_validity &= 1359 - ~(NFS_INO_INVALID_DATA | NFS_INO_DATA_INVAL_DEFER); 1356 + nfsi->cache_validity &= ~NFS_INO_INVALID_DATA; 1357 + nfs_ooo_clear(nfsi); 1360 1358 spin_unlock(&inode->i_lock); 1361 1359 trace_nfs_invalidate_mapping_enter(inode); 1362 1360 ret = nfs_invalidate_mapping(inode, mapping); ··· 1818 1816 return 0; 1819 1817 } 1820 1818 1819 + static void nfs_ooo_merge(struct nfs_inode *nfsi, 1820 + u64 start, u64 end) 1821 + { 1822 + int i, cnt; 1823 + 1824 + if (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER) 1825 + /* No point merging anything */ 1826 + return; 1827 + 1828 + if (!nfsi->ooo) { 1829 + nfsi->ooo = kmalloc(sizeof(*nfsi->ooo), GFP_ATOMIC); 1830 + if (!nfsi->ooo) { 1831 + nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER; 1832 + return; 1833 + } 1834 + nfsi->ooo->cnt = 0; 1835 + } 1836 + 1837 + /* add this range, merging if possible */ 1838 + cnt = nfsi->ooo->cnt; 1839 + for (i = 0; i < cnt; i++) { 1840 + if (end == nfsi->ooo->gap[i].start) 1841 + end = nfsi->ooo->gap[i].end; 1842 + else if (start == nfsi->ooo->gap[i].end) 1843 + start = nfsi->ooo->gap[i].start; 1844 + else 1845 + continue; 1846 + /* Remove 'i' from table and loop to insert the new range */ 1847 + cnt -= 1; 1848 + nfsi->ooo->gap[i] = nfsi->ooo->gap[cnt]; 1849 + i = -1; 1850 + } 1851 + if (start != end) { 1852 + if (cnt >= ARRAY_SIZE(nfsi->ooo->gap)) { 1853 + nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER; 1854 + kfree(nfsi->ooo); 1855 + nfsi->ooo = NULL; 1856 + return; 1857 + } 1858 + nfsi->ooo->gap[cnt].start = start; 1859 + nfsi->ooo->gap[cnt].end = end; 1860 + cnt += 1; 1861 + } 1862 + nfsi->ooo->cnt = cnt; 1863 + } 1864 + 1865 + static void nfs_ooo_record(struct nfs_inode *nfsi, 1866 + struct nfs_fattr *fattr) 1867 + { 1868 + /* This reply was out-of-order, so record in the 1869 + * pre/post change id, possibly cancelling 1870 + * gaps created when iversion was jumpped forward. 1871 + */ 1872 + if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) && 1873 + (fattr->valid & NFS_ATTR_FATTR_PRECHANGE)) 1874 + nfs_ooo_merge(nfsi, 1875 + fattr->change_attr, 1876 + fattr->pre_change_attr); 1877 + } 1878 + 1821 1879 static int nfs_refresh_inode_locked(struct inode *inode, 1822 1880 struct nfs_fattr *fattr) 1823 1881 { ··· 1888 1826 1889 1827 if (attr_cmp > 0 || nfs_inode_finish_partial_attr_update(fattr, inode)) 1890 1828 ret = nfs_update_inode(inode, fattr); 1891 - else if (attr_cmp == 0) 1892 - ret = nfs_check_inode_attributes(inode, fattr); 1829 + else { 1830 + nfs_ooo_record(NFS_I(inode), fattr); 1831 + 1832 + if (attr_cmp == 0) 1833 + ret = nfs_check_inode_attributes(inode, fattr); 1834 + } 1893 1835 1894 1836 trace_nfs_refresh_inode_exit(inode, ret); 1895 1837 return ret; ··· 1984 1918 if (attr_cmp < 0) 1985 1919 return 0; 1986 1920 if ((fattr->valid & NFS_ATTR_FATTR) == 0 || !attr_cmp) { 1921 + /* Record the pre/post change info before clearing PRECHANGE */ 1922 + nfs_ooo_record(NFS_I(inode), fattr); 1987 1923 fattr->valid &= ~(NFS_ATTR_FATTR_PRECHANGE 1988 1924 | NFS_ATTR_FATTR_PRESIZE 1989 1925 | NFS_ATTR_FATTR_PREMTIME ··· 2140 2072 2141 2073 /* More cache consistency checks */ 2142 2074 if (fattr->valid & NFS_ATTR_FATTR_CHANGE) { 2075 + if (!have_writers && nfsi->ooo && nfsi->ooo->cnt == 1 && 2076 + nfsi->ooo->gap[0].end == inode_peek_iversion_raw(inode)) { 2077 + /* There is one remaining gap that hasn't been 2078 + * merged into iversion - do that now. 2079 + */ 2080 + inode_set_iversion_raw(inode, nfsi->ooo->gap[0].start); 2081 + kfree(nfsi->ooo); 2082 + nfsi->ooo = NULL; 2083 + } 2143 2084 if (!inode_eq_iversion_raw(inode, fattr->change_attr)) { 2144 2085 /* Could it be a race with writeback? */ 2145 2086 if (!(have_writers || have_delegation)) { ··· 2170 2093 dprintk("NFS: change_attr change on server for file %s/%ld\n", 2171 2094 inode->i_sb->s_id, 2172 2095 inode->i_ino); 2173 - } else if (!have_delegation) 2174 - nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER; 2096 + } else if (!have_delegation) { 2097 + nfs_ooo_record(nfsi, fattr); 2098 + nfs_ooo_merge(nfsi, inode_peek_iversion_raw(inode), 2099 + fattr->change_attr); 2100 + } 2175 2101 inode_set_iversion_raw(inode, fattr->change_attr); 2176 2102 } 2177 2103 } else { ··· 2328 2248 return NULL; 2329 2249 nfsi->flags = 0UL; 2330 2250 nfsi->cache_validity = 0UL; 2251 + nfsi->ooo = NULL; 2331 2252 #if IS_ENABLED(CONFIG_NFS_V4) 2332 2253 nfsi->nfs4_acl = NULL; 2333 2254 #endif /* CONFIG_NFS_V4 */ ··· 2343 2262 2344 2263 void nfs_free_inode(struct inode *inode) 2345 2264 { 2265 + kfree(NFS_I(inode)->ooo); 2346 2266 kmem_cache_free(nfs_inode_cachep, NFS_I(inode)); 2347 2267 } 2348 2268 EXPORT_SYMBOL_GPL(nfs_free_inode);
+47
include/linux/nfs_fs.h
··· 195 195 /* Open contexts for shared mmap writes */ 196 196 struct list_head open_files; 197 197 198 + /* Keep track of out-of-order replies. 199 + * The ooo array contains start/end pairs of 200 + * numbers from the changeid sequence when 201 + * the inode's iversion has been updated. 202 + * It also contains end/start pair (i.e. reverse order) 203 + * of sections of the changeid sequence that have 204 + * been seen in replies from the server. 205 + * Normally these should match and when both 206 + * A:B and B:A are found in ooo, they are both removed. 207 + * And if a reply with A:B causes an iversion update 208 + * of A:B, then neither are added. 209 + * When a reply has pre_change that doesn't match 210 + * iversion, then the changeid pair and any consequent 211 + * change in iversion ARE added. Later replies 212 + * might fill in the gaps, or possibly a gap is caused 213 + * by a change from another client. 214 + * When a file or directory is opened, if the ooo table 215 + * is not empty, then we assume the gaps were due to 216 + * another client and we invalidate the cached data. 217 + * 218 + * We can only track a limited number of concurrent gaps. 219 + * Currently that limit is 16. 220 + * We allocate the table on demand. If there is insufficient 221 + * memory, then we probably cannot cache the file anyway 222 + * so there is no loss. 223 + */ 224 + struct { 225 + int cnt; 226 + struct { 227 + u64 start, end; 228 + } gap[16]; 229 + } *ooo; 230 + 198 231 #if IS_ENABLED(CONFIG_NFS_V4) 199 232 struct nfs4_cached_acl *nfs4_acl; 200 233 /* NFSv4 state */ ··· 643 610 if (sizeof(ino_t) < sizeof(u64)) 644 611 ino ^= fileid >> (sizeof(u64)-sizeof(ino_t)) * 8; 645 612 return ino; 613 + } 614 + 615 + static inline void nfs_ooo_clear(struct nfs_inode *nfsi) 616 + { 617 + nfsi->cache_validity &= ~NFS_INO_DATA_INVAL_DEFER; 618 + kfree(nfsi->ooo); 619 + nfsi->ooo = NULL; 620 + } 621 + 622 + static inline bool nfs_ooo_test(struct nfs_inode *nfsi) 623 + { 624 + return (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER) || 625 + (nfsi->ooo && nfsi->ooo->cnt > 0); 626 + 646 627 } 647 628 648 629 #define NFS_JUKEBOX_RETRY_TIME (5 * HZ)