radix-tree: fix small lockless radix-tree bug

We shrink a radix tree when its root node has only one child, in the left
most slot. The child becomes the new root node. To perform this
operation in a manner compatible with concurrent lockless lookups, we
atomically switch the root pointer from the parent to its child.

However a concurrent lockless lookup may now have loaded a pointer to the
parent (and is presently deciding what to do next). For this reason, we
also have to keep the parent node in a valid state after shrinking the
tree, until the next RCU grace period -- otherwise this lookup with the
parent pointer may not do the right thing. Notably, we need to keep the
child in the left most slot there in case that is requested by the lookup.

This is all pretty standard RCU stuff. It is worth repeating because in
my eagerness to obey the radix tree node constructor scheme, I had broken
it by zeroing the radix tree node before the grace period.

What could happen is that a lookup can load the parent pointer, then
decide it wants to follow the left most child slot, only to find the slot
contained NULL due to the concurrent shrinker having zeroed the parent
node before waiting for a grace period. The lookup would return a false
negative as a result.

Fix it by doing that clearing in the RCU callback. I would normally want
to rip out the constructor entirely, but radix tree nodes are one of those
places where they make sense (only few cachelines will be touched soon
after allocation).

This was never actually found in any lockless pagecache testing or by the
test harness, but by seeing the odd problem with my scalable vmap rewrite.
I have not tickled the test harness into reproducing it yet, but I'll
keep working at it.

Fortunately, it is not a problem anywhere lockless pagecache is used in
mainline kernels (pagecache probe is not a guarantee, and brd does not
have concurrent lookups and deletes).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by Nick Piggin and committed by Linus Torvalds 643b52b9 d2187ebd

+62 -58
+62 -58
lib/radix-tree.c
··· 88 88 return root->gfp_mask & __GFP_BITS_MASK; 89 89 } 90 90 91 + static inline void tag_set(struct radix_tree_node *node, unsigned int tag, 92 + int offset) 93 + { 94 + __set_bit(offset, node->tags[tag]); 95 + } 96 + 97 + static inline void tag_clear(struct radix_tree_node *node, unsigned int tag, 98 + int offset) 99 + { 100 + __clear_bit(offset, node->tags[tag]); 101 + } 102 + 103 + static inline int tag_get(struct radix_tree_node *node, unsigned int tag, 104 + int offset) 105 + { 106 + return test_bit(offset, node->tags[tag]); 107 + } 108 + 109 + static inline void root_tag_set(struct radix_tree_root *root, unsigned int tag) 110 + { 111 + root->gfp_mask |= (__force gfp_t)(1 << (tag + __GFP_BITS_SHIFT)); 112 + } 113 + 114 + static inline void root_tag_clear(struct radix_tree_root *root, unsigned int tag) 115 + { 116 + root->gfp_mask &= (__force gfp_t)~(1 << (tag + __GFP_BITS_SHIFT)); 117 + } 118 + 119 + static inline void root_tag_clear_all(struct radix_tree_root *root) 120 + { 121 + root->gfp_mask &= __GFP_BITS_MASK; 122 + } 123 + 124 + static inline int root_tag_get(struct radix_tree_root *root, unsigned int tag) 125 + { 126 + return (__force unsigned)root->gfp_mask & (1 << (tag + __GFP_BITS_SHIFT)); 127 + } 128 + 129 + /* 130 + * Returns 1 if any slot in the node has this tag set. 131 + * Otherwise returns 0. 132 + */ 133 + static inline int any_tag_set(struct radix_tree_node *node, unsigned int tag) 134 + { 135 + int idx; 136 + for (idx = 0; idx < RADIX_TREE_TAG_LONGS; idx++) { 137 + if (node->tags[tag][idx]) 138 + return 1; 139 + } 140 + return 0; 141 + } 91 142 /* 92 143 * This assumes that the caller has performed appropriate preallocation, and 93 144 * that the caller has pinned this thread of control to the current CPU. ··· 175 124 { 176 125 struct radix_tree_node *node = 177 126 container_of(head, struct radix_tree_node, rcu_head); 127 + 128 + /* 129 + * must only free zeroed nodes into the slab. radix_tree_shrink 130 + * can leave us with a non-NULL entry in the first slot, so clear 131 + * that here to make sure. 132 + */ 133 + tag_clear(node, 0, 0); 134 + tag_clear(node, 1, 0); 135 + node->slots[0] = NULL; 136 + node->count = 0; 137 + 178 138 kmem_cache_free(radix_tree_node_cachep, node); 179 139 } 180 140 ··· 226 164 return ret; 227 165 } 228 166 EXPORT_SYMBOL(radix_tree_preload); 229 - 230 - static inline void tag_set(struct radix_tree_node *node, unsigned int tag, 231 - int offset) 232 - { 233 - __set_bit(offset, node->tags[tag]); 234 - } 235 - 236 - static inline void tag_clear(struct radix_tree_node *node, unsigned int tag, 237 - int offset) 238 - { 239 - __clear_bit(offset, node->tags[tag]); 240 - } 241 - 242 - static inline int tag_get(struct radix_tree_node *node, unsigned int tag, 243 - int offset) 244 - { 245 - return test_bit(offset, node->tags[tag]); 246 - } 247 - 248 - static inline void root_tag_set(struct radix_tree_root *root, unsigned int tag) 249 - { 250 - root->gfp_mask |= (__force gfp_t)(1 << (tag + __GFP_BITS_SHIFT)); 251 - } 252 - 253 - 254 - static inline void root_tag_clear(struct radix_tree_root *root, unsigned int tag) 255 - { 256 - root->gfp_mask &= (__force gfp_t)~(1 << (tag + __GFP_BITS_SHIFT)); 257 - } 258 - 259 - static inline void root_tag_clear_all(struct radix_tree_root *root) 260 - { 261 - root->gfp_mask &= __GFP_BITS_MASK; 262 - } 263 - 264 - static inline int root_tag_get(struct radix_tree_root *root, unsigned int tag) 265 - { 266 - return (__force unsigned)root->gfp_mask & (1 << (tag + __GFP_BITS_SHIFT)); 267 - } 268 - 269 - /* 270 - * Returns 1 if any slot in the node has this tag set. 271 - * Otherwise returns 0. 272 - */ 273 - static inline int any_tag_set(struct radix_tree_node *node, unsigned int tag) 274 - { 275 - int idx; 276 - for (idx = 0; idx < RADIX_TREE_TAG_LONGS; idx++) { 277 - if (node->tags[tag][idx]) 278 - return 1; 279 - } 280 - return 0; 281 - } 282 167 283 168 /* 284 169 * Return the maximum key which can be store into a ··· 939 930 newptr = radix_tree_ptr_to_indirect(newptr); 940 931 root->rnode = newptr; 941 932 root->height--; 942 - /* must only free zeroed nodes into the slab */ 943 - tag_clear(to_free, 0, 0); 944 - tag_clear(to_free, 1, 0); 945 - to_free->slots[0] = NULL; 946 - to_free->count = 0; 947 933 radix_tree_node_free(to_free); 948 934 } 949 935 }