Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

ocfs2/dlm: Clear joining_node on hearbeat node down

Currently the process of dlm join contains 2 steps: query join and assert join.
After query join, the joined node will set its joining_node. So if the joining
node happens to panic before the 2nd step, the joined node will fail to clear
its joining_node flag because that node isn't in the domain map. It at least
cause 2 problems.
1. All the new join request will fail. So no new node can mount the volume.
2. The joined node can't umount the volume since during the umount process it
has to wait for the joining_node to be unknown. So the umount will be hanged.

The solution is to clear the joining_node before we check the domain map.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

authored by

Tao Ma and committed by
Mark Fasheh
2d4b1cbb 4092d49f

+6 -6
+6 -6
fs/ocfs2/dlm/dlmrecovery.c
··· 2270 2270 } 2271 2271 } 2272 2272 2273 + /* Clean up join state on node death. */ 2274 + if (dlm->joining_node == idx) { 2275 + mlog(0, "Clearing join state for node %u\n", idx); 2276 + __dlm_set_joining_node(dlm, DLM_LOCK_RES_OWNER_UNKNOWN); 2277 + } 2278 + 2273 2279 /* check to see if the node is already considered dead */ 2274 2280 if (!test_bit(idx, dlm->live_nodes_map)) { 2275 2281 mlog(0, "for domain %s, node %d is already dead. " ··· 2293 2287 } 2294 2288 2295 2289 clear_bit(idx, dlm->live_nodes_map); 2296 - 2297 - /* Clean up join state on node death. */ 2298 - if (dlm->joining_node == idx) { 2299 - mlog(0, "Clearing join state for node %u\n", idx); 2300 - __dlm_set_joining_node(dlm, DLM_LOCK_RES_OWNER_UNKNOWN); 2301 - } 2302 2290 2303 2291 /* make sure local cleanup occurs before the heartbeat events */ 2304 2292 if (!test_bit(idx, dlm->recovery_map))