lol

nixos/tests/consul: stop consul cleanly

This should fix the flakyness of the test.

Forcefully killing the consul process can lead to
a broken `/var/lib/consul/node-id` file, which
will prevent consul from starting on that node again.
See https://github.com/hashicorp/consul/issues/3489

So instead of crashing the whole node, which leads to
this corruption from time to time, we kill the
networking instead, preventing any cluster
communication and then cleanly stop consul.

+20 -10
+20 -10
nixos/tests/consul.nix
··· 145 145 client2.succeed("[ $(consul kv get testkey) == 42 ]") 146 146 147 147 148 - def rolling_reboot_test(proper_rolling_procedure=True): 148 + def rolling_restart_test(proper_rolling_procedure=True): 149 149 """ 150 150 Tests that the cluster can tolearate failures of any single server, 151 151 following the recommended rolling upgrade procedure from ··· 158 158 """ 159 159 160 160 for server in servers: 161 - server.crash() 161 + server.block() 162 + server.systemctl("stop consul") 163 + 164 + # Make sure the stopped peer is recognized as being down 165 + client1.wait_until_succeeds( 166 + f"[ $(consul members | grep {server.name} | grep -o -E 'failed|left' | wc -l) == 1 ]" 167 + ) 162 168 163 169 # For each client, wait until they have connection again 164 170 # using `kv get -recurse` before issuing commands. ··· 170 176 client2.succeed("[ $(consul kv get testkey) == 43 ]") 171 177 client2.succeed("consul kv delete testkey") 172 178 173 - # Restart crashed machine. 174 - server.start() 179 + server.unblock() 180 + server.systemctl("start consul") 175 181 176 182 if proper_rolling_procedure: 177 183 # Wait for recovery. ··· 197 203 """ 198 204 199 205 for server in servers: 200 - server.crash() 206 + server.block() 207 + server.systemctl("stop --no-block consul") 201 208 202 209 for server in servers: 203 - server.start() 210 + # --no-block is async, so ensure it has been stopped by now 211 + server.wait_until_fails("systemctl is-active --quiet consul") 212 + server.unblock() 213 + server.systemctl("start consul") 204 214 205 215 # Wait for recovery. 206 216 wait_for_healthy_servers() ··· 217 227 218 228 # Run the tests. 219 229 220 - print("rolling_reboot_test()") 221 - rolling_reboot_test() 230 + print("rolling_restart_test()") 231 + rolling_restart_test() 222 232 223 233 print("all_servers_crash_simultaneously_test()") 224 234 all_servers_crash_simultaneously_test() 225 235 226 - print("rolling_reboot_test(proper_rolling_procedure=False)") 227 - rolling_reboot_test(proper_rolling_procedure=False) 236 + print("rolling_restart_test(proper_rolling_procedure=False)") 237 + rolling_restart_test(proper_rolling_procedure=False) 228 238 ''; 229 239 })