md: restart recovery cleanly after device failure.

Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

kernel os linux

When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.

For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.

We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.

The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.

Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.

Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.

Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.

Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

authored by

NeilBrown and committed by

Linus Torvalds 18 years ago dfc70645 90b08710

+44 -19

6 changed files

expand all

drivers

md.c

multipath.c

raid1.c

raid10.c

raid5.c

include

linux

raid

md_k.h

+11 -11

drivers/md/md.c

··· 5434 5434 atomic_sub(blocks, &mddev->recovery_active); 5435 5435 wake_up(&mddev->recovery_wait); 5436 5436 if (!ok) { 5437 - set_bit(MD_RECOVERY_ERR, &mddev->recovery); 5437 + set_bit(MD_RECOVERY_INTR, &mddev->recovery); 5438 5438 md_wakeup_thread(mddev->thread); 5439 5439 // stop recovery, signal do_sync .... 5440 5440 } ··· 5690 5690 sectors = mddev->pers->sync_request(mddev, j, &skipped, 5691 5691 currspeed < speed_min(mddev)); 5692 5692 if (sectors == 0) { 5693 - set_bit(MD_RECOVERY_ERR, &mddev->recovery); 5693 + set_bit(MD_RECOVERY_INTR, &mddev->recovery); 5694 5694 goto out; 5695 5695 } 5696 5696 ··· 5713 5713 5714 5714 last_check = io_sectors; 5715 5715 5716 - if (test_bit(MD_RECOVERY_INTR, &mddev->recovery) || 5717 - test_bit(MD_RECOVERY_ERR, &mddev->recovery)) 5716 + if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) 5718 5717 break; 5719 5718 5720 5719 repeat: ··· 5767 5768 /* tell personality that we are finished */ 5768 5769 mddev->pers->sync_request(mddev, max_sectors, &skipped, 1); 5769 5770 5770 - if (!test_bit(MD_RECOVERY_ERR, &mddev->recovery) && 5771 - !test_bit(MD_RECOVERY_CHECK, &mddev->recovery) && 5771 + if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) && 5772 5772 mddev->curr_resync > 2) { 5773 5773 if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { 5774 5774 if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { ··· 5836 5838 } 5837 5839 5838 5840 if (mddev->degraded) { 5839 - rdev_for_each(rdev, rtmp, mddev) 5841 + rdev_for_each(rdev, rtmp, mddev) { 5842 + if (rdev->raid_disk >= 0 && 5843 + !test_bit(In_sync, &rdev->flags)) 5844 + spares++; 5840 5845 if (rdev->raid_disk < 0 5841 5846 && !test_bit(Faulty, &rdev->flags)) { 5842 5847 rdev->recovery_offset = 0; ··· 5857 5856 } else 5858 5857 break; 5859 5858 } 5859 + } 5860 5860 } 5861 5861 return spares; 5862 5862 } ··· 5871 5869 * to do that as needed. 5872 5870 * When it is determined that resync is needed, we set MD_RECOVERY_RUNNING in 5873 5871 * "->recovery" and create a thread at ->sync_thread. 5874 - * When the thread finishes it sets MD_RECOVERY_DONE (and might set MD_RECOVERY_ERR) 5872 + * When the thread finishes it sets MD_RECOVERY_DONE 5875 5873 * and wakeups up this thread which will reap the thread and finish up. 5876 5874 * This thread also removes any faulty devices (with nr_pending == 0). 5877 5875 * ··· 5946 5944 /* resync has finished, collect result */ 5947 5945 md_unregister_thread(mddev->sync_thread); 5948 5946 mddev->sync_thread = NULL; 5949 - if (!test_bit(MD_RECOVERY_ERR, &mddev->recovery) && 5950 - !test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { 5947 + if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { 5951 5948 /* success...*/ 5952 5949 /* activate any spares */ 5953 5950 mddev->pers->spare_active(mddev); ··· 5970 5969 * might be left set 5971 5970 */ 5972 5971 clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 5973 - clear_bit(MD_RECOVERY_ERR, &mddev->recovery); 5974 5972 clear_bit(MD_RECOVERY_INTR, &mddev->recovery); 5975 5973 clear_bit(MD_RECOVERY_DONE, &mddev->recovery); 5976 5974

+2 -1

drivers/md/multipath.c

··· 327 327 if (rdev) { 328 328 if (test_bit(In_sync, &rdev->flags) || 329 329 atomic_read(&rdev->nr_pending)) { 330 - printk(KERN_ERR "hot-remove-disk, slot %d is identified" " but is still operational!\n", number); 330 + printk(KERN_ERR "hot-remove-disk, slot %d is identified" 331 + " but is still operational!\n", number); 331 332 err = -EBUSY; 332 333 goto abort; 333 334 }

+9 -1

drivers/md/raid1.c

··· 1027 1027 /* 1028 1028 * if recovery is running, make sure it aborts. 1029 1029 */ 1030 - set_bit(MD_RECOVERY_ERR, &mddev->recovery); 1030 + set_bit(MD_RECOVERY_INTR, &mddev->recovery); 1031 1031 } else 1032 1032 set_bit(Faulty, &rdev->flags); 1033 1033 set_bit(MD_CHANGE_DEVS, &mddev->flags); ··· 1145 1145 if (rdev) { 1146 1146 if (test_bit(In_sync, &rdev->flags) || 1147 1147 atomic_read(&rdev->nr_pending)) { 1148 + err = -EBUSY; 1149 + goto abort; 1150 + } 1151 + /* Only remove non-faulty devices is recovery 1152 + * is not possible. 1153 + */ 1154 + if (!test_bit(Faulty, &rdev->flags) && 1155 + mddev->degraded < conf->raid_disks) { 1148 1156 err = -EBUSY; 1149 1157 goto abort; 1150 1158 }

+12 -2

drivers/md/raid10.c

··· 1020 1020 /* 1021 1021 * if recovery is running, make sure it aborts. 1022 1022 */ 1023 - set_bit(MD_RECOVERY_ERR, &mddev->recovery); 1023 + set_bit(MD_RECOVERY_INTR, &mddev->recovery); 1024 1024 } 1025 1025 set_bit(Faulty, &rdev->flags); 1026 1026 set_bit(MD_CHANGE_DEVS, &mddev->flags); ··· 1171 1171 err = -EBUSY; 1172 1172 goto abort; 1173 1173 } 1174 + /* Only remove faulty devices in recovery 1175 + * is not possible. 1176 + */ 1177 + if (!test_bit(Faulty, &rdev->flags) && 1178 + enough(conf)) { 1179 + err = -EBUSY; 1180 + goto abort; 1181 + } 1174 1182 p->rdev = NULL; 1175 1183 synchronize_rcu(); 1176 1184 if (atomic_read(&rdev->nr_pending)) { ··· 1245 1237 1246 1238 if (!uptodate) 1247 1239 md_error(mddev, conf->mirrors[d].rdev); 1240 + 1248 1241 update_head_pos(i, r10_bio); 1249 1242 1250 1243 while (atomic_dec_and_test(&r10_bio->remaining)) { ··· 1853 1844 if (rb2) 1854 1845 atomic_dec(&rb2->remaining); 1855 1846 r10_bio = rb2; 1856 - if (!test_and_set_bit(MD_RECOVERY_ERR, &mddev->recovery)) 1847 + if (!test_and_set_bit(MD_RECOVERY_INTR, 1848 + &mddev->recovery)) 1857 1849 printk(KERN_INFO "raid10: %s: insufficient working devices for recovery.\n", 1858 1850 mdname(mddev)); 1859 1851 break;

+9 -1

drivers/md/raid5.c

··· 1268 1268 /* 1269 1269 * if recovery was running, make sure it aborts. 1270 1270 */ 1271 - set_bit(MD_RECOVERY_ERR, &mddev->recovery); 1271 + set_bit(MD_RECOVERY_INTR, &mddev->recovery); 1272 1272 } 1273 1273 set_bit(Faulty, &rdev->flags); 1274 1274 printk (KERN_ALERT ··· 4571 4571 if (rdev) { 4572 4572 if (test_bit(In_sync, &rdev->flags) || 4573 4573 atomic_read(&rdev->nr_pending)) { 4574 + err = -EBUSY; 4575 + goto abort; 4576 + } 4577 + /* Only remove non-faulty devices if recovery 4578 + * isn't possible. 4579 + */ 4580 + if (!test_bit(Faulty, &rdev->flags) && 4581 + mddev->degraded <= conf->max_degraded) { 4574 4582 err = -EBUSY; 4575 4583 goto abort; 4576 4584 }

+1 -3

include/linux/raid/md_k.h

··· 188 188 * NEEDED: we might need to start a resync/recover 189 189 * RUNNING: a thread is running, or about to be started 190 190 * SYNC: actually doing a resync, not a recovery 191 - * ERR: and IO error was detected - abort the resync/recovery 192 - * INTR: someone requested a (clean) early abort. 191 + * INTR: resync needs to be aborted for some reason 193 192 * DONE: thread is done and is waiting to be reaped 194 193 * REQUEST: user-space has requested a sync (used with SYNC) 195 194 * CHECK: user-space request for for check-only, no repair ··· 198 199 */ 199 200 #define MD_RECOVERY_RUNNING 0 200 201 #define MD_RECOVERY_SYNC 1 201 - #define MD_RECOVERY_ERR 2 202 202 #define MD_RECOVERY_INTR 3 203 203 #define MD_RECOVERY_DONE 4 204 204 #define MD_RECOVERY_NEEDED 5