Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

[PATCH] drivers/md/dm-raid1.c: Fix inconsistent mirroring after interrupted recovery

dm-mirror has potential data corruption problem: while on-disk log shows
that all disk contents are in-sync, actual contents of the disks are not
synchronized. This problem occurs if initial recovery (synching) is
interrupted and resumed.

Attached patch fixes this problem.

Background:

rh_dec() changes the region state from RH_NOSYNC (out-of-sync) to RH_CLEAN
(in-sync), which results in the corresponding bit of clean_bits being set.

This is harmful if on-disk log is used and the map is removed/suspended
before the initial sync is completed. The clean_bits is written down to
the on-disk log at the map removal, and, upon resume, it's read and copied
to sync_bits. Since the recovery process refers to the sync_bits to find a
region to be recovered, the region whose state was changed from RH_NOSYNC
to RH_CLEAN is no longer recovered.

If you haven't applied dm-raid1-read-balancing.patch proposed in dm-devel
sometimes ago, the contents of the mirrored disk just corrupt silently. If
you have, balanced read may get bogus data from out-of-sync disks.

The patch keeps RH_NOSYNC state unchanged. It will be changed to
RH_RECOVERING when recovery starts and get reclaimed when the recovery
completes. So it doesn't leak the region hash entry.

Description:

Keep RH_NOSYNC state unchanged when I/O on the region completes.

rh_dec() changes the region state from RH_NOSYNC (out-of-sync) to RH_CLEAN
(in-sync), which results in the corresponding bit of clean_bits being set.

This is harmful if on-disk log is used and the map is removed/suspended
before the initial sync is completed. The clean_bits is written down to
the on-disk log at the map removal, and, upon resume, it's read and copied
to sync_bits. Since the recovery process refers to the sync_bits to find a
region to be recovered, the region whose state was changed from RH_NOSYNC
to RH_CLEAN is no longer recovered.

If you haven't applied dm-raid1-read-balancing.patch proposed in dm-devel
sometimes ago, the contents of the mirrored disk just corrupt silently. If
you have, balanced read may get bogus data from out-of-sync disks.

The RH_NOSYNC region will be changed to RH_RECOVERING when recovery starts
on the region and get reclaimed when the recovery completes. So it doesn't
leak the region hash entry.

Alasdair said:

I've analysed the relevant part of the state machine and I believe that
the patch is correct.

(Further work on this code is still needed - this patch has the
side-effect of holding onto memory unnecessarily for long periods of time
under certain workloads - but better that than corrupting data.)

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

authored by

Jun'ichi Nomura and committed by
Linus Torvalds
930d332a 76df1c65

+13 -1
+13 -1
drivers/md/dm-raid1.c
··· 402 402 403 403 spin_lock_irqsave(&rh->region_lock, flags); 404 404 if (atomic_dec_and_test(&reg->pending)) { 405 + /* 406 + * There is no pending I/O for this region. 407 + * We can move the region to corresponding list for next action. 408 + * At this point, the region is not yet connected to any list. 409 + * 410 + * If the state is RH_NOSYNC, the region should be kept off 411 + * from clean list. 412 + * The hash entry for RH_NOSYNC will remain in memory 413 + * until the region is recovered or the map is reloaded. 414 + */ 415 + 416 + /* do nothing for RH_NOSYNC */ 405 417 if (reg->state == RH_RECOVERING) { 406 418 list_add_tail(&reg->list, &rh->quiesced_regions); 407 - } else { 419 + } else if (reg->state == RH_DIRTY) { 408 420 reg->state = RH_CLEAN; 409 421 list_add(&reg->list, &rh->clean_regions); 410 422 }