Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

MD RAID10: Improve redundancy for 'far' and 'offset' algorithms (part 1)

The MD RAID10 'far' and 'offset' algorithms make copies of entire stripe
widths - copying them to a different location on the same devices after
shifting the stripe. An example layout of each follows below:

"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
F A B C D E --> Copy of stripe0, but shifted by 1
L G H I J K
...

"offset" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
F A B C D E --> Copy of stripe0, but shifted by 1
G H I J K L
L G H I J K
...

Redundancy for these algorithms is gained by shifting the copied stripes
one device to the right. This patch proposes that array be divided into
sets of adjacent devices and when the stripe copies are shifted, they wrap
on set boundaries rather than the array size boundary. That is, for the
purposes of shifting, the copies are confined to their sets within the
array. The sets are 'near_copies * far_copies' in size.

The above "far" algorithm example would change to:
"far" algorithm
dev1 dev2 dev3 dev4 dev5 dev6
==== ==== ==== ==== ==== ====
A B C D E F
G H I J K L
...
B A D C F E --> Copy of stripe0, shifted 1, 2-dev sets
H G J I L K Dev sets are 1-2, 3-4, 5-6
...

This has the affect of improving the redundancy of the array. We can
always sustain at least one failure, but sometimes more than one can
be handled. In the first examples, the pairs of devices that CANNOT fail
together are:
(1,2) (2,3) (3,4) (4,5) (5,6) (1, 6) [40% of possible pairs]
In the example where the copies are confined to sets, the pairs of
devices that cannot fail together are:
(1,2) (3,4) (5,6) [20% of possible pairs]

We cannot simply replace the old algorithms, so the 17th bit of the 'layout'
variable is used to indicate whether we use the old or new method of computing
the shift. (This is similar to the way the 16th bit indicates whether the
"far" algorithm or the "offset" algorithm is being used.)

This patch only handles the cases where the number of total raid disks is
a multiple of 'far_copies'. A follow-on patch addresses the condition where
this is not true.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

authored by

Jonathan Brassow and committed by
NeilBrown
475901af 4c0ca26b

+45 -18
+40 -18
drivers/md/raid10.c
··· 38 38 * near_copies (stored in low byte of layout) 39 39 * far_copies (stored in second byte of layout) 40 40 * far_offset (stored in bit 16 of layout ) 41 + * use_far_sets (stored in bit 17 of layout ) 41 42 * 42 - * The data to be stored is divided into chunks using chunksize. 43 - * Each device is divided into far_copies sections. 44 - * In each section, chunks are laid out in a style similar to raid0, but 45 - * near_copies copies of each chunk is stored (each on a different drive). 46 - * The starting device for each section is offset near_copies from the starting 47 - * device of the previous section. 48 - * Thus they are (near_copies*far_copies) of each chunk, and each is on a different 49 - * drive. 50 - * near_copies and far_copies must be at least one, and their product is at most 51 - * raid_disks. 43 + * The data to be stored is divided into chunks using chunksize. Each device 44 + * is divided into far_copies sections. In each section, chunks are laid out 45 + * in a style similar to raid0, but near_copies copies of each chunk is stored 46 + * (each on a different drive). The starting device for each section is offset 47 + * near_copies from the starting device of the previous section. Thus there 48 + * are (near_copies * far_copies) of each chunk, and each is on a different 49 + * drive. near_copies and far_copies must be at least one, and their product 50 + * is at most raid_disks. 52 51 * 53 52 * If far_offset is true, then the far_copies are handled a bit differently. 54 - * The copies are still in different stripes, but instead of be very far apart 55 - * on disk, there are adjacent stripes. 53 + * The copies are still in different stripes, but instead of being very far 54 + * apart on disk, there are adjacent stripes. 55 + * 56 + * The far and offset algorithms are handled slightly differently if 57 + * 'use_far_sets' is true. In this case, the array's devices are grouped into 58 + * sets that are (near_copies * far_copies) in size. The far copied stripes 59 + * are still shifted by 'near_copies' devices, but this shifting stays confined 60 + * to the set rather than the entire array. This is done to improve the number 61 + * of device combinations that can fail without causing the array to fail. 62 + * Example 'far' algorithm w/o 'use_far_sets' (each letter represents a chunk 63 + * on a device): 64 + * A B C D A B C D E 65 + * ... ... 66 + * D A B C E A B C D 67 + * Example 'far' algorithm w/ 'use_far_sets' enabled (sets illustrated w/ []'s): 68 + * [A B] [C D] [A B] [C D E] 69 + * |...| |...| |...| | ... | 70 + * [B A] [D C] [B A] [E C D] 56 71 */ 57 72 58 73 /* ··· 566 551 /* and calculate all the others */ 567 552 for (n = 0; n < geo->near_copies; n++) { 568 553 int d = dev; 554 + int set; 569 555 sector_t s = sector; 570 556 r10bio->devs[slot].devnum = d; 571 557 r10bio->devs[slot].addr = s; 572 558 slot++; 573 559 574 560 for (f = 1; f < geo->far_copies; f++) { 561 + set = d / geo->far_set_size; 575 562 d += geo->near_copies; 576 - d %= geo->raid_disks; 563 + d %= geo->far_set_size; 564 + d += geo->far_set_size * set; 565 + 577 566 s += geo->stride; 578 567 r10bio->devs[slot].devnum = d; 579 568 r10bio->devs[slot].addr = s; ··· 613 594 * or recovery, so reshape isn't happening 614 595 */ 615 596 struct geom *geo = &conf->geo; 597 + int far_set_start = (dev / geo->far_set_size) * geo->far_set_size; 598 + int far_set_size = geo->far_set_size; 616 599 617 600 offset = sector & geo->chunk_mask; 618 601 if (geo->far_offset) { ··· 622 601 chunk = sector >> geo->chunk_shift; 623 602 fc = sector_div(chunk, geo->far_copies); 624 603 dev -= fc * geo->near_copies; 625 - if (dev < 0) 626 - dev += geo->raid_disks; 604 + if (dev < far_set_start) 605 + dev += far_set_size; 627 606 } else { 628 607 while (sector >= geo->stride) { 629 608 sector -= geo->stride; 630 - if (dev < geo->near_copies) 631 - dev += geo->raid_disks - geo->near_copies; 609 + if (dev < (geo->near_copies + far_set_start)) 610 + dev += far_set_size - geo->near_copies; 632 611 else 633 612 dev -= geo->near_copies; 634 613 } ··· 3459 3438 disks = mddev->raid_disks + mddev->delta_disks; 3460 3439 break; 3461 3440 } 3462 - if (layout >> 17) 3441 + if (layout >> 18) 3463 3442 return -1; 3464 3443 if (chunk < (PAGE_SIZE >> 9) || 3465 3444 !is_power_of_2(chunk)) ··· 3471 3450 geo->near_copies = nc; 3472 3451 geo->far_copies = fc; 3473 3452 geo->far_offset = fo; 3453 + geo->far_set_size = (layout & (1<<17)) ? disks / fc : disks; 3474 3454 geo->chunk_mask = chunk - 1; 3475 3455 geo->chunk_shift = ffz(~chunk); 3476 3456 return nc*fc;
+5
drivers/md/raid10.h
··· 33 33 * far_offset, in which case it is 34 34 * 1 stripe. 35 35 */ 36 + int far_set_size; /* The number of devices in a set, 37 + * where a 'set' are devices that 38 + * contain far/offset copies of 39 + * each other. 40 + */ 36 41 int chunk_shift; /* shift from chunks to sectors */ 37 42 sector_t chunk_mask; 38 43 } prev, geo;