Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux

Merge tag 'md/4.1' of git://neil.brown.name/md

Pull md updates from Neil Brown:
"More updates that usual this time. A few have performance impacts
which hould mostly be positive, but RAID5 (in particular) can be very
work-load ensitive... We'll have to wait and see.

Highlights:

- "experimental" code for managing md/raid1 across a cluster using
DLM. Code is not ready for general use and triggers a WARNING if
used. However it is looking good and mostly done and having in
mainline will help co-ordinate development.

- RAID5/6 can now batch multiple (4K wide) stripe_heads so as to
handle a full (chunk wide) stripe as a single unit.

- RAID6 can now perform read-modify-write cycles which should help
performance on larger arrays: 6 or more devices.

- RAID5/6 stripe cache now grows and shrinks dynamically. The value
set is used as a minimum.

- Resync is now allowed to go a little faster than the 'mininum' when
there is competing IO. How much faster depends on the speed of the
devices, so the effective minimum should scale with device speed to
some extent"

* tag 'md/4.1' of git://neil.brown.name/md: (58 commits)
md/raid5: don't do chunk aligned read on degraded array.
md/raid5: allow the stripe_cache to grow and shrink.
md/raid5: change ->inactive_blocked to a bit-flag.
md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe
md/raid5: pass gfp_t arg to grow_one_stripe()
md/raid5: introduce configuration option rmw_level
md/raid5: activate raid6 rmw feature
md/raid6 algorithms: xor_syndrome() for SSE2
md/raid6 algorithms: xor_syndrome() for generic int
md/raid6 algorithms: improve test program
md/raid6 algorithms: delta syndrome functions
raid5: handle expansion/resync case with stripe batching
raid5: handle io error of batch list
RAID5: batch adjacent full stripe write
raid5: track overwrite disk count
raid5: add a new flag to track if a stripe can be batched
raid5: use flex_array for scribble data
md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
md: allow resync to go faster when there is competing IO.
md: remove 'go_faster' option from ->sync_request()
...

+2858 -303
+176
Documentation/md-cluster.txt
··· 1 + The cluster MD is a shared-device RAID for a cluster. 2 + 3 + 4 + 1. On-disk format 5 + 6 + Separate write-intent-bitmap are used for each cluster node. 7 + The bitmaps record all writes that may have been started on that node, 8 + and may not yet have finished. The on-disk layout is: 9 + 10 + 0 4k 8k 12k 11 + ------------------------------------------------------------------- 12 + | idle | md super | bm super [0] + bits | 13 + | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | 14 + | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | 15 + | bm bits [3, contd] | | | 16 + 17 + During "normal" functioning we assume the filesystem ensures that only one 18 + node writes to any given block at a time, so a write 19 + request will 20 + - set the appropriate bit (if not already set) 21 + - commit the write to all mirrors 22 + - schedule the bit to be cleared after a timeout. 23 + 24 + Reads are just handled normally. It is up to the filesystem to 25 + ensure one node doesn't read from a location where another node (or the same 26 + node) is writing. 27 + 28 + 29 + 2. DLM Locks for management 30 + 31 + There are two locks for managing the device: 32 + 33 + 2.1 Bitmap lock resource (bm_lockres) 34 + 35 + The bm_lockres protects individual node bitmaps. They are named in the 36 + form bitmap001 for node 1, bitmap002 for node and so on. When a node 37 + joins the cluster, it acquires the lock in PW mode and it stays so 38 + during the lifetime the node is part of the cluster. The lock resource 39 + number is based on the slot number returned by the DLM subsystem. Since 40 + DLM starts node count from one and bitmap slots start from zero, one is 41 + subtracted from the DLM slot number to arrive at the bitmap slot number. 42 + 43 + 3. Communication 44 + 45 + Each node has to communicate with other nodes when starting or ending 46 + resync, and metadata superblock updates. 47 + 48 + 3.1 Message Types 49 + 50 + There are 3 types, of messages which are passed 51 + 52 + 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been 53 + updated, and the node must re-read the md superblock. This is performed 54 + synchronously. 55 + 56 + 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended 57 + so that each node may suspend or resume the region. 58 + 59 + 3.2 Communication mechanism 60 + 61 + The DLM LVB is used to communicate within nodes of the cluster. There 62 + are three resources used for the purpose: 63 + 64 + 3.2.1 Token: The resource which protects the entire communication 65 + system. The node having the token resource is allowed to 66 + communicate. 67 + 68 + 3.2.2 Message: The lock resource which carries the data to 69 + communicate. 70 + 71 + 3.2.3 Ack: The resource, acquiring which means the message has been 72 + acknowledged by all nodes in the cluster. The BAST of the resource 73 + is used to inform the receive node that a node wants to communicate. 74 + 75 + The algorithm is: 76 + 77 + 1. receive status 78 + 79 + sender receiver receiver 80 + ACK:CR ACK:CR ACK:CR 81 + 82 + 2. sender get EX of TOKEN 83 + sender get EX of MESSAGE 84 + sender receiver receiver 85 + TOKEN:EX ACK:CR ACK:CR 86 + MESSAGE:EX 87 + ACK:CR 88 + 89 + Sender checks that it still needs to send a message. Messages received 90 + or other events that happened while waiting for the TOKEN may have made 91 + this message inappropriate or redundant. 92 + 93 + 3. sender write LVB. 94 + sender down-convert MESSAGE from EX to CR 95 + sender try to get EX of ACK 96 + [ wait until all receiver has *processed* the MESSAGE ] 97 + 98 + [ triggered by bast of ACK ] 99 + receiver get CR of MESSAGE 100 + receiver read LVB 101 + receiver processes the message 102 + [ wait finish ] 103 + receiver release ACK 104 + 105 + sender receiver receiver 106 + TOKEN:EX MESSAGE:CR MESSAGE:CR 107 + MESSAGE:CR 108 + ACK:EX 109 + 110 + 4. triggered by grant of EX on ACK (indicating all receivers have processed 111 + message) 112 + sender down-convert ACK from EX to CR 113 + sender release MESSAGE 114 + sender release TOKEN 115 + receiver upconvert to EX of MESSAGE 116 + receiver get CR of ACK 117 + receiver release MESSAGE 118 + 119 + sender receiver receiver 120 + ACK:CR ACK:CR ACK:CR 121 + 122 + 123 + 4. Handling Failures 124 + 125 + 4.1 Node Failure 126 + When a node fails, the DLM informs the cluster with the slot. The node 127 + starts a cluster recovery thread. The cluster recovery thread: 128 + - acquires the bitmap<number> lock of the failed node 129 + - opens the bitmap 130 + - reads the bitmap of the failed node 131 + - copies the set bitmap to local node 132 + - cleans the bitmap of the failed node 133 + - releases bitmap<number> lock of the failed node 134 + - initiates resync of the bitmap on the current node 135 + 136 + The resync process, is the regular md resync. However, in a clustered 137 + environment when a resync is performed, it needs to tell other nodes 138 + of the areas which are suspended. Before a resync starts, the node 139 + send out RESYNC_START with the (lo,hi) range of the area which needs 140 + to be suspended. Each node maintains a suspend_list, which contains 141 + the list of ranges which are currently suspended. On receiving 142 + RESYNC_START, the node adds the range to the suspend_list. Similarly, 143 + when the node performing resync finishes, it send RESYNC_FINISHED 144 + to other nodes and other nodes remove the corresponding entry from 145 + the suspend_list. 146 + 147 + A helper function, should_suspend() can be used to check if a particular 148 + I/O range should be suspended or not. 149 + 150 + 4.2 Device Failure 151 + Device failures are handled and communicated with the metadata update 152 + routine. 153 + 154 + 5. Adding a new Device 155 + For adding a new device, it is necessary that all nodes "see" the new device 156 + to be added. For this, the following algorithm is used: 157 + 158 + 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues 159 + ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) 160 + 2. Node 1 sends NEWDISK with uuid and slot number 161 + 3. Other nodes issue kobject_uevent_env with uuid and slot number 162 + (Steps 4,5 could be a udev rule) 163 + 4. In userspace, the node searches for the disk, perhaps 164 + using blkid -t SUB_UUID="" 165 + 5. Other nodes issue either of the following depending on whether the disk 166 + was found: 167 + ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and 168 + disc.number set to slot number) 169 + ioctl(CLUSTERED_DISK_NACK) 170 + 6. Other nodes drop lock on no-new-devs (CR) if device is found 171 + 7. Node 1 attempts EX lock on no-new-devs 172 + 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk 173 + as SpareLocal 174 + 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED 175 + 10. Other nodes get the information whether a disk is added or not 176 + by the following METADATA_UPDATED.
+16 -3
crypto/async_tx/async_pq.c
··· 124 124 { 125 125 void **srcs; 126 126 int i; 127 + int start = -1, stop = disks - 3; 127 128 128 129 if (submit->scribble) 129 130 srcs = submit->scribble; ··· 135 134 if (blocks[i] == NULL) { 136 135 BUG_ON(i > disks - 3); /* P or Q can't be zero */ 137 136 srcs[i] = (void*)raid6_empty_zero_page; 138 - } else 137 + } else { 139 138 srcs[i] = page_address(blocks[i]) + offset; 139 + if (i < disks - 2) { 140 + stop = i; 141 + if (start == -1) 142 + start = i; 143 + } 144 + } 140 145 } 141 - raid6_call.gen_syndrome(disks, len, srcs); 146 + if (submit->flags & ASYNC_TX_PQ_XOR_DST) { 147 + BUG_ON(!raid6_call.xor_syndrome); 148 + if (start >= 0) 149 + raid6_call.xor_syndrome(disks, start, stop, len, srcs); 150 + } else 151 + raid6_call.gen_syndrome(disks, len, srcs); 142 152 async_tx_sync_epilog(submit); 143 153 } 144 154 ··· 190 178 if (device) 191 179 unmap = dmaengine_get_unmap_data(device->dev, disks, GFP_NOIO); 192 180 193 - if (unmap && 181 + /* XORing P/Q is only implemented in software */ 182 + if (unmap && !(submit->flags & ASYNC_TX_PQ_XOR_DST) && 194 183 (src_cnt <= dma_maxpq(device, 0) || 195 184 dma_maxpq(device, DMA_PREP_CONTINUE) > 0) && 196 185 is_dma_pq_aligned(device, offset, 0, len)) {
+16
drivers/md/Kconfig
··· 175 175 176 176 In unsure, say N. 177 177 178 + 179 + config MD_CLUSTER 180 + tristate "Cluster Support for MD (EXPERIMENTAL)" 181 + depends on BLK_DEV_MD 182 + depends on DLM 183 + default n 184 + ---help--- 185 + Clustering support for MD devices. This enables locking and 186 + synchronization across multiple systems on the cluster, so all 187 + nodes in the cluster can access the MD devices simultaneously. 188 + 189 + This brings the redundancy (and uptime) of RAID levels across the 190 + nodes of the cluster. 191 + 192 + If unsure, say N. 193 + 178 194 source "drivers/md/bcache/Kconfig" 179 195 180 196 config BLK_DEV_DM_BUILTIN
+1
drivers/md/Makefile
··· 30 30 obj-$(CONFIG_MD_RAID456) += raid456.o 31 31 obj-$(CONFIG_MD_MULTIPATH) += multipath.o 32 32 obj-$(CONFIG_MD_FAULTY) += faulty.o 33 + obj-$(CONFIG_MD_CLUSTER) += md-cluster.o 33 34 obj-$(CONFIG_BCACHE) += bcache/ 34 35 obj-$(CONFIG_BLK_DEV_MD) += md-mod.o 35 36 obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
+167 -22
drivers/md/bitmap.c
··· 205 205 struct block_device *bdev; 206 206 struct mddev *mddev = bitmap->mddev; 207 207 struct bitmap_storage *store = &bitmap->storage; 208 + int node_offset = 0; 209 + 210 + if (mddev_is_clustered(bitmap->mddev)) 211 + node_offset = bitmap->cluster_slot * store->file_pages; 208 212 209 213 while ((rdev = next_active_rdev(rdev, mddev)) != NULL) { 210 214 int size = PAGE_SIZE; ··· 437 433 /* This might have been changed by a reshape */ 438 434 sb->sync_size = cpu_to_le64(bitmap->mddev->resync_max_sectors); 439 435 sb->chunksize = cpu_to_le32(bitmap->mddev->bitmap_info.chunksize); 436 + sb->nodes = cpu_to_le32(bitmap->mddev->bitmap_info.nodes); 440 437 sb->sectors_reserved = cpu_to_le32(bitmap->mddev-> 441 438 bitmap_info.space); 442 439 kunmap_atomic(sb); ··· 549 544 bitmap_super_t *sb; 550 545 unsigned long chunksize, daemon_sleep, write_behind; 551 546 unsigned long long events; 547 + int nodes = 0; 552 548 unsigned long sectors_reserved = 0; 553 549 int err = -EINVAL; 554 550 struct page *sb_page; ··· 568 562 return -ENOMEM; 569 563 bitmap->storage.sb_page = sb_page; 570 564 565 + re_read: 566 + /* If cluster_slot is set, the cluster is setup */ 567 + if (bitmap->cluster_slot >= 0) { 568 + sector_t bm_blocks = bitmap->mddev->resync_max_sectors; 569 + 570 + sector_div(bm_blocks, 571 + bitmap->mddev->bitmap_info.chunksize >> 9); 572 + /* bits to bytes */ 573 + bm_blocks = ((bm_blocks+7) >> 3) + sizeof(bitmap_super_t); 574 + /* to 4k blocks */ 575 + bm_blocks = DIV_ROUND_UP_SECTOR_T(bm_blocks, 4096); 576 + bitmap->mddev->bitmap_info.offset += bitmap->cluster_slot * (bm_blocks << 3); 577 + pr_info("%s:%d bm slot: %d offset: %llu\n", __func__, __LINE__, 578 + bitmap->cluster_slot, (unsigned long long)bitmap->mddev->bitmap_info.offset); 579 + } 580 + 571 581 if (bitmap->storage.file) { 572 582 loff_t isize = i_size_read(bitmap->storage.file->f_mapping->host); 573 583 int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize; ··· 599 577 if (err) 600 578 return err; 601 579 580 + err = -EINVAL; 602 581 sb = kmap_atomic(sb_page); 603 582 604 583 chunksize = le32_to_cpu(sb->chunksize); 605 584 daemon_sleep = le32_to_cpu(sb->daemon_sleep) * HZ; 606 585 write_behind = le32_to_cpu(sb->write_behind); 607 586 sectors_reserved = le32_to_cpu(sb->sectors_reserved); 587 + nodes = le32_to_cpu(sb->nodes); 588 + strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64); 608 589 609 590 /* verify that the bitmap-specific fields are valid */ 610 591 if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) ··· 644 619 goto out; 645 620 } 646 621 events = le64_to_cpu(sb->events); 647 - if (events < bitmap->mddev->events) { 622 + if (!nodes && (events < bitmap->mddev->events)) { 648 623 printk(KERN_INFO 649 624 "%s: bitmap file is out of date (%llu < %llu) " 650 625 "-- forcing full recovery\n", ··· 659 634 if (le32_to_cpu(sb->version) == BITMAP_MAJOR_HOSTENDIAN) 660 635 set_bit(BITMAP_HOSTENDIAN, &bitmap->flags); 661 636 bitmap->events_cleared = le64_to_cpu(sb->events_cleared); 637 + strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64); 662 638 err = 0; 639 + 663 640 out: 664 641 kunmap_atomic(sb); 642 + /* Assiging chunksize is required for "re_read" */ 643 + bitmap->mddev->bitmap_info.chunksize = chunksize; 644 + if (nodes && (bitmap->cluster_slot < 0)) { 645 + err = md_setup_cluster(bitmap->mddev, nodes); 646 + if (err) { 647 + pr_err("%s: Could not setup cluster service (%d)\n", 648 + bmname(bitmap), err); 649 + goto out_no_sb; 650 + } 651 + bitmap->cluster_slot = md_cluster_ops->slot_number(bitmap->mddev); 652 + goto re_read; 653 + } 654 + 655 + 665 656 out_no_sb: 666 657 if (test_bit(BITMAP_STALE, &bitmap->flags)) 667 658 bitmap->events_cleared = bitmap->mddev->events; 668 659 bitmap->mddev->bitmap_info.chunksize = chunksize; 669 660 bitmap->mddev->bitmap_info.daemon_sleep = daemon_sleep; 670 661 bitmap->mddev->bitmap_info.max_write_behind = write_behind; 662 + bitmap->mddev->bitmap_info.nodes = nodes; 671 663 if (bitmap->mddev->bitmap_info.space == 0 || 672 664 bitmap->mddev->bitmap_info.space > sectors_reserved) 673 665 bitmap->mddev->bitmap_info.space = sectors_reserved; 674 - if (err) 666 + if (err) { 675 667 bitmap_print_sb(bitmap); 668 + if (bitmap->cluster_slot < 0) 669 + md_cluster_stop(bitmap->mddev); 670 + } 676 671 return err; 677 672 } 678 673 ··· 737 692 } 738 693 739 694 static int bitmap_storage_alloc(struct bitmap_storage *store, 740 - unsigned long chunks, int with_super) 695 + unsigned long chunks, int with_super, 696 + int slot_number) 741 697 { 742 - int pnum; 698 + int pnum, offset = 0; 743 699 unsigned long num_pages; 744 700 unsigned long bytes; 745 701 ··· 749 703 bytes += sizeof(bitmap_super_t); 750 704 751 705 num_pages = DIV_ROUND_UP(bytes, PAGE_SIZE); 706 + offset = slot_number * (num_pages - 1); 752 707 753 708 store->filemap = kmalloc(sizeof(struct page *) 754 709 * num_pages, GFP_KERNEL); ··· 760 713 store->sb_page = alloc_page(GFP_KERNEL|__GFP_ZERO); 761 714 if (store->sb_page == NULL) 762 715 return -ENOMEM; 763 - store->sb_page->index = 0; 764 716 } 717 + 765 718 pnum = 0; 766 719 if (store->sb_page) { 767 720 store->filemap[0] = store->sb_page; 768 721 pnum = 1; 722 + store->sb_page->index = offset; 769 723 } 724 + 770 725 for ( ; pnum < num_pages; pnum++) { 771 726 store->filemap[pnum] = alloc_page(GFP_KERNEL|__GFP_ZERO); 772 727 if (!store->filemap[pnum]) { 773 728 store->file_pages = pnum; 774 729 return -ENOMEM; 775 730 } 776 - store->filemap[pnum]->index = pnum; 731 + store->filemap[pnum]->index = pnum + offset; 777 732 } 778 733 store->file_pages = pnum; 779 734 ··· 934 885 } 935 886 } 936 887 888 + static int bitmap_file_test_bit(struct bitmap *bitmap, sector_t block) 889 + { 890 + unsigned long bit; 891 + struct page *page; 892 + void *paddr; 893 + unsigned long chunk = block >> bitmap->counts.chunkshift; 894 + int set = 0; 895 + 896 + page = filemap_get_page(&bitmap->storage, chunk); 897 + if (!page) 898 + return -EINVAL; 899 + bit = file_page_offset(&bitmap->storage, chunk); 900 + paddr = kmap_atomic(page); 901 + if (test_bit(BITMAP_HOSTENDIAN, &bitmap->flags)) 902 + set = test_bit(bit, paddr); 903 + else 904 + set = test_bit_le(bit, paddr); 905 + kunmap_atomic(paddr); 906 + return set; 907 + } 908 + 909 + 937 910 /* this gets called when the md device is ready to unplug its underlying 938 911 * (slave) device queues -- before we let any writes go down, we need to 939 912 * sync the dirty pages of the bitmap file to disk */ ··· 1006 935 */ 1007 936 static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) 1008 937 { 1009 - unsigned long i, chunks, index, oldindex, bit; 938 + unsigned long i, chunks, index, oldindex, bit, node_offset = 0; 1010 939 struct page *page = NULL; 1011 940 unsigned long bit_cnt = 0; 1012 941 struct file *file; ··· 1052 981 if (!bitmap->mddev->bitmap_info.external) 1053 982 offset = sizeof(bitmap_super_t); 1054 983 984 + if (mddev_is_clustered(bitmap->mddev)) 985 + node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE)); 986 + 1055 987 for (i = 0; i < chunks; i++) { 1056 988 int b; 1057 989 index = file_page_index(&bitmap->storage, i); ··· 1075 1001 bitmap->mddev, 1076 1002 bitmap->mddev->bitmap_info.offset, 1077 1003 page, 1078 - index, count); 1004 + index + node_offset, count); 1079 1005 1080 1006 if (ret) 1081 1007 goto err; ··· 1281 1207 j < bitmap->storage.file_pages 1282 1208 && !test_bit(BITMAP_STALE, &bitmap->flags); 1283 1209 j++) { 1284 - 1285 1210 if (test_page_attr(bitmap, j, 1286 1211 BITMAP_PAGE_DIRTY)) 1287 1212 /* bitmap_unplug will handle the rest */ ··· 1603 1530 return; 1604 1531 } 1605 1532 if (!*bmc) { 1606 - *bmc = 2 | (needed ? NEEDED_MASK : 0); 1533 + *bmc = 2; 1607 1534 bitmap_count_page(&bitmap->counts, offset, 1); 1608 1535 bitmap_set_pending(&bitmap->counts, offset); 1609 1536 bitmap->allclean = 0; 1610 1537 } 1538 + if (needed) 1539 + *bmc |= NEEDED_MASK; 1611 1540 spin_unlock_irq(&bitmap->counts.lock); 1612 1541 } 1613 1542 ··· 1666 1591 if (!bitmap) /* there was no bitmap */ 1667 1592 return; 1668 1593 1594 + if (mddev_is_clustered(bitmap->mddev) && bitmap->mddev->cluster_info && 1595 + bitmap->cluster_slot == md_cluster_ops->slot_number(bitmap->mddev)) 1596 + md_cluster_stop(bitmap->mddev); 1597 + 1669 1598 /* Shouldn't be needed - but just in case.... */ 1670 1599 wait_event(bitmap->write_wait, 1671 1600 atomic_read(&bitmap->pending_writes) == 0); ··· 1715 1636 * initialize the bitmap structure 1716 1637 * if this returns an error, bitmap_destroy must be called to do clean up 1717 1638 */ 1718 - int bitmap_create(struct mddev *mddev) 1639 + struct bitmap *bitmap_create(struct mddev *mddev, int slot) 1719 1640 { 1720 1641 struct bitmap *bitmap; 1721 1642 sector_t blocks = mddev->resync_max_sectors; ··· 1729 1650 1730 1651 bitmap = kzalloc(sizeof(*bitmap), GFP_KERNEL); 1731 1652 if (!bitmap) 1732 - return -ENOMEM; 1653 + return ERR_PTR(-ENOMEM); 1733 1654 1734 1655 spin_lock_init(&bitmap->counts.lock); 1735 1656 atomic_set(&bitmap->pending_writes, 0); ··· 1738 1659 init_waitqueue_head(&bitmap->behind_wait); 1739 1660 1740 1661 bitmap->mddev = mddev; 1662 + bitmap->cluster_slot = slot; 1741 1663 1742 1664 if (mddev->kobj.sd) 1743 1665 bm = sysfs_get_dirent(mddev->kobj.sd, "bitmap"); ··· 1786 1706 printk(KERN_INFO "created bitmap (%lu pages) for device %s\n", 1787 1707 bitmap->counts.pages, bmname(bitmap)); 1788 1708 1789 - mddev->bitmap = bitmap; 1790 - return test_bit(BITMAP_WRITE_ERROR, &bitmap->flags) ? -EIO : 0; 1709 + err = test_bit(BITMAP_WRITE_ERROR, &bitmap->flags) ? -EIO : 0; 1710 + if (err) 1711 + goto error; 1791 1712 1713 + return bitmap; 1792 1714 error: 1793 1715 bitmap_free(bitmap); 1794 - return err; 1716 + return ERR_PTR(err); 1795 1717 } 1796 1718 1797 1719 int bitmap_load(struct mddev *mddev) ··· 1846 1764 return err; 1847 1765 } 1848 1766 EXPORT_SYMBOL_GPL(bitmap_load); 1767 + 1768 + /* Loads the bitmap associated with slot and copies the resync information 1769 + * to our bitmap 1770 + */ 1771 + int bitmap_copy_from_slot(struct mddev *mddev, int slot, 1772 + sector_t *low, sector_t *high, bool clear_bits) 1773 + { 1774 + int rv = 0, i, j; 1775 + sector_t block, lo = 0, hi = 0; 1776 + struct bitmap_counts *counts; 1777 + struct bitmap *bitmap = bitmap_create(mddev, slot); 1778 + 1779 + if (IS_ERR(bitmap)) 1780 + return PTR_ERR(bitmap); 1781 + 1782 + rv = bitmap_read_sb(bitmap); 1783 + if (rv) 1784 + goto err; 1785 + 1786 + rv = bitmap_init_from_disk(bitmap, 0); 1787 + if (rv) 1788 + goto err; 1789 + 1790 + counts = &bitmap->counts; 1791 + for (j = 0; j < counts->chunks; j++) { 1792 + block = (sector_t)j << counts->chunkshift; 1793 + if (bitmap_file_test_bit(bitmap, block)) { 1794 + if (!lo) 1795 + lo = block; 1796 + hi = block; 1797 + bitmap_file_clear_bit(bitmap, block); 1798 + bitmap_set_memory_bits(mddev->bitmap, block, 1); 1799 + bitmap_file_set_bit(mddev->bitmap, block); 1800 + } 1801 + } 1802 + 1803 + if (clear_bits) { 1804 + bitmap_update_sb(bitmap); 1805 + /* Setting this for the ev_page should be enough. 1806 + * And we do not require both write_all and PAGE_DIRT either 1807 + */ 1808 + for (i = 0; i < bitmap->storage.file_pages; i++) 1809 + set_page_attr(bitmap, i, BITMAP_PAGE_DIRTY); 1810 + bitmap_write_all(bitmap); 1811 + bitmap_unplug(bitmap); 1812 + } 1813 + *low = lo; 1814 + *high = hi; 1815 + err: 1816 + bitmap_free(bitmap); 1817 + return rv; 1818 + } 1819 + EXPORT_SYMBOL_GPL(bitmap_copy_from_slot); 1820 + 1849 1821 1850 1822 void bitmap_status(struct seq_file *seq, struct bitmap *bitmap) 1851 1823 { ··· 1985 1849 memset(&store, 0, sizeof(store)); 1986 1850 if (bitmap->mddev->bitmap_info.offset || bitmap->mddev->bitmap_info.file) 1987 1851 ret = bitmap_storage_alloc(&store, chunks, 1988 - !bitmap->mddev->bitmap_info.external); 1852 + !bitmap->mddev->bitmap_info.external, 1853 + bitmap->cluster_slot); 1989 1854 if (ret) 1990 1855 goto err; 1991 1856 ··· 2158 2021 return -EINVAL; 2159 2022 mddev->bitmap_info.offset = offset; 2160 2023 if (mddev->pers) { 2024 + struct bitmap *bitmap; 2161 2025 mddev->pers->quiesce(mddev, 1); 2162 - rv = bitmap_create(mddev); 2163 - if (!rv) 2026 + bitmap = bitmap_create(mddev, -1); 2027 + if (IS_ERR(bitmap)) 2028 + rv = PTR_ERR(bitmap); 2029 + else { 2030 + mddev->bitmap = bitmap; 2164 2031 rv = bitmap_load(mddev); 2165 - if (rv) { 2166 - bitmap_destroy(mddev); 2167 - mddev->bitmap_info.offset = 0; 2032 + if (rv) { 2033 + bitmap_destroy(mddev); 2034 + mddev->bitmap_info.offset = 0; 2035 + } 2168 2036 } 2169 2037 mddev->pers->quiesce(mddev, 0); 2170 2038 if (rv) ··· 2328 2186 2329 2187 static ssize_t metadata_show(struct mddev *mddev, char *page) 2330 2188 { 2189 + if (mddev_is_clustered(mddev)) 2190 + return sprintf(page, "clustered\n"); 2331 2191 return sprintf(page, "%s\n", (mddev->bitmap_info.external 2332 2192 ? "external" : "internal")); 2333 2193 } ··· 2342 2198 return -EBUSY; 2343 2199 if (strncmp(buf, "external", 8) == 0) 2344 2200 mddev->bitmap_info.external = 1; 2345 - else if (strncmp(buf, "internal", 8) == 0) 2201 + else if ((strncmp(buf, "internal", 8) == 0) || 2202 + (strncmp(buf, "clustered", 9) == 0)) 2346 2203 mddev->bitmap_info.external = 0; 2347 2204 else 2348 2205 return -EINVAL;
+7 -3
drivers/md/bitmap.h
··· 130 130 __le32 write_behind; /* 60 number of outstanding write-behind writes */ 131 131 __le32 sectors_reserved; /* 64 number of 512-byte sectors that are 132 132 * reserved for the bitmap. */ 133 - 134 - __u8 pad[256 - 68]; /* set to zero */ 133 + __le32 nodes; /* 68 the maximum number of nodes in cluster. */ 134 + __u8 cluster_name[64]; /* 72 cluster name to which this md belongs */ 135 + __u8 pad[256 - 136]; /* set to zero */ 135 136 } bitmap_super_t; 136 137 137 138 /* notes: ··· 227 226 wait_queue_head_t behind_wait; 228 227 229 228 struct kernfs_node *sysfs_can_clear; 229 + int cluster_slot; /* Slot offset for clustered env */ 230 230 }; 231 231 232 232 /* the bitmap API */ 233 233 234 234 /* these are used only by md/bitmap */ 235 - int bitmap_create(struct mddev *mddev); 235 + struct bitmap *bitmap_create(struct mddev *mddev, int slot); 236 236 int bitmap_load(struct mddev *mddev); 237 237 void bitmap_flush(struct mddev *mddev); 238 238 void bitmap_destroy(struct mddev *mddev); ··· 262 260 263 261 int bitmap_resize(struct bitmap *bitmap, sector_t blocks, 264 262 int chunksize, int init); 263 + int bitmap_copy_from_slot(struct mddev *mddev, int slot, 264 + sector_t *lo, sector_t *hi, bool clear_bits); 265 265 #endif 266 266 267 267 #endif
+965
drivers/md/md-cluster.c
··· 1 + /* 2 + * Copyright (C) 2015, SUSE 3 + * 4 + * This program is free software; you can redistribute it and/or modify 5 + * it under the terms of the GNU General Public License as published by 6 + * the Free Software Foundation; either version 2, or (at your option) 7 + * any later version. 8 + * 9 + */ 10 + 11 + 12 + #include <linux/module.h> 13 + #include <linux/dlm.h> 14 + #include <linux/sched.h> 15 + #include <linux/raid/md_p.h> 16 + #include "md.h" 17 + #include "bitmap.h" 18 + #include "md-cluster.h" 19 + 20 + #define LVB_SIZE 64 21 + #define NEW_DEV_TIMEOUT 5000 22 + 23 + struct dlm_lock_resource { 24 + dlm_lockspace_t *ls; 25 + struct dlm_lksb lksb; 26 + char *name; /* lock name. */ 27 + uint32_t flags; /* flags to pass to dlm_lock() */ 28 + struct completion completion; /* completion for synchronized locking */ 29 + void (*bast)(void *arg, int mode); /* blocking AST function pointer*/ 30 + struct mddev *mddev; /* pointing back to mddev. */ 31 + }; 32 + 33 + struct suspend_info { 34 + int slot; 35 + sector_t lo; 36 + sector_t hi; 37 + struct list_head list; 38 + }; 39 + 40 + struct resync_info { 41 + __le64 lo; 42 + __le64 hi; 43 + }; 44 + 45 + /* md_cluster_info flags */ 46 + #define MD_CLUSTER_WAITING_FOR_NEWDISK 1 47 + 48 + 49 + struct md_cluster_info { 50 + /* dlm lock space and resources for clustered raid. */ 51 + dlm_lockspace_t *lockspace; 52 + int slot_number; 53 + struct completion completion; 54 + struct dlm_lock_resource *sb_lock; 55 + struct mutex sb_mutex; 56 + struct dlm_lock_resource *bitmap_lockres; 57 + struct list_head suspend_list; 58 + spinlock_t suspend_lock; 59 + struct md_thread *recovery_thread; 60 + unsigned long recovery_map; 61 + /* communication loc resources */ 62 + struct dlm_lock_resource *ack_lockres; 63 + struct dlm_lock_resource *message_lockres; 64 + struct dlm_lock_resource *token_lockres; 65 + struct dlm_lock_resource *no_new_dev_lockres; 66 + struct md_thread *recv_thread; 67 + struct completion newdisk_completion; 68 + unsigned long state; 69 + }; 70 + 71 + enum msg_type { 72 + METADATA_UPDATED = 0, 73 + RESYNCING, 74 + NEWDISK, 75 + REMOVE, 76 + RE_ADD, 77 + }; 78 + 79 + struct cluster_msg { 80 + int type; 81 + int slot; 82 + /* TODO: Unionize this for smaller footprint */ 83 + sector_t low; 84 + sector_t high; 85 + char uuid[16]; 86 + int raid_slot; 87 + }; 88 + 89 + static void sync_ast(void *arg) 90 + { 91 + struct dlm_lock_resource *res; 92 + 93 + res = (struct dlm_lock_resource *) arg; 94 + complete(&res->completion); 95 + } 96 + 97 + static int dlm_lock_sync(struct dlm_lock_resource *res, int mode) 98 + { 99 + int ret = 0; 100 + 101 + init_completion(&res->completion); 102 + ret = dlm_lock(res->ls, mode, &res->lksb, 103 + res->flags, res->name, strlen(res->name), 104 + 0, sync_ast, res, res->bast); 105 + if (ret) 106 + return ret; 107 + wait_for_completion(&res->completion); 108 + return res->lksb.sb_status; 109 + } 110 + 111 + static int dlm_unlock_sync(struct dlm_lock_resource *res) 112 + { 113 + return dlm_lock_sync(res, DLM_LOCK_NL); 114 + } 115 + 116 + static struct dlm_lock_resource *lockres_init(struct mddev *mddev, 117 + char *name, void (*bastfn)(void *arg, int mode), int with_lvb) 118 + { 119 + struct dlm_lock_resource *res = NULL; 120 + int ret, namelen; 121 + struct md_cluster_info *cinfo = mddev->cluster_info; 122 + 123 + res = kzalloc(sizeof(struct dlm_lock_resource), GFP_KERNEL); 124 + if (!res) 125 + return NULL; 126 + res->ls = cinfo->lockspace; 127 + res->mddev = mddev; 128 + namelen = strlen(name); 129 + res->name = kzalloc(namelen + 1, GFP_KERNEL); 130 + if (!res->name) { 131 + pr_err("md-cluster: Unable to allocate resource name for resource %s\n", name); 132 + goto out_err; 133 + } 134 + strlcpy(res->name, name, namelen + 1); 135 + if (with_lvb) { 136 + res->lksb.sb_lvbptr = kzalloc(LVB_SIZE, GFP_KERNEL); 137 + if (!res->lksb.sb_lvbptr) { 138 + pr_err("md-cluster: Unable to allocate LVB for resource %s\n", name); 139 + goto out_err; 140 + } 141 + res->flags = DLM_LKF_VALBLK; 142 + } 143 + 144 + if (bastfn) 145 + res->bast = bastfn; 146 + 147 + res->flags |= DLM_LKF_EXPEDITE; 148 + 149 + ret = dlm_lock_sync(res, DLM_LOCK_NL); 150 + if (ret) { 151 + pr_err("md-cluster: Unable to lock NL on new lock resource %s\n", name); 152 + goto out_err; 153 + } 154 + res->flags &= ~DLM_LKF_EXPEDITE; 155 + res->flags |= DLM_LKF_CONVERT; 156 + 157 + return res; 158 + out_err: 159 + kfree(res->lksb.sb_lvbptr); 160 + kfree(res->name); 161 + kfree(res); 162 + return NULL; 163 + } 164 + 165 + static void lockres_free(struct dlm_lock_resource *res) 166 + { 167 + if (!res) 168 + return; 169 + 170 + init_completion(&res->completion); 171 + dlm_unlock(res->ls, res->lksb.sb_lkid, 0, &res->lksb, res); 172 + wait_for_completion(&res->completion); 173 + 174 + kfree(res->name); 175 + kfree(res->lksb.sb_lvbptr); 176 + kfree(res); 177 + } 178 + 179 + static char *pretty_uuid(char *dest, char *src) 180 + { 181 + int i, len = 0; 182 + 183 + for (i = 0; i < 16; i++) { 184 + if (i == 4 || i == 6 || i == 8 || i == 10) 185 + len += sprintf(dest + len, "-"); 186 + len += sprintf(dest + len, "%02x", (__u8)src[i]); 187 + } 188 + return dest; 189 + } 190 + 191 + static void add_resync_info(struct mddev *mddev, struct dlm_lock_resource *lockres, 192 + sector_t lo, sector_t hi) 193 + { 194 + struct resync_info *ri; 195 + 196 + ri = (struct resync_info *)lockres->lksb.sb_lvbptr; 197 + ri->lo = cpu_to_le64(lo); 198 + ri->hi = cpu_to_le64(hi); 199 + } 200 + 201 + static struct suspend_info *read_resync_info(struct mddev *mddev, struct dlm_lock_resource *lockres) 202 + { 203 + struct resync_info ri; 204 + struct suspend_info *s = NULL; 205 + sector_t hi = 0; 206 + 207 + dlm_lock_sync(lockres, DLM_LOCK_CR); 208 + memcpy(&ri, lockres->lksb.sb_lvbptr, sizeof(struct resync_info)); 209 + hi = le64_to_cpu(ri.hi); 210 + if (ri.hi > 0) { 211 + s = kzalloc(sizeof(struct suspend_info), GFP_KERNEL); 212 + if (!s) 213 + goto out; 214 + s->hi = hi; 215 + s->lo = le64_to_cpu(ri.lo); 216 + } 217 + dlm_unlock_sync(lockres); 218 + out: 219 + return s; 220 + } 221 + 222 + static void recover_bitmaps(struct md_thread *thread) 223 + { 224 + struct mddev *mddev = thread->mddev; 225 + struct md_cluster_info *cinfo = mddev->cluster_info; 226 + struct dlm_lock_resource *bm_lockres; 227 + char str[64]; 228 + int slot, ret; 229 + struct suspend_info *s, *tmp; 230 + sector_t lo, hi; 231 + 232 + while (cinfo->recovery_map) { 233 + slot = fls64((u64)cinfo->recovery_map) - 1; 234 + 235 + /* Clear suspend_area associated with the bitmap */ 236 + spin_lock_irq(&cinfo->suspend_lock); 237 + list_for_each_entry_safe(s, tmp, &cinfo->suspend_list, list) 238 + if (slot == s->slot) { 239 + list_del(&s->list); 240 + kfree(s); 241 + } 242 + spin_unlock_irq(&cinfo->suspend_lock); 243 + 244 + snprintf(str, 64, "bitmap%04d", slot); 245 + bm_lockres = lockres_init(mddev, str, NULL, 1); 246 + if (!bm_lockres) { 247 + pr_err("md-cluster: Cannot initialize bitmaps\n"); 248 + goto clear_bit; 249 + } 250 + 251 + ret = dlm_lock_sync(bm_lockres, DLM_LOCK_PW); 252 + if (ret) { 253 + pr_err("md-cluster: Could not DLM lock %s: %d\n", 254 + str, ret); 255 + goto clear_bit; 256 + } 257 + ret = bitmap_copy_from_slot(mddev, slot, &lo, &hi, true); 258 + if (ret) { 259 + pr_err("md-cluster: Could not copy data from bitmap %d\n", slot); 260 + goto dlm_unlock; 261 + } 262 + if (hi > 0) { 263 + /* TODO:Wait for current resync to get over */ 264 + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 265 + if (lo < mddev->recovery_cp) 266 + mddev->recovery_cp = lo; 267 + md_check_recovery(mddev); 268 + } 269 + dlm_unlock: 270 + dlm_unlock_sync(bm_lockres); 271 + clear_bit: 272 + clear_bit(slot, &cinfo->recovery_map); 273 + } 274 + } 275 + 276 + static void recover_prep(void *arg) 277 + { 278 + } 279 + 280 + static void recover_slot(void *arg, struct dlm_slot *slot) 281 + { 282 + struct mddev *mddev = arg; 283 + struct md_cluster_info *cinfo = mddev->cluster_info; 284 + 285 + pr_info("md-cluster: %s Node %d/%d down. My slot: %d. Initiating recovery.\n", 286 + mddev->bitmap_info.cluster_name, 287 + slot->nodeid, slot->slot, 288 + cinfo->slot_number); 289 + set_bit(slot->slot - 1, &cinfo->recovery_map); 290 + if (!cinfo->recovery_thread) { 291 + cinfo->recovery_thread = md_register_thread(recover_bitmaps, 292 + mddev, "recover"); 293 + if (!cinfo->recovery_thread) { 294 + pr_warn("md-cluster: Could not create recovery thread\n"); 295 + return; 296 + } 297 + } 298 + md_wakeup_thread(cinfo->recovery_thread); 299 + } 300 + 301 + static void recover_done(void *arg, struct dlm_slot *slots, 302 + int num_slots, int our_slot, 303 + uint32_t generation) 304 + { 305 + struct mddev *mddev = arg; 306 + struct md_cluster_info *cinfo = mddev->cluster_info; 307 + 308 + cinfo->slot_number = our_slot; 309 + complete(&cinfo->completion); 310 + } 311 + 312 + static const struct dlm_lockspace_ops md_ls_ops = { 313 + .recover_prep = recover_prep, 314 + .recover_slot = recover_slot, 315 + .recover_done = recover_done, 316 + }; 317 + 318 + /* 319 + * The BAST function for the ack lock resource 320 + * This function wakes up the receive thread in 321 + * order to receive and process the message. 322 + */ 323 + static void ack_bast(void *arg, int mode) 324 + { 325 + struct dlm_lock_resource *res = (struct dlm_lock_resource *)arg; 326 + struct md_cluster_info *cinfo = res->mddev->cluster_info; 327 + 328 + if (mode == DLM_LOCK_EX) 329 + md_wakeup_thread(cinfo->recv_thread); 330 + } 331 + 332 + static void __remove_suspend_info(struct md_cluster_info *cinfo, int slot) 333 + { 334 + struct suspend_info *s, *tmp; 335 + 336 + list_for_each_entry_safe(s, tmp, &cinfo->suspend_list, list) 337 + if (slot == s->slot) { 338 + pr_info("%s:%d Deleting suspend_info: %d\n", 339 + __func__, __LINE__, slot); 340 + list_del(&s->list); 341 + kfree(s); 342 + break; 343 + } 344 + } 345 + 346 + static void remove_suspend_info(struct md_cluster_info *cinfo, int slot) 347 + { 348 + spin_lock_irq(&cinfo->suspend_lock); 349 + __remove_suspend_info(cinfo, slot); 350 + spin_unlock_irq(&cinfo->suspend_lock); 351 + } 352 + 353 + 354 + static void process_suspend_info(struct md_cluster_info *cinfo, 355 + int slot, sector_t lo, sector_t hi) 356 + { 357 + struct suspend_info *s; 358 + 359 + if (!hi) { 360 + remove_suspend_info(cinfo, slot); 361 + return; 362 + } 363 + s = kzalloc(sizeof(struct suspend_info), GFP_KERNEL); 364 + if (!s) 365 + return; 366 + s->slot = slot; 367 + s->lo = lo; 368 + s->hi = hi; 369 + spin_lock_irq(&cinfo->suspend_lock); 370 + /* Remove existing entry (if exists) before adding */ 371 + __remove_suspend_info(cinfo, slot); 372 + list_add(&s->list, &cinfo->suspend_list); 373 + spin_unlock_irq(&cinfo->suspend_lock); 374 + } 375 + 376 + static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg) 377 + { 378 + char disk_uuid[64]; 379 + struct md_cluster_info *cinfo = mddev->cluster_info; 380 + char event_name[] = "EVENT=ADD_DEVICE"; 381 + char raid_slot[16]; 382 + char *envp[] = {event_name, disk_uuid, raid_slot, NULL}; 383 + int len; 384 + 385 + len = snprintf(disk_uuid, 64, "DEVICE_UUID="); 386 + pretty_uuid(disk_uuid + len, cmsg->uuid); 387 + snprintf(raid_slot, 16, "RAID_DISK=%d", cmsg->raid_slot); 388 + pr_info("%s:%d Sending kobject change with %s and %s\n", __func__, __LINE__, disk_uuid, raid_slot); 389 + init_completion(&cinfo->newdisk_completion); 390 + set_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state); 391 + kobject_uevent_env(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE, envp); 392 + wait_for_completion_timeout(&cinfo->newdisk_completion, 393 + NEW_DEV_TIMEOUT); 394 + clear_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state); 395 + } 396 + 397 + 398 + static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg) 399 + { 400 + struct md_cluster_info *cinfo = mddev->cluster_info; 401 + 402 + md_reload_sb(mddev); 403 + dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR); 404 + } 405 + 406 + static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg) 407 + { 408 + struct md_rdev *rdev = md_find_rdev_nr_rcu(mddev, msg->raid_slot); 409 + 410 + if (rdev) 411 + md_kick_rdev_from_array(rdev); 412 + else 413 + pr_warn("%s: %d Could not find disk(%d) to REMOVE\n", __func__, __LINE__, msg->raid_slot); 414 + } 415 + 416 + static void process_readd_disk(struct mddev *mddev, struct cluster_msg *msg) 417 + { 418 + struct md_rdev *rdev = md_find_rdev_nr_rcu(mddev, msg->raid_slot); 419 + 420 + if (rdev && test_bit(Faulty, &rdev->flags)) 421 + clear_bit(Faulty, &rdev->flags); 422 + else 423 + pr_warn("%s: %d Could not find disk(%d) which is faulty", __func__, __LINE__, msg->raid_slot); 424 + } 425 + 426 + static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) 427 + { 428 + switch (msg->type) { 429 + case METADATA_UPDATED: 430 + pr_info("%s: %d Received message: METADATA_UPDATE from %d\n", 431 + __func__, __LINE__, msg->slot); 432 + process_metadata_update(mddev, msg); 433 + break; 434 + case RESYNCING: 435 + pr_info("%s: %d Received message: RESYNCING from %d\n", 436 + __func__, __LINE__, msg->slot); 437 + process_suspend_info(mddev->cluster_info, msg->slot, 438 + msg->low, msg->high); 439 + break; 440 + case NEWDISK: 441 + pr_info("%s: %d Received message: NEWDISK from %d\n", 442 + __func__, __LINE__, msg->slot); 443 + process_add_new_disk(mddev, msg); 444 + break; 445 + case REMOVE: 446 + pr_info("%s: %d Received REMOVE from %d\n", 447 + __func__, __LINE__, msg->slot); 448 + process_remove_disk(mddev, msg); 449 + break; 450 + case RE_ADD: 451 + pr_info("%s: %d Received RE_ADD from %d\n", 452 + __func__, __LINE__, msg->slot); 453 + process_readd_disk(mddev, msg); 454 + break; 455 + default: 456 + pr_warn("%s:%d Received unknown message from %d\n", 457 + __func__, __LINE__, msg->slot); 458 + } 459 + } 460 + 461 + /* 462 + * thread for receiving message 463 + */ 464 + static void recv_daemon(struct md_thread *thread) 465 + { 466 + struct md_cluster_info *cinfo = thread->mddev->cluster_info; 467 + struct dlm_lock_resource *ack_lockres = cinfo->ack_lockres; 468 + struct dlm_lock_resource *message_lockres = cinfo->message_lockres; 469 + struct cluster_msg msg; 470 + 471 + /*get CR on Message*/ 472 + if (dlm_lock_sync(message_lockres, DLM_LOCK_CR)) { 473 + pr_err("md/raid1:failed to get CR on MESSAGE\n"); 474 + return; 475 + } 476 + 477 + /* read lvb and wake up thread to process this message_lockres */ 478 + memcpy(&msg, message_lockres->lksb.sb_lvbptr, sizeof(struct cluster_msg)); 479 + process_recvd_msg(thread->mddev, &msg); 480 + 481 + /*release CR on ack_lockres*/ 482 + dlm_unlock_sync(ack_lockres); 483 + /*up-convert to EX on message_lockres*/ 484 + dlm_lock_sync(message_lockres, DLM_LOCK_EX); 485 + /*get CR on ack_lockres again*/ 486 + dlm_lock_sync(ack_lockres, DLM_LOCK_CR); 487 + /*release CR on message_lockres*/ 488 + dlm_unlock_sync(message_lockres); 489 + } 490 + 491 + /* lock_comm() 492 + * Takes the lock on the TOKEN lock resource so no other 493 + * node can communicate while the operation is underway. 494 + */ 495 + static int lock_comm(struct md_cluster_info *cinfo) 496 + { 497 + int error; 498 + 499 + error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); 500 + if (error) 501 + pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", 502 + __func__, __LINE__, error); 503 + return error; 504 + } 505 + 506 + static void unlock_comm(struct md_cluster_info *cinfo) 507 + { 508 + dlm_unlock_sync(cinfo->token_lockres); 509 + } 510 + 511 + /* __sendmsg() 512 + * This function performs the actual sending of the message. This function is 513 + * usually called after performing the encompassing operation 514 + * The function: 515 + * 1. Grabs the message lockresource in EX mode 516 + * 2. Copies the message to the message LVB 517 + * 3. Downconverts message lockresource to CR 518 + * 4. Upconverts ack lock resource from CR to EX. This forces the BAST on other nodes 519 + * and the other nodes read the message. The thread will wait here until all other 520 + * nodes have released ack lock resource. 521 + * 5. Downconvert ack lockresource to CR 522 + */ 523 + static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) 524 + { 525 + int error; 526 + int slot = cinfo->slot_number - 1; 527 + 528 + cmsg->slot = cpu_to_le32(slot); 529 + /*get EX on Message*/ 530 + error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX); 531 + if (error) { 532 + pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error); 533 + goto failed_message; 534 + } 535 + 536 + memcpy(cinfo->message_lockres->lksb.sb_lvbptr, (void *)cmsg, 537 + sizeof(struct cluster_msg)); 538 + /*down-convert EX to CR on Message*/ 539 + error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CR); 540 + if (error) { 541 + pr_err("md-cluster: failed to convert EX to CR on MESSAGE(%d)\n", 542 + error); 543 + goto failed_message; 544 + } 545 + 546 + /*up-convert CR to EX on Ack*/ 547 + error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX); 548 + if (error) { 549 + pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n", 550 + error); 551 + goto failed_ack; 552 + } 553 + 554 + /*down-convert EX to CR on Ack*/ 555 + error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR); 556 + if (error) { 557 + pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n", 558 + error); 559 + goto failed_ack; 560 + } 561 + 562 + failed_ack: 563 + dlm_unlock_sync(cinfo->message_lockres); 564 + failed_message: 565 + return error; 566 + } 567 + 568 + static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) 569 + { 570 + int ret; 571 + 572 + lock_comm(cinfo); 573 + ret = __sendmsg(cinfo, cmsg); 574 + unlock_comm(cinfo); 575 + return ret; 576 + } 577 + 578 + static int gather_all_resync_info(struct mddev *mddev, int total_slots) 579 + { 580 + struct md_cluster_info *cinfo = mddev->cluster_info; 581 + int i, ret = 0; 582 + struct dlm_lock_resource *bm_lockres; 583 + struct suspend_info *s; 584 + char str[64]; 585 + 586 + 587 + for (i = 0; i < total_slots; i++) { 588 + memset(str, '\0', 64); 589 + snprintf(str, 64, "bitmap%04d", i); 590 + bm_lockres = lockres_init(mddev, str, NULL, 1); 591 + if (!bm_lockres) 592 + return -ENOMEM; 593 + if (i == (cinfo->slot_number - 1)) 594 + continue; 595 + 596 + bm_lockres->flags |= DLM_LKF_NOQUEUE; 597 + ret = dlm_lock_sync(bm_lockres, DLM_LOCK_PW); 598 + if (ret == -EAGAIN) { 599 + memset(bm_lockres->lksb.sb_lvbptr, '\0', LVB_SIZE); 600 + s = read_resync_info(mddev, bm_lockres); 601 + if (s) { 602 + pr_info("%s:%d Resync[%llu..%llu] in progress on %d\n", 603 + __func__, __LINE__, 604 + (unsigned long long) s->lo, 605 + (unsigned long long) s->hi, i); 606 + spin_lock_irq(&cinfo->suspend_lock); 607 + s->slot = i; 608 + list_add(&s->list, &cinfo->suspend_list); 609 + spin_unlock_irq(&cinfo->suspend_lock); 610 + } 611 + ret = 0; 612 + lockres_free(bm_lockres); 613 + continue; 614 + } 615 + if (ret) 616 + goto out; 617 + /* TODO: Read the disk bitmap sb and check if it needs recovery */ 618 + dlm_unlock_sync(bm_lockres); 619 + lockres_free(bm_lockres); 620 + } 621 + out: 622 + return ret; 623 + } 624 + 625 + static int join(struct mddev *mddev, int nodes) 626 + { 627 + struct md_cluster_info *cinfo; 628 + int ret, ops_rv; 629 + char str[64]; 630 + 631 + if (!try_module_get(THIS_MODULE)) 632 + return -ENOENT; 633 + 634 + cinfo = kzalloc(sizeof(struct md_cluster_info), GFP_KERNEL); 635 + if (!cinfo) 636 + return -ENOMEM; 637 + 638 + init_completion(&cinfo->completion); 639 + 640 + mutex_init(&cinfo->sb_mutex); 641 + mddev->cluster_info = cinfo; 642 + 643 + memset(str, 0, 64); 644 + pretty_uuid(str, mddev->uuid); 645 + ret = dlm_new_lockspace(str, mddev->bitmap_info.cluster_name, 646 + DLM_LSFL_FS, LVB_SIZE, 647 + &md_ls_ops, mddev, &ops_rv, &cinfo->lockspace); 648 + if (ret) 649 + goto err; 650 + wait_for_completion(&cinfo->completion); 651 + if (nodes < cinfo->slot_number) { 652 + pr_err("md-cluster: Slot allotted(%d) is greater than available slots(%d).", 653 + cinfo->slot_number, nodes); 654 + ret = -ERANGE; 655 + goto err; 656 + } 657 + cinfo->sb_lock = lockres_init(mddev, "cmd-super", 658 + NULL, 0); 659 + if (!cinfo->sb_lock) { 660 + ret = -ENOMEM; 661 + goto err; 662 + } 663 + /* Initiate the communication resources */ 664 + ret = -ENOMEM; 665 + cinfo->recv_thread = md_register_thread(recv_daemon, mddev, "cluster_recv"); 666 + if (!cinfo->recv_thread) { 667 + pr_err("md-cluster: cannot allocate memory for recv_thread!\n"); 668 + goto err; 669 + } 670 + cinfo->message_lockres = lockres_init(mddev, "message", NULL, 1); 671 + if (!cinfo->message_lockres) 672 + goto err; 673 + cinfo->token_lockres = lockres_init(mddev, "token", NULL, 0); 674 + if (!cinfo->token_lockres) 675 + goto err; 676 + cinfo->ack_lockres = lockres_init(mddev, "ack", ack_bast, 0); 677 + if (!cinfo->ack_lockres) 678 + goto err; 679 + cinfo->no_new_dev_lockres = lockres_init(mddev, "no-new-dev", NULL, 0); 680 + if (!cinfo->no_new_dev_lockres) 681 + goto err; 682 + 683 + /* get sync CR lock on ACK. */ 684 + if (dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR)) 685 + pr_err("md-cluster: failed to get a sync CR lock on ACK!(%d)\n", 686 + ret); 687 + /* get sync CR lock on no-new-dev. */ 688 + if (dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR)) 689 + pr_err("md-cluster: failed to get a sync CR lock on no-new-dev!(%d)\n", ret); 690 + 691 + 692 + pr_info("md-cluster: Joined cluster %s slot %d\n", str, cinfo->slot_number); 693 + snprintf(str, 64, "bitmap%04d", cinfo->slot_number - 1); 694 + cinfo->bitmap_lockres = lockres_init(mddev, str, NULL, 1); 695 + if (!cinfo->bitmap_lockres) 696 + goto err; 697 + if (dlm_lock_sync(cinfo->bitmap_lockres, DLM_LOCK_PW)) { 698 + pr_err("Failed to get bitmap lock\n"); 699 + ret = -EINVAL; 700 + goto err; 701 + } 702 + 703 + INIT_LIST_HEAD(&cinfo->suspend_list); 704 + spin_lock_init(&cinfo->suspend_lock); 705 + 706 + ret = gather_all_resync_info(mddev, nodes); 707 + if (ret) 708 + goto err; 709 + 710 + return 0; 711 + err: 712 + lockres_free(cinfo->message_lockres); 713 + lockres_free(cinfo->token_lockres); 714 + lockres_free(cinfo->ack_lockres); 715 + lockres_free(cinfo->no_new_dev_lockres); 716 + lockres_free(cinfo->bitmap_lockres); 717 + lockres_free(cinfo->sb_lock); 718 + if (cinfo->lockspace) 719 + dlm_release_lockspace(cinfo->lockspace, 2); 720 + mddev->cluster_info = NULL; 721 + kfree(cinfo); 722 + module_put(THIS_MODULE); 723 + return ret; 724 + } 725 + 726 + static int leave(struct mddev *mddev) 727 + { 728 + struct md_cluster_info *cinfo = mddev->cluster_info; 729 + 730 + if (!cinfo) 731 + return 0; 732 + md_unregister_thread(&cinfo->recovery_thread); 733 + md_unregister_thread(&cinfo->recv_thread); 734 + lockres_free(cinfo->message_lockres); 735 + lockres_free(cinfo->token_lockres); 736 + lockres_free(cinfo->ack_lockres); 737 + lockres_free(cinfo->no_new_dev_lockres); 738 + lockres_free(cinfo->sb_lock); 739 + lockres_free(cinfo->bitmap_lockres); 740 + dlm_release_lockspace(cinfo->lockspace, 2); 741 + return 0; 742 + } 743 + 744 + /* slot_number(): Returns the MD slot number to use 745 + * DLM starts the slot numbers from 1, wheras cluster-md 746 + * wants the number to be from zero, so we deduct one 747 + */ 748 + static int slot_number(struct mddev *mddev) 749 + { 750 + struct md_cluster_info *cinfo = mddev->cluster_info; 751 + 752 + return cinfo->slot_number - 1; 753 + } 754 + 755 + static void resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi) 756 + { 757 + struct md_cluster_info *cinfo = mddev->cluster_info; 758 + 759 + add_resync_info(mddev, cinfo->bitmap_lockres, lo, hi); 760 + /* Re-acquire the lock to refresh LVB */ 761 + dlm_lock_sync(cinfo->bitmap_lockres, DLM_LOCK_PW); 762 + } 763 + 764 + static int metadata_update_start(struct mddev *mddev) 765 + { 766 + return lock_comm(mddev->cluster_info); 767 + } 768 + 769 + static int metadata_update_finish(struct mddev *mddev) 770 + { 771 + struct md_cluster_info *cinfo = mddev->cluster_info; 772 + struct cluster_msg cmsg; 773 + int ret; 774 + 775 + memset(&cmsg, 0, sizeof(cmsg)); 776 + cmsg.type = cpu_to_le32(METADATA_UPDATED); 777 + ret = __sendmsg(cinfo, &cmsg); 778 + unlock_comm(cinfo); 779 + return ret; 780 + } 781 + 782 + static int metadata_update_cancel(struct mddev *mddev) 783 + { 784 + struct md_cluster_info *cinfo = mddev->cluster_info; 785 + 786 + return dlm_unlock_sync(cinfo->token_lockres); 787 + } 788 + 789 + static int resync_send(struct mddev *mddev, enum msg_type type, 790 + sector_t lo, sector_t hi) 791 + { 792 + struct md_cluster_info *cinfo = mddev->cluster_info; 793 + struct cluster_msg cmsg; 794 + int slot = cinfo->slot_number - 1; 795 + 796 + pr_info("%s:%d lo: %llu hi: %llu\n", __func__, __LINE__, 797 + (unsigned long long)lo, 798 + (unsigned long long)hi); 799 + resync_info_update(mddev, lo, hi); 800 + cmsg.type = cpu_to_le32(type); 801 + cmsg.slot = cpu_to_le32(slot); 802 + cmsg.low = cpu_to_le64(lo); 803 + cmsg.high = cpu_to_le64(hi); 804 + return sendmsg(cinfo, &cmsg); 805 + } 806 + 807 + static int resync_start(struct mddev *mddev, sector_t lo, sector_t hi) 808 + { 809 + pr_info("%s:%d\n", __func__, __LINE__); 810 + return resync_send(mddev, RESYNCING, lo, hi); 811 + } 812 + 813 + static void resync_finish(struct mddev *mddev) 814 + { 815 + pr_info("%s:%d\n", __func__, __LINE__); 816 + resync_send(mddev, RESYNCING, 0, 0); 817 + } 818 + 819 + static int area_resyncing(struct mddev *mddev, sector_t lo, sector_t hi) 820 + { 821 + struct md_cluster_info *cinfo = mddev->cluster_info; 822 + int ret = 0; 823 + struct suspend_info *s; 824 + 825 + spin_lock_irq(&cinfo->suspend_lock); 826 + if (list_empty(&cinfo->suspend_list)) 827 + goto out; 828 + list_for_each_entry(s, &cinfo->suspend_list, list) 829 + if (hi > s->lo && lo < s->hi) { 830 + ret = 1; 831 + break; 832 + } 833 + out: 834 + spin_unlock_irq(&cinfo->suspend_lock); 835 + return ret; 836 + } 837 + 838 + static int add_new_disk_start(struct mddev *mddev, struct md_rdev *rdev) 839 + { 840 + struct md_cluster_info *cinfo = mddev->cluster_info; 841 + struct cluster_msg cmsg; 842 + int ret = 0; 843 + struct mdp_superblock_1 *sb = page_address(rdev->sb_page); 844 + char *uuid = sb->device_uuid; 845 + 846 + memset(&cmsg, 0, sizeof(cmsg)); 847 + cmsg.type = cpu_to_le32(NEWDISK); 848 + memcpy(cmsg.uuid, uuid, 16); 849 + cmsg.raid_slot = rdev->desc_nr; 850 + lock_comm(cinfo); 851 + ret = __sendmsg(cinfo, &cmsg); 852 + if (ret) 853 + return ret; 854 + cinfo->no_new_dev_lockres->flags |= DLM_LKF_NOQUEUE; 855 + ret = dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_EX); 856 + cinfo->no_new_dev_lockres->flags &= ~DLM_LKF_NOQUEUE; 857 + /* Some node does not "see" the device */ 858 + if (ret == -EAGAIN) 859 + ret = -ENOENT; 860 + else 861 + dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR); 862 + return ret; 863 + } 864 + 865 + static int add_new_disk_finish(struct mddev *mddev) 866 + { 867 + struct cluster_msg cmsg; 868 + struct md_cluster_info *cinfo = mddev->cluster_info; 869 + int ret; 870 + /* Write sb and inform others */ 871 + md_update_sb(mddev, 1); 872 + cmsg.type = METADATA_UPDATED; 873 + ret = __sendmsg(cinfo, &cmsg); 874 + unlock_comm(cinfo); 875 + return ret; 876 + } 877 + 878 + static int new_disk_ack(struct mddev *mddev, bool ack) 879 + { 880 + struct md_cluster_info *cinfo = mddev->cluster_info; 881 + 882 + if (!test_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state)) { 883 + pr_warn("md-cluster(%s): Spurious cluster confirmation\n", mdname(mddev)); 884 + return -EINVAL; 885 + } 886 + 887 + if (ack) 888 + dlm_unlock_sync(cinfo->no_new_dev_lockres); 889 + complete(&cinfo->newdisk_completion); 890 + return 0; 891 + } 892 + 893 + static int remove_disk(struct mddev *mddev, struct md_rdev *rdev) 894 + { 895 + struct cluster_msg cmsg; 896 + struct md_cluster_info *cinfo = mddev->cluster_info; 897 + cmsg.type = REMOVE; 898 + cmsg.raid_slot = rdev->desc_nr; 899 + return __sendmsg(cinfo, &cmsg); 900 + } 901 + 902 + static int gather_bitmaps(struct md_rdev *rdev) 903 + { 904 + int sn, err; 905 + sector_t lo, hi; 906 + struct cluster_msg cmsg; 907 + struct mddev *mddev = rdev->mddev; 908 + struct md_cluster_info *cinfo = mddev->cluster_info; 909 + 910 + cmsg.type = RE_ADD; 911 + cmsg.raid_slot = rdev->desc_nr; 912 + err = sendmsg(cinfo, &cmsg); 913 + if (err) 914 + goto out; 915 + 916 + for (sn = 0; sn < mddev->bitmap_info.nodes; sn++) { 917 + if (sn == (cinfo->slot_number - 1)) 918 + continue; 919 + err = bitmap_copy_from_slot(mddev, sn, &lo, &hi, false); 920 + if (err) { 921 + pr_warn("md-cluster: Could not gather bitmaps from slot %d", sn); 922 + goto out; 923 + } 924 + if ((hi > 0) && (lo < mddev->recovery_cp)) 925 + mddev->recovery_cp = lo; 926 + } 927 + out: 928 + return err; 929 + } 930 + 931 + static struct md_cluster_operations cluster_ops = { 932 + .join = join, 933 + .leave = leave, 934 + .slot_number = slot_number, 935 + .resync_info_update = resync_info_update, 936 + .resync_start = resync_start, 937 + .resync_finish = resync_finish, 938 + .metadata_update_start = metadata_update_start, 939 + .metadata_update_finish = metadata_update_finish, 940 + .metadata_update_cancel = metadata_update_cancel, 941 + .area_resyncing = area_resyncing, 942 + .add_new_disk_start = add_new_disk_start, 943 + .add_new_disk_finish = add_new_disk_finish, 944 + .new_disk_ack = new_disk_ack, 945 + .remove_disk = remove_disk, 946 + .gather_bitmaps = gather_bitmaps, 947 + }; 948 + 949 + static int __init cluster_init(void) 950 + { 951 + pr_warn("md-cluster: EXPERIMENTAL. Use with caution\n"); 952 + pr_info("Registering Cluster MD functions\n"); 953 + register_md_cluster_operations(&cluster_ops, THIS_MODULE); 954 + return 0; 955 + } 956 + 957 + static void cluster_exit(void) 958 + { 959 + unregister_md_cluster_operations(); 960 + } 961 + 962 + module_init(cluster_init); 963 + module_exit(cluster_exit); 964 + MODULE_LICENSE("GPL"); 965 + MODULE_DESCRIPTION("Clustering support for MD");
+29
drivers/md/md-cluster.h
··· 1 + 2 + 3 + #ifndef _MD_CLUSTER_H 4 + #define _MD_CLUSTER_H 5 + 6 + #include "md.h" 7 + 8 + struct mddev; 9 + struct md_rdev; 10 + 11 + struct md_cluster_operations { 12 + int (*join)(struct mddev *mddev, int nodes); 13 + int (*leave)(struct mddev *mddev); 14 + int (*slot_number)(struct mddev *mddev); 15 + void (*resync_info_update)(struct mddev *mddev, sector_t lo, sector_t hi); 16 + int (*resync_start)(struct mddev *mddev, sector_t lo, sector_t hi); 17 + void (*resync_finish)(struct mddev *mddev); 18 + int (*metadata_update_start)(struct mddev *mddev); 19 + int (*metadata_update_finish)(struct mddev *mddev); 20 + int (*metadata_update_cancel)(struct mddev *mddev); 21 + int (*area_resyncing)(struct mddev *mddev, sector_t lo, sector_t hi); 22 + int (*add_new_disk_start)(struct mddev *mddev, struct md_rdev *rdev); 23 + int (*add_new_disk_finish)(struct mddev *mddev); 24 + int (*new_disk_ack)(struct mddev *mddev, bool ack); 25 + int (*remove_disk)(struct mddev *mddev, struct md_rdev *rdev); 26 + int (*gather_bitmaps)(struct md_rdev *rdev); 27 + }; 28 + 29 + #endif /* _MD_CLUSTER_H */
+313 -69
drivers/md/md.c
··· 53 53 #include <linux/slab.h> 54 54 #include "md.h" 55 55 #include "bitmap.h" 56 + #include "md-cluster.h" 56 57 57 58 #ifndef MODULE 58 59 static void autostart_arrays(int part); ··· 66 65 */ 67 66 static LIST_HEAD(pers_list); 68 67 static DEFINE_SPINLOCK(pers_lock); 68 + 69 + struct md_cluster_operations *md_cluster_ops; 70 + EXPORT_SYMBOL(md_cluster_ops); 71 + struct module *md_cluster_mod; 72 + EXPORT_SYMBOL(md_cluster_mod); 69 73 70 74 static DECLARE_WAIT_QUEUE_HEAD(resync_wait); 71 75 static struct workqueue_struct *md_wq; ··· 646 640 } 647 641 EXPORT_SYMBOL_GPL(mddev_unlock); 648 642 649 - static struct md_rdev *find_rdev_nr_rcu(struct mddev *mddev, int nr) 643 + struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr) 650 644 { 651 645 struct md_rdev *rdev; 652 646 ··· 656 650 657 651 return NULL; 658 652 } 653 + EXPORT_SYMBOL_GPL(md_find_rdev_nr_rcu); 659 654 660 655 static struct md_rdev *find_rdev(struct mddev *mddev, dev_t dev) 661 656 { ··· 2054 2047 int choice = 0; 2055 2048 if (mddev->pers) 2056 2049 choice = mddev->raid_disks; 2057 - while (find_rdev_nr_rcu(mddev, choice)) 2050 + while (md_find_rdev_nr_rcu(mddev, choice)) 2058 2051 choice++; 2059 2052 rdev->desc_nr = choice; 2060 2053 } else { 2061 - if (find_rdev_nr_rcu(mddev, rdev->desc_nr)) { 2054 + if (md_find_rdev_nr_rcu(mddev, rdev->desc_nr)) { 2062 2055 rcu_read_unlock(); 2063 2056 return -EBUSY; 2064 2057 } ··· 2173 2166 kobject_put(&rdev->kobj); 2174 2167 } 2175 2168 2176 - static void kick_rdev_from_array(struct md_rdev *rdev) 2169 + void md_kick_rdev_from_array(struct md_rdev *rdev) 2177 2170 { 2178 2171 unbind_rdev_from_array(rdev); 2179 2172 export_rdev(rdev); 2180 2173 } 2174 + EXPORT_SYMBOL_GPL(md_kick_rdev_from_array); 2181 2175 2182 2176 static void export_array(struct mddev *mddev) 2183 2177 { ··· 2187 2179 while (!list_empty(&mddev->disks)) { 2188 2180 rdev = list_first_entry(&mddev->disks, struct md_rdev, 2189 2181 same_set); 2190 - kick_rdev_from_array(rdev); 2182 + md_kick_rdev_from_array(rdev); 2191 2183 } 2192 2184 mddev->raid_disks = 0; 2193 2185 mddev->major_version = 0; ··· 2216 2208 } 2217 2209 } 2218 2210 2219 - static void md_update_sb(struct mddev *mddev, int force_change) 2211 + void md_update_sb(struct mddev *mddev, int force_change) 2220 2212 { 2221 2213 struct md_rdev *rdev; 2222 2214 int sync_req; ··· 2377 2369 wake_up(&rdev->blocked_wait); 2378 2370 } 2379 2371 } 2372 + EXPORT_SYMBOL(md_update_sb); 2373 + 2374 + static int add_bound_rdev(struct md_rdev *rdev) 2375 + { 2376 + struct mddev *mddev = rdev->mddev; 2377 + int err = 0; 2378 + 2379 + if (!mddev->pers->hot_remove_disk) { 2380 + /* If there is hot_add_disk but no hot_remove_disk 2381 + * then added disks for geometry changes, 2382 + * and should be added immediately. 2383 + */ 2384 + super_types[mddev->major_version]. 2385 + validate_super(mddev, rdev); 2386 + err = mddev->pers->hot_add_disk(mddev, rdev); 2387 + if (err) { 2388 + unbind_rdev_from_array(rdev); 2389 + export_rdev(rdev); 2390 + return err; 2391 + } 2392 + } 2393 + sysfs_notify_dirent_safe(rdev->sysfs_state); 2394 + 2395 + set_bit(MD_CHANGE_DEVS, &mddev->flags); 2396 + if (mddev->degraded) 2397 + set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 2398 + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 2399 + md_new_event(mddev); 2400 + md_wakeup_thread(mddev->thread); 2401 + return 0; 2402 + } 2380 2403 2381 2404 /* words written to sysfs files may, or may not, be \n terminated. 2382 2405 * We want to accept with case. For this we use cmd_match. ··· 2510 2471 err = -EBUSY; 2511 2472 else { 2512 2473 struct mddev *mddev = rdev->mddev; 2513 - kick_rdev_from_array(rdev); 2474 + if (mddev_is_clustered(mddev)) 2475 + md_cluster_ops->remove_disk(mddev, rdev); 2476 + md_kick_rdev_from_array(rdev); 2477 + if (mddev_is_clustered(mddev)) 2478 + md_cluster_ops->metadata_update_start(mddev); 2514 2479 if (mddev->pers) 2515 2480 md_update_sb(mddev, 1); 2516 2481 md_new_event(mddev); 2482 + if (mddev_is_clustered(mddev)) 2483 + md_cluster_ops->metadata_update_finish(mddev); 2517 2484 err = 0; 2518 2485 } 2519 2486 } else if (cmd_match(buf, "writemostly")) { ··· 2598 2553 clear_bit(Replacement, &rdev->flags); 2599 2554 err = 0; 2600 2555 } 2556 + } else if (cmd_match(buf, "re-add")) { 2557 + if (test_bit(Faulty, &rdev->flags) && (rdev->raid_disk == -1)) { 2558 + /* clear_bit is performed _after_ all the devices 2559 + * have their local Faulty bit cleared. If any writes 2560 + * happen in the meantime in the local node, they 2561 + * will land in the local bitmap, which will be synced 2562 + * by this node eventually 2563 + */ 2564 + if (!mddev_is_clustered(rdev->mddev) || 2565 + (err = md_cluster_ops->gather_bitmaps(rdev)) == 0) { 2566 + clear_bit(Faulty, &rdev->flags); 2567 + err = add_bound_rdev(rdev); 2568 + } 2569 + } else 2570 + err = -EBUSY; 2601 2571 } 2602 2572 if (!err) 2603 2573 sysfs_notify_dirent_safe(rdev->sysfs_state); ··· 3187 3127 "md: fatal superblock inconsistency in %s" 3188 3128 " -- removing from array\n", 3189 3129 bdevname(rdev->bdev,b)); 3190 - kick_rdev_from_array(rdev); 3130 + md_kick_rdev_from_array(rdev); 3191 3131 } 3192 3132 3193 3133 super_types[mddev->major_version]. ··· 3202 3142 "md: %s: %s: only %d devices permitted\n", 3203 3143 mdname(mddev), bdevname(rdev->bdev, b), 3204 3144 mddev->max_disks); 3205 - kick_rdev_from_array(rdev); 3145 + md_kick_rdev_from_array(rdev); 3206 3146 continue; 3207 3147 } 3208 - if (rdev != freshest) 3148 + if (rdev != freshest) { 3209 3149 if (super_types[mddev->major_version]. 3210 3150 validate_super(mddev, rdev)) { 3211 3151 printk(KERN_WARNING "md: kicking non-fresh %s" 3212 3152 " from array!\n", 3213 3153 bdevname(rdev->bdev,b)); 3214 - kick_rdev_from_array(rdev); 3154 + md_kick_rdev_from_array(rdev); 3215 3155 continue; 3216 3156 } 3157 + /* No device should have a Candidate flag 3158 + * when reading devices 3159 + */ 3160 + if (test_bit(Candidate, &rdev->flags)) { 3161 + pr_info("md: kicking Cluster Candidate %s from array!\n", 3162 + bdevname(rdev->bdev, b)); 3163 + md_kick_rdev_from_array(rdev); 3164 + } 3165 + } 3217 3166 if (mddev->level == LEVEL_MULTIPATH) { 3218 3167 rdev->desc_nr = i++; 3219 3168 rdev->raid_disk = rdev->desc_nr; ··· 4077 4008 if (err) 4078 4009 return err; 4079 4010 if (mddev->pers) { 4011 + if (mddev_is_clustered(mddev)) 4012 + md_cluster_ops->metadata_update_start(mddev); 4080 4013 err = update_size(mddev, sectors); 4081 4014 md_update_sb(mddev, 1); 4015 + if (mddev_is_clustered(mddev)) 4016 + md_cluster_ops->metadata_update_finish(mddev); 4082 4017 } else { 4083 4018 if (mddev->dev_sectors == 0 || 4084 4019 mddev->dev_sectors > sectors) ··· 4427 4354 { 4428 4355 unsigned long long min; 4429 4356 int err; 4430 - int chunk; 4431 4357 4432 4358 if (kstrtoull(buf, 10, &min)) 4433 4359 return -EINVAL; ··· 4440 4368 if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 4441 4369 goto out_unlock; 4442 4370 4443 - /* Must be a multiple of chunk_size */ 4444 - chunk = mddev->chunk_sectors; 4445 - if (chunk) { 4446 - sector_t temp = min; 4447 - 4448 - err = -EINVAL; 4449 - if (sector_div(temp, chunk)) 4450 - goto out_unlock; 4451 - } 4452 - mddev->resync_min = min; 4371 + /* Round down to multiple of 4K for safety */ 4372 + mddev->resync_min = round_down(min, 8); 4453 4373 err = 0; 4454 4374 4455 4375 out_unlock: ··· 5141 5077 } 5142 5078 if (err == 0 && pers->sync_request && 5143 5079 (mddev->bitmap_info.file || mddev->bitmap_info.offset)) { 5144 - err = bitmap_create(mddev); 5145 - if (err) 5080 + struct bitmap *bitmap; 5081 + 5082 + bitmap = bitmap_create(mddev, -1); 5083 + if (IS_ERR(bitmap)) { 5084 + err = PTR_ERR(bitmap); 5146 5085 printk(KERN_ERR "%s: failed to create bitmap (%d)\n", 5147 5086 mdname(mddev), err); 5087 + } else 5088 + mddev->bitmap = bitmap; 5089 + 5148 5090 } 5149 5091 if (err) { 5150 5092 mddev_detach(mddev); ··· 5302 5232 5303 5233 static void __md_stop_writes(struct mddev *mddev) 5304 5234 { 5235 + if (mddev_is_clustered(mddev)) 5236 + md_cluster_ops->metadata_update_start(mddev); 5305 5237 set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 5306 5238 flush_workqueue(md_misc_wq); 5307 5239 if (mddev->sync_thread) { ··· 5322 5250 mddev->in_sync = 1; 5323 5251 md_update_sb(mddev, 1); 5324 5252 } 5253 + if (mddev_is_clustered(mddev)) 5254 + md_cluster_ops->metadata_update_finish(mddev); 5325 5255 } 5326 5256 5327 5257 void md_stop_writes(struct mddev *mddev) ··· 5710 5636 info.state = (1<<MD_SB_CLEAN); 5711 5637 if (mddev->bitmap && mddev->bitmap_info.offset) 5712 5638 info.state |= (1<<MD_SB_BITMAP_PRESENT); 5639 + if (mddev_is_clustered(mddev)) 5640 + info.state |= (1<<MD_SB_CLUSTERED); 5713 5641 info.active_disks = insync; 5714 5642 info.working_disks = working; 5715 5643 info.failed_disks = failed; ··· 5767 5691 return -EFAULT; 5768 5692 5769 5693 rcu_read_lock(); 5770 - rdev = find_rdev_nr_rcu(mddev, info.number); 5694 + rdev = md_find_rdev_nr_rcu(mddev, info.number); 5771 5695 if (rdev) { 5772 5696 info.major = MAJOR(rdev->bdev->bd_dev); 5773 5697 info.minor = MINOR(rdev->bdev->bd_dev); ··· 5799 5723 char b[BDEVNAME_SIZE], b2[BDEVNAME_SIZE]; 5800 5724 struct md_rdev *rdev; 5801 5725 dev_t dev = MKDEV(info->major,info->minor); 5726 + 5727 + if (mddev_is_clustered(mddev) && 5728 + !(info->state & ((1 << MD_DISK_CLUSTER_ADD) | (1 << MD_DISK_CANDIDATE)))) { 5729 + pr_err("%s: Cannot add to clustered mddev.\n", 5730 + mdname(mddev)); 5731 + return -EINVAL; 5732 + } 5802 5733 5803 5734 if (info->major != MAJOR(dev) || info->minor != MINOR(dev)) 5804 5735 return -EOVERFLOW; ··· 5893 5810 else 5894 5811 clear_bit(WriteMostly, &rdev->flags); 5895 5812 5813 + /* 5814 + * check whether the device shows up in other nodes 5815 + */ 5816 + if (mddev_is_clustered(mddev)) { 5817 + if (info->state & (1 << MD_DISK_CANDIDATE)) { 5818 + /* Through --cluster-confirm */ 5819 + set_bit(Candidate, &rdev->flags); 5820 + err = md_cluster_ops->new_disk_ack(mddev, true); 5821 + if (err) { 5822 + export_rdev(rdev); 5823 + return err; 5824 + } 5825 + } else if (info->state & (1 << MD_DISK_CLUSTER_ADD)) { 5826 + /* --add initiated by this node */ 5827 + err = md_cluster_ops->add_new_disk_start(mddev, rdev); 5828 + if (err) { 5829 + md_cluster_ops->add_new_disk_finish(mddev); 5830 + export_rdev(rdev); 5831 + return err; 5832 + } 5833 + } 5834 + } 5835 + 5896 5836 rdev->raid_disk = -1; 5897 5837 err = bind_rdev_to_array(rdev, mddev); 5898 - if (!err && !mddev->pers->hot_remove_disk) { 5899 - /* If there is hot_add_disk but no hot_remove_disk 5900 - * then added disks for geometry changes, 5901 - * and should be added immediately. 5902 - */ 5903 - super_types[mddev->major_version]. 5904 - validate_super(mddev, rdev); 5905 - err = mddev->pers->hot_add_disk(mddev, rdev); 5906 - if (err) 5907 - unbind_rdev_from_array(rdev); 5908 - } 5909 5838 if (err) 5910 5839 export_rdev(rdev); 5911 5840 else 5912 - sysfs_notify_dirent_safe(rdev->sysfs_state); 5913 - 5914 - set_bit(MD_CHANGE_DEVS, &mddev->flags); 5915 - if (mddev->degraded) 5916 - set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 5917 - set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 5918 - if (!err) 5919 - md_new_event(mddev); 5920 - md_wakeup_thread(mddev->thread); 5841 + err = add_bound_rdev(rdev); 5842 + if (mddev_is_clustered(mddev) && 5843 + (info->state & (1 << MD_DISK_CLUSTER_ADD))) 5844 + md_cluster_ops->add_new_disk_finish(mddev); 5921 5845 return err; 5922 5846 } 5923 5847 ··· 5985 5895 if (!rdev) 5986 5896 return -ENXIO; 5987 5897 5898 + if (mddev_is_clustered(mddev)) 5899 + md_cluster_ops->metadata_update_start(mddev); 5900 + 5988 5901 clear_bit(Blocked, &rdev->flags); 5989 5902 remove_and_add_spares(mddev, rdev); 5990 5903 5991 5904 if (rdev->raid_disk >= 0) 5992 5905 goto busy; 5993 5906 5994 - kick_rdev_from_array(rdev); 5907 + if (mddev_is_clustered(mddev)) 5908 + md_cluster_ops->remove_disk(mddev, rdev); 5909 + 5910 + md_kick_rdev_from_array(rdev); 5995 5911 md_update_sb(mddev, 1); 5996 5912 md_new_event(mddev); 5997 5913 5914 + if (mddev_is_clustered(mddev)) 5915 + md_cluster_ops->metadata_update_finish(mddev); 5916 + 5998 5917 return 0; 5999 5918 busy: 5919 + if (mddev_is_clustered(mddev)) 5920 + md_cluster_ops->metadata_update_cancel(mddev); 6000 5921 printk(KERN_WARNING "md: cannot remove active disk %s from %s ...\n", 6001 5922 bdevname(rdev->bdev,b), mdname(mddev)); 6002 5923 return -EBUSY; ··· 6057 5956 err = -EINVAL; 6058 5957 goto abort_export; 6059 5958 } 5959 + 5960 + if (mddev_is_clustered(mddev)) 5961 + md_cluster_ops->metadata_update_start(mddev); 6060 5962 clear_bit(In_sync, &rdev->flags); 6061 5963 rdev->desc_nr = -1; 6062 5964 rdev->saved_raid_disk = -1; 6063 5965 err = bind_rdev_to_array(rdev, mddev); 6064 5966 if (err) 6065 - goto abort_export; 5967 + goto abort_clustered; 6066 5968 6067 5969 /* 6068 5970 * The rest should better be atomic, we can have disk failures ··· 6076 5972 6077 5973 md_update_sb(mddev, 1); 6078 5974 5975 + if (mddev_is_clustered(mddev)) 5976 + md_cluster_ops->metadata_update_finish(mddev); 6079 5977 /* 6080 5978 * Kick recovery, maybe this spare has to be added to the 6081 5979 * array immediately. ··· 6087 5981 md_new_event(mddev); 6088 5982 return 0; 6089 5983 5984 + abort_clustered: 5985 + if (mddev_is_clustered(mddev)) 5986 + md_cluster_ops->metadata_update_cancel(mddev); 6090 5987 abort_export: 6091 5988 export_rdev(rdev); 6092 5989 return err; ··· 6147 6038 if (mddev->pers) { 6148 6039 mddev->pers->quiesce(mddev, 1); 6149 6040 if (fd >= 0) { 6150 - err = bitmap_create(mddev); 6151 - if (!err) 6041 + struct bitmap *bitmap; 6042 + 6043 + bitmap = bitmap_create(mddev, -1); 6044 + if (!IS_ERR(bitmap)) { 6045 + mddev->bitmap = bitmap; 6152 6046 err = bitmap_load(mddev); 6047 + } else 6048 + err = PTR_ERR(bitmap); 6153 6049 } 6154 6050 if (fd < 0 || err) { 6155 6051 bitmap_destroy(mddev); ··· 6407 6293 return rv; 6408 6294 } 6409 6295 } 6296 + if (mddev_is_clustered(mddev)) 6297 + md_cluster_ops->metadata_update_start(mddev); 6410 6298 if (info->size >= 0 && mddev->dev_sectors / 2 != info->size) 6411 6299 rv = update_size(mddev, (sector_t)info->size * 2); 6412 6300 ··· 6416 6300 rv = update_raid_disks(mddev, info->raid_disks); 6417 6301 6418 6302 if ((state ^ info->state) & (1<<MD_SB_BITMAP_PRESENT)) { 6419 - if (mddev->pers->quiesce == NULL || mddev->thread == NULL) 6420 - return -EINVAL; 6421 - if (mddev->recovery || mddev->sync_thread) 6422 - return -EBUSY; 6303 + if (mddev->pers->quiesce == NULL || mddev->thread == NULL) { 6304 + rv = -EINVAL; 6305 + goto err; 6306 + } 6307 + if (mddev->recovery || mddev->sync_thread) { 6308 + rv = -EBUSY; 6309 + goto err; 6310 + } 6423 6311 if (info->state & (1<<MD_SB_BITMAP_PRESENT)) { 6312 + struct bitmap *bitmap; 6424 6313 /* add the bitmap */ 6425 - if (mddev->bitmap) 6426 - return -EEXIST; 6427 - if (mddev->bitmap_info.default_offset == 0) 6428 - return -EINVAL; 6314 + if (mddev->bitmap) { 6315 + rv = -EEXIST; 6316 + goto err; 6317 + } 6318 + if (mddev->bitmap_info.default_offset == 0) { 6319 + rv = -EINVAL; 6320 + goto err; 6321 + } 6429 6322 mddev->bitmap_info.offset = 6430 6323 mddev->bitmap_info.default_offset; 6431 6324 mddev->bitmap_info.space = 6432 6325 mddev->bitmap_info.default_space; 6433 6326 mddev->pers->quiesce(mddev, 1); 6434 - rv = bitmap_create(mddev); 6435 - if (!rv) 6327 + bitmap = bitmap_create(mddev, -1); 6328 + if (!IS_ERR(bitmap)) { 6329 + mddev->bitmap = bitmap; 6436 6330 rv = bitmap_load(mddev); 6331 + } else 6332 + rv = PTR_ERR(bitmap); 6437 6333 if (rv) 6438 6334 bitmap_destroy(mddev); 6439 6335 mddev->pers->quiesce(mddev, 0); 6440 6336 } else { 6441 6337 /* remove the bitmap */ 6442 - if (!mddev->bitmap) 6443 - return -ENOENT; 6444 - if (mddev->bitmap->storage.file) 6445 - return -EINVAL; 6338 + if (!mddev->bitmap) { 6339 + rv = -ENOENT; 6340 + goto err; 6341 + } 6342 + if (mddev->bitmap->storage.file) { 6343 + rv = -EINVAL; 6344 + goto err; 6345 + } 6446 6346 mddev->pers->quiesce(mddev, 1); 6447 6347 bitmap_destroy(mddev); 6448 6348 mddev->pers->quiesce(mddev, 0); ··· 6466 6334 } 6467 6335 } 6468 6336 md_update_sb(mddev, 1); 6337 + if (mddev_is_clustered(mddev)) 6338 + md_cluster_ops->metadata_update_finish(mddev); 6339 + return rv; 6340 + err: 6341 + if (mddev_is_clustered(mddev)) 6342 + md_cluster_ops->metadata_update_cancel(mddev); 6469 6343 return rv; 6470 6344 } 6471 6345 ··· 6531 6393 case SET_DISK_FAULTY: 6532 6394 case STOP_ARRAY: 6533 6395 case STOP_ARRAY_RO: 6396 + case CLUSTERED_DISK_NACK: 6534 6397 return true; 6535 6398 default: 6536 6399 return false; ··· 6803 6664 err = add_new_disk(mddev, &info); 6804 6665 goto unlock; 6805 6666 } 6667 + 6668 + case CLUSTERED_DISK_NACK: 6669 + if (mddev_is_clustered(mddev)) 6670 + md_cluster_ops->new_disk_ack(mddev, false); 6671 + else 6672 + err = -EINVAL; 6673 + goto unlock; 6806 6674 6807 6675 case HOT_ADD_DISK: 6808 6676 err = hot_add_disk(mddev, new_decode_dev(arg)); ··· 7384 7238 } 7385 7239 EXPORT_SYMBOL(unregister_md_personality); 7386 7240 7241 + int register_md_cluster_operations(struct md_cluster_operations *ops, struct module *module) 7242 + { 7243 + if (md_cluster_ops != NULL) 7244 + return -EALREADY; 7245 + spin_lock(&pers_lock); 7246 + md_cluster_ops = ops; 7247 + md_cluster_mod = module; 7248 + spin_unlock(&pers_lock); 7249 + return 0; 7250 + } 7251 + EXPORT_SYMBOL(register_md_cluster_operations); 7252 + 7253 + int unregister_md_cluster_operations(void) 7254 + { 7255 + spin_lock(&pers_lock); 7256 + md_cluster_ops = NULL; 7257 + spin_unlock(&pers_lock); 7258 + return 0; 7259 + } 7260 + EXPORT_SYMBOL(unregister_md_cluster_operations); 7261 + 7262 + int md_setup_cluster(struct mddev *mddev, int nodes) 7263 + { 7264 + int err; 7265 + 7266 + err = request_module("md-cluster"); 7267 + if (err) { 7268 + pr_err("md-cluster module not found.\n"); 7269 + return err; 7270 + } 7271 + 7272 + spin_lock(&pers_lock); 7273 + if (!md_cluster_ops || !try_module_get(md_cluster_mod)) { 7274 + spin_unlock(&pers_lock); 7275 + return -ENOENT; 7276 + } 7277 + spin_unlock(&pers_lock); 7278 + 7279 + return md_cluster_ops->join(mddev, nodes); 7280 + } 7281 + 7282 + void md_cluster_stop(struct mddev *mddev) 7283 + { 7284 + if (!md_cluster_ops) 7285 + return; 7286 + md_cluster_ops->leave(mddev); 7287 + module_put(md_cluster_mod); 7288 + } 7289 + 7387 7290 static int is_mddev_idle(struct mddev *mddev, int init) 7388 7291 { 7389 7292 struct md_rdev *rdev; ··· 7570 7375 mddev->safemode == 0) 7571 7376 mddev->safemode = 1; 7572 7377 spin_unlock(&mddev->lock); 7378 + if (mddev_is_clustered(mddev)) 7379 + md_cluster_ops->metadata_update_start(mddev); 7573 7380 md_update_sb(mddev, 0); 7381 + if (mddev_is_clustered(mddev)) 7382 + md_cluster_ops->metadata_update_finish(mddev); 7574 7383 sysfs_notify_dirent_safe(mddev->sysfs_state); 7575 7384 } else 7576 7385 spin_unlock(&mddev->lock); ··· 7775 7576 md_new_event(mddev); 7776 7577 update_time = jiffies; 7777 7578 7579 + if (mddev_is_clustered(mddev)) 7580 + md_cluster_ops->resync_start(mddev, j, max_sectors); 7581 + 7778 7582 blk_start_plug(&plug); 7779 7583 while (j < max_sectors) { 7780 7584 sector_t sectors; ··· 7820 7618 if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) 7821 7619 break; 7822 7620 7823 - sectors = mddev->pers->sync_request(mddev, j, &skipped, 7824 - currspeed < speed_min(mddev)); 7621 + sectors = mddev->pers->sync_request(mddev, j, &skipped); 7825 7622 if (sectors == 0) { 7826 7623 set_bit(MD_RECOVERY_INTR, &mddev->recovery); 7827 7624 break; ··· 7837 7636 j += sectors; 7838 7637 if (j > 2) 7839 7638 mddev->curr_resync = j; 7639 + if (mddev_is_clustered(mddev)) 7640 + md_cluster_ops->resync_info_update(mddev, j, max_sectors); 7840 7641 mddev->curr_mark_cnt = io_sectors; 7841 7642 if (last_check == 0) 7842 7643 /* this is the earliest that rebuild will be ··· 7880 7677 /((jiffies-mddev->resync_mark)/HZ +1) +1; 7881 7678 7882 7679 if (currspeed > speed_min(mddev)) { 7883 - if ((currspeed > speed_max(mddev)) || 7884 - !is_mddev_idle(mddev, 0)) { 7680 + if (currspeed > speed_max(mddev)) { 7885 7681 msleep(500); 7886 7682 goto repeat; 7683 + } 7684 + if (!is_mddev_idle(mddev, 0)) { 7685 + /* 7686 + * Give other IO more of a chance. 7687 + * The faster the devices, the less we wait. 7688 + */ 7689 + wait_event(mddev->recovery_wait, 7690 + !atomic_read(&mddev->recovery_active)); 7887 7691 } 7888 7692 } 7889 7693 } ··· 7904 7694 wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active)); 7905 7695 7906 7696 /* tell personality that we are finished */ 7907 - mddev->pers->sync_request(mddev, max_sectors, &skipped, 1); 7697 + mddev->pers->sync_request(mddev, max_sectors, &skipped); 7698 + 7699 + if (mddev_is_clustered(mddev)) 7700 + md_cluster_ops->resync_finish(mddev); 7908 7701 7909 7702 if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) && 7910 7703 mddev->curr_resync > 2) { ··· 8138 7925 sysfs_notify_dirent_safe(mddev->sysfs_state); 8139 7926 } 8140 7927 8141 - if (mddev->flags & MD_UPDATE_SB_FLAGS) 7928 + if (mddev->flags & MD_UPDATE_SB_FLAGS) { 7929 + if (mddev_is_clustered(mddev)) 7930 + md_cluster_ops->metadata_update_start(mddev); 8142 7931 md_update_sb(mddev, 0); 7932 + if (mddev_is_clustered(mddev)) 7933 + md_cluster_ops->metadata_update_finish(mddev); 7934 + } 8143 7935 8144 7936 if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && 8145 7937 !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) { ··· 8242 8024 set_bit(MD_CHANGE_DEVS, &mddev->flags); 8243 8025 } 8244 8026 } 8027 + if (mddev_is_clustered(mddev)) 8028 + md_cluster_ops->metadata_update_start(mddev); 8245 8029 if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && 8246 8030 mddev->pers->finish_reshape) 8247 8031 mddev->pers->finish_reshape(mddev); ··· 8256 8036 rdev->saved_raid_disk = -1; 8257 8037 8258 8038 md_update_sb(mddev, 1); 8039 + if (mddev_is_clustered(mddev)) 8040 + md_cluster_ops->metadata_update_finish(mddev); 8259 8041 clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); 8260 8042 clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); 8261 8043 clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); ··· 8877 8655 err_wq: 8878 8656 return ret; 8879 8657 } 8658 + 8659 + void md_reload_sb(struct mddev *mddev) 8660 + { 8661 + struct md_rdev *rdev, *tmp; 8662 + 8663 + rdev_for_each_safe(rdev, tmp, mddev) { 8664 + rdev->sb_loaded = 0; 8665 + ClearPageUptodate(rdev->sb_page); 8666 + } 8667 + mddev->raid_disks = 0; 8668 + analyze_sbs(mddev); 8669 + rdev_for_each_safe(rdev, tmp, mddev) { 8670 + struct mdp_superblock_1 *sb = page_address(rdev->sb_page); 8671 + /* since we don't write to faulty devices, we figure out if the 8672 + * disk is faulty by comparing events 8673 + */ 8674 + if (mddev->events > sb->events) 8675 + set_bit(Faulty, &rdev->flags); 8676 + } 8677 + 8678 + } 8679 + EXPORT_SYMBOL(md_reload_sb); 8880 8680 8881 8681 #ifndef MODULE 8882 8682
+25 -1
drivers/md/md.h
··· 23 23 #include <linux/timer.h> 24 24 #include <linux/wait.h> 25 25 #include <linux/workqueue.h> 26 + #include "md-cluster.h" 26 27 27 28 #define MaxSector (~(sector_t)0) 28 29 ··· 171 170 * a want_replacement device with same 172 171 * raid_disk number. 173 172 */ 173 + Candidate, /* For clustered environments only: 174 + * This device is seen locally but not 175 + * by the whole cluster 176 + */ 174 177 }; 175 178 176 179 #define BB_LEN_MASK (0x00000000000001FFULL) ··· 206 201 extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors, 207 202 int is_new); 208 203 extern void md_ack_all_badblocks(struct badblocks *bb); 204 + 205 + struct md_cluster_info; 209 206 210 207 struct mddev { 211 208 void *private; ··· 437 430 unsigned long daemon_sleep; /* how many jiffies between updates? */ 438 431 unsigned long max_write_behind; /* write-behind mode */ 439 432 int external; 433 + int nodes; /* Maximum number of nodes in the cluster */ 434 + char cluster_name[64]; /* Name of the cluster */ 440 435 } bitmap_info; 441 436 442 437 atomic_t max_corr_read_errors; /* max read retries */ ··· 457 448 struct work_struct flush_work; 458 449 struct work_struct event_work; /* used by dm to report failure event */ 459 450 void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); 451 + struct md_cluster_info *cluster_info; 460 452 }; 461 453 462 454 static inline int __must_check mddev_lock(struct mddev *mddev) ··· 506 496 int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev); 507 497 int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev); 508 498 int (*spare_active) (struct mddev *mddev); 509 - sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster); 499 + sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped); 510 500 int (*resize) (struct mddev *mddev, sector_t sectors); 511 501 sector_t (*size) (struct mddev *mddev, sector_t sectors, int raid_disks); 512 502 int (*check_reshape) (struct mddev *mddev); ··· 618 608 619 609 extern int register_md_personality(struct md_personality *p); 620 610 extern int unregister_md_personality(struct md_personality *p); 611 + extern int register_md_cluster_operations(struct md_cluster_operations *ops, 612 + struct module *module); 613 + extern int unregister_md_cluster_operations(void); 614 + extern int md_setup_cluster(struct mddev *mddev, int nodes); 615 + extern void md_cluster_stop(struct mddev *mddev); 621 616 extern struct md_thread *md_register_thread( 622 617 void (*run)(struct md_thread *thread), 623 618 struct mddev *mddev, ··· 669 654 struct mddev *mddev); 670 655 671 656 extern void md_unplug(struct blk_plug_cb *cb, bool from_schedule); 657 + extern void md_reload_sb(struct mddev *mddev); 658 + extern void md_update_sb(struct mddev *mddev, int force); 659 + extern void md_kick_rdev_from_array(struct md_rdev * rdev); 660 + struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr); 672 661 static inline int mddev_check_plugged(struct mddev *mddev) 673 662 { 674 663 return !!blk_check_plugged(md_unplug, mddev, ··· 688 669 } 689 670 } 690 671 672 + extern struct md_cluster_operations *md_cluster_ops; 673 + static inline int mddev_is_clustered(struct mddev *mddev) 674 + { 675 + return mddev->cluster_info && mddev->bitmap_info.nodes > 1; 676 + } 691 677 #endif /* _MD_MD_H */
+26 -20
drivers/md/raid0.c
··· 271 271 goto abort; 272 272 } 273 273 274 - blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); 275 - blk_queue_io_opt(mddev->queue, 276 - (mddev->chunk_sectors << 9) * mddev->raid_disks); 274 + if (mddev->queue) { 275 + blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); 276 + blk_queue_io_opt(mddev->queue, 277 + (mddev->chunk_sectors << 9) * mddev->raid_disks); 277 278 278 - if (!discard_supported) 279 - queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); 280 - else 281 - queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); 279 + if (!discard_supported) 280 + queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); 281 + else 282 + queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); 283 + } 282 284 283 285 pr_debug("md/raid0:%s: done.\n", mdname(mddev)); 284 286 *private_conf = conf; ··· 431 429 } 432 430 if (md_check_no_bitmap(mddev)) 433 431 return -EINVAL; 434 - blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); 435 - blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); 436 - blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); 432 + 433 + if (mddev->queue) { 434 + blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); 435 + blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); 436 + blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); 437 + } 437 438 438 439 /* if private is not null, we are here after takeover */ 439 440 if (mddev->private == NULL) { ··· 453 448 printk(KERN_INFO "md/raid0:%s: md_size is %llu sectors.\n", 454 449 mdname(mddev), 455 450 (unsigned long long)mddev->array_sectors); 456 - /* calculate the max read-ahead size. 457 - * For read-ahead of large files to be effective, we need to 458 - * readahead at least twice a whole stripe. i.e. number of devices 459 - * multiplied by chunk size times 2. 460 - * If an individual device has an ra_pages greater than the 461 - * chunk size, then we will not drive that device as hard as it 462 - * wants. We consider this a configuration error: a larger 463 - * chunksize should be used in that case. 464 - */ 465 - { 451 + 452 + if (mddev->queue) { 453 + /* calculate the max read-ahead size. 454 + * For read-ahead of large files to be effective, we need to 455 + * readahead at least twice a whole stripe. i.e. number of devices 456 + * multiplied by chunk size times 2. 457 + * If an individual device has an ra_pages greater than the 458 + * chunk size, then we will not drive that device as hard as it 459 + * wants. We consider this a configuration error: a larger 460 + * chunksize should be used in that case. 461 + */ 466 462 int stripe = mddev->raid_disks * 467 463 (mddev->chunk_sectors << 9) / PAGE_SIZE; 468 464 if (mddev->queue->backing_dev_info.ra_pages < 2* stripe)
+17 -12
drivers/md/raid1.c
··· 539 539 has_nonrot_disk = 0; 540 540 choose_next_idle = 0; 541 541 542 - choose_first = (conf->mddev->recovery_cp < this_sector + sectors); 542 + if ((conf->mddev->recovery_cp < this_sector + sectors) || 543 + (mddev_is_clustered(conf->mddev) && 544 + md_cluster_ops->area_resyncing(conf->mddev, this_sector, 545 + this_sector + sectors))) 546 + choose_first = 1; 547 + else 548 + choose_first = 0; 543 549 544 550 for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) { 545 551 sector_t dist; ··· 1108 1102 md_write_start(mddev, bio); /* wait on superblock update early */ 1109 1103 1110 1104 if (bio_data_dir(bio) == WRITE && 1111 - bio_end_sector(bio) > mddev->suspend_lo && 1112 - bio->bi_iter.bi_sector < mddev->suspend_hi) { 1105 + ((bio_end_sector(bio) > mddev->suspend_lo && 1106 + bio->bi_iter.bi_sector < mddev->suspend_hi) || 1107 + (mddev_is_clustered(mddev) && 1108 + md_cluster_ops->area_resyncing(mddev, bio->bi_iter.bi_sector, bio_end_sector(bio))))) { 1113 1109 /* As the suspend_* range is controlled by 1114 1110 * userspace, we want an interruptible 1115 1111 * wait. ··· 1122 1114 prepare_to_wait(&conf->wait_barrier, 1123 1115 &w, TASK_INTERRUPTIBLE); 1124 1116 if (bio_end_sector(bio) <= mddev->suspend_lo || 1125 - bio->bi_iter.bi_sector >= mddev->suspend_hi) 1117 + bio->bi_iter.bi_sector >= mddev->suspend_hi || 1118 + (mddev_is_clustered(mddev) && 1119 + !md_cluster_ops->area_resyncing(mddev, 1120 + bio->bi_iter.bi_sector, bio_end_sector(bio)))) 1126 1121 break; 1127 1122 schedule(); 1128 1123 } ··· 1572 1561 struct md_rdev *rdev = conf->mirrors[i].rdev; 1573 1562 struct md_rdev *repl = conf->mirrors[conf->raid_disks + i].rdev; 1574 1563 if (repl 1564 + && !test_bit(Candidate, &repl->flags) 1575 1565 && repl->recovery_offset == MaxSector 1576 1566 && !test_bit(Faulty, &repl->flags) 1577 1567 && !test_and_set_bit(In_sync, &repl->flags)) { ··· 2480 2468 * that can be installed to exclude normal IO requests. 2481 2469 */ 2482 2470 2483 - static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster) 2471 + static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped) 2484 2472 { 2485 2473 struct r1conf *conf = mddev->private; 2486 2474 struct r1bio *r1_bio; ··· 2533 2521 *skipped = 1; 2534 2522 return sync_blocks; 2535 2523 } 2536 - /* 2537 - * If there is non-resync activity waiting for a turn, 2538 - * and resync is going fast enough, 2539 - * then let it though before starting on this new sync request. 2540 - */ 2541 - if (!go_faster && conf->nr_waiting) 2542 - msleep_interruptible(1000); 2543 2524 2544 2525 bitmap_cond_end_sync(mddev->bitmap, sector_nr); 2545 2526 r1_bio = mempool_alloc(conf->r1buf_pool, GFP_NOIO);
+1 -7
drivers/md/raid10.c
··· 2889 2889 */ 2890 2890 2891 2891 static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, 2892 - int *skipped, int go_faster) 2892 + int *skipped) 2893 2893 { 2894 2894 struct r10conf *conf = mddev->private; 2895 2895 struct r10bio *r10_bio; ··· 2994 2994 if (conf->geo.near_copies < conf->geo.raid_disks && 2995 2995 max_sector > (sector_nr | chunk_mask)) 2996 2996 max_sector = (sector_nr | chunk_mask) + 1; 2997 - /* 2998 - * If there is non-resync activity waiting for us then 2999 - * put in a delay to throttle resync. 3000 - */ 3001 - if (!go_faster && conf->nr_waiting) 3002 - msleep_interruptible(1000); 3003 2997 3004 2998 /* Again, very different code for resync and recovery. 3005 2999 * Both must result in an r10bio with a list of bios that
+690 -134
drivers/md/raid5.c
··· 54 54 #include <linux/slab.h> 55 55 #include <linux/ratelimit.h> 56 56 #include <linux/nodemask.h> 57 + #include <linux/flex_array.h> 57 58 #include <trace/events/block.h> 58 59 59 60 #include "md.h" ··· 497 496 } 498 497 } 499 498 500 - static int grow_buffers(struct stripe_head *sh) 499 + static int grow_buffers(struct stripe_head *sh, gfp_t gfp) 501 500 { 502 501 int i; 503 502 int num = sh->raid_conf->pool_size; ··· 505 504 for (i = 0; i < num; i++) { 506 505 struct page *page; 507 506 508 - if (!(page = alloc_page(GFP_KERNEL))) { 507 + if (!(page = alloc_page(gfp))) { 509 508 return 1; 510 509 } 511 510 sh->dev[i].page = page; ··· 526 525 BUG_ON(atomic_read(&sh->count) != 0); 527 526 BUG_ON(test_bit(STRIPE_HANDLE, &sh->state)); 528 527 BUG_ON(stripe_operations_active(sh)); 528 + BUG_ON(sh->batch_head); 529 529 530 530 pr_debug("init_stripe called, stripe %llu\n", 531 531 (unsigned long long)sector); ··· 554 552 } 555 553 if (read_seqcount_retry(&conf->gen_lock, seq)) 556 554 goto retry; 555 + sh->overwrite_disks = 0; 557 556 insert_hash(conf, sh); 558 557 sh->cpu = smp_processor_id(); 558 + set_bit(STRIPE_BATCH_READY, &sh->state); 559 559 } 560 560 561 561 static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector, ··· 672 668 *(conf->hash_locks + hash)); 673 669 sh = __find_stripe(conf, sector, conf->generation - previous); 674 670 if (!sh) { 675 - if (!conf->inactive_blocked) 671 + if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) { 676 672 sh = get_free_stripe(conf, hash); 673 + if (!sh && llist_empty(&conf->released_stripes) && 674 + !test_bit(R5_DID_ALLOC, &conf->cache_state)) 675 + set_bit(R5_ALLOC_MORE, 676 + &conf->cache_state); 677 + } 677 678 if (noblock && sh == NULL) 678 679 break; 679 680 if (!sh) { 680 - conf->inactive_blocked = 1; 681 + set_bit(R5_INACTIVE_BLOCKED, 682 + &conf->cache_state); 681 683 wait_event_lock_irq( 682 684 conf->wait_for_stripe, 683 685 !list_empty(conf->inactive_list + hash) && 684 686 (atomic_read(&conf->active_stripes) 685 687 < (conf->max_nr_stripes * 3 / 4) 686 - || !conf->inactive_blocked), 688 + || !test_bit(R5_INACTIVE_BLOCKED, 689 + &conf->cache_state)), 687 690 *(conf->hash_locks + hash)); 688 - conf->inactive_blocked = 0; 691 + clear_bit(R5_INACTIVE_BLOCKED, 692 + &conf->cache_state); 689 693 } else { 690 694 init_stripe(sh, sector, previous); 691 695 atomic_inc(&sh->count); ··· 718 706 719 707 spin_unlock_irq(conf->hash_locks + hash); 720 708 return sh; 709 + } 710 + 711 + static bool is_full_stripe_write(struct stripe_head *sh) 712 + { 713 + BUG_ON(sh->overwrite_disks > (sh->disks - sh->raid_conf->max_degraded)); 714 + return sh->overwrite_disks == (sh->disks - sh->raid_conf->max_degraded); 715 + } 716 + 717 + static void lock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) 718 + { 719 + local_irq_disable(); 720 + if (sh1 > sh2) { 721 + spin_lock(&sh2->stripe_lock); 722 + spin_lock_nested(&sh1->stripe_lock, 1); 723 + } else { 724 + spin_lock(&sh1->stripe_lock); 725 + spin_lock_nested(&sh2->stripe_lock, 1); 726 + } 727 + } 728 + 729 + static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) 730 + { 731 + spin_unlock(&sh1->stripe_lock); 732 + spin_unlock(&sh2->stripe_lock); 733 + local_irq_enable(); 734 + } 735 + 736 + /* Only freshly new full stripe normal write stripe can be added to a batch list */ 737 + static bool stripe_can_batch(struct stripe_head *sh) 738 + { 739 + return test_bit(STRIPE_BATCH_READY, &sh->state) && 740 + is_full_stripe_write(sh); 741 + } 742 + 743 + /* we only do back search */ 744 + static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh) 745 + { 746 + struct stripe_head *head; 747 + sector_t head_sector, tmp_sec; 748 + int hash; 749 + int dd_idx; 750 + 751 + if (!stripe_can_batch(sh)) 752 + return; 753 + /* Don't cross chunks, so stripe pd_idx/qd_idx is the same */ 754 + tmp_sec = sh->sector; 755 + if (!sector_div(tmp_sec, conf->chunk_sectors)) 756 + return; 757 + head_sector = sh->sector - STRIPE_SECTORS; 758 + 759 + hash = stripe_hash_locks_hash(head_sector); 760 + spin_lock_irq(conf->hash_locks + hash); 761 + head = __find_stripe(conf, head_sector, conf->generation); 762 + if (head && !atomic_inc_not_zero(&head->count)) { 763 + spin_lock(&conf->device_lock); 764 + if (!atomic_read(&head->count)) { 765 + if (!test_bit(STRIPE_HANDLE, &head->state)) 766 + atomic_inc(&conf->active_stripes); 767 + BUG_ON(list_empty(&head->lru) && 768 + !test_bit(STRIPE_EXPANDING, &head->state)); 769 + list_del_init(&head->lru); 770 + if (head->group) { 771 + head->group->stripes_cnt--; 772 + head->group = NULL; 773 + } 774 + } 775 + atomic_inc(&head->count); 776 + spin_unlock(&conf->device_lock); 777 + } 778 + spin_unlock_irq(conf->hash_locks + hash); 779 + 780 + if (!head) 781 + return; 782 + if (!stripe_can_batch(head)) 783 + goto out; 784 + 785 + lock_two_stripes(head, sh); 786 + /* clear_batch_ready clear the flag */ 787 + if (!stripe_can_batch(head) || !stripe_can_batch(sh)) 788 + goto unlock_out; 789 + 790 + if (sh->batch_head) 791 + goto unlock_out; 792 + 793 + dd_idx = 0; 794 + while (dd_idx == sh->pd_idx || dd_idx == sh->qd_idx) 795 + dd_idx++; 796 + if (head->dev[dd_idx].towrite->bi_rw != sh->dev[dd_idx].towrite->bi_rw) 797 + goto unlock_out; 798 + 799 + if (head->batch_head) { 800 + spin_lock(&head->batch_head->batch_lock); 801 + /* This batch list is already running */ 802 + if (!stripe_can_batch(head)) { 803 + spin_unlock(&head->batch_head->batch_lock); 804 + goto unlock_out; 805 + } 806 + 807 + /* 808 + * at this point, head's BATCH_READY could be cleared, but we 809 + * can still add the stripe to batch list 810 + */ 811 + list_add(&sh->batch_list, &head->batch_list); 812 + spin_unlock(&head->batch_head->batch_lock); 813 + 814 + sh->batch_head = head->batch_head; 815 + } else { 816 + head->batch_head = head; 817 + sh->batch_head = head->batch_head; 818 + spin_lock(&head->batch_lock); 819 + list_add_tail(&sh->batch_list, &head->batch_list); 820 + spin_unlock(&head->batch_lock); 821 + } 822 + 823 + if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) 824 + if (atomic_dec_return(&conf->preread_active_stripes) 825 + < IO_THRESHOLD) 826 + md_wakeup_thread(conf->mddev->thread); 827 + 828 + atomic_inc(&sh->count); 829 + unlock_out: 830 + unlock_two_stripes(head, sh); 831 + out: 832 + release_stripe(head); 721 833 } 722 834 723 835 /* Determine if 'data_offset' or 'new_data_offset' should be used ··· 874 738 { 875 739 struct r5conf *conf = sh->raid_conf; 876 740 int i, disks = sh->disks; 741 + struct stripe_head *head_sh = sh; 877 742 878 743 might_sleep(); 879 744 ··· 883 746 int replace_only = 0; 884 747 struct bio *bi, *rbi; 885 748 struct md_rdev *rdev, *rrdev = NULL; 749 + 750 + sh = head_sh; 886 751 if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags)) { 887 752 if (test_and_clear_bit(R5_WantFUA, &sh->dev[i].flags)) 888 753 rw = WRITE_FUA; ··· 903 764 if (test_and_clear_bit(R5_SyncIO, &sh->dev[i].flags)) 904 765 rw |= REQ_SYNC; 905 766 767 + again: 906 768 bi = &sh->dev[i].req; 907 769 rbi = &sh->dev[i].rreq; /* For writing to replacement */ 908 770 ··· 922 782 /* We raced and saw duplicates */ 923 783 rrdev = NULL; 924 784 } else { 925 - if (test_bit(R5_ReadRepl, &sh->dev[i].flags) && rrdev) 785 + if (test_bit(R5_ReadRepl, &head_sh->dev[i].flags) && rrdev) 926 786 rdev = rrdev; 927 787 rrdev = NULL; 928 788 } ··· 993 853 __func__, (unsigned long long)sh->sector, 994 854 bi->bi_rw, i); 995 855 atomic_inc(&sh->count); 856 + if (sh != head_sh) 857 + atomic_inc(&head_sh->count); 996 858 if (use_new_offset(conf, sh)) 997 859 bi->bi_iter.bi_sector = (sh->sector 998 860 + rdev->new_data_offset); 999 861 else 1000 862 bi->bi_iter.bi_sector = (sh->sector 1001 863 + rdev->data_offset); 1002 - if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags)) 864 + if (test_bit(R5_ReadNoMerge, &head_sh->dev[i].flags)) 1003 865 bi->bi_rw |= REQ_NOMERGE; 1004 866 1005 867 if (test_bit(R5_SkipCopy, &sh->dev[i].flags)) ··· 1045 903 __func__, (unsigned long long)sh->sector, 1046 904 rbi->bi_rw, i); 1047 905 atomic_inc(&sh->count); 906 + if (sh != head_sh) 907 + atomic_inc(&head_sh->count); 1048 908 if (use_new_offset(conf, sh)) 1049 909 rbi->bi_iter.bi_sector = (sh->sector 1050 910 + rrdev->new_data_offset); ··· 1078 934 pr_debug("skip op %ld on disc %d for sector %llu\n", 1079 935 bi->bi_rw, i, (unsigned long long)sh->sector); 1080 936 clear_bit(R5_LOCKED, &sh->dev[i].flags); 937 + if (sh->batch_head) 938 + set_bit(STRIPE_BATCH_ERR, 939 + &sh->batch_head->state); 1081 940 set_bit(STRIPE_HANDLE, &sh->state); 1082 941 } 942 + 943 + if (!head_sh->batch_head) 944 + continue; 945 + sh = list_first_entry(&sh->batch_list, struct stripe_head, 946 + batch_list); 947 + if (sh != head_sh) 948 + goto again; 1083 949 } 1084 950 } 1085 951 ··· 1205 1051 struct async_submit_ctl submit; 1206 1052 int i; 1207 1053 1054 + BUG_ON(sh->batch_head); 1208 1055 pr_debug("%s: stripe %llu\n", __func__, 1209 1056 (unsigned long long)sh->sector); 1210 1057 ··· 1264 1109 1265 1110 /* return a pointer to the address conversion region of the scribble buffer */ 1266 1111 static addr_conv_t *to_addr_conv(struct stripe_head *sh, 1267 - struct raid5_percpu *percpu) 1112 + struct raid5_percpu *percpu, int i) 1268 1113 { 1269 - return percpu->scribble + sizeof(struct page *) * (sh->disks + 2); 1114 + void *addr; 1115 + 1116 + addr = flex_array_get(percpu->scribble, i); 1117 + return addr + sizeof(struct page *) * (sh->disks + 2); 1118 + } 1119 + 1120 + /* return a pointer to the address conversion region of the scribble buffer */ 1121 + static struct page **to_addr_page(struct raid5_percpu *percpu, int i) 1122 + { 1123 + void *addr; 1124 + 1125 + addr = flex_array_get(percpu->scribble, i); 1126 + return addr; 1270 1127 } 1271 1128 1272 1129 static struct dma_async_tx_descriptor * 1273 1130 ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu) 1274 1131 { 1275 1132 int disks = sh->disks; 1276 - struct page **xor_srcs = percpu->scribble; 1133 + struct page **xor_srcs = to_addr_page(percpu, 0); 1277 1134 int target = sh->ops.target; 1278 1135 struct r5dev *tgt = &sh->dev[target]; 1279 1136 struct page *xor_dest = tgt->page; ··· 1293 1126 struct dma_async_tx_descriptor *tx; 1294 1127 struct async_submit_ctl submit; 1295 1128 int i; 1129 + 1130 + BUG_ON(sh->batch_head); 1296 1131 1297 1132 pr_debug("%s: stripe %llu block: %d\n", 1298 1133 __func__, (unsigned long long)sh->sector, target); ··· 1307 1138 atomic_inc(&sh->count); 1308 1139 1309 1140 init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, NULL, 1310 - ops_complete_compute, sh, to_addr_conv(sh, percpu)); 1141 + ops_complete_compute, sh, to_addr_conv(sh, percpu, 0)); 1311 1142 if (unlikely(count == 1)) 1312 1143 tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit); 1313 1144 else ··· 1325 1156 * destination buffer is recorded in srcs[count] and the Q destination 1326 1157 * is recorded in srcs[count+1]]. 1327 1158 */ 1328 - static int set_syndrome_sources(struct page **srcs, struct stripe_head *sh) 1159 + static int set_syndrome_sources(struct page **srcs, 1160 + struct stripe_head *sh, 1161 + int srctype) 1329 1162 { 1330 1163 int disks = sh->disks; 1331 1164 int syndrome_disks = sh->ddf_layout ? disks : (disks - 2); ··· 1342 1171 i = d0_idx; 1343 1172 do { 1344 1173 int slot = raid6_idx_to_slot(i, sh, &count, syndrome_disks); 1174 + struct r5dev *dev = &sh->dev[i]; 1345 1175 1346 - srcs[slot] = sh->dev[i].page; 1176 + if (i == sh->qd_idx || i == sh->pd_idx || 1177 + (srctype == SYNDROME_SRC_ALL) || 1178 + (srctype == SYNDROME_SRC_WANT_DRAIN && 1179 + test_bit(R5_Wantdrain, &dev->flags)) || 1180 + (srctype == SYNDROME_SRC_WRITTEN && 1181 + dev->written)) 1182 + srcs[slot] = sh->dev[i].page; 1347 1183 i = raid6_next_disk(i, disks); 1348 1184 } while (i != d0_idx); 1349 1185 ··· 1361 1183 ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu) 1362 1184 { 1363 1185 int disks = sh->disks; 1364 - struct page **blocks = percpu->scribble; 1186 + struct page **blocks = to_addr_page(percpu, 0); 1365 1187 int target; 1366 1188 int qd_idx = sh->qd_idx; 1367 1189 struct dma_async_tx_descriptor *tx; ··· 1371 1193 int i; 1372 1194 int count; 1373 1195 1196 + BUG_ON(sh->batch_head); 1374 1197 if (sh->ops.target < 0) 1375 1198 target = sh->ops.target2; 1376 1199 else if (sh->ops.target2 < 0) ··· 1390 1211 atomic_inc(&sh->count); 1391 1212 1392 1213 if (target == qd_idx) { 1393 - count = set_syndrome_sources(blocks, sh); 1214 + count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL); 1394 1215 blocks[count] = NULL; /* regenerating p is not necessary */ 1395 1216 BUG_ON(blocks[count+1] != dest); /* q should already be set */ 1396 1217 init_async_submit(&submit, ASYNC_TX_FENCE, NULL, 1397 1218 ops_complete_compute, sh, 1398 - to_addr_conv(sh, percpu)); 1219 + to_addr_conv(sh, percpu, 0)); 1399 1220 tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit); 1400 1221 } else { 1401 1222 /* Compute any data- or p-drive using XOR */ ··· 1408 1229 1409 1230 init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, 1410 1231 NULL, ops_complete_compute, sh, 1411 - to_addr_conv(sh, percpu)); 1232 + to_addr_conv(sh, percpu, 0)); 1412 1233 tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE, &submit); 1413 1234 } 1414 1235 ··· 1427 1248 struct r5dev *tgt = &sh->dev[target]; 1428 1249 struct r5dev *tgt2 = &sh->dev[target2]; 1429 1250 struct dma_async_tx_descriptor *tx; 1430 - struct page **blocks = percpu->scribble; 1251 + struct page **blocks = to_addr_page(percpu, 0); 1431 1252 struct async_submit_ctl submit; 1432 1253 1254 + BUG_ON(sh->batch_head); 1433 1255 pr_debug("%s: stripe %llu block1: %d block2: %d\n", 1434 1256 __func__, (unsigned long long)sh->sector, target, target2); 1435 1257 BUG_ON(target < 0 || target2 < 0); ··· 1470 1290 /* Missing P+Q, just recompute */ 1471 1291 init_async_submit(&submit, ASYNC_TX_FENCE, NULL, 1472 1292 ops_complete_compute, sh, 1473 - to_addr_conv(sh, percpu)); 1293 + to_addr_conv(sh, percpu, 0)); 1474 1294 return async_gen_syndrome(blocks, 0, syndrome_disks+2, 1475 1295 STRIPE_SIZE, &submit); 1476 1296 } else { ··· 1494 1314 init_async_submit(&submit, 1495 1315 ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, 1496 1316 NULL, NULL, NULL, 1497 - to_addr_conv(sh, percpu)); 1317 + to_addr_conv(sh, percpu, 0)); 1498 1318 tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE, 1499 1319 &submit); 1500 1320 1501 - count = set_syndrome_sources(blocks, sh); 1321 + count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL); 1502 1322 init_async_submit(&submit, ASYNC_TX_FENCE, tx, 1503 1323 ops_complete_compute, sh, 1504 - to_addr_conv(sh, percpu)); 1324 + to_addr_conv(sh, percpu, 0)); 1505 1325 return async_gen_syndrome(blocks, 0, count+2, 1506 1326 STRIPE_SIZE, &submit); 1507 1327 } 1508 1328 } else { 1509 1329 init_async_submit(&submit, ASYNC_TX_FENCE, NULL, 1510 1330 ops_complete_compute, sh, 1511 - to_addr_conv(sh, percpu)); 1331 + to_addr_conv(sh, percpu, 0)); 1512 1332 if (failb == syndrome_disks) { 1513 1333 /* We're missing D+P. */ 1514 1334 return async_raid6_datap_recov(syndrome_disks+2, ··· 1532 1352 } 1533 1353 1534 1354 static struct dma_async_tx_descriptor * 1535 - ops_run_prexor(struct stripe_head *sh, struct raid5_percpu *percpu, 1536 - struct dma_async_tx_descriptor *tx) 1355 + ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu, 1356 + struct dma_async_tx_descriptor *tx) 1537 1357 { 1538 1358 int disks = sh->disks; 1539 - struct page **xor_srcs = percpu->scribble; 1359 + struct page **xor_srcs = to_addr_page(percpu, 0); 1540 1360 int count = 0, pd_idx = sh->pd_idx, i; 1541 1361 struct async_submit_ctl submit; 1542 1362 1543 1363 /* existing parity data subtracted */ 1544 1364 struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page; 1545 1365 1366 + BUG_ON(sh->batch_head); 1546 1367 pr_debug("%s: stripe %llu\n", __func__, 1547 1368 (unsigned long long)sh->sector); 1548 1369 ··· 1555 1374 } 1556 1375 1557 1376 init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx, 1558 - ops_complete_prexor, sh, to_addr_conv(sh, percpu)); 1377 + ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0)); 1559 1378 tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit); 1379 + 1380 + return tx; 1381 + } 1382 + 1383 + static struct dma_async_tx_descriptor * 1384 + ops_run_prexor6(struct stripe_head *sh, struct raid5_percpu *percpu, 1385 + struct dma_async_tx_descriptor *tx) 1386 + { 1387 + struct page **blocks = to_addr_page(percpu, 0); 1388 + int count; 1389 + struct async_submit_ctl submit; 1390 + 1391 + pr_debug("%s: stripe %llu\n", __func__, 1392 + (unsigned long long)sh->sector); 1393 + 1394 + count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_WANT_DRAIN); 1395 + 1396 + init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_PQ_XOR_DST, tx, 1397 + ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0)); 1398 + tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit); 1560 1399 1561 1400 return tx; 1562 1401 } ··· 1586 1385 { 1587 1386 int disks = sh->disks; 1588 1387 int i; 1388 + struct stripe_head *head_sh = sh; 1589 1389 1590 1390 pr_debug("%s: stripe %llu\n", __func__, 1591 1391 (unsigned long long)sh->sector); 1592 1392 1593 1393 for (i = disks; i--; ) { 1594 - struct r5dev *dev = &sh->dev[i]; 1394 + struct r5dev *dev; 1595 1395 struct bio *chosen; 1596 1396 1597 - if (test_and_clear_bit(R5_Wantdrain, &dev->flags)) { 1397 + sh = head_sh; 1398 + if (test_and_clear_bit(R5_Wantdrain, &head_sh->dev[i].flags)) { 1598 1399 struct bio *wbi; 1599 1400 1401 + again: 1402 + dev = &sh->dev[i]; 1600 1403 spin_lock_irq(&sh->stripe_lock); 1601 1404 chosen = dev->towrite; 1602 1405 dev->towrite = NULL; 1406 + sh->overwrite_disks = 0; 1603 1407 BUG_ON(dev->written); 1604 1408 wbi = dev->written = chosen; 1605 1409 spin_unlock_irq(&sh->stripe_lock); ··· 1628 1422 } 1629 1423 } 1630 1424 wbi = r5_next_bio(wbi, dev->sector); 1425 + } 1426 + 1427 + if (head_sh->batch_head) { 1428 + sh = list_first_entry(&sh->batch_list, 1429 + struct stripe_head, 1430 + batch_list); 1431 + if (sh == head_sh) 1432 + continue; 1433 + goto again; 1631 1434 } 1632 1435 } 1633 1436 } ··· 1693 1478 struct dma_async_tx_descriptor *tx) 1694 1479 { 1695 1480 int disks = sh->disks; 1696 - struct page **xor_srcs = percpu->scribble; 1481 + struct page **xor_srcs; 1697 1482 struct async_submit_ctl submit; 1698 - int count = 0, pd_idx = sh->pd_idx, i; 1483 + int count, pd_idx = sh->pd_idx, i; 1699 1484 struct page *xor_dest; 1700 1485 int prexor = 0; 1701 1486 unsigned long flags; 1487 + int j = 0; 1488 + struct stripe_head *head_sh = sh; 1489 + int last_stripe; 1702 1490 1703 1491 pr_debug("%s: stripe %llu\n", __func__, 1704 1492 (unsigned long long)sh->sector); ··· 1718 1500 ops_complete_reconstruct(sh); 1719 1501 return; 1720 1502 } 1503 + again: 1504 + count = 0; 1505 + xor_srcs = to_addr_page(percpu, j); 1721 1506 /* check if prexor is active which means only process blocks 1722 1507 * that are part of a read-modify-write (written) 1723 1508 */ 1724 - if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) { 1509 + if (head_sh->reconstruct_state == reconstruct_state_prexor_drain_run) { 1725 1510 prexor = 1; 1726 1511 xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page; 1727 1512 for (i = disks; i--; ) { 1728 1513 struct r5dev *dev = &sh->dev[i]; 1729 - if (dev->written) 1514 + if (head_sh->dev[i].written) 1730 1515 xor_srcs[count++] = dev->page; 1731 1516 } 1732 1517 } else { ··· 1746 1525 * set ASYNC_TX_XOR_DROP_DST and ASYNC_TX_XOR_ZERO_DST 1747 1526 * for the synchronous xor case 1748 1527 */ 1749 - flags = ASYNC_TX_ACK | 1750 - (prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST); 1528 + last_stripe = !head_sh->batch_head || 1529 + list_first_entry(&sh->batch_list, 1530 + struct stripe_head, batch_list) == head_sh; 1531 + if (last_stripe) { 1532 + flags = ASYNC_TX_ACK | 1533 + (prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST); 1751 1534 1752 - atomic_inc(&sh->count); 1535 + atomic_inc(&head_sh->count); 1536 + init_async_submit(&submit, flags, tx, ops_complete_reconstruct, head_sh, 1537 + to_addr_conv(sh, percpu, j)); 1538 + } else { 1539 + flags = prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST; 1540 + init_async_submit(&submit, flags, tx, NULL, NULL, 1541 + to_addr_conv(sh, percpu, j)); 1542 + } 1753 1543 1754 - init_async_submit(&submit, flags, tx, ops_complete_reconstruct, sh, 1755 - to_addr_conv(sh, percpu)); 1756 1544 if (unlikely(count == 1)) 1757 1545 tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit); 1758 1546 else 1759 1547 tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit); 1548 + if (!last_stripe) { 1549 + j++; 1550 + sh = list_first_entry(&sh->batch_list, struct stripe_head, 1551 + batch_list); 1552 + goto again; 1553 + } 1760 1554 } 1761 1555 1762 1556 static void ··· 1779 1543 struct dma_async_tx_descriptor *tx) 1780 1544 { 1781 1545 struct async_submit_ctl submit; 1782 - struct page **blocks = percpu->scribble; 1783 - int count, i; 1546 + struct page **blocks; 1547 + int count, i, j = 0; 1548 + struct stripe_head *head_sh = sh; 1549 + int last_stripe; 1550 + int synflags; 1551 + unsigned long txflags; 1784 1552 1785 1553 pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector); 1786 1554 ··· 1802 1562 return; 1803 1563 } 1804 1564 1805 - count = set_syndrome_sources(blocks, sh); 1565 + again: 1566 + blocks = to_addr_page(percpu, j); 1806 1567 1807 - atomic_inc(&sh->count); 1568 + if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) { 1569 + synflags = SYNDROME_SRC_WRITTEN; 1570 + txflags = ASYNC_TX_ACK | ASYNC_TX_PQ_XOR_DST; 1571 + } else { 1572 + synflags = SYNDROME_SRC_ALL; 1573 + txflags = ASYNC_TX_ACK; 1574 + } 1808 1575 1809 - init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_reconstruct, 1810 - sh, to_addr_conv(sh, percpu)); 1576 + count = set_syndrome_sources(blocks, sh, synflags); 1577 + last_stripe = !head_sh->batch_head || 1578 + list_first_entry(&sh->batch_list, 1579 + struct stripe_head, batch_list) == head_sh; 1580 + 1581 + if (last_stripe) { 1582 + atomic_inc(&head_sh->count); 1583 + init_async_submit(&submit, txflags, tx, ops_complete_reconstruct, 1584 + head_sh, to_addr_conv(sh, percpu, j)); 1585 + } else 1586 + init_async_submit(&submit, 0, tx, NULL, NULL, 1587 + to_addr_conv(sh, percpu, j)); 1811 1588 async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit); 1589 + if (!last_stripe) { 1590 + j++; 1591 + sh = list_first_entry(&sh->batch_list, struct stripe_head, 1592 + batch_list); 1593 + goto again; 1594 + } 1812 1595 } 1813 1596 1814 1597 static void ops_complete_check(void *stripe_head_ref) ··· 1852 1589 int pd_idx = sh->pd_idx; 1853 1590 int qd_idx = sh->qd_idx; 1854 1591 struct page *xor_dest; 1855 - struct page **xor_srcs = percpu->scribble; 1592 + struct page **xor_srcs = to_addr_page(percpu, 0); 1856 1593 struct dma_async_tx_descriptor *tx; 1857 1594 struct async_submit_ctl submit; 1858 1595 int count; ··· 1861 1598 pr_debug("%s: stripe %llu\n", __func__, 1862 1599 (unsigned long long)sh->sector); 1863 1600 1601 + BUG_ON(sh->batch_head); 1864 1602 count = 0; 1865 1603 xor_dest = sh->dev[pd_idx].page; 1866 1604 xor_srcs[count++] = xor_dest; ··· 1872 1608 } 1873 1609 1874 1610 init_async_submit(&submit, 0, NULL, NULL, NULL, 1875 - to_addr_conv(sh, percpu)); 1611 + to_addr_conv(sh, percpu, 0)); 1876 1612 tx = async_xor_val(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, 1877 1613 &sh->ops.zero_sum_result, &submit); 1878 1614 ··· 1883 1619 1884 1620 static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu, int checkp) 1885 1621 { 1886 - struct page **srcs = percpu->scribble; 1622 + struct page **srcs = to_addr_page(percpu, 0); 1887 1623 struct async_submit_ctl submit; 1888 1624 int count; 1889 1625 1890 1626 pr_debug("%s: stripe %llu checkp: %d\n", __func__, 1891 1627 (unsigned long long)sh->sector, checkp); 1892 1628 1893 - count = set_syndrome_sources(srcs, sh); 1629 + BUG_ON(sh->batch_head); 1630 + count = set_syndrome_sources(srcs, sh, SYNDROME_SRC_ALL); 1894 1631 if (!checkp) 1895 1632 srcs[count] = NULL; 1896 1633 1897 1634 atomic_inc(&sh->count); 1898 1635 init_async_submit(&submit, ASYNC_TX_ACK, NULL, ops_complete_check, 1899 - sh, to_addr_conv(sh, percpu)); 1636 + sh, to_addr_conv(sh, percpu, 0)); 1900 1637 async_syndrome_val(srcs, 0, count+2, STRIPE_SIZE, 1901 1638 &sh->ops.zero_sum_result, percpu->spare_page, &submit); 1902 1639 } ··· 1932 1667 async_tx_ack(tx); 1933 1668 } 1934 1669 1935 - if (test_bit(STRIPE_OP_PREXOR, &ops_request)) 1936 - tx = ops_run_prexor(sh, percpu, tx); 1670 + if (test_bit(STRIPE_OP_PREXOR, &ops_request)) { 1671 + if (level < 6) 1672 + tx = ops_run_prexor5(sh, percpu, tx); 1673 + else 1674 + tx = ops_run_prexor6(sh, percpu, tx); 1675 + } 1937 1676 1938 1677 if (test_bit(STRIPE_OP_BIODRAIN, &ops_request)) { 1939 1678 tx = ops_run_biodrain(sh, tx); ··· 1962 1693 BUG(); 1963 1694 } 1964 1695 1965 - if (overlap_clear) 1696 + if (overlap_clear && !sh->batch_head) 1966 1697 for (i = disks; i--; ) { 1967 1698 struct r5dev *dev = &sh->dev[i]; 1968 1699 if (test_and_clear_bit(R5_Overlap, &dev->flags)) ··· 1971 1702 put_cpu(); 1972 1703 } 1973 1704 1974 - static int grow_one_stripe(struct r5conf *conf, int hash) 1705 + static int grow_one_stripe(struct r5conf *conf, gfp_t gfp) 1975 1706 { 1976 1707 struct stripe_head *sh; 1977 - sh = kmem_cache_zalloc(conf->slab_cache, GFP_KERNEL); 1708 + sh = kmem_cache_zalloc(conf->slab_cache, gfp); 1978 1709 if (!sh) 1979 1710 return 0; 1980 1711 ··· 1982 1713 1983 1714 spin_lock_init(&sh->stripe_lock); 1984 1715 1985 - if (grow_buffers(sh)) { 1716 + if (grow_buffers(sh, gfp)) { 1986 1717 shrink_buffers(sh); 1987 1718 kmem_cache_free(conf->slab_cache, sh); 1988 1719 return 0; 1989 1720 } 1990 - sh->hash_lock_index = hash; 1721 + sh->hash_lock_index = 1722 + conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS; 1991 1723 /* we just created an active stripe so... */ 1992 1724 atomic_set(&sh->count, 1); 1993 1725 atomic_inc(&conf->active_stripes); 1994 1726 INIT_LIST_HEAD(&sh->lru); 1727 + 1728 + spin_lock_init(&sh->batch_lock); 1729 + INIT_LIST_HEAD(&sh->batch_list); 1730 + sh->batch_head = NULL; 1995 1731 release_stripe(sh); 1732 + conf->max_nr_stripes++; 1996 1733 return 1; 1997 1734 } 1998 1735 ··· 2006 1731 { 2007 1732 struct kmem_cache *sc; 2008 1733 int devs = max(conf->raid_disks, conf->previous_raid_disks); 2009 - int hash; 2010 1734 2011 1735 if (conf->mddev->gendisk) 2012 1736 sprintf(conf->cache_name[0], ··· 2023 1749 return 1; 2024 1750 conf->slab_cache = sc; 2025 1751 conf->pool_size = devs; 2026 - hash = conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS; 2027 - while (num--) { 2028 - if (!grow_one_stripe(conf, hash)) 1752 + while (num--) 1753 + if (!grow_one_stripe(conf, GFP_KERNEL)) 2029 1754 return 1; 2030 - conf->max_nr_stripes++; 2031 - hash = (hash + 1) % NR_STRIPE_HASH_LOCKS; 2032 - } 1755 + 2033 1756 return 0; 2034 1757 } 2035 1758 ··· 2043 1772 * calculate over all devices (not just the data blocks), using zeros in place 2044 1773 * of the P and Q blocks. 2045 1774 */ 2046 - static size_t scribble_len(int num) 1775 + static struct flex_array *scribble_alloc(int num, int cnt, gfp_t flags) 2047 1776 { 1777 + struct flex_array *ret; 2048 1778 size_t len; 2049 1779 2050 1780 len = sizeof(struct page *) * (num+2) + sizeof(addr_conv_t) * (num+2); 2051 - 2052 - return len; 1781 + ret = flex_array_alloc(len, cnt, flags); 1782 + if (!ret) 1783 + return NULL; 1784 + /* always prealloc all elements, so no locking is required */ 1785 + if (flex_array_prealloc(ret, 0, cnt, flags)) { 1786 + flex_array_free(ret); 1787 + return NULL; 1788 + } 1789 + return ret; 2053 1790 } 2054 1791 2055 1792 static int resize_stripes(struct r5conf *conf, int newsize) ··· 2175 1896 err = -ENOMEM; 2176 1897 2177 1898 get_online_cpus(); 2178 - conf->scribble_len = scribble_len(newsize); 2179 1899 for_each_present_cpu(cpu) { 2180 1900 struct raid5_percpu *percpu; 2181 - void *scribble; 1901 + struct flex_array *scribble; 2182 1902 2183 1903 percpu = per_cpu_ptr(conf->percpu, cpu); 2184 - scribble = kmalloc(conf->scribble_len, GFP_NOIO); 1904 + scribble = scribble_alloc(newsize, conf->chunk_sectors / 1905 + STRIPE_SECTORS, GFP_NOIO); 2185 1906 2186 1907 if (scribble) { 2187 - kfree(percpu->scribble); 1908 + flex_array_free(percpu->scribble); 2188 1909 percpu->scribble = scribble; 2189 1910 } else { 2190 1911 err = -ENOMEM; ··· 2216 1937 return err; 2217 1938 } 2218 1939 2219 - static int drop_one_stripe(struct r5conf *conf, int hash) 1940 + static int drop_one_stripe(struct r5conf *conf) 2220 1941 { 2221 1942 struct stripe_head *sh; 1943 + int hash = (conf->max_nr_stripes - 1) % NR_STRIPE_HASH_LOCKS; 2222 1944 2223 1945 spin_lock_irq(conf->hash_locks + hash); 2224 1946 sh = get_free_stripe(conf, hash); ··· 2230 1950 shrink_buffers(sh); 2231 1951 kmem_cache_free(conf->slab_cache, sh); 2232 1952 atomic_dec(&conf->active_stripes); 1953 + conf->max_nr_stripes--; 2233 1954 return 1; 2234 1955 } 2235 1956 2236 1957 static void shrink_stripes(struct r5conf *conf) 2237 1958 { 2238 - int hash; 2239 - for (hash = 0; hash < NR_STRIPE_HASH_LOCKS; hash++) 2240 - while (drop_one_stripe(conf, hash)) 2241 - ; 1959 + while (conf->max_nr_stripes && 1960 + drop_one_stripe(conf)) 1961 + ; 2242 1962 2243 1963 if (conf->slab_cache) 2244 1964 kmem_cache_destroy(conf->slab_cache); ··· 2434 2154 } 2435 2155 rdev_dec_pending(rdev, conf->mddev); 2436 2156 2157 + if (sh->batch_head && !uptodate) 2158 + set_bit(STRIPE_BATCH_ERR, &sh->batch_head->state); 2159 + 2437 2160 if (!test_and_clear_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags)) 2438 2161 clear_bit(R5_LOCKED, &sh->dev[i].flags); 2439 2162 set_bit(STRIPE_HANDLE, &sh->state); 2440 2163 release_stripe(sh); 2164 + 2165 + if (sh->batch_head && sh != sh->batch_head) 2166 + release_stripe(sh->batch_head); 2441 2167 } 2442 2168 2443 2169 static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous); ··· 2821 2535 schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s, 2822 2536 int rcw, int expand) 2823 2537 { 2824 - int i, pd_idx = sh->pd_idx, disks = sh->disks; 2538 + int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx, disks = sh->disks; 2825 2539 struct r5conf *conf = sh->raid_conf; 2826 2540 int level = conf->level; 2827 2541 ··· 2857 2571 if (!test_and_set_bit(STRIPE_FULL_WRITE, &sh->state)) 2858 2572 atomic_inc(&conf->pending_full_writes); 2859 2573 } else { 2860 - BUG_ON(level == 6); 2861 2574 BUG_ON(!(test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags) || 2862 2575 test_bit(R5_Wantcompute, &sh->dev[pd_idx].flags))); 2576 + BUG_ON(level == 6 && 2577 + (!(test_bit(R5_UPTODATE, &sh->dev[qd_idx].flags) || 2578 + test_bit(R5_Wantcompute, &sh->dev[qd_idx].flags)))); 2863 2579 2864 2580 for (i = disks; i--; ) { 2865 2581 struct r5dev *dev = &sh->dev[i]; 2866 - if (i == pd_idx) 2582 + if (i == pd_idx || i == qd_idx) 2867 2583 continue; 2868 2584 2869 2585 if (dev->towrite && ··· 2912 2624 * toread/towrite point to the first in a chain. 2913 2625 * The bi_next chain must be in order. 2914 2626 */ 2915 - static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, int forwrite) 2627 + static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, 2628 + int forwrite, int previous) 2916 2629 { 2917 2630 struct bio **bip; 2918 2631 struct r5conf *conf = sh->raid_conf; ··· 2932 2643 * protect it. 2933 2644 */ 2934 2645 spin_lock_irq(&sh->stripe_lock); 2646 + /* Don't allow new IO added to stripes in batch list */ 2647 + if (sh->batch_head) 2648 + goto overlap; 2935 2649 if (forwrite) { 2936 2650 bip = &sh->dev[dd_idx].towrite; 2937 2651 if (*bip == NULL) ··· 2948 2656 } 2949 2657 if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi)) 2950 2658 goto overlap; 2659 + 2660 + if (!forwrite || previous) 2661 + clear_bit(STRIPE_BATCH_READY, &sh->state); 2951 2662 2952 2663 BUG_ON(*bip && bi->bi_next && (*bip) != bi->bi_next); 2953 2664 if (*bip) ··· 2969 2674 sector = bio_end_sector(bi); 2970 2675 } 2971 2676 if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS) 2972 - set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags); 2677 + if (!test_and_set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags)) 2678 + sh->overwrite_disks++; 2973 2679 } 2974 2680 2975 2681 pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n", ··· 2984 2688 sh->bm_seq = conf->seq_flush+1; 2985 2689 set_bit(STRIPE_BIT_DELAY, &sh->state); 2986 2690 } 2691 + 2692 + if (stripe_can_batch(sh)) 2693 + stripe_add_to_batch_list(conf, sh); 2987 2694 return 1; 2988 2695 2989 2696 overlap: ··· 3019 2720 struct bio **return_bi) 3020 2721 { 3021 2722 int i; 2723 + BUG_ON(sh->batch_head); 3022 2724 for (i = disks; i--; ) { 3023 2725 struct bio *bi; 3024 2726 int bitmap_end = 0; ··· 3046 2746 /* fail all writes first */ 3047 2747 bi = sh->dev[i].towrite; 3048 2748 sh->dev[i].towrite = NULL; 2749 + sh->overwrite_disks = 0; 3049 2750 spin_unlock_irq(&sh->stripe_lock); 3050 2751 if (bi) 3051 2752 bitmap_end = 1; ··· 3135 2834 int abort = 0; 3136 2835 int i; 3137 2836 2837 + BUG_ON(sh->batch_head); 3138 2838 clear_bit(STRIPE_SYNCING, &sh->state); 3139 2839 if (test_and_clear_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags)) 3140 2840 wake_up(&conf->wait_for_overlap); ··· 3366 3064 { 3367 3065 int i; 3368 3066 3067 + BUG_ON(sh->batch_head); 3369 3068 /* look for blocks to read/compute, skip this if a compute 3370 3069 * is already in flight, or if the stripe contents are in the 3371 3070 * midst of changing due to a write ··· 3390 3087 int i; 3391 3088 struct r5dev *dev; 3392 3089 int discard_pending = 0; 3090 + struct stripe_head *head_sh = sh; 3091 + bool do_endio = false; 3092 + int wakeup_nr = 0; 3393 3093 3394 3094 for (i = disks; i--; ) 3395 3095 if (sh->dev[i].written) { ··· 3408 3102 clear_bit(R5_UPTODATE, &dev->flags); 3409 3103 if (test_and_clear_bit(R5_SkipCopy, &dev->flags)) { 3410 3104 WARN_ON(test_bit(R5_UPTODATE, &dev->flags)); 3411 - dev->page = dev->orig_page; 3412 3105 } 3106 + do_endio = true; 3107 + 3108 + returnbi: 3109 + dev->page = dev->orig_page; 3413 3110 wbi = dev->written; 3414 3111 dev->written = NULL; 3415 3112 while (wbi && wbi->bi_iter.bi_sector < ··· 3429 3120 STRIPE_SECTORS, 3430 3121 !test_bit(STRIPE_DEGRADED, &sh->state), 3431 3122 0); 3123 + if (head_sh->batch_head) { 3124 + sh = list_first_entry(&sh->batch_list, 3125 + struct stripe_head, 3126 + batch_list); 3127 + if (sh != head_sh) { 3128 + dev = &sh->dev[i]; 3129 + goto returnbi; 3130 + } 3131 + } 3132 + sh = head_sh; 3133 + dev = &sh->dev[i]; 3432 3134 } else if (test_bit(R5_Discard, &dev->flags)) 3433 3135 discard_pending = 1; 3434 3136 WARN_ON(test_bit(R5_SkipCopy, &dev->flags)); ··· 3461 3141 * will be reinitialized 3462 3142 */ 3463 3143 spin_lock_irq(&conf->device_lock); 3144 + unhash: 3464 3145 remove_hash(sh); 3146 + if (head_sh->batch_head) { 3147 + sh = list_first_entry(&sh->batch_list, 3148 + struct stripe_head, batch_list); 3149 + if (sh != head_sh) 3150 + goto unhash; 3151 + } 3465 3152 spin_unlock_irq(&conf->device_lock); 3153 + sh = head_sh; 3154 + 3466 3155 if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state)) 3467 3156 set_bit(STRIPE_HANDLE, &sh->state); 3468 3157 ··· 3480 3151 if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state)) 3481 3152 if (atomic_dec_and_test(&conf->pending_full_writes)) 3482 3153 md_wakeup_thread(conf->mddev->thread); 3154 + 3155 + if (!head_sh->batch_head || !do_endio) 3156 + return; 3157 + for (i = 0; i < head_sh->disks; i++) { 3158 + if (test_and_clear_bit(R5_Overlap, &head_sh->dev[i].flags)) 3159 + wakeup_nr++; 3160 + } 3161 + while (!list_empty(&head_sh->batch_list)) { 3162 + int i; 3163 + sh = list_first_entry(&head_sh->batch_list, 3164 + struct stripe_head, batch_list); 3165 + list_del_init(&sh->batch_list); 3166 + 3167 + set_mask_bits(&sh->state, ~STRIPE_EXPAND_SYNC_FLAG, 3168 + head_sh->state & ~((1 << STRIPE_ACTIVE) | 3169 + (1 << STRIPE_PREREAD_ACTIVE) | 3170 + STRIPE_EXPAND_SYNC_FLAG)); 3171 + sh->check_state = head_sh->check_state; 3172 + sh->reconstruct_state = head_sh->reconstruct_state; 3173 + for (i = 0; i < sh->disks; i++) { 3174 + if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) 3175 + wakeup_nr++; 3176 + sh->dev[i].flags = head_sh->dev[i].flags; 3177 + } 3178 + 3179 + spin_lock_irq(&sh->stripe_lock); 3180 + sh->batch_head = NULL; 3181 + spin_unlock_irq(&sh->stripe_lock); 3182 + if (sh->state & STRIPE_EXPAND_SYNC_FLAG) 3183 + set_bit(STRIPE_HANDLE, &sh->state); 3184 + release_stripe(sh); 3185 + } 3186 + 3187 + spin_lock_irq(&head_sh->stripe_lock); 3188 + head_sh->batch_head = NULL; 3189 + spin_unlock_irq(&head_sh->stripe_lock); 3190 + wake_up_nr(&conf->wait_for_overlap, wakeup_nr); 3191 + if (head_sh->state & STRIPE_EXPAND_SYNC_FLAG) 3192 + set_bit(STRIPE_HANDLE, &head_sh->state); 3483 3193 } 3484 3194 3485 3195 static void handle_stripe_dirtying(struct r5conf *conf, ··· 3529 3161 int rmw = 0, rcw = 0, i; 3530 3162 sector_t recovery_cp = conf->mddev->recovery_cp; 3531 3163 3532 - /* RAID6 requires 'rcw' in current implementation. 3533 - * Otherwise, check whether resync is now happening or should start. 3164 + /* Check whether resync is now happening or should start. 3534 3165 * If yes, then the array is dirty (after unclean shutdown or 3535 3166 * initial creation), so parity in some stripes might be inconsistent. 3536 3167 * In this case, we need to always do reconstruct-write, to ensure 3537 3168 * that in case of drive failure or read-error correction, we 3538 3169 * generate correct data from the parity. 3539 3170 */ 3540 - if (conf->max_degraded == 2 || 3171 + if (conf->rmw_level == PARITY_DISABLE_RMW || 3541 3172 (recovery_cp < MaxSector && sh->sector >= recovery_cp && 3542 3173 s->failed == 0)) { 3543 3174 /* Calculate the real rcw later - for now make it 3544 3175 * look like rcw is cheaper 3545 3176 */ 3546 3177 rcw = 1; rmw = 2; 3547 - pr_debug("force RCW max_degraded=%u, recovery_cp=%llu sh->sector=%llu\n", 3548 - conf->max_degraded, (unsigned long long)recovery_cp, 3178 + pr_debug("force RCW rmw_level=%u, recovery_cp=%llu sh->sector=%llu\n", 3179 + conf->rmw_level, (unsigned long long)recovery_cp, 3549 3180 (unsigned long long)sh->sector); 3550 3181 } else for (i = disks; i--; ) { 3551 3182 /* would I have to read this buffer for read_modify_write */ 3552 3183 struct r5dev *dev = &sh->dev[i]; 3553 - if ((dev->towrite || i == sh->pd_idx) && 3184 + if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) && 3554 3185 !test_bit(R5_LOCKED, &dev->flags) && 3555 3186 !(test_bit(R5_UPTODATE, &dev->flags) || 3556 3187 test_bit(R5_Wantcompute, &dev->flags))) { ··· 3559 3192 rmw += 2*disks; /* cannot read it */ 3560 3193 } 3561 3194 /* Would I have to read this buffer for reconstruct_write */ 3562 - if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx && 3195 + if (!test_bit(R5_OVERWRITE, &dev->flags) && 3196 + i != sh->pd_idx && i != sh->qd_idx && 3563 3197 !test_bit(R5_LOCKED, &dev->flags) && 3564 3198 !(test_bit(R5_UPTODATE, &dev->flags) || 3565 3199 test_bit(R5_Wantcompute, &dev->flags))) { ··· 3573 3205 pr_debug("for sector %llu, rmw=%d rcw=%d\n", 3574 3206 (unsigned long long)sh->sector, rmw, rcw); 3575 3207 set_bit(STRIPE_HANDLE, &sh->state); 3576 - if (rmw < rcw && rmw > 0) { 3208 + if ((rmw < rcw || (rmw == rcw && conf->rmw_level == PARITY_ENABLE_RMW)) && rmw > 0) { 3577 3209 /* prefer read-modify-write, but need to get some data */ 3578 3210 if (conf->mddev->queue) 3579 3211 blk_add_trace_msg(conf->mddev->queue, ··· 3581 3213 (unsigned long long)sh->sector, rmw); 3582 3214 for (i = disks; i--; ) { 3583 3215 struct r5dev *dev = &sh->dev[i]; 3584 - if ((dev->towrite || i == sh->pd_idx) && 3216 + if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) && 3585 3217 !test_bit(R5_LOCKED, &dev->flags) && 3586 3218 !(test_bit(R5_UPTODATE, &dev->flags) || 3587 3219 test_bit(R5_Wantcompute, &dev->flags)) && ··· 3600 3232 } 3601 3233 } 3602 3234 } 3603 - if (rcw <= rmw && rcw > 0) { 3235 + if ((rcw < rmw || (rcw == rmw && conf->rmw_level != PARITY_ENABLE_RMW)) && rcw > 0) { 3604 3236 /* want reconstruct write, but need to get some data */ 3605 3237 int qread =0; 3606 3238 rcw = 0; ··· 3658 3290 { 3659 3291 struct r5dev *dev = NULL; 3660 3292 3293 + BUG_ON(sh->batch_head); 3661 3294 set_bit(STRIPE_HANDLE, &sh->state); 3662 3295 3663 3296 switch (sh->check_state) { ··· 3749 3380 int qd_idx = sh->qd_idx; 3750 3381 struct r5dev *dev; 3751 3382 3383 + BUG_ON(sh->batch_head); 3752 3384 set_bit(STRIPE_HANDLE, &sh->state); 3753 3385 3754 3386 BUG_ON(s->failed > 2); ··· 3913 3543 * copy some of them into a target stripe for expand. 3914 3544 */ 3915 3545 struct dma_async_tx_descriptor *tx = NULL; 3546 + BUG_ON(sh->batch_head); 3916 3547 clear_bit(STRIPE_EXPAND_SOURCE, &sh->state); 3917 3548 for (i = 0; i < sh->disks; i++) 3918 3549 if (i != sh->pd_idx && i != sh->qd_idx) { ··· 3986 3615 3987 3616 memset(s, 0, sizeof(*s)); 3988 3617 3989 - s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); 3990 - s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); 3618 + s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state) && !sh->batch_head; 3619 + s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state) && !sh->batch_head; 3991 3620 s->failed_num[0] = -1; 3992 3621 s->failed_num[1] = -1; 3993 3622 ··· 4157 3786 rcu_read_unlock(); 4158 3787 } 4159 3788 3789 + static int clear_batch_ready(struct stripe_head *sh) 3790 + { 3791 + struct stripe_head *tmp; 3792 + if (!test_and_clear_bit(STRIPE_BATCH_READY, &sh->state)) 3793 + return 0; 3794 + spin_lock(&sh->stripe_lock); 3795 + if (!sh->batch_head) { 3796 + spin_unlock(&sh->stripe_lock); 3797 + return 0; 3798 + } 3799 + 3800 + /* 3801 + * this stripe could be added to a batch list before we check 3802 + * BATCH_READY, skips it 3803 + */ 3804 + if (sh->batch_head != sh) { 3805 + spin_unlock(&sh->stripe_lock); 3806 + return 1; 3807 + } 3808 + spin_lock(&sh->batch_lock); 3809 + list_for_each_entry(tmp, &sh->batch_list, batch_list) 3810 + clear_bit(STRIPE_BATCH_READY, &tmp->state); 3811 + spin_unlock(&sh->batch_lock); 3812 + spin_unlock(&sh->stripe_lock); 3813 + 3814 + /* 3815 + * BATCH_READY is cleared, no new stripes can be added. 3816 + * batch_list can be accessed without lock 3817 + */ 3818 + return 0; 3819 + } 3820 + 3821 + static void check_break_stripe_batch_list(struct stripe_head *sh) 3822 + { 3823 + struct stripe_head *head_sh, *next; 3824 + int i; 3825 + 3826 + if (!test_and_clear_bit(STRIPE_BATCH_ERR, &sh->state)) 3827 + return; 3828 + 3829 + head_sh = sh; 3830 + do { 3831 + sh = list_first_entry(&sh->batch_list, 3832 + struct stripe_head, batch_list); 3833 + BUG_ON(sh == head_sh); 3834 + } while (!test_bit(STRIPE_DEGRADED, &sh->state)); 3835 + 3836 + while (sh != head_sh) { 3837 + next = list_first_entry(&sh->batch_list, 3838 + struct stripe_head, batch_list); 3839 + list_del_init(&sh->batch_list); 3840 + 3841 + set_mask_bits(&sh->state, ~STRIPE_EXPAND_SYNC_FLAG, 3842 + head_sh->state & ~((1 << STRIPE_ACTIVE) | 3843 + (1 << STRIPE_PREREAD_ACTIVE) | 3844 + (1 << STRIPE_DEGRADED) | 3845 + STRIPE_EXPAND_SYNC_FLAG)); 3846 + sh->check_state = head_sh->check_state; 3847 + sh->reconstruct_state = head_sh->reconstruct_state; 3848 + for (i = 0; i < sh->disks; i++) 3849 + sh->dev[i].flags = head_sh->dev[i].flags & 3850 + (~((1 << R5_WriteError) | (1 << R5_Overlap))); 3851 + 3852 + spin_lock_irq(&sh->stripe_lock); 3853 + sh->batch_head = NULL; 3854 + spin_unlock_irq(&sh->stripe_lock); 3855 + 3856 + set_bit(STRIPE_HANDLE, &sh->state); 3857 + release_stripe(sh); 3858 + 3859 + sh = next; 3860 + } 3861 + } 3862 + 4160 3863 static void handle_stripe(struct stripe_head *sh) 4161 3864 { 4162 3865 struct stripe_head_state s; ··· 4248 3803 return; 4249 3804 } 4250 3805 4251 - if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state)) { 3806 + if (clear_batch_ready(sh) ) { 3807 + clear_bit_unlock(STRIPE_ACTIVE, &sh->state); 3808 + return; 3809 + } 3810 + 3811 + check_break_stripe_batch_list(sh); 3812 + 3813 + if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state) && !sh->batch_head) { 4252 3814 spin_lock(&sh->stripe_lock); 4253 3815 /* Cannot process 'sync' concurrently with 'discard' */ 4254 3816 if (!test_bit(STRIPE_DISCARD, &sh->state) && ··· 4610 4158 * how busy the stripe_cache is 4611 4159 */ 4612 4160 4613 - if (conf->inactive_blocked) 4161 + if (test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) 4614 4162 return 1; 4615 4163 if (conf->quiesce) 4616 4164 return 1; ··· 4632 4180 unsigned int chunk_sectors = mddev->chunk_sectors; 4633 4181 unsigned int bio_sectors = bvm->bi_size >> 9; 4634 4182 4635 - if ((bvm->bi_rw & 1) == WRITE) 4636 - return biovec->bv_len; /* always allow writes to be mergeable */ 4183 + /* 4184 + * always allow writes to be mergeable, read as well if array 4185 + * is degraded as we'll go through stripe cache anyway. 4186 + */ 4187 + if ((bvm->bi_rw & 1) == WRITE || mddev->degraded) 4188 + return biovec->bv_len; 4637 4189 4638 4190 if (mddev->new_chunk_sectors < mddev->chunk_sectors) 4639 4191 chunk_sectors = mddev->new_chunk_sectors; ··· 5059 4603 } 5060 4604 set_bit(STRIPE_DISCARD, &sh->state); 5061 4605 finish_wait(&conf->wait_for_overlap, &w); 4606 + sh->overwrite_disks = 0; 5062 4607 for (d = 0; d < conf->raid_disks; d++) { 5063 4608 if (d == sh->pd_idx || d == sh->qd_idx) 5064 4609 continue; 5065 4610 sh->dev[d].towrite = bi; 5066 4611 set_bit(R5_OVERWRITE, &sh->dev[d].flags); 5067 4612 raid5_inc_bi_active_stripes(bi); 4613 + sh->overwrite_disks++; 5068 4614 } 5069 4615 spin_unlock_irq(&sh->stripe_lock); 5070 4616 if (conf->mddev->bitmap) { ··· 5114 4656 5115 4657 md_write_start(mddev, bi); 5116 4658 5117 - if (rw == READ && 4659 + /* 4660 + * If array is degraded, better not do chunk aligned read because 4661 + * later we might have to read it again in order to reconstruct 4662 + * data on failed drives. 4663 + */ 4664 + if (rw == READ && mddev->degraded == 0 && 5118 4665 mddev->reshape_position == MaxSector && 5119 4666 chunk_aligned_read(mddev,bi)) 5120 4667 return; ··· 5235 4772 } 5236 4773 5237 4774 if (test_bit(STRIPE_EXPANDING, &sh->state) || 5238 - !add_stripe_bio(sh, bi, dd_idx, rw)) { 4775 + !add_stripe_bio(sh, bi, dd_idx, rw, previous)) { 5239 4776 /* Stripe is busy expanding or 5240 4777 * add failed due to overlap. Flush everything 5241 4778 * and wait a while ··· 5248 4785 } 5249 4786 set_bit(STRIPE_HANDLE, &sh->state); 5250 4787 clear_bit(STRIPE_DELAYED, &sh->state); 5251 - if ((bi->bi_rw & REQ_SYNC) && 4788 + if ((!sh->batch_head || sh == sh->batch_head) && 4789 + (bi->bi_rw & REQ_SYNC) && 5252 4790 !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) 5253 4791 atomic_inc(&conf->preread_active_stripes); 5254 4792 release_stripe_plug(mddev, sh); ··· 5514 5050 return reshape_sectors; 5515 5051 } 5516 5052 5517 - /* FIXME go_faster isn't used */ 5518 - static inline sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster) 5053 + static inline sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped) 5519 5054 { 5520 5055 struct r5conf *conf = mddev->private; 5521 5056 struct stripe_head *sh; ··· 5649 5186 return handled; 5650 5187 } 5651 5188 5652 - if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) { 5189 + if (!add_stripe_bio(sh, raid_bio, dd_idx, 0, 0)) { 5653 5190 release_stripe(sh); 5654 5191 raid5_set_bi_processed_stripes(raid_bio, scnt); 5655 5192 conf->retry_read_aligned = raid_bio; ··· 5775 5312 int batch_size, released; 5776 5313 5777 5314 released = release_stripe_list(conf, conf->temp_inactive_list); 5315 + if (released) 5316 + clear_bit(R5_DID_ALLOC, &conf->cache_state); 5778 5317 5779 5318 if ( 5780 5319 !list_empty(&conf->bitmap_list)) { ··· 5815 5350 pr_debug("%d stripes handled\n", handled); 5816 5351 5817 5352 spin_unlock_irq(&conf->device_lock); 5353 + if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state)) { 5354 + grow_one_stripe(conf, __GFP_NOWARN); 5355 + /* Set flag even if allocation failed. This helps 5356 + * slow down allocation requests when mem is short 5357 + */ 5358 + set_bit(R5_DID_ALLOC, &conf->cache_state); 5359 + } 5818 5360 5819 5361 async_tx_issue_pending_all(); 5820 5362 blk_finish_plug(&plug); ··· 5837 5365 spin_lock(&mddev->lock); 5838 5366 conf = mddev->private; 5839 5367 if (conf) 5840 - ret = sprintf(page, "%d\n", conf->max_nr_stripes); 5368 + ret = sprintf(page, "%d\n", conf->min_nr_stripes); 5841 5369 spin_unlock(&mddev->lock); 5842 5370 return ret; 5843 5371 } ··· 5847 5375 { 5848 5376 struct r5conf *conf = mddev->private; 5849 5377 int err; 5850 - int hash; 5851 5378 5852 5379 if (size <= 16 || size > 32768) 5853 5380 return -EINVAL; 5854 - hash = (conf->max_nr_stripes - 1) % NR_STRIPE_HASH_LOCKS; 5855 - while (size < conf->max_nr_stripes) { 5856 - if (drop_one_stripe(conf, hash)) 5857 - conf->max_nr_stripes--; 5858 - else 5859 - break; 5860 - hash--; 5861 - if (hash < 0) 5862 - hash = NR_STRIPE_HASH_LOCKS - 1; 5863 - } 5381 + 5382 + conf->min_nr_stripes = size; 5383 + while (size < conf->max_nr_stripes && 5384 + drop_one_stripe(conf)) 5385 + ; 5386 + 5387 + 5864 5388 err = md_allow_write(mddev); 5865 5389 if (err) 5866 5390 return err; 5867 - hash = conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS; 5868 - while (size > conf->max_nr_stripes) { 5869 - if (grow_one_stripe(conf, hash)) 5870 - conf->max_nr_stripes++; 5871 - else break; 5872 - hash = (hash + 1) % NR_STRIPE_HASH_LOCKS; 5873 - } 5391 + 5392 + while (size > conf->max_nr_stripes) 5393 + if (!grow_one_stripe(conf, GFP_KERNEL)) 5394 + break; 5395 + 5874 5396 return 0; 5875 5397 } 5876 5398 EXPORT_SYMBOL(raid5_set_cache_size); ··· 5899 5433 raid5_store_stripe_cache_size); 5900 5434 5901 5435 static ssize_t 5436 + raid5_show_rmw_level(struct mddev *mddev, char *page) 5437 + { 5438 + struct r5conf *conf = mddev->private; 5439 + if (conf) 5440 + return sprintf(page, "%d\n", conf->rmw_level); 5441 + else 5442 + return 0; 5443 + } 5444 + 5445 + static ssize_t 5446 + raid5_store_rmw_level(struct mddev *mddev, const char *page, size_t len) 5447 + { 5448 + struct r5conf *conf = mddev->private; 5449 + unsigned long new; 5450 + 5451 + if (!conf) 5452 + return -ENODEV; 5453 + 5454 + if (len >= PAGE_SIZE) 5455 + return -EINVAL; 5456 + 5457 + if (kstrtoul(page, 10, &new)) 5458 + return -EINVAL; 5459 + 5460 + if (new != PARITY_DISABLE_RMW && !raid6_call.xor_syndrome) 5461 + return -EINVAL; 5462 + 5463 + if (new != PARITY_DISABLE_RMW && 5464 + new != PARITY_ENABLE_RMW && 5465 + new != PARITY_PREFER_RMW) 5466 + return -EINVAL; 5467 + 5468 + conf->rmw_level = new; 5469 + return len; 5470 + } 5471 + 5472 + static struct md_sysfs_entry 5473 + raid5_rmw_level = __ATTR(rmw_level, S_IRUGO | S_IWUSR, 5474 + raid5_show_rmw_level, 5475 + raid5_store_rmw_level); 5476 + 5477 + 5478 + static ssize_t 5902 5479 raid5_show_preread_threshold(struct mddev *mddev, char *page) 5903 5480 { 5904 5481 struct r5conf *conf; ··· 5972 5463 conf = mddev->private; 5973 5464 if (!conf) 5974 5465 err = -ENODEV; 5975 - else if (new > conf->max_nr_stripes) 5466 + else if (new > conf->min_nr_stripes) 5976 5467 err = -EINVAL; 5977 5468 else 5978 5469 conf->bypass_threshold = new; ··· 6127 5618 &raid5_preread_bypass_threshold.attr, 6128 5619 &raid5_group_thread_cnt.attr, 6129 5620 &raid5_skip_copy.attr, 5621 + &raid5_rmw_level.attr, 6130 5622 NULL, 6131 5623 }; 6132 5624 static struct attribute_group raid5_attrs_group = { ··· 6209 5699 static void free_scratch_buffer(struct r5conf *conf, struct raid5_percpu *percpu) 6210 5700 { 6211 5701 safe_put_page(percpu->spare_page); 6212 - kfree(percpu->scribble); 5702 + if (percpu->scribble) 5703 + flex_array_free(percpu->scribble); 6213 5704 percpu->spare_page = NULL; 6214 5705 percpu->scribble = NULL; 6215 5706 } ··· 6220 5709 if (conf->level == 6 && !percpu->spare_page) 6221 5710 percpu->spare_page = alloc_page(GFP_KERNEL); 6222 5711 if (!percpu->scribble) 6223 - percpu->scribble = kmalloc(conf->scribble_len, GFP_KERNEL); 5712 + percpu->scribble = scribble_alloc(max(conf->raid_disks, 5713 + conf->previous_raid_disks), conf->chunk_sectors / 5714 + STRIPE_SECTORS, GFP_KERNEL); 6224 5715 6225 5716 if (!percpu->scribble || (conf->level == 6 && !percpu->spare_page)) { 6226 5717 free_scratch_buffer(conf, percpu); ··· 6253 5740 6254 5741 static void free_conf(struct r5conf *conf) 6255 5742 { 5743 + if (conf->shrinker.seeks) 5744 + unregister_shrinker(&conf->shrinker); 6256 5745 free_thread_groups(conf); 6257 5746 shrink_stripes(conf); 6258 5747 raid5_free_percpu(conf); ··· 6320 5805 put_online_cpus(); 6321 5806 6322 5807 return err; 5808 + } 5809 + 5810 + static unsigned long raid5_cache_scan(struct shrinker *shrink, 5811 + struct shrink_control *sc) 5812 + { 5813 + struct r5conf *conf = container_of(shrink, struct r5conf, shrinker); 5814 + int ret = 0; 5815 + while (ret < sc->nr_to_scan) { 5816 + if (drop_one_stripe(conf) == 0) 5817 + return SHRINK_STOP; 5818 + ret++; 5819 + } 5820 + return ret; 5821 + } 5822 + 5823 + static unsigned long raid5_cache_count(struct shrinker *shrink, 5824 + struct shrink_control *sc) 5825 + { 5826 + struct r5conf *conf = container_of(shrink, struct r5conf, shrinker); 5827 + 5828 + if (conf->max_nr_stripes < conf->min_nr_stripes) 5829 + /* unlikely, but not impossible */ 5830 + return 0; 5831 + return conf->max_nr_stripes - conf->min_nr_stripes; 6323 5832 } 6324 5833 6325 5834 static struct r5conf *setup_conf(struct mddev *mddev) ··· 6418 5879 else 6419 5880 conf->previous_raid_disks = mddev->raid_disks - mddev->delta_disks; 6420 5881 max_disks = max(conf->raid_disks, conf->previous_raid_disks); 6421 - conf->scribble_len = scribble_len(max_disks); 6422 5882 6423 5883 conf->disks = kzalloc(max_disks * sizeof(struct disk_info), 6424 5884 GFP_KERNEL); ··· 6445 5907 INIT_LIST_HEAD(conf->temp_inactive_list + i); 6446 5908 6447 5909 conf->level = mddev->new_level; 5910 + conf->chunk_sectors = mddev->new_chunk_sectors; 6448 5911 if (raid5_alloc_percpu(conf) != 0) 6449 5912 goto abort; 6450 5913 ··· 6478 5939 conf->fullsync = 1; 6479 5940 } 6480 5941 6481 - conf->chunk_sectors = mddev->new_chunk_sectors; 6482 5942 conf->level = mddev->new_level; 6483 - if (conf->level == 6) 5943 + if (conf->level == 6) { 6484 5944 conf->max_degraded = 2; 6485 - else 5945 + if (raid6_call.xor_syndrome) 5946 + conf->rmw_level = PARITY_ENABLE_RMW; 5947 + else 5948 + conf->rmw_level = PARITY_DISABLE_RMW; 5949 + } else { 6486 5950 conf->max_degraded = 1; 5951 + conf->rmw_level = PARITY_ENABLE_RMW; 5952 + } 6487 5953 conf->algorithm = mddev->new_layout; 6488 5954 conf->reshape_progress = mddev->reshape_position; 6489 5955 if (conf->reshape_progress != MaxSector) { ··· 6496 5952 conf->prev_algo = mddev->layout; 6497 5953 } 6498 5954 6499 - memory = conf->max_nr_stripes * (sizeof(struct stripe_head) + 5955 + conf->min_nr_stripes = NR_STRIPES; 5956 + memory = conf->min_nr_stripes * (sizeof(struct stripe_head) + 6500 5957 max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; 6501 5958 atomic_set(&conf->empty_inactive_list_nr, NR_STRIPE_HASH_LOCKS); 6502 - if (grow_stripes(conf, NR_STRIPES)) { 5959 + if (grow_stripes(conf, conf->min_nr_stripes)) { 6503 5960 printk(KERN_ERR 6504 5961 "md/raid:%s: couldn't allocate %dkB for buffers\n", 6505 5962 mdname(mddev), memory); ··· 6508 5963 } else 6509 5964 printk(KERN_INFO "md/raid:%s: allocated %dkB\n", 6510 5965 mdname(mddev), memory); 5966 + /* 5967 + * Losing a stripe head costs more than the time to refill it, 5968 + * it reduces the queue depth and so can hurt throughput. 5969 + * So set it rather large, scaled by number of devices. 5970 + */ 5971 + conf->shrinker.seeks = DEFAULT_SEEKS * conf->raid_disks * 4; 5972 + conf->shrinker.scan_objects = raid5_cache_scan; 5973 + conf->shrinker.count_objects = raid5_cache_count; 5974 + conf->shrinker.batch = 128; 5975 + conf->shrinker.flags = 0; 5976 + register_shrinker(&conf->shrinker); 6511 5977 6512 5978 sprintf(pers_name, "raid%d", mddev->new_level); 6513 5979 conf->thread = md_register_thread(raid5d, mddev, pers_name); ··· 7160 6604 */ 7161 6605 struct r5conf *conf = mddev->private; 7162 6606 if (((mddev->chunk_sectors << 9) / STRIPE_SIZE) * 4 7163 - > conf->max_nr_stripes || 6607 + > conf->min_nr_stripes || 7164 6608 ((mddev->new_chunk_sectors << 9) / STRIPE_SIZE) * 4 7165 - > conf->max_nr_stripes) { 6609 + > conf->min_nr_stripes) { 7166 6610 printk(KERN_WARNING "md/raid:%s: reshape: not enough stripes. Needed %lu\n", 7167 6611 mdname(mddev), 7168 6612 ((max(mddev->chunk_sectors, mddev->new_chunk_sectors) << 9)
+50 -9
drivers/md/raid5.h
··· 210 210 atomic_t count; /* nr of active thread/requests */ 211 211 int bm_seq; /* sequence number for bitmap flushes */ 212 212 int disks; /* disks in stripe */ 213 + int overwrite_disks; /* total overwrite disks in stripe, 214 + * this is only checked when stripe 215 + * has STRIPE_BATCH_READY 216 + */ 213 217 enum check_states check_state; 214 218 enum reconstruct_states reconstruct_state; 215 219 spinlock_t stripe_lock; 216 220 int cpu; 217 221 struct r5worker_group *group; 222 + 223 + struct stripe_head *batch_head; /* protected by stripe lock */ 224 + spinlock_t batch_lock; /* only header's lock is useful */ 225 + struct list_head batch_list; /* protected by head's batch lock*/ 218 226 /** 219 227 * struct stripe_operations 220 228 * @target - STRIPE_OP_COMPUTE_BLK target ··· 335 327 STRIPE_ON_UNPLUG_LIST, 336 328 STRIPE_DISCARD, 337 329 STRIPE_ON_RELEASE_LIST, 330 + STRIPE_BATCH_READY, 331 + STRIPE_BATCH_ERR, 338 332 }; 339 333 334 + #define STRIPE_EXPAND_SYNC_FLAG \ 335 + ((1 << STRIPE_EXPAND_SOURCE) |\ 336 + (1 << STRIPE_EXPAND_READY) |\ 337 + (1 << STRIPE_EXPANDING) |\ 338 + (1 << STRIPE_SYNC_REQUESTED)) 340 339 /* 341 340 * Operation request flags 342 341 */ ··· 354 339 STRIPE_OP_BIODRAIN, 355 340 STRIPE_OP_RECONSTRUCT, 356 341 STRIPE_OP_CHECK, 342 + }; 343 + 344 + /* 345 + * RAID parity calculation preferences 346 + */ 347 + enum { 348 + PARITY_DISABLE_RMW = 0, 349 + PARITY_ENABLE_RMW, 350 + PARITY_PREFER_RMW, 351 + }; 352 + 353 + /* 354 + * Pages requested from set_syndrome_sources() 355 + */ 356 + enum { 357 + SYNDROME_SRC_ALL, 358 + SYNDROME_SRC_WANT_DRAIN, 359 + SYNDROME_SRC_WRITTEN, 357 360 }; 358 361 /* 359 362 * Plugging: ··· 429 396 spinlock_t hash_locks[NR_STRIPE_HASH_LOCKS]; 430 397 struct mddev *mddev; 431 398 int chunk_sectors; 432 - int level, algorithm; 399 + int level, algorithm, rmw_level; 433 400 int max_degraded; 434 401 int raid_disks; 435 402 int max_nr_stripes; 403 + int min_nr_stripes; 436 404 437 405 /* reshape_progress is the leading edge of a 'reshape' 438 406 * It has value MaxSector when no reshape is happening ··· 492 458 /* per cpu variables */ 493 459 struct raid5_percpu { 494 460 struct page *spare_page; /* Used when checking P/Q in raid6 */ 495 - void *scribble; /* space for constructing buffer 461 + struct flex_array *scribble; /* space for constructing buffer 496 462 * lists and performing address 497 463 * conversions 498 464 */ 499 465 } __percpu *percpu; 500 - size_t scribble_len; /* size of scribble region must be 501 - * associated with conf to handle 502 - * cpu hotplug while reshaping 503 - */ 504 466 #ifdef CONFIG_HOTPLUG_CPU 505 467 struct notifier_block cpu_notify; 506 468 #endif ··· 510 480 struct llist_head released_stripes; 511 481 wait_queue_head_t wait_for_stripe; 512 482 wait_queue_head_t wait_for_overlap; 513 - int inactive_blocked; /* release of inactive stripes blocked, 514 - * waiting for 25% to be free 515 - */ 483 + unsigned long cache_state; 484 + #define R5_INACTIVE_BLOCKED 1 /* release of inactive stripes blocked, 485 + * waiting for 25% to be free 486 + */ 487 + #define R5_ALLOC_MORE 2 /* It might help to allocate another 488 + * stripe. 489 + */ 490 + #define R5_DID_ALLOC 4 /* A stripe was allocated, don't allocate 491 + * more until at least one has been 492 + * released. This avoids flooding 493 + * the cache. 494 + */ 495 + struct shrinker shrinker; 516 496 int pool_size; /* number of disks in stripeheads in pool */ 517 497 spinlock_t device_lock; 518 498 struct disk_info *disks; ··· 536 496 int group_cnt; 537 497 int worker_cnt_per_group; 538 498 }; 499 + 539 500 540 501 /* 541 502 * Our supported algorithms
+3
include/linux/async_tx.h
··· 60 60 * dependency chain 61 61 * @ASYNC_TX_FENCE: specify that the next operation in the dependency 62 62 * chain uses this operation's result as an input 63 + * @ASYNC_TX_PQ_XOR_DST: do not overwrite the syndrome but XOR it with the 64 + * input data. Required for rmw case. 63 65 */ 64 66 enum async_tx_flags { 65 67 ASYNC_TX_XOR_ZERO_DST = (1 << 0), 66 68 ASYNC_TX_XOR_DROP_DST = (1 << 1), 67 69 ASYNC_TX_ACK = (1 << 2), 68 70 ASYNC_TX_FENCE = (1 << 3), 71 + ASYNC_TX_PQ_XOR_DST = (1 << 4), 69 72 }; 70 73 71 74 /**
+1
include/linux/raid/pq.h
··· 72 72 /* Routine choices */ 73 73 struct raid6_calls { 74 74 void (*gen_syndrome)(int, size_t, void **); 75 + void (*xor_syndrome)(int, int, int, size_t, void **); 75 76 int (*valid)(void); /* Returns 1 if this routine set is usable */ 76 77 const char *name; /* Name of this routine set */ 77 78 int prefer; /* Has special performance attribute */
+7
include/uapi/linux/raid/md_p.h
··· 78 78 #define MD_DISK_ACTIVE 1 /* disk is running or spare disk */ 79 79 #define MD_DISK_SYNC 2 /* disk is in sync with the raid set */ 80 80 #define MD_DISK_REMOVED 3 /* disk is in sync with the raid set */ 81 + #define MD_DISK_CLUSTER_ADD 4 /* Initiate a disk add across the cluster 82 + * For clustered enviroments only. 83 + */ 84 + #define MD_DISK_CANDIDATE 5 /* disk is added as spare (local) until confirmed 85 + * For clustered enviroments only. 86 + */ 81 87 82 88 #define MD_DISK_WRITEMOSTLY 9 /* disk is "write-mostly" is RAID1 config. 83 89 * read requests will only be sent here in ··· 107 101 #define MD_SB_CLEAN 0 108 102 #define MD_SB_ERRORS 1 109 103 104 + #define MD_SB_CLUSTERED 5 /* MD is clustered */ 110 105 #define MD_SB_BITMAP_PRESENT 8 /* bitmap may be present nearby */ 111 106 112 107 /*
+1
include/uapi/linux/raid/md_u.h
··· 62 62 #define STOP_ARRAY _IO (MD_MAJOR, 0x32) 63 63 #define STOP_ARRAY_RO _IO (MD_MAJOR, 0x33) 64 64 #define RESTART_ARRAY_RW _IO (MD_MAJOR, 0x34) 65 + #define CLUSTERED_DISK_NACK _IO (MD_MAJOR, 0x35) 65 66 66 67 /* 63 partitions with the alternate major number (mdp) */ 67 68 #define MdpMinorShift 6
+34 -7
lib/raid6/algos.c
··· 131 131 static inline const struct raid6_calls *raid6_choose_gen( 132 132 void *(*const dptrs)[(65536/PAGE_SIZE)+2], const int disks) 133 133 { 134 - unsigned long perf, bestperf, j0, j1; 134 + unsigned long perf, bestgenperf, bestxorperf, j0, j1; 135 + int start = (disks>>1)-1, stop = disks-3; /* work on the second half of the disks */ 135 136 const struct raid6_calls *const *algo; 136 137 const struct raid6_calls *best; 137 138 138 - for (bestperf = 0, best = NULL, algo = raid6_algos; *algo; algo++) { 139 + for (bestgenperf = 0, bestxorperf = 0, best = NULL, algo = raid6_algos; *algo; algo++) { 139 140 if (!best || (*algo)->prefer >= best->prefer) { 140 141 if ((*algo)->valid && !(*algo)->valid()) 141 142 continue; ··· 154 153 } 155 154 preempt_enable(); 156 155 157 - if (perf > bestperf) { 158 - bestperf = perf; 156 + if (perf > bestgenperf) { 157 + bestgenperf = perf; 159 158 best = *algo; 160 159 } 161 - pr_info("raid6: %-8s %5ld MB/s\n", (*algo)->name, 160 + pr_info("raid6: %-8s gen() %5ld MB/s\n", (*algo)->name, 162 161 (perf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2)); 162 + 163 + if (!(*algo)->xor_syndrome) 164 + continue; 165 + 166 + perf = 0; 167 + 168 + preempt_disable(); 169 + j0 = jiffies; 170 + while ((j1 = jiffies) == j0) 171 + cpu_relax(); 172 + while (time_before(jiffies, 173 + j1 + (1<<RAID6_TIME_JIFFIES_LG2))) { 174 + (*algo)->xor_syndrome(disks, start, stop, 175 + PAGE_SIZE, *dptrs); 176 + perf++; 177 + } 178 + preempt_enable(); 179 + 180 + if (best == *algo) 181 + bestxorperf = perf; 182 + 183 + pr_info("raid6: %-8s xor() %5ld MB/s\n", (*algo)->name, 184 + (perf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2+1)); 163 185 } 164 186 } 165 187 166 188 if (best) { 167 - pr_info("raid6: using algorithm %s (%ld MB/s)\n", 189 + pr_info("raid6: using algorithm %s gen() %ld MB/s\n", 168 190 best->name, 169 - (bestperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2)); 191 + (bestgenperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2)); 192 + if (best->xor_syndrome) 193 + pr_info("raid6: .... xor() %ld MB/s, rmw enabled\n", 194 + (bestxorperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2+1)); 170 195 raid6_call = *best; 171 196 } else 172 197 pr_err("raid6: Yikes! No algorithm found!\n");
+1
lib/raid6/altivec.uc
··· 119 119 120 120 const struct raid6_calls raid6_altivec$# = { 121 121 raid6_altivec$#_gen_syndrome, 122 + NULL, /* XOR not yet implemented */ 122 123 raid6_have_altivec, 123 124 "altivecx$#", 124 125 0
+3
lib/raid6/avx2.c
··· 89 89 90 90 const struct raid6_calls raid6_avx2x1 = { 91 91 raid6_avx21_gen_syndrome, 92 + NULL, /* XOR not yet implemented */ 92 93 raid6_have_avx2, 93 94 "avx2x1", 94 95 1 /* Has cache hints */ ··· 151 150 152 151 const struct raid6_calls raid6_avx2x2 = { 153 152 raid6_avx22_gen_syndrome, 153 + NULL, /* XOR not yet implemented */ 154 154 raid6_have_avx2, 155 155 "avx2x2", 156 156 1 /* Has cache hints */ ··· 244 242 245 243 const struct raid6_calls raid6_avx2x4 = { 246 244 raid6_avx24_gen_syndrome, 245 + NULL, /* XOR not yet implemented */ 247 246 raid6_have_avx2, 248 247 "avx2x4", 249 248 1 /* Has cache hints */
+40 -1
lib/raid6/int.uc
··· 107 107 } 108 108 } 109 109 110 + static void raid6_int$#_xor_syndrome(int disks, int start, int stop, 111 + size_t bytes, void **ptrs) 112 + { 113 + u8 **dptr = (u8 **)ptrs; 114 + u8 *p, *q; 115 + int d, z, z0; 116 + 117 + unative_t wd$$, wq$$, wp$$, w1$$, w2$$; 118 + 119 + z0 = stop; /* P/Q right side optimization */ 120 + p = dptr[disks-2]; /* XOR parity */ 121 + q = dptr[disks-1]; /* RS syndrome */ 122 + 123 + for ( d = 0 ; d < bytes ; d += NSIZE*$# ) { 124 + /* P/Q data pages */ 125 + wq$$ = wp$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE]; 126 + for ( z = z0-1 ; z >= start ; z-- ) { 127 + wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE]; 128 + wp$$ ^= wd$$; 129 + w2$$ = MASK(wq$$); 130 + w1$$ = SHLBYTE(wq$$); 131 + w2$$ &= NBYTES(0x1d); 132 + w1$$ ^= w2$$; 133 + wq$$ = w1$$ ^ wd$$; 134 + } 135 + /* P/Q left side optimization */ 136 + for ( z = start-1 ; z >= 0 ; z-- ) { 137 + w2$$ = MASK(wq$$); 138 + w1$$ = SHLBYTE(wq$$); 139 + w2$$ &= NBYTES(0x1d); 140 + wq$$ = w1$$ ^ w2$$; 141 + } 142 + *(unative_t *)&p[d+NSIZE*$$] ^= wp$$; 143 + *(unative_t *)&q[d+NSIZE*$$] ^= wq$$; 144 + } 145 + 146 + } 147 + 110 148 const struct raid6_calls raid6_intx$# = { 111 149 raid6_int$#_gen_syndrome, 112 - NULL, /* always valid */ 150 + raid6_int$#_xor_syndrome, 151 + NULL, /* always valid */ 113 152 "int" NSTRING "x$#", 114 153 0 115 154 };
+2
lib/raid6/mmx.c
··· 76 76 77 77 const struct raid6_calls raid6_mmxx1 = { 78 78 raid6_mmx1_gen_syndrome, 79 + NULL, /* XOR not yet implemented */ 79 80 raid6_have_mmx, 80 81 "mmxx1", 81 82 0 ··· 135 134 136 135 const struct raid6_calls raid6_mmxx2 = { 137 136 raid6_mmx2_gen_syndrome, 137 + NULL, /* XOR not yet implemented */ 138 138 raid6_have_mmx, 139 139 "mmxx2", 140 140 0
+1
lib/raid6/neon.c
··· 42 42 } \ 43 43 struct raid6_calls const raid6_neonx ## _n = { \ 44 44 raid6_neon ## _n ## _gen_syndrome, \ 45 + NULL, /* XOR not yet implemented */ \ 45 46 raid6_have_neon, \ 46 47 "neonx" #_n, \ 47 48 0 \
+2
lib/raid6/sse1.c
··· 92 92 93 93 const struct raid6_calls raid6_sse1x1 = { 94 94 raid6_sse11_gen_syndrome, 95 + NULL, /* XOR not yet implemented */ 95 96 raid6_have_sse1_or_mmxext, 96 97 "sse1x1", 97 98 1 /* Has cache hints */ ··· 155 154 156 155 const struct raid6_calls raid6_sse1x2 = { 157 156 raid6_sse12_gen_syndrome, 157 + NULL, /* XOR not yet implemented */ 158 158 raid6_have_sse1_or_mmxext, 159 159 "sse1x2", 160 160 1 /* Has cache hints */
+227
lib/raid6/sse2.c
··· 88 88 kernel_fpu_end(); 89 89 } 90 90 91 + 92 + static void raid6_sse21_xor_syndrome(int disks, int start, int stop, 93 + size_t bytes, void **ptrs) 94 + { 95 + u8 **dptr = (u8 **)ptrs; 96 + u8 *p, *q; 97 + int d, z, z0; 98 + 99 + z0 = stop; /* P/Q right side optimization */ 100 + p = dptr[disks-2]; /* XOR parity */ 101 + q = dptr[disks-1]; /* RS syndrome */ 102 + 103 + kernel_fpu_begin(); 104 + 105 + asm volatile("movdqa %0,%%xmm0" : : "m" (raid6_sse_constants.x1d[0])); 106 + 107 + for ( d = 0 ; d < bytes ; d += 16 ) { 108 + asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d])); 109 + asm volatile("movdqa %0,%%xmm2" : : "m" (p[d])); 110 + asm volatile("pxor %xmm4,%xmm2"); 111 + /* P/Q data pages */ 112 + for ( z = z0-1 ; z >= start ; z-- ) { 113 + asm volatile("pxor %xmm5,%xmm5"); 114 + asm volatile("pcmpgtb %xmm4,%xmm5"); 115 + asm volatile("paddb %xmm4,%xmm4"); 116 + asm volatile("pand %xmm0,%xmm5"); 117 + asm volatile("pxor %xmm5,%xmm4"); 118 + asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d])); 119 + asm volatile("pxor %xmm5,%xmm2"); 120 + asm volatile("pxor %xmm5,%xmm4"); 121 + } 122 + /* P/Q left side optimization */ 123 + for ( z = start-1 ; z >= 0 ; z-- ) { 124 + asm volatile("pxor %xmm5,%xmm5"); 125 + asm volatile("pcmpgtb %xmm4,%xmm5"); 126 + asm volatile("paddb %xmm4,%xmm4"); 127 + asm volatile("pand %xmm0,%xmm5"); 128 + asm volatile("pxor %xmm5,%xmm4"); 129 + } 130 + asm volatile("pxor %0,%%xmm4" : : "m" (q[d])); 131 + /* Don't use movntdq for r/w memory area < cache line */ 132 + asm volatile("movdqa %%xmm4,%0" : "=m" (q[d])); 133 + asm volatile("movdqa %%xmm2,%0" : "=m" (p[d])); 134 + } 135 + 136 + asm volatile("sfence" : : : "memory"); 137 + kernel_fpu_end(); 138 + } 139 + 91 140 const struct raid6_calls raid6_sse2x1 = { 92 141 raid6_sse21_gen_syndrome, 142 + raid6_sse21_xor_syndrome, 93 143 raid6_have_sse2, 94 144 "sse2x1", 95 145 1 /* Has cache hints */ ··· 200 150 kernel_fpu_end(); 201 151 } 202 152 153 + static void raid6_sse22_xor_syndrome(int disks, int start, int stop, 154 + size_t bytes, void **ptrs) 155 + { 156 + u8 **dptr = (u8 **)ptrs; 157 + u8 *p, *q; 158 + int d, z, z0; 159 + 160 + z0 = stop; /* P/Q right side optimization */ 161 + p = dptr[disks-2]; /* XOR parity */ 162 + q = dptr[disks-1]; /* RS syndrome */ 163 + 164 + kernel_fpu_begin(); 165 + 166 + asm volatile("movdqa %0,%%xmm0" : : "m" (raid6_sse_constants.x1d[0])); 167 + 168 + for ( d = 0 ; d < bytes ; d += 32 ) { 169 + asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d])); 170 + asm volatile("movdqa %0,%%xmm6" :: "m" (dptr[z0][d+16])); 171 + asm volatile("movdqa %0,%%xmm2" : : "m" (p[d])); 172 + asm volatile("movdqa %0,%%xmm3" : : "m" (p[d+16])); 173 + asm volatile("pxor %xmm4,%xmm2"); 174 + asm volatile("pxor %xmm6,%xmm3"); 175 + /* P/Q data pages */ 176 + for ( z = z0-1 ; z >= start ; z-- ) { 177 + asm volatile("pxor %xmm5,%xmm5"); 178 + asm volatile("pxor %xmm7,%xmm7"); 179 + asm volatile("pcmpgtb %xmm4,%xmm5"); 180 + asm volatile("pcmpgtb %xmm6,%xmm7"); 181 + asm volatile("paddb %xmm4,%xmm4"); 182 + asm volatile("paddb %xmm6,%xmm6"); 183 + asm volatile("pand %xmm0,%xmm5"); 184 + asm volatile("pand %xmm0,%xmm7"); 185 + asm volatile("pxor %xmm5,%xmm4"); 186 + asm volatile("pxor %xmm7,%xmm6"); 187 + asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d])); 188 + asm volatile("movdqa %0,%%xmm7" :: "m" (dptr[z][d+16])); 189 + asm volatile("pxor %xmm5,%xmm2"); 190 + asm volatile("pxor %xmm7,%xmm3"); 191 + asm volatile("pxor %xmm5,%xmm4"); 192 + asm volatile("pxor %xmm7,%xmm6"); 193 + } 194 + /* P/Q left side optimization */ 195 + for ( z = start-1 ; z >= 0 ; z-- ) { 196 + asm volatile("pxor %xmm5,%xmm5"); 197 + asm volatile("pxor %xmm7,%xmm7"); 198 + asm volatile("pcmpgtb %xmm4,%xmm5"); 199 + asm volatile("pcmpgtb %xmm6,%xmm7"); 200 + asm volatile("paddb %xmm4,%xmm4"); 201 + asm volatile("paddb %xmm6,%xmm6"); 202 + asm volatile("pand %xmm0,%xmm5"); 203 + asm volatile("pand %xmm0,%xmm7"); 204 + asm volatile("pxor %xmm5,%xmm4"); 205 + asm volatile("pxor %xmm7,%xmm6"); 206 + } 207 + asm volatile("pxor %0,%%xmm4" : : "m" (q[d])); 208 + asm volatile("pxor %0,%%xmm6" : : "m" (q[d+16])); 209 + /* Don't use movntdq for r/w memory area < cache line */ 210 + asm volatile("movdqa %%xmm4,%0" : "=m" (q[d])); 211 + asm volatile("movdqa %%xmm6,%0" : "=m" (q[d+16])); 212 + asm volatile("movdqa %%xmm2,%0" : "=m" (p[d])); 213 + asm volatile("movdqa %%xmm3,%0" : "=m" (p[d+16])); 214 + } 215 + 216 + asm volatile("sfence" : : : "memory"); 217 + kernel_fpu_end(); 218 + } 219 + 203 220 const struct raid6_calls raid6_sse2x2 = { 204 221 raid6_sse22_gen_syndrome, 222 + raid6_sse22_xor_syndrome, 205 223 raid6_have_sse2, 206 224 "sse2x2", 207 225 1 /* Has cache hints */ ··· 366 248 kernel_fpu_end(); 367 249 } 368 250 251 + static void raid6_sse24_xor_syndrome(int disks, int start, int stop, 252 + size_t bytes, void **ptrs) 253 + { 254 + u8 **dptr = (u8 **)ptrs; 255 + u8 *p, *q; 256 + int d, z, z0; 257 + 258 + z0 = stop; /* P/Q right side optimization */ 259 + p = dptr[disks-2]; /* XOR parity */ 260 + q = dptr[disks-1]; /* RS syndrome */ 261 + 262 + kernel_fpu_begin(); 263 + 264 + asm volatile("movdqa %0,%%xmm0" :: "m" (raid6_sse_constants.x1d[0])); 265 + 266 + for ( d = 0 ; d < bytes ; d += 64 ) { 267 + asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d])); 268 + asm volatile("movdqa %0,%%xmm6" :: "m" (dptr[z0][d+16])); 269 + asm volatile("movdqa %0,%%xmm12" :: "m" (dptr[z0][d+32])); 270 + asm volatile("movdqa %0,%%xmm14" :: "m" (dptr[z0][d+48])); 271 + asm volatile("movdqa %0,%%xmm2" : : "m" (p[d])); 272 + asm volatile("movdqa %0,%%xmm3" : : "m" (p[d+16])); 273 + asm volatile("movdqa %0,%%xmm10" : : "m" (p[d+32])); 274 + asm volatile("movdqa %0,%%xmm11" : : "m" (p[d+48])); 275 + asm volatile("pxor %xmm4,%xmm2"); 276 + asm volatile("pxor %xmm6,%xmm3"); 277 + asm volatile("pxor %xmm12,%xmm10"); 278 + asm volatile("pxor %xmm14,%xmm11"); 279 + /* P/Q data pages */ 280 + for ( z = z0-1 ; z >= start ; z-- ) { 281 + asm volatile("prefetchnta %0" :: "m" (dptr[z][d])); 282 + asm volatile("prefetchnta %0" :: "m" (dptr[z][d+32])); 283 + asm volatile("pxor %xmm5,%xmm5"); 284 + asm volatile("pxor %xmm7,%xmm7"); 285 + asm volatile("pxor %xmm13,%xmm13"); 286 + asm volatile("pxor %xmm15,%xmm15"); 287 + asm volatile("pcmpgtb %xmm4,%xmm5"); 288 + asm volatile("pcmpgtb %xmm6,%xmm7"); 289 + asm volatile("pcmpgtb %xmm12,%xmm13"); 290 + asm volatile("pcmpgtb %xmm14,%xmm15"); 291 + asm volatile("paddb %xmm4,%xmm4"); 292 + asm volatile("paddb %xmm6,%xmm6"); 293 + asm volatile("paddb %xmm12,%xmm12"); 294 + asm volatile("paddb %xmm14,%xmm14"); 295 + asm volatile("pand %xmm0,%xmm5"); 296 + asm volatile("pand %xmm0,%xmm7"); 297 + asm volatile("pand %xmm0,%xmm13"); 298 + asm volatile("pand %xmm0,%xmm15"); 299 + asm volatile("pxor %xmm5,%xmm4"); 300 + asm volatile("pxor %xmm7,%xmm6"); 301 + asm volatile("pxor %xmm13,%xmm12"); 302 + asm volatile("pxor %xmm15,%xmm14"); 303 + asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d])); 304 + asm volatile("movdqa %0,%%xmm7" :: "m" (dptr[z][d+16])); 305 + asm volatile("movdqa %0,%%xmm13" :: "m" (dptr[z][d+32])); 306 + asm volatile("movdqa %0,%%xmm15" :: "m" (dptr[z][d+48])); 307 + asm volatile("pxor %xmm5,%xmm2"); 308 + asm volatile("pxor %xmm7,%xmm3"); 309 + asm volatile("pxor %xmm13,%xmm10"); 310 + asm volatile("pxor %xmm15,%xmm11"); 311 + asm volatile("pxor %xmm5,%xmm4"); 312 + asm volatile("pxor %xmm7,%xmm6"); 313 + asm volatile("pxor %xmm13,%xmm12"); 314 + asm volatile("pxor %xmm15,%xmm14"); 315 + } 316 + asm volatile("prefetchnta %0" :: "m" (q[d])); 317 + asm volatile("prefetchnta %0" :: "m" (q[d+32])); 318 + /* P/Q left side optimization */ 319 + for ( z = start-1 ; z >= 0 ; z-- ) { 320 + asm volatile("pxor %xmm5,%xmm5"); 321 + asm volatile("pxor %xmm7,%xmm7"); 322 + asm volatile("pxor %xmm13,%xmm13"); 323 + asm volatile("pxor %xmm15,%xmm15"); 324 + asm volatile("pcmpgtb %xmm4,%xmm5"); 325 + asm volatile("pcmpgtb %xmm6,%xmm7"); 326 + asm volatile("pcmpgtb %xmm12,%xmm13"); 327 + asm volatile("pcmpgtb %xmm14,%xmm15"); 328 + asm volatile("paddb %xmm4,%xmm4"); 329 + asm volatile("paddb %xmm6,%xmm6"); 330 + asm volatile("paddb %xmm12,%xmm12"); 331 + asm volatile("paddb %xmm14,%xmm14"); 332 + asm volatile("pand %xmm0,%xmm5"); 333 + asm volatile("pand %xmm0,%xmm7"); 334 + asm volatile("pand %xmm0,%xmm13"); 335 + asm volatile("pand %xmm0,%xmm15"); 336 + asm volatile("pxor %xmm5,%xmm4"); 337 + asm volatile("pxor %xmm7,%xmm6"); 338 + asm volatile("pxor %xmm13,%xmm12"); 339 + asm volatile("pxor %xmm15,%xmm14"); 340 + } 341 + asm volatile("movntdq %%xmm2,%0" : "=m" (p[d])); 342 + asm volatile("movntdq %%xmm3,%0" : "=m" (p[d+16])); 343 + asm volatile("movntdq %%xmm10,%0" : "=m" (p[d+32])); 344 + asm volatile("movntdq %%xmm11,%0" : "=m" (p[d+48])); 345 + asm volatile("pxor %0,%%xmm4" : : "m" (q[d])); 346 + asm volatile("pxor %0,%%xmm6" : : "m" (q[d+16])); 347 + asm volatile("pxor %0,%%xmm12" : : "m" (q[d+32])); 348 + asm volatile("pxor %0,%%xmm14" : : "m" (q[d+48])); 349 + asm volatile("movntdq %%xmm4,%0" : "=m" (q[d])); 350 + asm volatile("movntdq %%xmm6,%0" : "=m" (q[d+16])); 351 + asm volatile("movntdq %%xmm12,%0" : "=m" (q[d+32])); 352 + asm volatile("movntdq %%xmm14,%0" : "=m" (q[d+48])); 353 + } 354 + asm volatile("sfence" : : : "memory"); 355 + kernel_fpu_end(); 356 + } 357 + 358 + 369 359 const struct raid6_calls raid6_sse2x4 = { 370 360 raid6_sse24_gen_syndrome, 361 + raid6_sse24_xor_syndrome, 371 362 raid6_have_sse2, 372 363 "sse2x4", 373 364 1 /* Has cache hints */
+36 -15
lib/raid6/test/test.c
··· 28 28 char data[NDISKS][PAGE_SIZE]; 29 29 char recovi[PAGE_SIZE], recovj[PAGE_SIZE]; 30 30 31 - static void makedata(void) 31 + static void makedata(int start, int stop) 32 32 { 33 33 int i, j; 34 34 35 - for (i = 0; i < NDISKS; i++) { 35 + for (i = start; i <= stop; i++) { 36 36 for (j = 0; j < PAGE_SIZE; j++) 37 37 data[i][j] = rand(); 38 38 ··· 91 91 { 92 92 const struct raid6_calls *const *algo; 93 93 const struct raid6_recov_calls *const *ra; 94 - int i, j; 94 + int i, j, p1, p2; 95 95 int err = 0; 96 96 97 - makedata(); 97 + makedata(0, NDISKS-1); 98 98 99 99 for (ra = raid6_recov_algos; *ra; ra++) { 100 100 if ((*ra)->valid && !(*ra)->valid()) 101 101 continue; 102 + 102 103 raid6_2data_recov = (*ra)->data2; 103 104 raid6_datap_recov = (*ra)->datap; 104 105 105 106 printf("using recovery %s\n", (*ra)->name); 106 107 107 108 for (algo = raid6_algos; *algo; algo++) { 108 - if (!(*algo)->valid || (*algo)->valid()) { 109 - raid6_call = **algo; 109 + if ((*algo)->valid && !(*algo)->valid()) 110 + continue; 110 111 111 - /* Nuke syndromes */ 112 - memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE); 112 + raid6_call = **algo; 113 113 114 - /* Generate assumed good syndrome */ 115 - raid6_call.gen_syndrome(NDISKS, PAGE_SIZE, 116 - (void **)&dataptrs); 114 + /* Nuke syndromes */ 115 + memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE); 117 116 118 - for (i = 0; i < NDISKS-1; i++) 119 - for (j = i+1; j < NDISKS; j++) 120 - err += test_disks(i, j); 121 - } 117 + /* Generate assumed good syndrome */ 118 + raid6_call.gen_syndrome(NDISKS, PAGE_SIZE, 119 + (void **)&dataptrs); 120 + 121 + for (i = 0; i < NDISKS-1; i++) 122 + for (j = i+1; j < NDISKS; j++) 123 + err += test_disks(i, j); 124 + 125 + if (!raid6_call.xor_syndrome) 126 + continue; 127 + 128 + for (p1 = 0; p1 < NDISKS-2; p1++) 129 + for (p2 = p1; p2 < NDISKS-2; p2++) { 130 + 131 + /* Simulate rmw run */ 132 + raid6_call.xor_syndrome(NDISKS, p1, p2, PAGE_SIZE, 133 + (void **)&dataptrs); 134 + makedata(p1, p2); 135 + raid6_call.xor_syndrome(NDISKS, p1, p2, PAGE_SIZE, 136 + (void **)&dataptrs); 137 + 138 + for (i = 0; i < NDISKS-1; i++) 139 + for (j = i+1; j < NDISKS; j++) 140 + err += test_disks(i, j); 141 + } 142 + 122 143 } 123 144 printf("\n"); 124 145 }
+1
lib/raid6/tilegx.uc
··· 80 80 81 81 const struct raid6_calls raid6_tilegx$# = { 82 82 raid6_tilegx$#_gen_syndrome, 83 + NULL, /* XOR not yet implemented */ 83 84 NULL, 84 85 "tilegx$#", 85 86 0