Merge tag 'md/4.1' of git://neil.brown.name/md

+176

Documentation/md-cluster.txt

··· 1 + The cluster MD is a shared-device RAID for a cluster. 2 + 3 + 4 + 1. On-disk format 5 + 6 + Separate write-intent-bitmap are used for each cluster node. 7 + The bitmaps record all writes that may have been started on that node, 8 + and may not yet have finished. The on-disk layout is: 9 + 10 + 0 4k 8k 12k 11 + ------------------------------------------------------------------- 12 + | idle | md super | bm super [0] + bits | 13 + | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | 14 + | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | 15 + | bm bits [3, contd] | | | 16 + 17 + During "normal" functioning we assume the filesystem ensures that only one 18 + node writes to any given block at a time, so a write 19 + request will 20 + - set the appropriate bit (if not already set) 21 + - commit the write to all mirrors 22 + - schedule the bit to be cleared after a timeout. 23 + 24 + Reads are just handled normally. It is up to the filesystem to 25 + ensure one node doesn't read from a location where another node (or the same 26 + node) is writing. 27 + 28 + 29 + 2. DLM Locks for management 30 + 31 + There are two locks for managing the device: 32 + 33 + 2.1 Bitmap lock resource (bm_lockres) 34 + 35 + The bm_lockres protects individual node bitmaps. They are named in the 36 + form bitmap001 for node 1, bitmap002 for node and so on. When a node 37 + joins the cluster, it acquires the lock in PW mode and it stays so 38 + during the lifetime the node is part of the cluster. The lock resource 39 + number is based on the slot number returned by the DLM subsystem. Since 40 + DLM starts node count from one and bitmap slots start from zero, one is 41 + subtracted from the DLM slot number to arrive at the bitmap slot number. 42 + 43 + 3. Communication 44 + 45 + Each node has to communicate with other nodes when starting or ending 46 + resync, and metadata superblock updates. 47 + 48 + 3.1 Message Types 49 + 50 + There are 3 types, of messages which are passed 51 + 52 + 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been 53 + updated, and the node must re-read the md superblock. This is performed 54 + synchronously. 55 + 56 + 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended 57 + so that each node may suspend or resume the region. 58 + 59 + 3.2 Communication mechanism 60 + 61 + The DLM LVB is used to communicate within nodes of the cluster. There 62 + are three resources used for the purpose: 63 + 64 + 3.2.1 Token: The resource which protects the entire communication 65 + system. The node having the token resource is allowed to 66 + communicate. 67 + 68 + 3.2.2 Message: The lock resource which carries the data to 69 + communicate. 70 + 71 + 3.2.3 Ack: The resource, acquiring which means the message has been 72 + acknowledged by all nodes in the cluster. The BAST of the resource 73 + is used to inform the receive node that a node wants to communicate. 74 + 75 + The algorithm is: 76 + 77 + 1. receive status 78 + 79 + sender receiver receiver 80 + ACK:CR ACK:CR ACK:CR 81 + 82 + 2. sender get EX of TOKEN 83 + sender get EX of MESSAGE 84 + sender receiver receiver 85 + TOKEN:EX ACK:CR ACK:CR 86 + MESSAGE:EX 87 + ACK:CR 88 + 89 + Sender checks that it still needs to send a message. Messages received 90 + or other events that happened while waiting for the TOKEN may have made 91 + this message inappropriate or redundant. 92 + 93 + 3. sender write LVB. 94 + sender down-convert MESSAGE from EX to CR 95 + sender try to get EX of ACK 96 + [ wait until all receiver has *processed* the MESSAGE ] 97 + 98 + [ triggered by bast of ACK ] 99 + receiver get CR of MESSAGE 100 + receiver read LVB 101 + receiver processes the message 102 + [ wait finish ] 103 + receiver release ACK 104 + 105 + sender receiver receiver 106 + TOKEN:EX MESSAGE:CR MESSAGE:CR 107 + MESSAGE:CR 108 + ACK:EX 109 + 110 + 4. triggered by grant of EX on ACK (indicating all receivers have processed 111 + message) 112 + sender down-convert ACK from EX to CR 113 + sender release MESSAGE 114 + sender release TOKEN 115 + receiver upconvert to EX of MESSAGE 116 + receiver get CR of ACK 117 + receiver release MESSAGE 118 + 119 + sender receiver receiver 120 + ACK:CR ACK:CR ACK:CR 121 + 122 + 123 + 4. Handling Failures 124 + 125 + 4.1 Node Failure 126 + When a node fails, the DLM informs the cluster with the slot. The node 127 + starts a cluster recovery thread. The cluster recovery thread: 128 + - acquires the bitmap<number> lock of the failed node 129 + - opens the bitmap 130 + - reads the bitmap of the failed node 131 + - copies the set bitmap to local node 132 + - cleans the bitmap of the failed node 133 + - releases bitmap<number> lock of the failed node 134 + - initiates resync of the bitmap on the current node 135 + 136 + The resync process, is the regular md resync. However, in a clustered 137 + environment when a resync is performed, it needs to tell other nodes 138 + of the areas which are suspended. Before a resync starts, the node 139 + send out RESYNC_START with the (lo,hi) range of the area which needs 140 + to be suspended. Each node maintains a suspend_list, which contains 141 + the list of ranges which are currently suspended. On receiving 142 + RESYNC_START, the node adds the range to the suspend_list. Similarly, 143 + when the node performing resync finishes, it send RESYNC_FINISHED 144 + to other nodes and other nodes remove the corresponding entry from 145 + the suspend_list. 146 + 147 + A helper function, should_suspend() can be used to check if a particular 148 + I/O range should be suspended or not. 149 + 150 + 4.2 Device Failure 151 + Device failures are handled and communicated with the metadata update 152 + routine. 153 + 154 + 5. Adding a new Device 155 + For adding a new device, it is necessary that all nodes "see" the new device 156 + to be added. For this, the following algorithm is used: 157 + 158 + 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues 159 + ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) 160 + 2. Node 1 sends NEWDISK with uuid and slot number 161 + 3. Other nodes issue kobject_uevent_env with uuid and slot number 162 + (Steps 4,5 could be a udev rule) 163 + 4. In userspace, the node searches for the disk, perhaps 164 + using blkid -t SUB_UUID="" 165 + 5. Other nodes issue either of the following depending on whether the disk 166 + was found: 167 + ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and 168 + disc.number set to slot number) 169 + ioctl(CLUSTERED_DISK_NACK) 170 + 6. Other nodes drop lock on no-new-devs (CR) if device is found 171 + 7. Node 1 attempts EX lock on no-new-devs 172 + 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk 173 + as SpareLocal 174 + 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED 175 + 10. Other nodes get the information whether a disk is added or not 176 + by the following METADATA_UPDATED.

+16 -3

crypto/async_tx/async_pq.c

··· 124 124 { 125 125 void **srcs; 126 126 int i; 127 + int start = -1, stop = disks - 3; 127 128 128 129 if (submit->scribble) 129 130 srcs = submit->scribble; ··· 135 134 if (blocks[i] == NULL) { 136 135 BUG_ON(i > disks - 3); /* P or Q can't be zero */ 137 136 srcs[i] = (void*)raid6_empty_zero_page; 138 - } else 137 + } else { 139 138 srcs[i] = page_address(blocks[i]) + offset; 139 + if (i < disks - 2) { 140 + stop = i; 141 + if (start == -1) 142 + start = i; 143 + } 144 + } 140 145 } 141 - raid6_call.gen_syndrome(disks, len, srcs); 146 + if (submit->flags & ASYNC_TX_PQ_XOR_DST) { 147 + BUG_ON(!raid6_call.xor_syndrome); 148 + if (start >= 0) 149 + raid6_call.xor_syndrome(disks, start, stop, len, srcs); 150 + } else 151 + raid6_call.gen_syndrome(disks, len, srcs); 142 152 async_tx_sync_epilog(submit); 143 153 } 144 154 ··· 190 178 if (device) 191 179 unmap = dmaengine_get_unmap_data(device->dev, disks, GFP_NOIO); 192 180 193 - if (unmap && 181 + /* XORing P/Q is only implemented in software */ 182 + if (unmap && !(submit->flags & ASYNC_TX_PQ_XOR_DST) && 194 183 (src_cnt <= dma_maxpq(device, 0) || 195 184 dma_maxpq(device, DMA_PREP_CONTINUE) > 0) && 196 185 is_dma_pq_aligned(device, offset, 0, len)) {

+16

drivers/md/Kconfig

··· 175 175 176 176 In unsure, say N. 177 177 178 + 179 + config MD_CLUSTER 180 + tristate "Cluster Support for MD (EXPERIMENTAL)" 181 + depends on BLK_DEV_MD 182 + depends on DLM 183 + default n 184 + ---help--- 185 + Clustering support for MD devices. This enables locking and 186 + synchronization across multiple systems on the cluster, so all 187 + nodes in the cluster can access the MD devices simultaneously. 188 + 189 + This brings the redundancy (and uptime) of RAID levels across the 190 + nodes of the cluster. 191 + 192 + If unsure, say N. 193 + 178 194 source "drivers/md/bcache/Kconfig" 179 195 180 196 config BLK_DEV_DM_BUILTIN

+1

drivers/md/Makefile

··· 30 30 obj-$(CONFIG_MD_RAID456) += raid456.o 31 31 obj-$(CONFIG_MD_MULTIPATH) += multipath.o 32 32 obj-$(CONFIG_MD_FAULTY) += faulty.o 33 + obj-$(CONFIG_MD_CLUSTER) += md-cluster.o 33 34 obj-$(CONFIG_BCACHE) += bcache/ 34 35 obj-$(CONFIG_BLK_DEV_MD) += md-mod.o 35 36 obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o

+167 -22

drivers/md/bitmap.c

··· 205 205 struct block_device *bdev; 206 206 struct mddev *mddev = bitmap->mddev; 207 207 struct bitmap_storage *store = &bitmap->storage; 208 + int node_offset = 0; 209 + 210 + if (mddev_is_clustered(bitmap->mddev)) 211 + node_offset = bitmap->cluster_slot * store->file_pages; 208 212 209 213 while ((rdev = next_active_rdev(rdev, mddev)) != NULL) { 210 214 int size = PAGE_SIZE; ··· 437 433 /* This might have been changed by a reshape */ 438 434 sb->sync_size = cpu_to_le64(bitmap->mddev->resync_max_sectors); 439 435 sb->chunksize = cpu_to_le32(bitmap->mddev->bitmap_info.chunksize); 436 + sb->nodes = cpu_to_le32(bitmap->mddev->bitmap_info.nodes); 440 437 sb->sectors_reserved = cpu_to_le32(bitmap->mddev-> 441 438 bitmap_info.space); 442 439 kunmap_atomic(sb); ··· 549 544 bitmap_super_t *sb; 550 545 unsigned long chunksize, daemon_sleep, write_behind; 551 546 unsigned long long events; 547 + int nodes = 0; 552 548 unsigned long sectors_reserved = 0; 553 549 int err = -EINVAL; 554 550 struct page *sb_page; ··· 568 562 return -ENOMEM; 569 563 bitmap->storage.sb_page = sb_page; 570 564 565 + re_read: 566 + /* If cluster_slot is set, the cluster is setup */ 567 + if (bitmap->cluster_slot >= 0) { 568 + sector_t bm_blocks = bitmap->mddev->resync_max_sectors; 569 + 570 + sector_div(bm_blocks, 571 + bitmap->mddev->bitmap_info.chunksize >> 9); 572 + /* bits to bytes */ 573 + bm_blocks = ((bm_blocks+7) >> 3) + sizeof(bitmap_super_t); 574 + /* to 4k blocks */ 575 + bm_blocks = DIV_ROUND_UP_SECTOR_T(bm_blocks, 4096); 576 + bitmap->mddev->bitmap_info.offset += bitmap->cluster_slot * (bm_blocks << 3); 577 + pr_info("%s:%d bm slot: %d offset: %llu\n", __func__, __LINE__, 578 + bitmap->cluster_slot, (unsigned long long)bitmap->mddev->bitmap_info.offset); 579 + } 580 + 571 581 if (bitmap->storage.file) { 572 582 loff_t isize = i_size_read(bitmap->storage.file->f_mapping->host); 573 583 int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize; ··· 599 577 if (err) 600 578 return err; 601 579 580 + err = -EINVAL; 602 581 sb = kmap_atomic(sb_page); 603 582 604 583 chunksize = le32_to_cpu(sb->chunksize); 605 584 daemon_sleep = le32_to_cpu(sb->daemon_sleep) * HZ; 606 585 write_behind = le32_to_cpu(sb->write_behind); 607 586 sectors_reserved = le32_to_cpu(sb->sectors_reserved); 587 + nodes = le32_to_cpu(sb->nodes); 588 + strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64); 608 589 609 590 /* verify that the bitmap-specific fields are valid */ 610 591 if (sb->magic != cpu_to_le32(BITMAP_MAGIC)) ··· 644 619 goto out; 645 620 } 646 621 events = le64_to_cpu(sb->events); 647 - if (events < bitmap->mddev->events) { 622 + if (!nodes && (events < bitmap->mddev->events)) { 648 623 printk(KERN_INFO 649 624 "%s: bitmap file is out of date (%llu < %llu) " 650 625 "-- forcing full recovery\n", ··· 659 634 if (le32_to_cpu(sb->version) == BITMAP_MAJOR_HOSTENDIAN) 660 635 set_bit(BITMAP_HOSTENDIAN, &bitmap->flags); 661 636 bitmap->events_cleared = le64_to_cpu(sb->events_cleared); 637 + strlcpy(bitmap->mddev->bitmap_info.cluster_name, sb->cluster_name, 64); 662 638 err = 0; 639 + 663 640 out: 664 641 kunmap_atomic(sb); 642 + /* Assiging chunksize is required for "re_read" */ 643 + bitmap->mddev->bitmap_info.chunksize = chunksize; 644 + if (nodes && (bitmap->cluster_slot < 0)) { 645 + err = md_setup_cluster(bitmap->mddev, nodes); 646 + if (err) { 647 + pr_err("%s: Could not setup cluster service (%d)\n", 648 + bmname(bitmap), err); 649 + goto out_no_sb; 650 + } 651 + bitmap->cluster_slot = md_cluster_ops->slot_number(bitmap->mddev); 652 + goto re_read; 653 + } 654 + 655 + 665 656 out_no_sb: 666 657 if (test_bit(BITMAP_STALE, &bitmap->flags)) 667 658 bitmap->events_cleared = bitmap->mddev->events; 668 659 bitmap->mddev->bitmap_info.chunksize = chunksize; 669 660 bitmap->mddev->bitmap_info.daemon_sleep = daemon_sleep; 670 661 bitmap->mddev->bitmap_info.max_write_behind = write_behind; 662 + bitmap->mddev->bitmap_info.nodes = nodes; 671 663 if (bitmap->mddev->bitmap_info.space == 0 || 672 664 bitmap->mddev->bitmap_info.space > sectors_reserved) 673 665 bitmap->mddev->bitmap_info.space = sectors_reserved; 674 - if (err) 666 + if (err) { 675 667 bitmap_print_sb(bitmap); 668 + if (bitmap->cluster_slot < 0) 669 + md_cluster_stop(bitmap->mddev); 670 + } 676 671 return err; 677 672 } 678 673 ··· 737 692 } 738 693 739 694 static int bitmap_storage_alloc(struct bitmap_storage *store, 740 - unsigned long chunks, int with_super) 695 + unsigned long chunks, int with_super, 696 + int slot_number) 741 697 { 742 - int pnum; 698 + int pnum, offset = 0; 743 699 unsigned long num_pages; 744 700 unsigned long bytes; 745 701 ··· 749 703 bytes += sizeof(bitmap_super_t); 750 704 751 705 num_pages = DIV_ROUND_UP(bytes, PAGE_SIZE); 706 + offset = slot_number * (num_pages - 1); 752 707 753 708 store->filemap = kmalloc(sizeof(struct page *) 754 709 * num_pages, GFP_KERNEL); ··· 760 713 store->sb_page = alloc_page(GFP_KERNEL|__GFP_ZERO); 761 714 if (store->sb_page == NULL) 762 715 return -ENOMEM; 763 - store->sb_page->index = 0; 764 716 } 717 + 765 718 pnum = 0; 766 719 if (store->sb_page) { 767 720 store->filemap[0] = store->sb_page; 768 721 pnum = 1; 722 + store->sb_page->index = offset; 769 723 } 724 + 770 725 for ( ; pnum < num_pages; pnum++) { 771 726 store->filemap[pnum] = alloc_page(GFP_KERNEL|__GFP_ZERO); 772 727 if (!store->filemap[pnum]) { 773 728 store->file_pages = pnum; 774 729 return -ENOMEM; 775 730 } 776 - store->filemap[pnum]->index = pnum; 731 + store->filemap[pnum]->index = pnum + offset; 777 732 } 778 733 store->file_pages = pnum; 779 734 ··· 934 885 } 935 886 } 936 887 888 + static int bitmap_file_test_bit(struct bitmap *bitmap, sector_t block) 889 + { 890 + unsigned long bit; 891 + struct page *page; 892 + void *paddr; 893 + unsigned long chunk = block >> bitmap->counts.chunkshift; 894 + int set = 0; 895 + 896 + page = filemap_get_page(&bitmap->storage, chunk); 897 + if (!page) 898 + return -EINVAL; 899 + bit = file_page_offset(&bitmap->storage, chunk); 900 + paddr = kmap_atomic(page); 901 + if (test_bit(BITMAP_HOSTENDIAN, &bitmap->flags)) 902 + set = test_bit(bit, paddr); 903 + else 904 + set = test_bit_le(bit, paddr); 905 + kunmap_atomic(paddr); 906 + return set; 907 + } 908 + 909 + 937 910 /* this gets called when the md device is ready to unplug its underlying 938 911 * (slave) device queues -- before we let any writes go down, we need to 939 912 * sync the dirty pages of the bitmap file to disk */ ··· 1006 935 */ 1007 936 static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start) 1008 937 { 1009 - unsigned long i, chunks, index, oldindex, bit; 938 + unsigned long i, chunks, index, oldindex, bit, node_offset = 0; 1010 939 struct page *page = NULL; 1011 940 unsigned long bit_cnt = 0; 1012 941 struct file *file; ··· 1052 981 if (!bitmap->mddev->bitmap_info.external) 1053 982 offset = sizeof(bitmap_super_t); 1054 983 984 + if (mddev_is_clustered(bitmap->mddev)) 985 + node_offset = bitmap->cluster_slot * (DIV_ROUND_UP(store->bytes, PAGE_SIZE)); 986 + 1055 987 for (i = 0; i < chunks; i++) { 1056 988 int b; 1057 989 index = file_page_index(&bitmap->storage, i); ··· 1075 1001 bitmap->mddev, 1076 1002 bitmap->mddev->bitmap_info.offset, 1077 1003 page, 1078 - index, count); 1004 + index + node_offset, count); 1079 1005 1080 1006 if (ret) 1081 1007 goto err; ··· 1281 1207 j < bitmap->storage.file_pages 1282 1208 && !test_bit(BITMAP_STALE, &bitmap->flags); 1283 1209 j++) { 1284 - 1285 1210 if (test_page_attr(bitmap, j, 1286 1211 BITMAP_PAGE_DIRTY)) 1287 1212 /* bitmap_unplug will handle the rest */ ··· 1603 1530 return; 1604 1531 } 1605 1532 if (!*bmc) { 1606 - *bmc = 2 | (needed ? NEEDED_MASK : 0); 1533 + *bmc = 2; 1607 1534 bitmap_count_page(&bitmap->counts, offset, 1); 1608 1535 bitmap_set_pending(&bitmap->counts, offset); 1609 1536 bitmap->allclean = 0; 1610 1537 } 1538 + if (needed) 1539 + *bmc |= NEEDED_MASK; 1611 1540 spin_unlock_irq(&bitmap->counts.lock); 1612 1541 } 1613 1542 ··· 1666 1591 if (!bitmap) /* there was no bitmap */ 1667 1592 return; 1668 1593 1594 + if (mddev_is_clustered(bitmap->mddev) && bitmap->mddev->cluster_info && 1595 + bitmap->cluster_slot == md_cluster_ops->slot_number(bitmap->mddev)) 1596 + md_cluster_stop(bitmap->mddev); 1597 + 1669 1598 /* Shouldn't be needed - but just in case.... */ 1670 1599 wait_event(bitmap->write_wait, 1671 1600 atomic_read(&bitmap->pending_writes) == 0); ··· 1715 1636 * initialize the bitmap structure 1716 1637 * if this returns an error, bitmap_destroy must be called to do clean up 1717 1638 */ 1718 - int bitmap_create(struct mddev *mddev) 1639 + struct bitmap *bitmap_create(struct mddev *mddev, int slot) 1719 1640 { 1720 1641 struct bitmap *bitmap; 1721 1642 sector_t blocks = mddev->resync_max_sectors; ··· 1729 1650 1730 1651 bitmap = kzalloc(sizeof(*bitmap), GFP_KERNEL); 1731 1652 if (!bitmap) 1732 - return -ENOMEM; 1653 + return ERR_PTR(-ENOMEM); 1733 1654 1734 1655 spin_lock_init(&bitmap->counts.lock); 1735 1656 atomic_set(&bitmap->pending_writes, 0); ··· 1738 1659 init_waitqueue_head(&bitmap->behind_wait); 1739 1660 1740 1661 bitmap->mddev = mddev; 1662 + bitmap->cluster_slot = slot; 1741 1663 1742 1664 if (mddev->kobj.sd) 1743 1665 bm = sysfs_get_dirent(mddev->kobj.sd, "bitmap"); ··· 1786 1706 printk(KERN_INFO "created bitmap (%lu pages) for device %s\n", 1787 1707 bitmap->counts.pages, bmname(bitmap)); 1788 1708 1789 - mddev->bitmap = bitmap; 1790 - return test_bit(BITMAP_WRITE_ERROR, &bitmap->flags) ? -EIO : 0; 1709 + err = test_bit(BITMAP_WRITE_ERROR, &bitmap->flags) ? -EIO : 0; 1710 + if (err) 1711 + goto error; 1791 1712 1713 + return bitmap; 1792 1714 error: 1793 1715 bitmap_free(bitmap); 1794 - return err; 1716 + return ERR_PTR(err); 1795 1717 } 1796 1718 1797 1719 int bitmap_load(struct mddev *mddev) ··· 1846 1764 return err; 1847 1765 } 1848 1766 EXPORT_SYMBOL_GPL(bitmap_load); 1767 + 1768 + /* Loads the bitmap associated with slot and copies the resync information 1769 + * to our bitmap 1770 + */ 1771 + int bitmap_copy_from_slot(struct mddev *mddev, int slot, 1772 + sector_t *low, sector_t *high, bool clear_bits) 1773 + { 1774 + int rv = 0, i, j; 1775 + sector_t block, lo = 0, hi = 0; 1776 + struct bitmap_counts *counts; 1777 + struct bitmap *bitmap = bitmap_create(mddev, slot); 1778 + 1779 + if (IS_ERR(bitmap)) 1780 + return PTR_ERR(bitmap); 1781 + 1782 + rv = bitmap_read_sb(bitmap); 1783 + if (rv) 1784 + goto err; 1785 + 1786 + rv = bitmap_init_from_disk(bitmap, 0); 1787 + if (rv) 1788 + goto err; 1789 + 1790 + counts = &bitmap->counts; 1791 + for (j = 0; j < counts->chunks; j++) { 1792 + block = (sector_t)j << counts->chunkshift; 1793 + if (bitmap_file_test_bit(bitmap, block)) { 1794 + if (!lo) 1795 + lo = block; 1796 + hi = block; 1797 + bitmap_file_clear_bit(bitmap, block); 1798 + bitmap_set_memory_bits(mddev->bitmap, block, 1); 1799 + bitmap_file_set_bit(mddev->bitmap, block); 1800 + } 1801 + } 1802 + 1803 + if (clear_bits) { 1804 + bitmap_update_sb(bitmap); 1805 + /* Setting this for the ev_page should be enough. 1806 + * And we do not require both write_all and PAGE_DIRT either 1807 + */ 1808 + for (i = 0; i < bitmap->storage.file_pages; i++) 1809 + set_page_attr(bitmap, i, BITMAP_PAGE_DIRTY); 1810 + bitmap_write_all(bitmap); 1811 + bitmap_unplug(bitmap); 1812 + } 1813 + *low = lo; 1814 + *high = hi; 1815 + err: 1816 + bitmap_free(bitmap); 1817 + return rv; 1818 + } 1819 + EXPORT_SYMBOL_GPL(bitmap_copy_from_slot); 1820 + 1849 1821 1850 1822 void bitmap_status(struct seq_file *seq, struct bitmap *bitmap) 1851 1823 { ··· 1985 1849 memset(&store, 0, sizeof(store)); 1986 1850 if (bitmap->mddev->bitmap_info.offset || bitmap->mddev->bitmap_info.file) 1987 1851 ret = bitmap_storage_alloc(&store, chunks, 1988 - !bitmap->mddev->bitmap_info.external); 1852 + !bitmap->mddev->bitmap_info.external, 1853 + bitmap->cluster_slot); 1989 1854 if (ret) 1990 1855 goto err; 1991 1856 ··· 2158 2021 return -EINVAL; 2159 2022 mddev->bitmap_info.offset = offset; 2160 2023 if (mddev->pers) { 2024 + struct bitmap *bitmap; 2161 2025 mddev->pers->quiesce(mddev, 1); 2162 - rv = bitmap_create(mddev); 2163 - if (!rv) 2026 + bitmap = bitmap_create(mddev, -1); 2027 + if (IS_ERR(bitmap)) 2028 + rv = PTR_ERR(bitmap); 2029 + else { 2030 + mddev->bitmap = bitmap; 2164 2031 rv = bitmap_load(mddev); 2165 - if (rv) { 2166 - bitmap_destroy(mddev); 2167 - mddev->bitmap_info.offset = 0; 2032 + if (rv) { 2033 + bitmap_destroy(mddev); 2034 + mddev->bitmap_info.offset = 0; 2035 + } 2168 2036 } 2169 2037 mddev->pers->quiesce(mddev, 0); 2170 2038 if (rv) ··· 2328 2186 2329 2187 static ssize_t metadata_show(struct mddev *mddev, char *page) 2330 2188 { 2189 + if (mddev_is_clustered(mddev)) 2190 + return sprintf(page, "clustered\n"); 2331 2191 return sprintf(page, "%s\n", (mddev->bitmap_info.external 2332 2192 ? "external" : "internal")); 2333 2193 } ··· 2342 2198 return -EBUSY; 2343 2199 if (strncmp(buf, "external", 8) == 0) 2344 2200 mddev->bitmap_info.external = 1; 2345 - else if (strncmp(buf, "internal", 8) == 0) 2201 + else if ((strncmp(buf, "internal", 8) == 0) || 2202 + (strncmp(buf, "clustered", 9) == 0)) 2346 2203 mddev->bitmap_info.external = 0; 2347 2204 else 2348 2205 return -EINVAL;

+7 -3

drivers/md/bitmap.h

··· 130 130 __le32 write_behind; /* 60 number of outstanding write-behind writes */ 131 131 __le32 sectors_reserved; /* 64 number of 512-byte sectors that are 132 132 * reserved for the bitmap. */ 133 - 134 - __u8 pad[256 - 68]; /* set to zero */ 133 + __le32 nodes; /* 68 the maximum number of nodes in cluster. */ 134 + __u8 cluster_name[64]; /* 72 cluster name to which this md belongs */ 135 + __u8 pad[256 - 136]; /* set to zero */ 135 136 } bitmap_super_t; 136 137 137 138 /* notes: ··· 227 226 wait_queue_head_t behind_wait; 228 227 229 228 struct kernfs_node *sysfs_can_clear; 229 + int cluster_slot; /* Slot offset for clustered env */ 230 230 }; 231 231 232 232 /* the bitmap API */ 233 233 234 234 /* these are used only by md/bitmap */ 235 - int bitmap_create(struct mddev *mddev); 235 + struct bitmap *bitmap_create(struct mddev *mddev, int slot); 236 236 int bitmap_load(struct mddev *mddev); 237 237 void bitmap_flush(struct mddev *mddev); 238 238 void bitmap_destroy(struct mddev *mddev); ··· 262 260 263 261 int bitmap_resize(struct bitmap *bitmap, sector_t blocks, 264 262 int chunksize, int init); 263 + int bitmap_copy_from_slot(struct mddev *mddev, int slot, 264 + sector_t *lo, sector_t *hi, bool clear_bits); 265 265 #endif 266 266 267 267 #endif

+965

drivers/md/md-cluster.c

··· 1 + /* 2 + * Copyright (C) 2015, SUSE 3 + * 4 + * This program is free software; you can redistribute it and/or modify 5 + * it under the terms of the GNU General Public License as published by 6 + * the Free Software Foundation; either version 2, or (at your option) 7 + * any later version. 8 + * 9 + */ 10 + 11 + 12 + #include <linux/module.h> 13 + #include <linux/dlm.h> 14 + #include <linux/sched.h> 15 + #include <linux/raid/md_p.h> 16 + #include "md.h" 17 + #include "bitmap.h" 18 + #include "md-cluster.h" 19 + 20 + #define LVB_SIZE 64 21 + #define NEW_DEV_TIMEOUT 5000 22 + 23 + struct dlm_lock_resource { 24 + dlm_lockspace_t *ls; 25 + struct dlm_lksb lksb; 26 + char *name; /* lock name. */ 27 + uint32_t flags; /* flags to pass to dlm_lock() */ 28 + struct completion completion; /* completion for synchronized locking */ 29 + void (*bast)(void *arg, int mode); /* blocking AST function pointer*/ 30 + struct mddev *mddev; /* pointing back to mddev. */ 31 + }; 32 + 33 + struct suspend_info { 34 + int slot; 35 + sector_t lo; 36 + sector_t hi; 37 + struct list_head list; 38 + }; 39 + 40 + struct resync_info { 41 + __le64 lo; 42 + __le64 hi; 43 + }; 44 + 45 + /* md_cluster_info flags */ 46 + #define MD_CLUSTER_WAITING_FOR_NEWDISK 1 47 + 48 + 49 + struct md_cluster_info { 50 + /* dlm lock space and resources for clustered raid. */ 51 + dlm_lockspace_t *lockspace; 52 + int slot_number; 53 + struct completion completion; 54 + struct dlm_lock_resource *sb_lock; 55 + struct mutex sb_mutex; 56 + struct dlm_lock_resource *bitmap_lockres; 57 + struct list_head suspend_list; 58 + spinlock_t suspend_lock; 59 + struct md_thread *recovery_thread; 60 + unsigned long recovery_map; 61 + /* communication loc resources */ 62 + struct dlm_lock_resource *ack_lockres; 63 + struct dlm_lock_resource *message_lockres; 64 + struct dlm_lock_resource *token_lockres; 65 + struct dlm_lock_resource *no_new_dev_lockres; 66 + struct md_thread *recv_thread; 67 + struct completion newdisk_completion; 68 + unsigned long state; 69 + }; 70 + 71 + enum msg_type { 72 + METADATA_UPDATED = 0, 73 + RESYNCING, 74 + NEWDISK, 75 + REMOVE, 76 + RE_ADD, 77 + }; 78 + 79 + struct cluster_msg { 80 + int type; 81 + int slot; 82 + /* TODO: Unionize this for smaller footprint */ 83 + sector_t low; 84 + sector_t high; 85 + char uuid[16]; 86 + int raid_slot; 87 + }; 88 + 89 + static void sync_ast(void *arg) 90 + { 91 + struct dlm_lock_resource *res; 92 + 93 + res = (struct dlm_lock_resource *) arg; 94 + complete(&res->completion); 95 + } 96 + 97 + static int dlm_lock_sync(struct dlm_lock_resource *res, int mode) 98 + { 99 + int ret = 0; 100 + 101 + init_completion(&res->completion); 102 + ret = dlm_lock(res->ls, mode, &res->lksb, 103 + res->flags, res->name, strlen(res->name), 104 + 0, sync_ast, res, res->bast); 105 + if (ret) 106 + return ret; 107 + wait_for_completion(&res->completion); 108 + return res->lksb.sb_status; 109 + } 110 + 111 + static int dlm_unlock_sync(struct dlm_lock_resource *res) 112 + { 113 + return dlm_lock_sync(res, DLM_LOCK_NL); 114 + } 115 + 116 + static struct dlm_lock_resource *lockres_init(struct mddev *mddev, 117 + char *name, void (*bastfn)(void *arg, int mode), int with_lvb) 118 + { 119 + struct dlm_lock_resource *res = NULL; 120 + int ret, namelen; 121 + struct md_cluster_info *cinfo = mddev->cluster_info; 122 + 123 + res = kzalloc(sizeof(struct dlm_lock_resource), GFP_KERNEL); 124 + if (!res) 125 + return NULL; 126 + res->ls = cinfo->lockspace; 127 + res->mddev = mddev; 128 + namelen = strlen(name); 129 + res->name = kzalloc(namelen + 1, GFP_KERNEL); 130 + if (!res->name) { 131 + pr_err("md-cluster: Unable to allocate resource name for resource %s\n", name); 132 + goto out_err; 133 + } 134 + strlcpy(res->name, name, namelen + 1); 135 + if (with_lvb) { 136 + res->lksb.sb_lvbptr = kzalloc(LVB_SIZE, GFP_KERNEL); 137 + if (!res->lksb.sb_lvbptr) { 138 + pr_err("md-cluster: Unable to allocate LVB for resource %s\n", name); 139 + goto out_err; 140 + } 141 + res->flags = DLM_LKF_VALBLK; 142 + } 143 + 144 + if (bastfn) 145 + res->bast = bastfn; 146 + 147 + res->flags |= DLM_LKF_EXPEDITE; 148 + 149 + ret = dlm_lock_sync(res, DLM_LOCK_NL); 150 + if (ret) { 151 + pr_err("md-cluster: Unable to lock NL on new lock resource %s\n", name); 152 + goto out_err; 153 + } 154 + res->flags &= ~DLM_LKF_EXPEDITE; 155 + res->flags |= DLM_LKF_CONVERT; 156 + 157 + return res; 158 + out_err: 159 + kfree(res->lksb.sb_lvbptr); 160 + kfree(res->name); 161 + kfree(res); 162 + return NULL; 163 + } 164 + 165 + static void lockres_free(struct dlm_lock_resource *res) 166 + { 167 + if (!res) 168 + return; 169 + 170 + init_completion(&res->completion); 171 + dlm_unlock(res->ls, res->lksb.sb_lkid, 0, &res->lksb, res); 172 + wait_for_completion(&res->completion); 173 + 174 + kfree(res->name); 175 + kfree(res->lksb.sb_lvbptr); 176 + kfree(res); 177 + } 178 + 179 + static char *pretty_uuid(char *dest, char *src) 180 + { 181 + int i, len = 0; 182 + 183 + for (i = 0; i < 16; i++) { 184 + if (i == 4 || i == 6 || i == 8 || i == 10) 185 + len += sprintf(dest + len, "-"); 186 + len += sprintf(dest + len, "%02x", (__u8)src[i]); 187 + } 188 + return dest; 189 + } 190 + 191 + static void add_resync_info(struct mddev *mddev, struct dlm_lock_resource *lockres, 192 + sector_t lo, sector_t hi) 193 + { 194 + struct resync_info *ri; 195 + 196 + ri = (struct resync_info *)lockres->lksb.sb_lvbptr; 197 + ri->lo = cpu_to_le64(lo); 198 + ri->hi = cpu_to_le64(hi); 199 + } 200 + 201 + static struct suspend_info *read_resync_info(struct mddev *mddev, struct dlm_lock_resource *lockres) 202 + { 203 + struct resync_info ri; 204 + struct suspend_info *s = NULL; 205 + sector_t hi = 0; 206 + 207 + dlm_lock_sync(lockres, DLM_LOCK_CR); 208 + memcpy(&ri, lockres->lksb.sb_lvbptr, sizeof(struct resync_info)); 209 + hi = le64_to_cpu(ri.hi); 210 + if (ri.hi > 0) { 211 + s = kzalloc(sizeof(struct suspend_info), GFP_KERNEL); 212 + if (!s) 213 + goto out; 214 + s->hi = hi; 215 + s->lo = le64_to_cpu(ri.lo); 216 + } 217 + dlm_unlock_sync(lockres); 218 + out: 219 + return s; 220 + } 221 + 222 + static void recover_bitmaps(struct md_thread *thread) 223 + { 224 + struct mddev *mddev = thread->mddev; 225 + struct md_cluster_info *cinfo = mddev->cluster_info; 226 + struct dlm_lock_resource *bm_lockres; 227 + char str[64]; 228 + int slot, ret; 229 + struct suspend_info *s, *tmp; 230 + sector_t lo, hi; 231 + 232 + while (cinfo->recovery_map) { 233 + slot = fls64((u64)cinfo->recovery_map) - 1; 234 + 235 + /* Clear suspend_area associated with the bitmap */ 236 + spin_lock_irq(&cinfo->suspend_lock); 237 + list_for_each_entry_safe(s, tmp, &cinfo->suspend_list, list) 238 + if (slot == s->slot) { 239 + list_del(&s->list); 240 + kfree(s); 241 + } 242 + spin_unlock_irq(&cinfo->suspend_lock); 243 + 244 + snprintf(str, 64, "bitmap%04d", slot); 245 + bm_lockres = lockres_init(mddev, str, NULL, 1); 246 + if (!bm_lockres) { 247 + pr_err("md-cluster: Cannot initialize bitmaps\n"); 248 + goto clear_bit; 249 + } 250 + 251 + ret = dlm_lock_sync(bm_lockres, DLM_LOCK_PW); 252 + if (ret) { 253 + pr_err("md-cluster: Could not DLM lock %s: %d\n", 254 + str, ret); 255 + goto clear_bit; 256 + } 257 + ret = bitmap_copy_from_slot(mddev, slot, &lo, &hi, true); 258 + if (ret) { 259 + pr_err("md-cluster: Could not copy data from bitmap %d\n", slot); 260 + goto dlm_unlock; 261 + } 262 + if (hi > 0) { 263 + /* TODO:Wait for current resync to get over */ 264 + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 265 + if (lo < mddev->recovery_cp) 266 + mddev->recovery_cp = lo; 267 + md_check_recovery(mddev); 268 + } 269 + dlm_unlock: 270 + dlm_unlock_sync(bm_lockres); 271 + clear_bit: 272 + clear_bit(slot, &cinfo->recovery_map); 273 + } 274 + } 275 + 276 + static void recover_prep(void *arg) 277 + { 278 + } 279 + 280 + static void recover_slot(void *arg, struct dlm_slot *slot) 281 + { 282 + struct mddev *mddev = arg; 283 + struct md_cluster_info *cinfo = mddev->cluster_info; 284 + 285 + pr_info("md-cluster: %s Node %d/%d down. My slot: %d. Initiating recovery.\n", 286 + mddev->bitmap_info.cluster_name, 287 + slot->nodeid, slot->slot, 288 + cinfo->slot_number); 289 + set_bit(slot->slot - 1, &cinfo->recovery_map); 290 + if (!cinfo->recovery_thread) { 291 + cinfo->recovery_thread = md_register_thread(recover_bitmaps, 292 + mddev, "recover"); 293 + if (!cinfo->recovery_thread) { 294 + pr_warn("md-cluster: Could not create recovery thread\n"); 295 + return; 296 + } 297 + } 298 + md_wakeup_thread(cinfo->recovery_thread); 299 + } 300 + 301 + static void recover_done(void *arg, struct dlm_slot *slots, 302 + int num_slots, int our_slot, 303 + uint32_t generation) 304 + { 305 + struct mddev *mddev = arg; 306 + struct md_cluster_info *cinfo = mddev->cluster_info; 307 + 308 + cinfo->slot_number = our_slot; 309 + complete(&cinfo->completion); 310 + } 311 + 312 + static const struct dlm_lockspace_ops md_ls_ops = { 313 + .recover_prep = recover_prep, 314 + .recover_slot = recover_slot, 315 + .recover_done = recover_done, 316 + }; 317 + 318 + /* 319 + * The BAST function for the ack lock resource 320 + * This function wakes up the receive thread in 321 + * order to receive and process the message. 322 + */ 323 + static void ack_bast(void *arg, int mode) 324 + { 325 + struct dlm_lock_resource *res = (struct dlm_lock_resource *)arg; 326 + struct md_cluster_info *cinfo = res->mddev->cluster_info; 327 + 328 + if (mode == DLM_LOCK_EX) 329 + md_wakeup_thread(cinfo->recv_thread); 330 + } 331 + 332 + static void __remove_suspend_info(struct md_cluster_info *cinfo, int slot) 333 + { 334 + struct suspend_info *s, *tmp; 335 + 336 + list_for_each_entry_safe(s, tmp, &cinfo->suspend_list, list) 337 + if (slot == s->slot) { 338 + pr_info("%s:%d Deleting suspend_info: %d\n", 339 + __func__, __LINE__, slot); 340 + list_del(&s->list); 341 + kfree(s); 342 + break; 343 + } 344 + } 345 + 346 + static void remove_suspend_info(struct md_cluster_info *cinfo, int slot) 347 + { 348 + spin_lock_irq(&cinfo->suspend_lock); 349 + __remove_suspend_info(cinfo, slot); 350 + spin_unlock_irq(&cinfo->suspend_lock); 351 + } 352 + 353 + 354 + static void process_suspend_info(struct md_cluster_info *cinfo, 355 + int slot, sector_t lo, sector_t hi) 356 + { 357 + struct suspend_info *s; 358 + 359 + if (!hi) { 360 + remove_suspend_info(cinfo, slot); 361 + return; 362 + } 363 + s = kzalloc(sizeof(struct suspend_info), GFP_KERNEL); 364 + if (!s) 365 + return; 366 + s->slot = slot; 367 + s->lo = lo; 368 + s->hi = hi; 369 + spin_lock_irq(&cinfo->suspend_lock); 370 + /* Remove existing entry (if exists) before adding */ 371 + __remove_suspend_info(cinfo, slot); 372 + list_add(&s->list, &cinfo->suspend_list); 373 + spin_unlock_irq(&cinfo->suspend_lock); 374 + } 375 + 376 + static void process_add_new_disk(struct mddev *mddev, struct cluster_msg *cmsg) 377 + { 378 + char disk_uuid[64]; 379 + struct md_cluster_info *cinfo = mddev->cluster_info; 380 + char event_name[] = "EVENT=ADD_DEVICE"; 381 + char raid_slot[16]; 382 + char *envp[] = {event_name, disk_uuid, raid_slot, NULL}; 383 + int len; 384 + 385 + len = snprintf(disk_uuid, 64, "DEVICE_UUID="); 386 + pretty_uuid(disk_uuid + len, cmsg->uuid); 387 + snprintf(raid_slot, 16, "RAID_DISK=%d", cmsg->raid_slot); 388 + pr_info("%s:%d Sending kobject change with %s and %s\n", __func__, __LINE__, disk_uuid, raid_slot); 389 + init_completion(&cinfo->newdisk_completion); 390 + set_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state); 391 + kobject_uevent_env(&disk_to_dev(mddev->gendisk)->kobj, KOBJ_CHANGE, envp); 392 + wait_for_completion_timeout(&cinfo->newdisk_completion, 393 + NEW_DEV_TIMEOUT); 394 + clear_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state); 395 + } 396 + 397 + 398 + static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg) 399 + { 400 + struct md_cluster_info *cinfo = mddev->cluster_info; 401 + 402 + md_reload_sb(mddev); 403 + dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR); 404 + } 405 + 406 + static void process_remove_disk(struct mddev *mddev, struct cluster_msg *msg) 407 + { 408 + struct md_rdev *rdev = md_find_rdev_nr_rcu(mddev, msg->raid_slot); 409 + 410 + if (rdev) 411 + md_kick_rdev_from_array(rdev); 412 + else 413 + pr_warn("%s: %d Could not find disk(%d) to REMOVE\n", __func__, __LINE__, msg->raid_slot); 414 + } 415 + 416 + static void process_readd_disk(struct mddev *mddev, struct cluster_msg *msg) 417 + { 418 + struct md_rdev *rdev = md_find_rdev_nr_rcu(mddev, msg->raid_slot); 419 + 420 + if (rdev && test_bit(Faulty, &rdev->flags)) 421 + clear_bit(Faulty, &rdev->flags); 422 + else 423 + pr_warn("%s: %d Could not find disk(%d) which is faulty", __func__, __LINE__, msg->raid_slot); 424 + } 425 + 426 + static void process_recvd_msg(struct mddev *mddev, struct cluster_msg *msg) 427 + { 428 + switch (msg->type) { 429 + case METADATA_UPDATED: 430 + pr_info("%s: %d Received message: METADATA_UPDATE from %d\n", 431 + __func__, __LINE__, msg->slot); 432 + process_metadata_update(mddev, msg); 433 + break; 434 + case RESYNCING: 435 + pr_info("%s: %d Received message: RESYNCING from %d\n", 436 + __func__, __LINE__, msg->slot); 437 + process_suspend_info(mddev->cluster_info, msg->slot, 438 + msg->low, msg->high); 439 + break; 440 + case NEWDISK: 441 + pr_info("%s: %d Received message: NEWDISK from %d\n", 442 + __func__, __LINE__, msg->slot); 443 + process_add_new_disk(mddev, msg); 444 + break; 445 + case REMOVE: 446 + pr_info("%s: %d Received REMOVE from %d\n", 447 + __func__, __LINE__, msg->slot); 448 + process_remove_disk(mddev, msg); 449 + break; 450 + case RE_ADD: 451 + pr_info("%s: %d Received RE_ADD from %d\n", 452 + __func__, __LINE__, msg->slot); 453 + process_readd_disk(mddev, msg); 454 + break; 455 + default: 456 + pr_warn("%s:%d Received unknown message from %d\n", 457 + __func__, __LINE__, msg->slot); 458 + } 459 + } 460 + 461 + /* 462 + * thread for receiving message 463 + */ 464 + static void recv_daemon(struct md_thread *thread) 465 + { 466 + struct md_cluster_info *cinfo = thread->mddev->cluster_info; 467 + struct dlm_lock_resource *ack_lockres = cinfo->ack_lockres; 468 + struct dlm_lock_resource *message_lockres = cinfo->message_lockres; 469 + struct cluster_msg msg; 470 + 471 + /*get CR on Message*/ 472 + if (dlm_lock_sync(message_lockres, DLM_LOCK_CR)) { 473 + pr_err("md/raid1:failed to get CR on MESSAGE\n"); 474 + return; 475 + } 476 + 477 + /* read lvb and wake up thread to process this message_lockres */ 478 + memcpy(&msg, message_lockres->lksb.sb_lvbptr, sizeof(struct cluster_msg)); 479 + process_recvd_msg(thread->mddev, &msg); 480 + 481 + /*release CR on ack_lockres*/ 482 + dlm_unlock_sync(ack_lockres); 483 + /*up-convert to EX on message_lockres*/ 484 + dlm_lock_sync(message_lockres, DLM_LOCK_EX); 485 + /*get CR on ack_lockres again*/ 486 + dlm_lock_sync(ack_lockres, DLM_LOCK_CR); 487 + /*release CR on message_lockres*/ 488 + dlm_unlock_sync(message_lockres); 489 + } 490 + 491 + /* lock_comm() 492 + * Takes the lock on the TOKEN lock resource so no other 493 + * node can communicate while the operation is underway. 494 + */ 495 + static int lock_comm(struct md_cluster_info *cinfo) 496 + { 497 + int error; 498 + 499 + error = dlm_lock_sync(cinfo->token_lockres, DLM_LOCK_EX); 500 + if (error) 501 + pr_err("md-cluster(%s:%d): failed to get EX on TOKEN (%d)\n", 502 + __func__, __LINE__, error); 503 + return error; 504 + } 505 + 506 + static void unlock_comm(struct md_cluster_info *cinfo) 507 + { 508 + dlm_unlock_sync(cinfo->token_lockres); 509 + } 510 + 511 + /* __sendmsg() 512 + * This function performs the actual sending of the message. This function is 513 + * usually called after performing the encompassing operation 514 + * The function: 515 + * 1. Grabs the message lockresource in EX mode 516 + * 2. Copies the message to the message LVB 517 + * 3. Downconverts message lockresource to CR 518 + * 4. Upconverts ack lock resource from CR to EX. This forces the BAST on other nodes 519 + * and the other nodes read the message. The thread will wait here until all other 520 + * nodes have released ack lock resource. 521 + * 5. Downconvert ack lockresource to CR 522 + */ 523 + static int __sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) 524 + { 525 + int error; 526 + int slot = cinfo->slot_number - 1; 527 + 528 + cmsg->slot = cpu_to_le32(slot); 529 + /*get EX on Message*/ 530 + error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_EX); 531 + if (error) { 532 + pr_err("md-cluster: failed to get EX on MESSAGE (%d)\n", error); 533 + goto failed_message; 534 + } 535 + 536 + memcpy(cinfo->message_lockres->lksb.sb_lvbptr, (void *)cmsg, 537 + sizeof(struct cluster_msg)); 538 + /*down-convert EX to CR on Message*/ 539 + error = dlm_lock_sync(cinfo->message_lockres, DLM_LOCK_CR); 540 + if (error) { 541 + pr_err("md-cluster: failed to convert EX to CR on MESSAGE(%d)\n", 542 + error); 543 + goto failed_message; 544 + } 545 + 546 + /*up-convert CR to EX on Ack*/ 547 + error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_EX); 548 + if (error) { 549 + pr_err("md-cluster: failed to convert CR to EX on ACK(%d)\n", 550 + error); 551 + goto failed_ack; 552 + } 553 + 554 + /*down-convert EX to CR on Ack*/ 555 + error = dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR); 556 + if (error) { 557 + pr_err("md-cluster: failed to convert EX to CR on ACK(%d)\n", 558 + error); 559 + goto failed_ack; 560 + } 561 + 562 + failed_ack: 563 + dlm_unlock_sync(cinfo->message_lockres); 564 + failed_message: 565 + return error; 566 + } 567 + 568 + static int sendmsg(struct md_cluster_info *cinfo, struct cluster_msg *cmsg) 569 + { 570 + int ret; 571 + 572 + lock_comm(cinfo); 573 + ret = __sendmsg(cinfo, cmsg); 574 + unlock_comm(cinfo); 575 + return ret; 576 + } 577 + 578 + static int gather_all_resync_info(struct mddev *mddev, int total_slots) 579 + { 580 + struct md_cluster_info *cinfo = mddev->cluster_info; 581 + int i, ret = 0; 582 + struct dlm_lock_resource *bm_lockres; 583 + struct suspend_info *s; 584 + char str[64]; 585 + 586 + 587 + for (i = 0; i < total_slots; i++) { 588 + memset(str, '\0', 64); 589 + snprintf(str, 64, "bitmap%04d", i); 590 + bm_lockres = lockres_init(mddev, str, NULL, 1); 591 + if (!bm_lockres) 592 + return -ENOMEM; 593 + if (i == (cinfo->slot_number - 1)) 594 + continue; 595 + 596 + bm_lockres->flags |= DLM_LKF_NOQUEUE; 597 + ret = dlm_lock_sync(bm_lockres, DLM_LOCK_PW); 598 + if (ret == -EAGAIN) { 599 + memset(bm_lockres->lksb.sb_lvbptr, '\0', LVB_SIZE); 600 + s = read_resync_info(mddev, bm_lockres); 601 + if (s) { 602 + pr_info("%s:%d Resync[%llu..%llu] in progress on %d\n", 603 + __func__, __LINE__, 604 + (unsigned long long) s->lo, 605 + (unsigned long long) s->hi, i); 606 + spin_lock_irq(&cinfo->suspend_lock); 607 + s->slot = i; 608 + list_add(&s->list, &cinfo->suspend_list); 609 + spin_unlock_irq(&cinfo->suspend_lock); 610 + } 611 + ret = 0; 612 + lockres_free(bm_lockres); 613 + continue; 614 + } 615 + if (ret) 616 + goto out; 617 + /* TODO: Read the disk bitmap sb and check if it needs recovery */ 618 + dlm_unlock_sync(bm_lockres); 619 + lockres_free(bm_lockres); 620 + } 621 + out: 622 + return ret; 623 + } 624 + 625 + static int join(struct mddev *mddev, int nodes) 626 + { 627 + struct md_cluster_info *cinfo; 628 + int ret, ops_rv; 629 + char str[64]; 630 + 631 + if (!try_module_get(THIS_MODULE)) 632 + return -ENOENT; 633 + 634 + cinfo = kzalloc(sizeof(struct md_cluster_info), GFP_KERNEL); 635 + if (!cinfo) 636 + return -ENOMEM; 637 + 638 + init_completion(&cinfo->completion); 639 + 640 + mutex_init(&cinfo->sb_mutex); 641 + mddev->cluster_info = cinfo; 642 + 643 + memset(str, 0, 64); 644 + pretty_uuid(str, mddev->uuid); 645 + ret = dlm_new_lockspace(str, mddev->bitmap_info.cluster_name, 646 + DLM_LSFL_FS, LVB_SIZE, 647 + &md_ls_ops, mddev, &ops_rv, &cinfo->lockspace); 648 + if (ret) 649 + goto err; 650 + wait_for_completion(&cinfo->completion); 651 + if (nodes < cinfo->slot_number) { 652 + pr_err("md-cluster: Slot allotted(%d) is greater than available slots(%d).", 653 + cinfo->slot_number, nodes); 654 + ret = -ERANGE; 655 + goto err; 656 + } 657 + cinfo->sb_lock = lockres_init(mddev, "cmd-super", 658 + NULL, 0); 659 + if (!cinfo->sb_lock) { 660 + ret = -ENOMEM; 661 + goto err; 662 + } 663 + /* Initiate the communication resources */ 664 + ret = -ENOMEM; 665 + cinfo->recv_thread = md_register_thread(recv_daemon, mddev, "cluster_recv"); 666 + if (!cinfo->recv_thread) { 667 + pr_err("md-cluster: cannot allocate memory for recv_thread!\n"); 668 + goto err; 669 + } 670 + cinfo->message_lockres = lockres_init(mddev, "message", NULL, 1); 671 + if (!cinfo->message_lockres) 672 + goto err; 673 + cinfo->token_lockres = lockres_init(mddev, "token", NULL, 0); 674 + if (!cinfo->token_lockres) 675 + goto err; 676 + cinfo->ack_lockres = lockres_init(mddev, "ack", ack_bast, 0); 677 + if (!cinfo->ack_lockres) 678 + goto err; 679 + cinfo->no_new_dev_lockres = lockres_init(mddev, "no-new-dev", NULL, 0); 680 + if (!cinfo->no_new_dev_lockres) 681 + goto err; 682 + 683 + /* get sync CR lock on ACK. */ 684 + if (dlm_lock_sync(cinfo->ack_lockres, DLM_LOCK_CR)) 685 + pr_err("md-cluster: failed to get a sync CR lock on ACK!(%d)\n", 686 + ret); 687 + /* get sync CR lock on no-new-dev. */ 688 + if (dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR)) 689 + pr_err("md-cluster: failed to get a sync CR lock on no-new-dev!(%d)\n", ret); 690 + 691 + 692 + pr_info("md-cluster: Joined cluster %s slot %d\n", str, cinfo->slot_number); 693 + snprintf(str, 64, "bitmap%04d", cinfo->slot_number - 1); 694 + cinfo->bitmap_lockres = lockres_init(mddev, str, NULL, 1); 695 + if (!cinfo->bitmap_lockres) 696 + goto err; 697 + if (dlm_lock_sync(cinfo->bitmap_lockres, DLM_LOCK_PW)) { 698 + pr_err("Failed to get bitmap lock\n"); 699 + ret = -EINVAL; 700 + goto err; 701 + } 702 + 703 + INIT_LIST_HEAD(&cinfo->suspend_list); 704 + spin_lock_init(&cinfo->suspend_lock); 705 + 706 + ret = gather_all_resync_info(mddev, nodes); 707 + if (ret) 708 + goto err; 709 + 710 + return 0; 711 + err: 712 + lockres_free(cinfo->message_lockres); 713 + lockres_free(cinfo->token_lockres); 714 + lockres_free(cinfo->ack_lockres); 715 + lockres_free(cinfo->no_new_dev_lockres); 716 + lockres_free(cinfo->bitmap_lockres); 717 + lockres_free(cinfo->sb_lock); 718 + if (cinfo->lockspace) 719 + dlm_release_lockspace(cinfo->lockspace, 2); 720 + mddev->cluster_info = NULL; 721 + kfree(cinfo); 722 + module_put(THIS_MODULE); 723 + return ret; 724 + } 725 + 726 + static int leave(struct mddev *mddev) 727 + { 728 + struct md_cluster_info *cinfo = mddev->cluster_info; 729 + 730 + if (!cinfo) 731 + return 0; 732 + md_unregister_thread(&cinfo->recovery_thread); 733 + md_unregister_thread(&cinfo->recv_thread); 734 + lockres_free(cinfo->message_lockres); 735 + lockres_free(cinfo->token_lockres); 736 + lockres_free(cinfo->ack_lockres); 737 + lockres_free(cinfo->no_new_dev_lockres); 738 + lockres_free(cinfo->sb_lock); 739 + lockres_free(cinfo->bitmap_lockres); 740 + dlm_release_lockspace(cinfo->lockspace, 2); 741 + return 0; 742 + } 743 + 744 + /* slot_number(): Returns the MD slot number to use 745 + * DLM starts the slot numbers from 1, wheras cluster-md 746 + * wants the number to be from zero, so we deduct one 747 + */ 748 + static int slot_number(struct mddev *mddev) 749 + { 750 + struct md_cluster_info *cinfo = mddev->cluster_info; 751 + 752 + return cinfo->slot_number - 1; 753 + } 754 + 755 + static void resync_info_update(struct mddev *mddev, sector_t lo, sector_t hi) 756 + { 757 + struct md_cluster_info *cinfo = mddev->cluster_info; 758 + 759 + add_resync_info(mddev, cinfo->bitmap_lockres, lo, hi); 760 + /* Re-acquire the lock to refresh LVB */ 761 + dlm_lock_sync(cinfo->bitmap_lockres, DLM_LOCK_PW); 762 + } 763 + 764 + static int metadata_update_start(struct mddev *mddev) 765 + { 766 + return lock_comm(mddev->cluster_info); 767 + } 768 + 769 + static int metadata_update_finish(struct mddev *mddev) 770 + { 771 + struct md_cluster_info *cinfo = mddev->cluster_info; 772 + struct cluster_msg cmsg; 773 + int ret; 774 + 775 + memset(&cmsg, 0, sizeof(cmsg)); 776 + cmsg.type = cpu_to_le32(METADATA_UPDATED); 777 + ret = __sendmsg(cinfo, &cmsg); 778 + unlock_comm(cinfo); 779 + return ret; 780 + } 781 + 782 + static int metadata_update_cancel(struct mddev *mddev) 783 + { 784 + struct md_cluster_info *cinfo = mddev->cluster_info; 785 + 786 + return dlm_unlock_sync(cinfo->token_lockres); 787 + } 788 + 789 + static int resync_send(struct mddev *mddev, enum msg_type type, 790 + sector_t lo, sector_t hi) 791 + { 792 + struct md_cluster_info *cinfo = mddev->cluster_info; 793 + struct cluster_msg cmsg; 794 + int slot = cinfo->slot_number - 1; 795 + 796 + pr_info("%s:%d lo: %llu hi: %llu\n", __func__, __LINE__, 797 + (unsigned long long)lo, 798 + (unsigned long long)hi); 799 + resync_info_update(mddev, lo, hi); 800 + cmsg.type = cpu_to_le32(type); 801 + cmsg.slot = cpu_to_le32(slot); 802 + cmsg.low = cpu_to_le64(lo); 803 + cmsg.high = cpu_to_le64(hi); 804 + return sendmsg(cinfo, &cmsg); 805 + } 806 + 807 + static int resync_start(struct mddev *mddev, sector_t lo, sector_t hi) 808 + { 809 + pr_info("%s:%d\n", __func__, __LINE__); 810 + return resync_send(mddev, RESYNCING, lo, hi); 811 + } 812 + 813 + static void resync_finish(struct mddev *mddev) 814 + { 815 + pr_info("%s:%d\n", __func__, __LINE__); 816 + resync_send(mddev, RESYNCING, 0, 0); 817 + } 818 + 819 + static int area_resyncing(struct mddev *mddev, sector_t lo, sector_t hi) 820 + { 821 + struct md_cluster_info *cinfo = mddev->cluster_info; 822 + int ret = 0; 823 + struct suspend_info *s; 824 + 825 + spin_lock_irq(&cinfo->suspend_lock); 826 + if (list_empty(&cinfo->suspend_list)) 827 + goto out; 828 + list_for_each_entry(s, &cinfo->suspend_list, list) 829 + if (hi > s->lo && lo < s->hi) { 830 + ret = 1; 831 + break; 832 + } 833 + out: 834 + spin_unlock_irq(&cinfo->suspend_lock); 835 + return ret; 836 + } 837 + 838 + static int add_new_disk_start(struct mddev *mddev, struct md_rdev *rdev) 839 + { 840 + struct md_cluster_info *cinfo = mddev->cluster_info; 841 + struct cluster_msg cmsg; 842 + int ret = 0; 843 + struct mdp_superblock_1 *sb = page_address(rdev->sb_page); 844 + char *uuid = sb->device_uuid; 845 + 846 + memset(&cmsg, 0, sizeof(cmsg)); 847 + cmsg.type = cpu_to_le32(NEWDISK); 848 + memcpy(cmsg.uuid, uuid, 16); 849 + cmsg.raid_slot = rdev->desc_nr; 850 + lock_comm(cinfo); 851 + ret = __sendmsg(cinfo, &cmsg); 852 + if (ret) 853 + return ret; 854 + cinfo->no_new_dev_lockres->flags |= DLM_LKF_NOQUEUE; 855 + ret = dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_EX); 856 + cinfo->no_new_dev_lockres->flags &= ~DLM_LKF_NOQUEUE; 857 + /* Some node does not "see" the device */ 858 + if (ret == -EAGAIN) 859 + ret = -ENOENT; 860 + else 861 + dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR); 862 + return ret; 863 + } 864 + 865 + static int add_new_disk_finish(struct mddev *mddev) 866 + { 867 + struct cluster_msg cmsg; 868 + struct md_cluster_info *cinfo = mddev->cluster_info; 869 + int ret; 870 + /* Write sb and inform others */ 871 + md_update_sb(mddev, 1); 872 + cmsg.type = METADATA_UPDATED; 873 + ret = __sendmsg(cinfo, &cmsg); 874 + unlock_comm(cinfo); 875 + return ret; 876 + } 877 + 878 + static int new_disk_ack(struct mddev *mddev, bool ack) 879 + { 880 + struct md_cluster_info *cinfo = mddev->cluster_info; 881 + 882 + if (!test_bit(MD_CLUSTER_WAITING_FOR_NEWDISK, &cinfo->state)) { 883 + pr_warn("md-cluster(%s): Spurious cluster confirmation\n", mdname(mddev)); 884 + return -EINVAL; 885 + } 886 + 887 + if (ack) 888 + dlm_unlock_sync(cinfo->no_new_dev_lockres); 889 + complete(&cinfo->newdisk_completion); 890 + return 0; 891 + } 892 + 893 + static int remove_disk(struct mddev *mddev, struct md_rdev *rdev) 894 + { 895 + struct cluster_msg cmsg; 896 + struct md_cluster_info *cinfo = mddev->cluster_info; 897 + cmsg.type = REMOVE; 898 + cmsg.raid_slot = rdev->desc_nr; 899 + return __sendmsg(cinfo, &cmsg); 900 + } 901 + 902 + static int gather_bitmaps(struct md_rdev *rdev) 903 + { 904 + int sn, err; 905 + sector_t lo, hi; 906 + struct cluster_msg cmsg; 907 + struct mddev *mddev = rdev->mddev; 908 + struct md_cluster_info *cinfo = mddev->cluster_info; 909 + 910 + cmsg.type = RE_ADD; 911 + cmsg.raid_slot = rdev->desc_nr; 912 + err = sendmsg(cinfo, &cmsg); 913 + if (err) 914 + goto out; 915 + 916 + for (sn = 0; sn < mddev->bitmap_info.nodes; sn++) { 917 + if (sn == (cinfo->slot_number - 1)) 918 + continue; 919 + err = bitmap_copy_from_slot(mddev, sn, &lo, &hi, false); 920 + if (err) { 921 + pr_warn("md-cluster: Could not gather bitmaps from slot %d", sn); 922 + goto out; 923 + } 924 + if ((hi > 0) && (lo < mddev->recovery_cp)) 925 + mddev->recovery_cp = lo; 926 + } 927 + out: 928 + return err; 929 + } 930 + 931 + static struct md_cluster_operations cluster_ops = { 932 + .join = join, 933 + .leave = leave, 934 + .slot_number = slot_number, 935 + .resync_info_update = resync_info_update, 936 + .resync_start = resync_start, 937 + .resync_finish = resync_finish, 938 + .metadata_update_start = metadata_update_start, 939 + .metadata_update_finish = metadata_update_finish, 940 + .metadata_update_cancel = metadata_update_cancel, 941 + .area_resyncing = area_resyncing, 942 + .add_new_disk_start = add_new_disk_start, 943 + .add_new_disk_finish = add_new_disk_finish, 944 + .new_disk_ack = new_disk_ack, 945 + .remove_disk = remove_disk, 946 + .gather_bitmaps = gather_bitmaps, 947 + }; 948 + 949 + static int __init cluster_init(void) 950 + { 951 + pr_warn("md-cluster: EXPERIMENTAL. Use with caution\n"); 952 + pr_info("Registering Cluster MD functions\n"); 953 + register_md_cluster_operations(&cluster_ops, THIS_MODULE); 954 + return 0; 955 + } 956 + 957 + static void cluster_exit(void) 958 + { 959 + unregister_md_cluster_operations(); 960 + } 961 + 962 + module_init(cluster_init); 963 + module_exit(cluster_exit); 964 + MODULE_LICENSE("GPL"); 965 + MODULE_DESCRIPTION("Clustering support for MD");

+29

drivers/md/md-cluster.h

··· 1 + 2 + 3 + #ifndef _MD_CLUSTER_H 4 + #define _MD_CLUSTER_H 5 + 6 + #include "md.h" 7 + 8 + struct mddev; 9 + struct md_rdev; 10 + 11 + struct md_cluster_operations { 12 + int (*join)(struct mddev *mddev, int nodes); 13 + int (*leave)(struct mddev *mddev); 14 + int (*slot_number)(struct mddev *mddev); 15 + void (*resync_info_update)(struct mddev *mddev, sector_t lo, sector_t hi); 16 + int (*resync_start)(struct mddev *mddev, sector_t lo, sector_t hi); 17 + void (*resync_finish)(struct mddev *mddev); 18 + int (*metadata_update_start)(struct mddev *mddev); 19 + int (*metadata_update_finish)(struct mddev *mddev); 20 + int (*metadata_update_cancel)(struct mddev *mddev); 21 + int (*area_resyncing)(struct mddev *mddev, sector_t lo, sector_t hi); 22 + int (*add_new_disk_start)(struct mddev *mddev, struct md_rdev *rdev); 23 + int (*add_new_disk_finish)(struct mddev *mddev); 24 + int (*new_disk_ack)(struct mddev *mddev, bool ack); 25 + int (*remove_disk)(struct mddev *mddev, struct md_rdev *rdev); 26 + int (*gather_bitmaps)(struct md_rdev *rdev); 27 + }; 28 + 29 + #endif /* _MD_CLUSTER_H */

+313 -69

drivers/md/md.c

··· 53 53 #include <linux/slab.h> 54 54 #include "md.h" 55 55 #include "bitmap.h" 56 + #include "md-cluster.h" 56 57 57 58 #ifndef MODULE 58 59 static void autostart_arrays(int part); ··· 66 65 */ 67 66 static LIST_HEAD(pers_list); 68 67 static DEFINE_SPINLOCK(pers_lock); 68 + 69 + struct md_cluster_operations *md_cluster_ops; 70 + EXPORT_SYMBOL(md_cluster_ops); 71 + struct module *md_cluster_mod; 72 + EXPORT_SYMBOL(md_cluster_mod); 69 73 70 74 static DECLARE_WAIT_QUEUE_HEAD(resync_wait); 71 75 static struct workqueue_struct *md_wq; ··· 646 640 } 647 641 EXPORT_SYMBOL_GPL(mddev_unlock); 648 642 649 - static struct md_rdev *find_rdev_nr_rcu(struct mddev *mddev, int nr) 643 + struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr) 650 644 { 651 645 struct md_rdev *rdev; 652 646 ··· 656 650 657 651 return NULL; 658 652 } 653 + EXPORT_SYMBOL_GPL(md_find_rdev_nr_rcu); 659 654 660 655 static struct md_rdev *find_rdev(struct mddev *mddev, dev_t dev) 661 656 { ··· 2054 2047 int choice = 0; 2055 2048 if (mddev->pers) 2056 2049 choice = mddev->raid_disks; 2057 - while (find_rdev_nr_rcu(mddev, choice)) 2050 + while (md_find_rdev_nr_rcu(mddev, choice)) 2058 2051 choice++; 2059 2052 rdev->desc_nr = choice; 2060 2053 } else { 2061 - if (find_rdev_nr_rcu(mddev, rdev->desc_nr)) { 2054 + if (md_find_rdev_nr_rcu(mddev, rdev->desc_nr)) { 2062 2055 rcu_read_unlock(); 2063 2056 return -EBUSY; 2064 2057 } ··· 2173 2166 kobject_put(&rdev->kobj); 2174 2167 } 2175 2168 2176 - static void kick_rdev_from_array(struct md_rdev *rdev) 2169 + void md_kick_rdev_from_array(struct md_rdev *rdev) 2177 2170 { 2178 2171 unbind_rdev_from_array(rdev); 2179 2172 export_rdev(rdev); 2180 2173 } 2174 + EXPORT_SYMBOL_GPL(md_kick_rdev_from_array); 2181 2175 2182 2176 static void export_array(struct mddev *mddev) 2183 2177 { ··· 2187 2179 while (!list_empty(&mddev->disks)) { 2188 2180 rdev = list_first_entry(&mddev->disks, struct md_rdev, 2189 2181 same_set); 2190 - kick_rdev_from_array(rdev); 2182 + md_kick_rdev_from_array(rdev); 2191 2183 } 2192 2184 mddev->raid_disks = 0; 2193 2185 mddev->major_version = 0; ··· 2216 2208 } 2217 2209 } 2218 2210 2219 - static void md_update_sb(struct mddev *mddev, int force_change) 2211 + void md_update_sb(struct mddev *mddev, int force_change) 2220 2212 { 2221 2213 struct md_rdev *rdev; 2222 2214 int sync_req; ··· 2377 2369 wake_up(&rdev->blocked_wait); 2378 2370 } 2379 2371 } 2372 + EXPORT_SYMBOL(md_update_sb); 2373 + 2374 + static int add_bound_rdev(struct md_rdev *rdev) 2375 + { 2376 + struct mddev *mddev = rdev->mddev; 2377 + int err = 0; 2378 + 2379 + if (!mddev->pers->hot_remove_disk) { 2380 + /* If there is hot_add_disk but no hot_remove_disk 2381 + * then added disks for geometry changes, 2382 + * and should be added immediately. 2383 + */ 2384 + super_types[mddev->major_version]. 2385 + validate_super(mddev, rdev); 2386 + err = mddev->pers->hot_add_disk(mddev, rdev); 2387 + if (err) { 2388 + unbind_rdev_from_array(rdev); 2389 + export_rdev(rdev); 2390 + return err; 2391 + } 2392 + } 2393 + sysfs_notify_dirent_safe(rdev->sysfs_state); 2394 + 2395 + set_bit(MD_CHANGE_DEVS, &mddev->flags); 2396 + if (mddev->degraded) 2397 + set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 2398 + set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 2399 + md_new_event(mddev); 2400 + md_wakeup_thread(mddev->thread); 2401 + return 0; 2402 + } 2380 2403 2381 2404 /* words written to sysfs files may, or may not, be \n terminated. 2382 2405 * We want to accept with case. For this we use cmd_match. ··· 2510 2471 err = -EBUSY; 2511 2472 else { 2512 2473 struct mddev *mddev = rdev->mddev; 2513 - kick_rdev_from_array(rdev); 2474 + if (mddev_is_clustered(mddev)) 2475 + md_cluster_ops->remove_disk(mddev, rdev); 2476 + md_kick_rdev_from_array(rdev); 2477 + if (mddev_is_clustered(mddev)) 2478 + md_cluster_ops->metadata_update_start(mddev); 2514 2479 if (mddev->pers) 2515 2480 md_update_sb(mddev, 1); 2516 2481 md_new_event(mddev); 2482 + if (mddev_is_clustered(mddev)) 2483 + md_cluster_ops->metadata_update_finish(mddev); 2517 2484 err = 0; 2518 2485 } 2519 2486 } else if (cmd_match(buf, "writemostly")) { ··· 2598 2553 clear_bit(Replacement, &rdev->flags); 2599 2554 err = 0; 2600 2555 } 2556 + } else if (cmd_match(buf, "re-add")) { 2557 + if (test_bit(Faulty, &rdev->flags) && (rdev->raid_disk == -1)) { 2558 + /* clear_bit is performed _after_ all the devices 2559 + * have their local Faulty bit cleared. If any writes 2560 + * happen in the meantime in the local node, they 2561 + * will land in the local bitmap, which will be synced 2562 + * by this node eventually 2563 + */ 2564 + if (!mddev_is_clustered(rdev->mddev) || 2565 + (err = md_cluster_ops->gather_bitmaps(rdev)) == 0) { 2566 + clear_bit(Faulty, &rdev->flags); 2567 + err = add_bound_rdev(rdev); 2568 + } 2569 + } else 2570 + err = -EBUSY; 2601 2571 } 2602 2572 if (!err) 2603 2573 sysfs_notify_dirent_safe(rdev->sysfs_state); ··· 3187 3127 "md: fatal superblock inconsistency in %s" 3188 3128 " -- removing from array\n", 3189 3129 bdevname(rdev->bdev,b)); 3190 - kick_rdev_from_array(rdev); 3130 + md_kick_rdev_from_array(rdev); 3191 3131 } 3192 3132 3193 3133 super_types[mddev->major_version]. ··· 3202 3142 "md: %s: %s: only %d devices permitted\n", 3203 3143 mdname(mddev), bdevname(rdev->bdev, b), 3204 3144 mddev->max_disks); 3205 - kick_rdev_from_array(rdev); 3145 + md_kick_rdev_from_array(rdev); 3206 3146 continue; 3207 3147 } 3208 - if (rdev != freshest) 3148 + if (rdev != freshest) { 3209 3149 if (super_types[mddev->major_version]. 3210 3150 validate_super(mddev, rdev)) { 3211 3151 printk(KERN_WARNING "md: kicking non-fresh %s" 3212 3152 " from array!\n", 3213 3153 bdevname(rdev->bdev,b)); 3214 - kick_rdev_from_array(rdev); 3154 + md_kick_rdev_from_array(rdev); 3215 3155 continue; 3216 3156 } 3157 + /* No device should have a Candidate flag 3158 + * when reading devices 3159 + */ 3160 + if (test_bit(Candidate, &rdev->flags)) { 3161 + pr_info("md: kicking Cluster Candidate %s from array!\n", 3162 + bdevname(rdev->bdev, b)); 3163 + md_kick_rdev_from_array(rdev); 3164 + } 3165 + } 3217 3166 if (mddev->level == LEVEL_MULTIPATH) { 3218 3167 rdev->desc_nr = i++; 3219 3168 rdev->raid_disk = rdev->desc_nr; ··· 4077 4008 if (err) 4078 4009 return err; 4079 4010 if (mddev->pers) { 4011 + if (mddev_is_clustered(mddev)) 4012 + md_cluster_ops->metadata_update_start(mddev); 4080 4013 err = update_size(mddev, sectors); 4081 4014 md_update_sb(mddev, 1); 4015 + if (mddev_is_clustered(mddev)) 4016 + md_cluster_ops->metadata_update_finish(mddev); 4082 4017 } else { 4083 4018 if (mddev->dev_sectors == 0 || 4084 4019 mddev->dev_sectors > sectors) ··· 4427 4354 { 4428 4355 unsigned long long min; 4429 4356 int err; 4430 - int chunk; 4431 4357 4432 4358 if (kstrtoull(buf, 10, &min)) 4433 4359 return -EINVAL; ··· 4440 4368 if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 4441 4369 goto out_unlock; 4442 4370 4443 - /* Must be a multiple of chunk_size */ 4444 - chunk = mddev->chunk_sectors; 4445 - if (chunk) { 4446 - sector_t temp = min; 4447 - 4448 - err = -EINVAL; 4449 - if (sector_div(temp, chunk)) 4450 - goto out_unlock; 4451 - } 4452 - mddev->resync_min = min; 4371 + /* Round down to multiple of 4K for safety */ 4372 + mddev->resync_min = round_down(min, 8); 4453 4373 err = 0; 4454 4374 4455 4375 out_unlock: ··· 5141 5077 } 5142 5078 if (err == 0 && pers->sync_request && 5143 5079 (mddev->bitmap_info.file || mddev->bitmap_info.offset)) { 5144 - err = bitmap_create(mddev); 5145 - if (err) 5080 + struct bitmap *bitmap; 5081 + 5082 + bitmap = bitmap_create(mddev, -1); 5083 + if (IS_ERR(bitmap)) { 5084 + err = PTR_ERR(bitmap); 5146 5085 printk(KERN_ERR "%s: failed to create bitmap (%d)\n", 5147 5086 mdname(mddev), err); 5087 + } else 5088 + mddev->bitmap = bitmap; 5089 + 5148 5090 } 5149 5091 if (err) { 5150 5092 mddev_detach(mddev); ··· 5302 5232 5303 5233 static void __md_stop_writes(struct mddev *mddev) 5304 5234 { 5235 + if (mddev_is_clustered(mddev)) 5236 + md_cluster_ops->metadata_update_start(mddev); 5305 5237 set_bit(MD_RECOVERY_FROZEN, &mddev->recovery); 5306 5238 flush_workqueue(md_misc_wq); 5307 5239 if (mddev->sync_thread) { ··· 5322 5250 mddev->in_sync = 1; 5323 5251 md_update_sb(mddev, 1); 5324 5252 } 5253 + if (mddev_is_clustered(mddev)) 5254 + md_cluster_ops->metadata_update_finish(mddev); 5325 5255 } 5326 5256 5327 5257 void md_stop_writes(struct mddev *mddev) ··· 5710 5636 info.state = (1<<MD_SB_CLEAN); 5711 5637 if (mddev->bitmap && mddev->bitmap_info.offset) 5712 5638 info.state |= (1<<MD_SB_BITMAP_PRESENT); 5639 + if (mddev_is_clustered(mddev)) 5640 + info.state |= (1<<MD_SB_CLUSTERED); 5713 5641 info.active_disks = insync; 5714 5642 info.working_disks = working; 5715 5643 info.failed_disks = failed; ··· 5767 5691 return -EFAULT; 5768 5692 5769 5693 rcu_read_lock(); 5770 - rdev = find_rdev_nr_rcu(mddev, info.number); 5694 + rdev = md_find_rdev_nr_rcu(mddev, info.number); 5771 5695 if (rdev) { 5772 5696 info.major = MAJOR(rdev->bdev->bd_dev); 5773 5697 info.minor = MINOR(rdev->bdev->bd_dev); ··· 5799 5723 char b[BDEVNAME_SIZE], b2[BDEVNAME_SIZE]; 5800 5724 struct md_rdev *rdev; 5801 5725 dev_t dev = MKDEV(info->major,info->minor); 5726 + 5727 + if (mddev_is_clustered(mddev) && 5728 + !(info->state & ((1 << MD_DISK_CLUSTER_ADD) | (1 << MD_DISK_CANDIDATE)))) { 5729 + pr_err("%s: Cannot add to clustered mddev.\n", 5730 + mdname(mddev)); 5731 + return -EINVAL; 5732 + } 5802 5733 5803 5734 if (info->major != MAJOR(dev) || info->minor != MINOR(dev)) 5804 5735 return -EOVERFLOW; ··· 5893 5810 else 5894 5811 clear_bit(WriteMostly, &rdev->flags); 5895 5812 5813 + /* 5814 + * check whether the device shows up in other nodes 5815 + */ 5816 + if (mddev_is_clustered(mddev)) { 5817 + if (info->state & (1 << MD_DISK_CANDIDATE)) { 5818 + /* Through --cluster-confirm */ 5819 + set_bit(Candidate, &rdev->flags); 5820 + err = md_cluster_ops->new_disk_ack(mddev, true); 5821 + if (err) { 5822 + export_rdev(rdev); 5823 + return err; 5824 + } 5825 + } else if (info->state & (1 << MD_DISK_CLUSTER_ADD)) { 5826 + /* --add initiated by this node */ 5827 + err = md_cluster_ops->add_new_disk_start(mddev, rdev); 5828 + if (err) { 5829 + md_cluster_ops->add_new_disk_finish(mddev); 5830 + export_rdev(rdev); 5831 + return err; 5832 + } 5833 + } 5834 + } 5835 + 5896 5836 rdev->raid_disk = -1; 5897 5837 err = bind_rdev_to_array(rdev, mddev); 5898 - if (!err && !mddev->pers->hot_remove_disk) { 5899 - /* If there is hot_add_disk but no hot_remove_disk 5900 - * then added disks for geometry changes, 5901 - * and should be added immediately. 5902 - */ 5903 - super_types[mddev->major_version]. 5904 - validate_super(mddev, rdev); 5905 - err = mddev->pers->hot_add_disk(mddev, rdev); 5906 - if (err) 5907 - unbind_rdev_from_array(rdev); 5908 - } 5909 5838 if (err) 5910 5839 export_rdev(rdev); 5911 5840 else 5912 - sysfs_notify_dirent_safe(rdev->sysfs_state); 5913 - 5914 - set_bit(MD_CHANGE_DEVS, &mddev->flags); 5915 - if (mddev->degraded) 5916 - set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); 5917 - set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); 5918 - if (!err) 5919 - md_new_event(mddev); 5920 - md_wakeup_thread(mddev->thread); 5841 + err = add_bound_rdev(rdev); 5842 + if (mddev_is_clustered(mddev) && 5843 + (info->state & (1 << MD_DISK_CLUSTER_ADD))) 5844 + md_cluster_ops->add_new_disk_finish(mddev); 5921 5845 return err; 5922 5846 } 5923 5847 ··· 5985 5895 if (!rdev) 5986 5896 return -ENXIO; 5987 5897 5898 + if (mddev_is_clustered(mddev)) 5899 + md_cluster_ops->metadata_update_start(mddev); 5900 + 5988 5901 clear_bit(Blocked, &rdev->flags); 5989 5902 remove_and_add_spares(mddev, rdev); 5990 5903 5991 5904 if (rdev->raid_disk >= 0) 5992 5905 goto busy; 5993 5906 5994 - kick_rdev_from_array(rdev); 5907 + if (mddev_is_clustered(mddev)) 5908 + md_cluster_ops->remove_disk(mddev, rdev); 5909 + 5910 + md_kick_rdev_from_array(rdev); 5995 5911 md_update_sb(mddev, 1); 5996 5912 md_new_event(mddev); 5997 5913 5914 + if (mddev_is_clustered(mddev)) 5915 + md_cluster_ops->metadata_update_finish(mddev); 5916 + 5998 5917 return 0; 5999 5918 busy: 5919 + if (mddev_is_clustered(mddev)) 5920 + md_cluster_ops->metadata_update_cancel(mddev); 6000 5921 printk(KERN_WARNING "md: cannot remove active disk %s from %s ...\n", 6001 5922 bdevname(rdev->bdev,b), mdname(mddev)); 6002 5923 return -EBUSY; ··· 6057 5956 err = -EINVAL; 6058 5957 goto abort_export; 6059 5958 } 5959 + 5960 + if (mddev_is_clustered(mddev)) 5961 + md_cluster_ops->metadata_update_start(mddev); 6060 5962 clear_bit(In_sync, &rdev->flags); 6061 5963 rdev->desc_nr = -1; 6062 5964 rdev->saved_raid_disk = -1; 6063 5965 err = bind_rdev_to_array(rdev, mddev); 6064 5966 if (err) 6065 - goto abort_export; 5967 + goto abort_clustered; 6066 5968 6067 5969 /* 6068 5970 * The rest should better be atomic, we can have disk failures ··· 6076 5972 6077 5973 md_update_sb(mddev, 1); 6078 5974 5975 + if (mddev_is_clustered(mddev)) 5976 + md_cluster_ops->metadata_update_finish(mddev); 6079 5977 /* 6080 5978 * Kick recovery, maybe this spare has to be added to the 6081 5979 * array immediately. ··· 6087 5981 md_new_event(mddev); 6088 5982 return 0; 6089 5983 5984 + abort_clustered: 5985 + if (mddev_is_clustered(mddev)) 5986 + md_cluster_ops->metadata_update_cancel(mddev); 6090 5987 abort_export: 6091 5988 export_rdev(rdev); 6092 5989 return err; ··· 6147 6038 if (mddev->pers) { 6148 6039 mddev->pers->quiesce(mddev, 1); 6149 6040 if (fd >= 0) { 6150 - err = bitmap_create(mddev); 6151 - if (!err) 6041 + struct bitmap *bitmap; 6042 + 6043 + bitmap = bitmap_create(mddev, -1); 6044 + if (!IS_ERR(bitmap)) { 6045 + mddev->bitmap = bitmap; 6152 6046 err = bitmap_load(mddev); 6047 + } else 6048 + err = PTR_ERR(bitmap); 6153 6049 } 6154 6050 if (fd < 0 || err) { 6155 6051 bitmap_destroy(mddev); ··· 6407 6293 return rv; 6408 6294 } 6409 6295 } 6296 + if (mddev_is_clustered(mddev)) 6297 + md_cluster_ops->metadata_update_start(mddev); 6410 6298 if (info->size >= 0 && mddev->dev_sectors / 2 != info->size) 6411 6299 rv = update_size(mddev, (sector_t)info->size * 2); 6412 6300 ··· 6416 6300 rv = update_raid_disks(mddev, info->raid_disks); 6417 6301 6418 6302 if ((state ^ info->state) & (1<<MD_SB_BITMAP_PRESENT)) { 6419 - if (mddev->pers->quiesce == NULL || mddev->thread == NULL) 6420 - return -EINVAL; 6421 - if (mddev->recovery || mddev->sync_thread) 6422 - return -EBUSY; 6303 + if (mddev->pers->quiesce == NULL || mddev->thread == NULL) { 6304 + rv = -EINVAL; 6305 + goto err; 6306 + } 6307 + if (mddev->recovery || mddev->sync_thread) { 6308 + rv = -EBUSY; 6309 + goto err; 6310 + } 6423 6311 if (info->state & (1<<MD_SB_BITMAP_PRESENT)) { 6312 + struct bitmap *bitmap; 6424 6313 /* add the bitmap */ 6425 - if (mddev->bitmap) 6426 - return -EEXIST; 6427 - if (mddev->bitmap_info.default_offset == 0) 6428 - return -EINVAL; 6314 + if (mddev->bitmap) { 6315 + rv = -EEXIST; 6316 + goto err; 6317 + } 6318 + if (mddev->bitmap_info.default_offset == 0) { 6319 + rv = -EINVAL; 6320 + goto err; 6321 + } 6429 6322 mddev->bitmap_info.offset = 6430 6323 mddev->bitmap_info.default_offset; 6431 6324 mddev->bitmap_info.space = 6432 6325 mddev->bitmap_info.default_space; 6433 6326 mddev->pers->quiesce(mddev, 1); 6434 - rv = bitmap_create(mddev); 6435 - if (!rv) 6327 + bitmap = bitmap_create(mddev, -1); 6328 + if (!IS_ERR(bitmap)) { 6329 + mddev->bitmap = bitmap; 6436 6330 rv = bitmap_load(mddev); 6331 + } else 6332 + rv = PTR_ERR(bitmap); 6437 6333 if (rv) 6438 6334 bitmap_destroy(mddev); 6439 6335 mddev->pers->quiesce(mddev, 0); 6440 6336 } else { 6441 6337 /* remove the bitmap */ 6442 - if (!mddev->bitmap) 6443 - return -ENOENT; 6444 - if (mddev->bitmap->storage.file) 6445 - return -EINVAL; 6338 + if (!mddev->bitmap) { 6339 + rv = -ENOENT; 6340 + goto err; 6341 + } 6342 + if (mddev->bitmap->storage.file) { 6343 + rv = -EINVAL; 6344 + goto err; 6345 + } 6446 6346 mddev->pers->quiesce(mddev, 1); 6447 6347 bitmap_destroy(mddev); 6448 6348 mddev->pers->quiesce(mddev, 0); ··· 6466 6334 } 6467 6335 } 6468 6336 md_update_sb(mddev, 1); 6337 + if (mddev_is_clustered(mddev)) 6338 + md_cluster_ops->metadata_update_finish(mddev); 6339 + return rv; 6340 + err: 6341 + if (mddev_is_clustered(mddev)) 6342 + md_cluster_ops->metadata_update_cancel(mddev); 6469 6343 return rv; 6470 6344 } 6471 6345 ··· 6531 6393 case SET_DISK_FAULTY: 6532 6394 case STOP_ARRAY: 6533 6395 case STOP_ARRAY_RO: 6396 + case CLUSTERED_DISK_NACK: 6534 6397 return true; 6535 6398 default: 6536 6399 return false; ··· 6803 6664 err = add_new_disk(mddev, &info); 6804 6665 goto unlock; 6805 6666 } 6667 + 6668 + case CLUSTERED_DISK_NACK: 6669 + if (mddev_is_clustered(mddev)) 6670 + md_cluster_ops->new_disk_ack(mddev, false); 6671 + else 6672 + err = -EINVAL; 6673 + goto unlock; 6806 6674 6807 6675 case HOT_ADD_DISK: 6808 6676 err = hot_add_disk(mddev, new_decode_dev(arg)); ··· 7384 7238 } 7385 7239 EXPORT_SYMBOL(unregister_md_personality); 7386 7240 7241 + int register_md_cluster_operations(struct md_cluster_operations *ops, struct module *module) 7242 + { 7243 + if (md_cluster_ops != NULL) 7244 + return -EALREADY; 7245 + spin_lock(&pers_lock); 7246 + md_cluster_ops = ops; 7247 + md_cluster_mod = module; 7248 + spin_unlock(&pers_lock); 7249 + return 0; 7250 + } 7251 + EXPORT_SYMBOL(register_md_cluster_operations); 7252 + 7253 + int unregister_md_cluster_operations(void) 7254 + { 7255 + spin_lock(&pers_lock); 7256 + md_cluster_ops = NULL; 7257 + spin_unlock(&pers_lock); 7258 + return 0; 7259 + } 7260 + EXPORT_SYMBOL(unregister_md_cluster_operations); 7261 + 7262 + int md_setup_cluster(struct mddev *mddev, int nodes) 7263 + { 7264 + int err; 7265 + 7266 + err = request_module("md-cluster"); 7267 + if (err) { 7268 + pr_err("md-cluster module not found.\n"); 7269 + return err; 7270 + } 7271 + 7272 + spin_lock(&pers_lock); 7273 + if (!md_cluster_ops || !try_module_get(md_cluster_mod)) { 7274 + spin_unlock(&pers_lock); 7275 + return -ENOENT; 7276 + } 7277 + spin_unlock(&pers_lock); 7278 + 7279 + return md_cluster_ops->join(mddev, nodes); 7280 + } 7281 + 7282 + void md_cluster_stop(struct mddev *mddev) 7283 + { 7284 + if (!md_cluster_ops) 7285 + return; 7286 + md_cluster_ops->leave(mddev); 7287 + module_put(md_cluster_mod); 7288 + } 7289 + 7387 7290 static int is_mddev_idle(struct mddev *mddev, int init) 7388 7291 { 7389 7292 struct md_rdev *rdev; ··· 7570 7375 mddev->safemode == 0) 7571 7376 mddev->safemode = 1; 7572 7377 spin_unlock(&mddev->lock); 7378 + if (mddev_is_clustered(mddev)) 7379 + md_cluster_ops->metadata_update_start(mddev); 7573 7380 md_update_sb(mddev, 0); 7381 + if (mddev_is_clustered(mddev)) 7382 + md_cluster_ops->metadata_update_finish(mddev); 7574 7383 sysfs_notify_dirent_safe(mddev->sysfs_state); 7575 7384 } else 7576 7385 spin_unlock(&mddev->lock); ··· 7775 7576 md_new_event(mddev); 7776 7577 update_time = jiffies; 7777 7578 7579 + if (mddev_is_clustered(mddev)) 7580 + md_cluster_ops->resync_start(mddev, j, max_sectors); 7581 + 7778 7582 blk_start_plug(&plug); 7779 7583 while (j < max_sectors) { 7780 7584 sector_t sectors; ··· 7820 7618 if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) 7821 7619 break; 7822 7620 7823 - sectors = mddev->pers->sync_request(mddev, j, &skipped, 7824 - currspeed < speed_min(mddev)); 7621 + sectors = mddev->pers->sync_request(mddev, j, &skipped); 7825 7622 if (sectors == 0) { 7826 7623 set_bit(MD_RECOVERY_INTR, &mddev->recovery); 7827 7624 break; ··· 7837 7636 j += sectors; 7838 7637 if (j > 2) 7839 7638 mddev->curr_resync = j; 7639 + if (mddev_is_clustered(mddev)) 7640 + md_cluster_ops->resync_info_update(mddev, j, max_sectors); 7840 7641 mddev->curr_mark_cnt = io_sectors; 7841 7642 if (last_check == 0) 7842 7643 /* this is the earliest that rebuild will be ··· 7880 7677 /((jiffies-mddev->resync_mark)/HZ +1) +1; 7881 7678 7882 7679 if (currspeed > speed_min(mddev)) { 7883 - if ((currspeed > speed_max(mddev)) || 7884 - !is_mddev_idle(mddev, 0)) { 7680 + if (currspeed > speed_max(mddev)) { 7885 7681 msleep(500); 7886 7682 goto repeat; 7683 + } 7684 + if (!is_mddev_idle(mddev, 0)) { 7685 + /* 7686 + * Give other IO more of a chance. 7687 + * The faster the devices, the less we wait. 7688 + */ 7689 + wait_event(mddev->recovery_wait, 7690 + !atomic_read(&mddev->recovery_active)); 7887 7691 } 7888 7692 } 7889 7693 } ··· 7904 7694 wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active)); 7905 7695 7906 7696 /* tell personality that we are finished */ 7907 - mddev->pers->sync_request(mddev, max_sectors, &skipped, 1); 7697 + mddev->pers->sync_request(mddev, max_sectors, &skipped); 7698 + 7699 + if (mddev_is_clustered(mddev)) 7700 + md_cluster_ops->resync_finish(mddev); 7908 7701 7909 7702 if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) && 7910 7703 mddev->curr_resync > 2) { ··· 8138 7925 sysfs_notify_dirent_safe(mddev->sysfs_state); 8139 7926 } 8140 7927 8141 - if (mddev->flags & MD_UPDATE_SB_FLAGS) 7928 + if (mddev->flags & MD_UPDATE_SB_FLAGS) { 7929 + if (mddev_is_clustered(mddev)) 7930 + md_cluster_ops->metadata_update_start(mddev); 8142 7931 md_update_sb(mddev, 0); 7932 + if (mddev_is_clustered(mddev)) 7933 + md_cluster_ops->metadata_update_finish(mddev); 7934 + } 8143 7935 8144 7936 if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && 8145 7937 !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) { ··· 8242 8024 set_bit(MD_CHANGE_DEVS, &mddev->flags); 8243 8025 } 8244 8026 } 8027 + if (mddev_is_clustered(mddev)) 8028 + md_cluster_ops->metadata_update_start(mddev); 8245 8029 if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && 8246 8030 mddev->pers->finish_reshape) 8247 8031 mddev->pers->finish_reshape(mddev); ··· 8256 8036 rdev->saved_raid_disk = -1; 8257 8037 8258 8038 md_update_sb(mddev, 1); 8039 + if (mddev_is_clustered(mddev)) 8040 + md_cluster_ops->metadata_update_finish(mddev); 8259 8041 clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); 8260 8042 clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); 8261 8043 clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); ··· 8877 8655 err_wq: 8878 8656 return ret; 8879 8657 } 8658 + 8659 + void md_reload_sb(struct mddev *mddev) 8660 + { 8661 + struct md_rdev *rdev, *tmp; 8662 + 8663 + rdev_for_each_safe(rdev, tmp, mddev) { 8664 + rdev->sb_loaded = 0; 8665 + ClearPageUptodate(rdev->sb_page); 8666 + } 8667 + mddev->raid_disks = 0; 8668 + analyze_sbs(mddev); 8669 + rdev_for_each_safe(rdev, tmp, mddev) { 8670 + struct mdp_superblock_1 *sb = page_address(rdev->sb_page); 8671 + /* since we don't write to faulty devices, we figure out if the 8672 + * disk is faulty by comparing events 8673 + */ 8674 + if (mddev->events > sb->events) 8675 + set_bit(Faulty, &rdev->flags); 8676 + } 8677 + 8678 + } 8679 + EXPORT_SYMBOL(md_reload_sb); 8880 8680 8881 8681 #ifndef MODULE 8882 8682

+25 -1

drivers/md/md.h

··· 23 23 #include <linux/timer.h> 24 24 #include <linux/wait.h> 25 25 #include <linux/workqueue.h> 26 + #include "md-cluster.h" 26 27 27 28 #define MaxSector (~(sector_t)0) 28 29 ··· 171 170 * a want_replacement device with same 172 171 * raid_disk number. 173 172 */ 173 + Candidate, /* For clustered environments only: 174 + * This device is seen locally but not 175 + * by the whole cluster 176 + */ 174 177 }; 175 178 176 179 #define BB_LEN_MASK (0x00000000000001FFULL) ··· 206 201 extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors, 207 202 int is_new); 208 203 extern void md_ack_all_badblocks(struct badblocks *bb); 204 + 205 + struct md_cluster_info; 209 206 210 207 struct mddev { 211 208 void *private; ··· 437 430 unsigned long daemon_sleep; /* how many jiffies between updates? */ 438 431 unsigned long max_write_behind; /* write-behind mode */ 439 432 int external; 433 + int nodes; /* Maximum number of nodes in the cluster */ 434 + char cluster_name[64]; /* Name of the cluster */ 440 435 } bitmap_info; 441 436 442 437 atomic_t max_corr_read_errors; /* max read retries */ ··· 457 448 struct work_struct flush_work; 458 449 struct work_struct event_work; /* used by dm to report failure event */ 459 450 void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev); 451 + struct md_cluster_info *cluster_info; 460 452 }; 461 453 462 454 static inline int __must_check mddev_lock(struct mddev *mddev) ··· 506 496 int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev); 507 497 int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev); 508 498 int (*spare_active) (struct mddev *mddev); 509 - sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster); 499 + sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int *skipped); 510 500 int (*resize) (struct mddev *mddev, sector_t sectors); 511 501 sector_t (*size) (struct mddev *mddev, sector_t sectors, int raid_disks); 512 502 int (*check_reshape) (struct mddev *mddev); ··· 618 608 619 609 extern int register_md_personality(struct md_personality *p); 620 610 extern int unregister_md_personality(struct md_personality *p); 611 + extern int register_md_cluster_operations(struct md_cluster_operations *ops, 612 + struct module *module); 613 + extern int unregister_md_cluster_operations(void); 614 + extern int md_setup_cluster(struct mddev *mddev, int nodes); 615 + extern void md_cluster_stop(struct mddev *mddev); 621 616 extern struct md_thread *md_register_thread( 622 617 void (*run)(struct md_thread *thread), 623 618 struct mddev *mddev, ··· 669 654 struct mddev *mddev); 670 655 671 656 extern void md_unplug(struct blk_plug_cb *cb, bool from_schedule); 657 + extern void md_reload_sb(struct mddev *mddev); 658 + extern void md_update_sb(struct mddev *mddev, int force); 659 + extern void md_kick_rdev_from_array(struct md_rdev * rdev); 660 + struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr); 672 661 static inline int mddev_check_plugged(struct mddev *mddev) 673 662 { 674 663 return !!blk_check_plugged(md_unplug, mddev, ··· 688 669 } 689 670 } 690 671 672 + extern struct md_cluster_operations *md_cluster_ops; 673 + static inline int mddev_is_clustered(struct mddev *mddev) 674 + { 675 + return mddev->cluster_info && mddev->bitmap_info.nodes > 1; 676 + } 691 677 #endif /* _MD_MD_H */

+26 -20

drivers/md/raid0.c

··· 271 271 goto abort; 272 272 } 273 273 274 - blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); 275 - blk_queue_io_opt(mddev->queue, 276 - (mddev->chunk_sectors << 9) * mddev->raid_disks); 274 + if (mddev->queue) { 275 + blk_queue_io_min(mddev->queue, mddev->chunk_sectors << 9); 276 + blk_queue_io_opt(mddev->queue, 277 + (mddev->chunk_sectors << 9) * mddev->raid_disks); 277 278 278 - if (!discard_supported) 279 - queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); 280 - else 281 - queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); 279 + if (!discard_supported) 280 + queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); 281 + else 282 + queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, mddev->queue); 283 + } 282 284 283 285 pr_debug("md/raid0:%s: done.\n", mdname(mddev)); 284 286 *private_conf = conf; ··· 431 429 } 432 430 if (md_check_no_bitmap(mddev)) 433 431 return -EINVAL; 434 - blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); 435 - blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); 436 - blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); 432 + 433 + if (mddev->queue) { 434 + blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); 435 + blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors); 436 + blk_queue_max_discard_sectors(mddev->queue, mddev->chunk_sectors); 437 + } 437 438 438 439 /* if private is not null, we are here after takeover */ 439 440 if (mddev->private == NULL) { ··· 453 448 printk(KERN_INFO "md/raid0:%s: md_size is %llu sectors.\n", 454 449 mdname(mddev), 455 450 (unsigned long long)mddev->array_sectors); 456 - /* calculate the max read-ahead size. 457 - * For read-ahead of large files to be effective, we need to 458 - * readahead at least twice a whole stripe. i.e. number of devices 459 - * multiplied by chunk size times 2. 460 - * If an individual device has an ra_pages greater than the 461 - * chunk size, then we will not drive that device as hard as it 462 - * wants. We consider this a configuration error: a larger 463 - * chunksize should be used in that case. 464 - */ 465 - { 451 + 452 + if (mddev->queue) { 453 + /* calculate the max read-ahead size. 454 + * For read-ahead of large files to be effective, we need to 455 + * readahead at least twice a whole stripe. i.e. number of devices 456 + * multiplied by chunk size times 2. 457 + * If an individual device has an ra_pages greater than the 458 + * chunk size, then we will not drive that device as hard as it 459 + * wants. We consider this a configuration error: a larger 460 + * chunksize should be used in that case. 461 + */ 466 462 int stripe = mddev->raid_disks * 467 463 (mddev->chunk_sectors << 9) / PAGE_SIZE; 468 464 if (mddev->queue->backing_dev_info.ra_pages < 2* stripe)

+17 -12

drivers/md/raid1.c

··· 539 539 has_nonrot_disk = 0; 540 540 choose_next_idle = 0; 541 541 542 - choose_first = (conf->mddev->recovery_cp < this_sector + sectors); 542 + if ((conf->mddev->recovery_cp < this_sector + sectors) || 543 + (mddev_is_clustered(conf->mddev) && 544 + md_cluster_ops->area_resyncing(conf->mddev, this_sector, 545 + this_sector + sectors))) 546 + choose_first = 1; 547 + else 548 + choose_first = 0; 543 549 544 550 for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) { 545 551 sector_t dist; ··· 1108 1102 md_write_start(mddev, bio); /* wait on superblock update early */ 1109 1103 1110 1104 if (bio_data_dir(bio) == WRITE && 1111 - bio_end_sector(bio) > mddev->suspend_lo && 1112 - bio->bi_iter.bi_sector < mddev->suspend_hi) { 1105 + ((bio_end_sector(bio) > mddev->suspend_lo && 1106 + bio->bi_iter.bi_sector < mddev->suspend_hi) || 1107 + (mddev_is_clustered(mddev) && 1108 + md_cluster_ops->area_resyncing(mddev, bio->bi_iter.bi_sector, bio_end_sector(bio))))) { 1113 1109 /* As the suspend_* range is controlled by 1114 1110 * userspace, we want an interruptible 1115 1111 * wait. ··· 1122 1114 prepare_to_wait(&conf->wait_barrier, 1123 1115 &w, TASK_INTERRUPTIBLE); 1124 1116 if (bio_end_sector(bio) <= mddev->suspend_lo || 1125 - bio->bi_iter.bi_sector >= mddev->suspend_hi) 1117 + bio->bi_iter.bi_sector >= mddev->suspend_hi || 1118 + (mddev_is_clustered(mddev) && 1119 + !md_cluster_ops->area_resyncing(mddev, 1120 + bio->bi_iter.bi_sector, bio_end_sector(bio)))) 1126 1121 break; 1127 1122 schedule(); 1128 1123 } ··· 1572 1561 struct md_rdev *rdev = conf->mirrors[i].rdev; 1573 1562 struct md_rdev *repl = conf->mirrors[conf->raid_disks + i].rdev; 1574 1563 if (repl 1564 + && !test_bit(Candidate, &repl->flags) 1575 1565 && repl->recovery_offset == MaxSector 1576 1566 && !test_bit(Faulty, &repl->flags) 1577 1567 && !test_and_set_bit(In_sync, &repl->flags)) { ··· 2480 2468 * that can be installed to exclude normal IO requests. 2481 2469 */ 2482 2470 2483 - static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster) 2471 + static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped) 2484 2472 { 2485 2473 struct r1conf *conf = mddev->private; 2486 2474 struct r1bio *r1_bio; ··· 2533 2521 *skipped = 1; 2534 2522 return sync_blocks; 2535 2523 } 2536 - /* 2537 - * If there is non-resync activity waiting for a turn, 2538 - * and resync is going fast enough, 2539 - * then let it though before starting on this new sync request. 2540 - */ 2541 - if (!go_faster && conf->nr_waiting) 2542 - msleep_interruptible(1000); 2543 2524 2544 2525 bitmap_cond_end_sync(mddev->bitmap, sector_nr); 2545 2526 r1_bio = mempool_alloc(conf->r1buf_pool, GFP_NOIO);

+1 -7

drivers/md/raid10.c

··· 2889 2889 */ 2890 2890 2891 2891 static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, 2892 - int *skipped, int go_faster) 2892 + int *skipped) 2893 2893 { 2894 2894 struct r10conf *conf = mddev->private; 2895 2895 struct r10bio *r10_bio; ··· 2994 2994 if (conf->geo.near_copies < conf->geo.raid_disks && 2995 2995 max_sector > (sector_nr | chunk_mask)) 2996 2996 max_sector = (sector_nr | chunk_mask) + 1; 2997 - /* 2998 - * If there is non-resync activity waiting for us then 2999 - * put in a delay to throttle resync. 3000 - */ 3001 - if (!go_faster && conf->nr_waiting) 3002 - msleep_interruptible(1000); 3003 2997 3004 2998 /* Again, very different code for resync and recovery. 3005 2999 * Both must result in an r10bio with a list of bios that

+690 -134

drivers/md/raid5.c

··· 54 54 #include <linux/slab.h> 55 55 #include <linux/ratelimit.h> 56 56 #include <linux/nodemask.h> 57 + #include <linux/flex_array.h> 57 58 #include <trace/events/block.h> 58 59 59 60 #include "md.h" ··· 497 496 } 498 497 } 499 498 500 - static int grow_buffers(struct stripe_head *sh) 499 + static int grow_buffers(struct stripe_head *sh, gfp_t gfp) 501 500 { 502 501 int i; 503 502 int num = sh->raid_conf->pool_size; ··· 505 504 for (i = 0; i < num; i++) { 506 505 struct page *page; 507 506 508 - if (!(page = alloc_page(GFP_KERNEL))) { 507 + if (!(page = alloc_page(gfp))) { 509 508 return 1; 510 509 } 511 510 sh->dev[i].page = page; ··· 526 525 BUG_ON(atomic_read(&sh->count) != 0); 527 526 BUG_ON(test_bit(STRIPE_HANDLE, &sh->state)); 528 527 BUG_ON(stripe_operations_active(sh)); 528 + BUG_ON(sh->batch_head); 529 529 530 530 pr_debug("init_stripe called, stripe %llu\n", 531 531 (unsigned long long)sector); ··· 554 552 } 555 553 if (read_seqcount_retry(&conf->gen_lock, seq)) 556 554 goto retry; 555 + sh->overwrite_disks = 0; 557 556 insert_hash(conf, sh); 558 557 sh->cpu = smp_processor_id(); 558 + set_bit(STRIPE_BATCH_READY, &sh->state); 559 559 } 560 560 561 561 static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector, ··· 672 668 *(conf->hash_locks + hash)); 673 669 sh = __find_stripe(conf, sector, conf->generation - previous); 674 670 if (!sh) { 675 - if (!conf->inactive_blocked) 671 + if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) { 676 672 sh = get_free_stripe(conf, hash); 673 + if (!sh && llist_empty(&conf->released_stripes) && 674 + !test_bit(R5_DID_ALLOC, &conf->cache_state)) 675 + set_bit(R5_ALLOC_MORE, 676 + &conf->cache_state); 677 + } 677 678 if (noblock && sh == NULL) 678 679 break; 679 680 if (!sh) { 680 - conf->inactive_blocked = 1; 681 + set_bit(R5_INACTIVE_BLOCKED, 682 + &conf->cache_state); 681 683 wait_event_lock_irq( 682 684 conf->wait_for_stripe, 683 685 !list_empty(conf->inactive_list + hash) && 684 686 (atomic_read(&conf->active_stripes) 685 687 < (conf->max_nr_stripes * 3 / 4) 686 - || !conf->inactive_blocked), 688 + || !test_bit(R5_INACTIVE_BLOCKED, 689 + &conf->cache_state)), 687 690 *(conf->hash_locks + hash)); 688 - conf->inactive_blocked = 0; 691 + clear_bit(R5_INACTIVE_BLOCKED, 692 + &conf->cache_state); 689 693 } else { 690 694 init_stripe(sh, sector, previous); 691 695 atomic_inc(&sh->count); ··· 718 706 719 707 spin_unlock_irq(conf->hash_locks + hash); 720 708 return sh; 709 + } 710 + 711 + static bool is_full_stripe_write(struct stripe_head *sh) 712 + { 713 + BUG_ON(sh->overwrite_disks > (sh->disks - sh->raid_conf->max_degraded)); 714 + return sh->overwrite_disks == (sh->disks - sh->raid_conf->max_degraded); 715 + } 716 + 717 + static void lock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) 718 + { 719 + local_irq_disable(); 720 + if (sh1 > sh2) { 721 + spin_lock(&sh2->stripe_lock); 722 + spin_lock_nested(&sh1->stripe_lock, 1); 723 + } else { 724 + spin_lock(&sh1->stripe_lock); 725 + spin_lock_nested(&sh2->stripe_lock, 1); 726 + } 727 + } 728 + 729 + static void unlock_two_stripes(struct stripe_head *sh1, struct stripe_head *sh2) 730 + { 731 + spin_unlock(&sh1->stripe_lock); 732 + spin_unlock(&sh2->stripe_lock); 733 + local_irq_enable(); 734 + } 735 + 736 + /* Only freshly new full stripe normal write stripe can be added to a batch list */ 737 + static bool stripe_can_batch(struct stripe_head *sh) 738 + { 739 + return test_bit(STRIPE_BATCH_READY, &sh->state) && 740 + is_full_stripe_write(sh); 741 + } 742 + 743 + /* we only do back search */ 744 + static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh) 745 + { 746 + struct stripe_head *head; 747 + sector_t head_sector, tmp_sec; 748 + int hash; 749 + int dd_idx; 750 + 751 + if (!stripe_can_batch(sh)) 752 + return; 753 + /* Don't cross chunks, so stripe pd_idx/qd_idx is the same */ 754 + tmp_sec = sh->sector; 755 + if (!sector_div(tmp_sec, conf->chunk_sectors)) 756 + return; 757 + head_sector = sh->sector - STRIPE_SECTORS; 758 + 759 + hash = stripe_hash_locks_hash(head_sector); 760 + spin_lock_irq(conf->hash_locks + hash); 761 + head = __find_stripe(conf, head_sector, conf->generation); 762 + if (head && !atomic_inc_not_zero(&head->count)) { 763 + spin_lock(&conf->device_lock); 764 + if (!atomic_read(&head->count)) { 765 + if (!test_bit(STRIPE_HANDLE, &head->state)) 766 + atomic_inc(&conf->active_stripes); 767 + BUG_ON(list_empty(&head->lru) && 768 + !test_bit(STRIPE_EXPANDING, &head->state)); 769 + list_del_init(&head->lru); 770 + if (head->group) { 771 + head->group->stripes_cnt--; 772 + head->group = NULL; 773 + } 774 + } 775 + atomic_inc(&head->count); 776 + spin_unlock(&conf->device_lock); 777 + } 778 + spin_unlock_irq(conf->hash_locks + hash); 779 + 780 + if (!head) 781 + return; 782 + if (!stripe_can_batch(head)) 783 + goto out; 784 + 785 + lock_two_stripes(head, sh); 786 + /* clear_batch_ready clear the flag */ 787 + if (!stripe_can_batch(head) || !stripe_can_batch(sh)) 788 + goto unlock_out; 789 + 790 + if (sh->batch_head) 791 + goto unlock_out; 792 + 793 + dd_idx = 0; 794 + while (dd_idx == sh->pd_idx || dd_idx == sh->qd_idx) 795 + dd_idx++; 796 + if (head->dev[dd_idx].towrite->bi_rw != sh->dev[dd_idx].towrite->bi_rw) 797 + goto unlock_out; 798 + 799 + if (head->batch_head) { 800 + spin_lock(&head->batch_head->batch_lock); 801 + /* This batch list is already running */ 802 + if (!stripe_can_batch(head)) { 803 + spin_unlock(&head->batch_head->batch_lock); 804 + goto unlock_out; 805 + } 806 + 807 + /* 808 + * at this point, head's BATCH_READY could be cleared, but we 809 + * can still add the stripe to batch list 810 + */ 811 + list_add(&sh->batch_list, &head->batch_list); 812 + spin_unlock(&head->batch_head->batch_lock); 813 + 814 + sh->batch_head = head->batch_head; 815 + } else { 816 + head->batch_head = head; 817 + sh->batch_head = head->batch_head; 818 + spin_lock(&head->batch_lock); 819 + list_add_tail(&sh->batch_list, &head->batch_list); 820 + spin_unlock(&head->batch_lock); 821 + } 822 + 823 + if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) 824 + if (atomic_dec_return(&conf->preread_active_stripes) 825 + < IO_THRESHOLD) 826 + md_wakeup_thread(conf->mddev->thread); 827 + 828 + atomic_inc(&sh->count); 829 + unlock_out: 830 + unlock_two_stripes(head, sh); 831 + out: 832 + release_stripe(head); 721 833 } 722 834 723 835 /* Determine if 'data_offset' or 'new_data_offset' should be used ··· 874 738 { 875 739 struct r5conf *conf = sh->raid_conf; 876 740 int i, disks = sh->disks; 741 + struct stripe_head *head_sh = sh; 877 742 878 743 might_sleep(); 879 744 ··· 883 746 int replace_only = 0; 884 747 struct bio *bi, *rbi; 885 748 struct md_rdev *rdev, *rrdev = NULL; 749 + 750 + sh = head_sh; 886 751 if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags)) { 887 752 if (test_and_clear_bit(R5_WantFUA, &sh->dev[i].flags)) 888 753 rw = WRITE_FUA; ··· 903 764 if (test_and_clear_bit(R5_SyncIO, &sh->dev[i].flags)) 904 765 rw |= REQ_SYNC; 905 766 767 + again: 906 768 bi = &sh->dev[i].req; 907 769 rbi = &sh->dev[i].rreq; /* For writing to replacement */ 908 770 ··· 922 782 /* We raced and saw duplicates */ 923 783 rrdev = NULL; 924 784 } else { 925 - if (test_bit(R5_ReadRepl, &sh->dev[i].flags) && rrdev) 785 + if (test_bit(R5_ReadRepl, &head_sh->dev[i].flags) && rrdev) 926 786 rdev = rrdev; 927 787 rrdev = NULL; 928 788 } ··· 993 853 __func__, (unsigned long long)sh->sector, 994 854 bi->bi_rw, i); 995 855 atomic_inc(&sh->count); 856 + if (sh != head_sh) 857 + atomic_inc(&head_sh->count); 996 858 if (use_new_offset(conf, sh)) 997 859 bi->bi_iter.bi_sector = (sh->sector 998 860 + rdev->new_data_offset); 999 861 else 1000 862 bi->bi_iter.bi_sector = (sh->sector 1001 863 + rdev->data_offset); 1002 - if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags)) 864 + if (test_bit(R5_ReadNoMerge, &head_sh->dev[i].flags)) 1003 865 bi->bi_rw |= REQ_NOMERGE; 1004 866 1005 867 if (test_bit(R5_SkipCopy, &sh->dev[i].flags)) ··· 1045 903 __func__, (unsigned long long)sh->sector, 1046 904 rbi->bi_rw, i); 1047 905 atomic_inc(&sh->count); 906 + if (sh != head_sh) 907 + atomic_inc(&head_sh->count); 1048 908 if (use_new_offset(conf, sh)) 1049 909 rbi->bi_iter.bi_sector = (sh->sector 1050 910 + rrdev->new_data_offset); ··· 1078 934 pr_debug("skip op %ld on disc %d for sector %llu\n", 1079 935 bi->bi_rw, i, (unsigned long long)sh->sector); 1080 936 clear_bit(R5_LOCKED, &sh->dev[i].flags); 937 + if (sh->batch_head) 938 + set_bit(STRIPE_BATCH_ERR, 939 + &sh->batch_head->state); 1081 940 set_bit(STRIPE_HANDLE, &sh->state); 1082 941 } 942 + 943 + if (!head_sh->batch_head) 944 + continue; 945 + sh = list_first_entry(&sh->batch_list, struct stripe_head, 946 + batch_list); 947 + if (sh != head_sh) 948 + goto again; 1083 949 } 1084 950 } 1085 951 ··· 1205 1051 struct async_submit_ctl submit; 1206 1052 int i; 1207 1053 1054 + BUG_ON(sh->batch_head); 1208 1055 pr_debug("%s: stripe %llu\n", __func__, 1209 1056 (unsigned long long)sh->sector); 1210 1057 ··· 1264 1109 1265 1110 /* return a pointer to the address conversion region of the scribble buffer */ 1266 1111 static addr_conv_t *to_addr_conv(struct stripe_head *sh, 1267 - struct raid5_percpu *percpu) 1112 + struct raid5_percpu *percpu, int i) 1268 1113 { 1269 - return percpu->scribble + sizeof(struct page *) * (sh->disks + 2); 1114 + void *addr; 1115 + 1116 + addr = flex_array_get(percpu->scribble, i); 1117 + return addr + sizeof(struct page *) * (sh->disks + 2); 1118 + } 1119 + 1120 + /* return a pointer to the address conversion region of the scribble buffer */ 1121 + static struct page **to_addr_page(struct raid5_percpu *percpu, int i) 1122 + { 1123 + void *addr; 1124 + 1125 + addr = flex_array_get(percpu->scribble, i); 1126 + return addr; 1270 1127 } 1271 1128 1272 1129 static struct dma_async_tx_descriptor * 1273 1130 ops_run_compute5(struct stripe_head *sh, struct raid5_percpu *percpu) 1274 1131 { 1275 1132 int disks = sh->disks; 1276 - struct page **xor_srcs = percpu->scribble; 1133 + struct page **xor_srcs = to_addr_page(percpu, 0); 1277 1134 int target = sh->ops.target; 1278 1135 struct r5dev *tgt = &sh->dev[target]; 1279 1136 struct page *xor_dest = tgt->page; ··· 1293 1126 struct dma_async_tx_descriptor *tx; 1294 1127 struct async_submit_ctl submit; 1295 1128 int i; 1129 + 1130 + BUG_ON(sh->batch_head); 1296 1131 1297 1132 pr_debug("%s: stripe %llu block: %d\n", 1298 1133 __func__, (unsigned long long)sh->sector, target); ··· 1307 1138 atomic_inc(&sh->count); 1308 1139 1309 1140 init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, NULL, 1310 - ops_complete_compute, sh, to_addr_conv(sh, percpu)); 1141 + ops_complete_compute, sh, to_addr_conv(sh, percpu, 0)); 1311 1142 if (unlikely(count == 1)) 1312 1143 tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit); 1313 1144 else ··· 1325 1156 * destination buffer is recorded in srcs[count] and the Q destination 1326 1157 * is recorded in srcs[count+1]]. 1327 1158 */ 1328 - static int set_syndrome_sources(struct page **srcs, struct stripe_head *sh) 1159 + static int set_syndrome_sources(struct page **srcs, 1160 + struct stripe_head *sh, 1161 + int srctype) 1329 1162 { 1330 1163 int disks = sh->disks; 1331 1164 int syndrome_disks = sh->ddf_layout ? disks : (disks - 2); ··· 1342 1171 i = d0_idx; 1343 1172 do { 1344 1173 int slot = raid6_idx_to_slot(i, sh, &count, syndrome_disks); 1174 + struct r5dev *dev = &sh->dev[i]; 1345 1175 1346 - srcs[slot] = sh->dev[i].page; 1176 + if (i == sh->qd_idx || i == sh->pd_idx || 1177 + (srctype == SYNDROME_SRC_ALL) || 1178 + (srctype == SYNDROME_SRC_WANT_DRAIN && 1179 + test_bit(R5_Wantdrain, &dev->flags)) || 1180 + (srctype == SYNDROME_SRC_WRITTEN && 1181 + dev->written)) 1182 + srcs[slot] = sh->dev[i].page; 1347 1183 i = raid6_next_disk(i, disks); 1348 1184 } while (i != d0_idx); 1349 1185 ··· 1361 1183 ops_run_compute6_1(struct stripe_head *sh, struct raid5_percpu *percpu) 1362 1184 { 1363 1185 int disks = sh->disks; 1364 - struct page **blocks = percpu->scribble; 1186 + struct page **blocks = to_addr_page(percpu, 0); 1365 1187 int target; 1366 1188 int qd_idx = sh->qd_idx; 1367 1189 struct dma_async_tx_descriptor *tx; ··· 1371 1193 int i; 1372 1194 int count; 1373 1195 1196 + BUG_ON(sh->batch_head); 1374 1197 if (sh->ops.target < 0) 1375 1198 target = sh->ops.target2; 1376 1199 else if (sh->ops.target2 < 0) ··· 1390 1211 atomic_inc(&sh->count); 1391 1212 1392 1213 if (target == qd_idx) { 1393 - count = set_syndrome_sources(blocks, sh); 1214 + count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL); 1394 1215 blocks[count] = NULL; /* regenerating p is not necessary */ 1395 1216 BUG_ON(blocks[count+1] != dest); /* q should already be set */ 1396 1217 init_async_submit(&submit, ASYNC_TX_FENCE, NULL, 1397 1218 ops_complete_compute, sh, 1398 - to_addr_conv(sh, percpu)); 1219 + to_addr_conv(sh, percpu, 0)); 1399 1220 tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit); 1400 1221 } else { 1401 1222 /* Compute any data- or p-drive using XOR */ ··· 1408 1229 1409 1230 init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, 1410 1231 NULL, ops_complete_compute, sh, 1411 - to_addr_conv(sh, percpu)); 1232 + to_addr_conv(sh, percpu, 0)); 1412 1233 tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE, &submit); 1413 1234 } 1414 1235 ··· 1427 1248 struct r5dev *tgt = &sh->dev[target]; 1428 1249 struct r5dev *tgt2 = &sh->dev[target2]; 1429 1250 struct dma_async_tx_descriptor *tx; 1430 - struct page **blocks = percpu->scribble; 1251 + struct page **blocks = to_addr_page(percpu, 0); 1431 1252 struct async_submit_ctl submit; 1432 1253 1254 + BUG_ON(sh->batch_head); 1433 1255 pr_debug("%s: stripe %llu block1: %d block2: %d\n", 1434 1256 __func__, (unsigned long long)sh->sector, target, target2); 1435 1257 BUG_ON(target < 0 || target2 < 0); ··· 1470 1290 /* Missing P+Q, just recompute */ 1471 1291 init_async_submit(&submit, ASYNC_TX_FENCE, NULL, 1472 1292 ops_complete_compute, sh, 1473 - to_addr_conv(sh, percpu)); 1293 + to_addr_conv(sh, percpu, 0)); 1474 1294 return async_gen_syndrome(blocks, 0, syndrome_disks+2, 1475 1295 STRIPE_SIZE, &submit); 1476 1296 } else { ··· 1494 1314 init_async_submit(&submit, 1495 1315 ASYNC_TX_FENCE|ASYNC_TX_XOR_ZERO_DST, 1496 1316 NULL, NULL, NULL, 1497 - to_addr_conv(sh, percpu)); 1317 + to_addr_conv(sh, percpu, 0)); 1498 1318 tx = async_xor(dest, blocks, 0, count, STRIPE_SIZE, 1499 1319 &submit); 1500 1320 1501 - count = set_syndrome_sources(blocks, sh); 1321 + count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_ALL); 1502 1322 init_async_submit(&submit, ASYNC_TX_FENCE, tx, 1503 1323 ops_complete_compute, sh, 1504 - to_addr_conv(sh, percpu)); 1324 + to_addr_conv(sh, percpu, 0)); 1505 1325 return async_gen_syndrome(blocks, 0, count+2, 1506 1326 STRIPE_SIZE, &submit); 1507 1327 } 1508 1328 } else { 1509 1329 init_async_submit(&submit, ASYNC_TX_FENCE, NULL, 1510 1330 ops_complete_compute, sh, 1511 - to_addr_conv(sh, percpu)); 1331 + to_addr_conv(sh, percpu, 0)); 1512 1332 if (failb == syndrome_disks) { 1513 1333 /* We're missing D+P. */ 1514 1334 return async_raid6_datap_recov(syndrome_disks+2, ··· 1532 1352 } 1533 1353 1534 1354 static struct dma_async_tx_descriptor * 1535 - ops_run_prexor(struct stripe_head *sh, struct raid5_percpu *percpu, 1536 - struct dma_async_tx_descriptor *tx) 1355 + ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu, 1356 + struct dma_async_tx_descriptor *tx) 1537 1357 { 1538 1358 int disks = sh->disks; 1539 - struct page **xor_srcs = percpu->scribble; 1359 + struct page **xor_srcs = to_addr_page(percpu, 0); 1540 1360 int count = 0, pd_idx = sh->pd_idx, i; 1541 1361 struct async_submit_ctl submit; 1542 1362 1543 1363 /* existing parity data subtracted */ 1544 1364 struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page; 1545 1365 1366 + BUG_ON(sh->batch_head); 1546 1367 pr_debug("%s: stripe %llu\n", __func__, 1547 1368 (unsigned long long)sh->sector); 1548 1369 ··· 1555 1374 } 1556 1375 1557 1376 init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_XOR_DROP_DST, tx, 1558 - ops_complete_prexor, sh, to_addr_conv(sh, percpu)); 1377 + ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0)); 1559 1378 tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit); 1379 + 1380 + return tx; 1381 + } 1382 + 1383 + static struct dma_async_tx_descriptor * 1384 + ops_run_prexor6(struct stripe_head *sh, struct raid5_percpu *percpu, 1385 + struct dma_async_tx_descriptor *tx) 1386 + { 1387 + struct page **blocks = to_addr_page(percpu, 0); 1388 + int count; 1389 + struct async_submit_ctl submit; 1390 + 1391 + pr_debug("%s: stripe %llu\n", __func__, 1392 + (unsigned long long)sh->sector); 1393 + 1394 + count = set_syndrome_sources(blocks, sh, SYNDROME_SRC_WANT_DRAIN); 1395 + 1396 + init_async_submit(&submit, ASYNC_TX_FENCE|ASYNC_TX_PQ_XOR_DST, tx, 1397 + ops_complete_prexor, sh, to_addr_conv(sh, percpu, 0)); 1398 + tx = async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit); 1560 1399 1561 1400 return tx; 1562 1401 } ··· 1586 1385 { 1587 1386 int disks = sh->disks; 1588 1387 int i; 1388 + struct stripe_head *head_sh = sh; 1589 1389 1590 1390 pr_debug("%s: stripe %llu\n", __func__, 1591 1391 (unsigned long long)sh->sector); 1592 1392 1593 1393 for (i = disks; i--; ) { 1594 - struct r5dev *dev = &sh->dev[i]; 1394 + struct r5dev *dev; 1595 1395 struct bio *chosen; 1596 1396 1597 - if (test_and_clear_bit(R5_Wantdrain, &dev->flags)) { 1397 + sh = head_sh; 1398 + if (test_and_clear_bit(R5_Wantdrain, &head_sh->dev[i].flags)) { 1598 1399 struct bio *wbi; 1599 1400 1401 + again: 1402 + dev = &sh->dev[i]; 1600 1403 spin_lock_irq(&sh->stripe_lock); 1601 1404 chosen = dev->towrite; 1602 1405 dev->towrite = NULL; 1406 + sh->overwrite_disks = 0; 1603 1407 BUG_ON(dev->written); 1604 1408 wbi = dev->written = chosen; 1605 1409 spin_unlock_irq(&sh->stripe_lock); ··· 1628 1422 } 1629 1423 } 1630 1424 wbi = r5_next_bio(wbi, dev->sector); 1425 + } 1426 + 1427 + if (head_sh->batch_head) { 1428 + sh = list_first_entry(&sh->batch_list, 1429 + struct stripe_head, 1430 + batch_list); 1431 + if (sh == head_sh) 1432 + continue; 1433 + goto again; 1631 1434 } 1632 1435 } 1633 1436 } ··· 1693 1478 struct dma_async_tx_descriptor *tx) 1694 1479 { 1695 1480 int disks = sh->disks; 1696 - struct page **xor_srcs = percpu->scribble; 1481 + struct page **xor_srcs; 1697 1482 struct async_submit_ctl submit; 1698 - int count = 0, pd_idx = sh->pd_idx, i; 1483 + int count, pd_idx = sh->pd_idx, i; 1699 1484 struct page *xor_dest; 1700 1485 int prexor = 0; 1701 1486 unsigned long flags; 1487 + int j = 0; 1488 + struct stripe_head *head_sh = sh; 1489 + int last_stripe; 1702 1490 1703 1491 pr_debug("%s: stripe %llu\n", __func__, 1704 1492 (unsigned long long)sh->sector); ··· 1718 1500 ops_complete_reconstruct(sh); 1719 1501 return; 1720 1502 } 1503 + again: 1504 + count = 0; 1505 + xor_srcs = to_addr_page(percpu, j); 1721 1506 /* check if prexor is active which means only process blocks 1722 1507 * that are part of a read-modify-write (written) 1723 1508 */ 1724 - if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) { 1509 + if (head_sh->reconstruct_state == reconstruct_state_prexor_drain_run) { 1725 1510 prexor = 1; 1726 1511 xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page; 1727 1512 for (i = disks; i--; ) { 1728 1513 struct r5dev *dev = &sh->dev[i]; 1729 - if (dev->written) 1514 + if (head_sh->dev[i].written) 1730 1515 xor_srcs[count++] = dev->page; 1731 1516 } 1732 1517 } else { ··· 1746 1525 * set ASYNC_TX_XOR_DROP_DST and ASYNC_TX_XOR_ZERO_DST 1747 1526 * for the synchronous xor case 1748 1527 */ 1749 - flags = ASYNC_TX_ACK | 1750 - (prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST); 1528 + last_stripe = !head_sh->batch_head || 1529 + list_first_entry(&sh->batch_list, 1530 + struct stripe_head, batch_list) == head_sh; 1531 + if (last_stripe) { 1532 + flags = ASYNC_TX_ACK | 1533 + (prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST); 1751 1534 1752 - atomic_inc(&sh->count); 1535 + atomic_inc(&head_sh->count); 1536 + init_async_submit(&submit, flags, tx, ops_complete_reconstruct, head_sh, 1537 + to_addr_conv(sh, percpu, j)); 1538 + } else { 1539 + flags = prexor ? ASYNC_TX_XOR_DROP_DST : ASYNC_TX_XOR_ZERO_DST; 1540 + init_async_submit(&submit, flags, tx, NULL, NULL, 1541 + to_addr_conv(sh, percpu, j)); 1542 + } 1753 1543 1754 - init_async_submit(&submit, flags, tx, ops_complete_reconstruct, sh, 1755 - to_addr_conv(sh, percpu)); 1756 1544 if (unlikely(count == 1)) 1757 1545 tx = async_memcpy(xor_dest, xor_srcs[0], 0, 0, STRIPE_SIZE, &submit); 1758 1546 else 1759 1547 tx = async_xor(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, &submit); 1548 + if (!last_stripe) { 1549 + j++; 1550 + sh = list_first_entry(&sh->batch_list, struct stripe_head, 1551 + batch_list); 1552 + goto again; 1553 + } 1760 1554 } 1761 1555 1762 1556 static void ··· 1779 1543 struct dma_async_tx_descriptor *tx) 1780 1544 { 1781 1545 struct async_submit_ctl submit; 1782 - struct page **blocks = percpu->scribble; 1783 - int count, i; 1546 + struct page **blocks; 1547 + int count, i, j = 0; 1548 + struct stripe_head *head_sh = sh; 1549 + int last_stripe; 1550 + int synflags; 1551 + unsigned long txflags; 1784 1552 1785 1553 pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector); 1786 1554 ··· 1802 1562 return; 1803 1563 } 1804 1564 1805 - count = set_syndrome_sources(blocks, sh); 1565 + again: 1566 + blocks = to_addr_page(percpu, j); 1806 1567 1807 - atomic_inc(&sh->count); 1568 + if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) { 1569 + synflags = SYNDROME_SRC_WRITTEN; 1570 + txflags = ASYNC_TX_ACK | ASYNC_TX_PQ_XOR_DST; 1571 + } else { 1572 + synflags = SYNDROME_SRC_ALL; 1573 + txflags = ASYNC_TX_ACK; 1574 + } 1808 1575 1809 - init_async_submit(&submit, ASYNC_TX_ACK, tx, ops_complete_reconstruct, 1810 - sh, to_addr_conv(sh, percpu)); 1576 + count = set_syndrome_sources(blocks, sh, synflags); 1577 + last_stripe = !head_sh->batch_head || 1578 + list_first_entry(&sh->batch_list, 1579 + struct stripe_head, batch_list) == head_sh; 1580 + 1581 + if (last_stripe) { 1582 + atomic_inc(&head_sh->count); 1583 + init_async_submit(&submit, txflags, tx, ops_complete_reconstruct, 1584 + head_sh, to_addr_conv(sh, percpu, j)); 1585 + } else 1586 + init_async_submit(&submit, 0, tx, NULL, NULL, 1587 + to_addr_conv(sh, percpu, j)); 1811 1588 async_gen_syndrome(blocks, 0, count+2, STRIPE_SIZE, &submit); 1589 + if (!last_stripe) { 1590 + j++; 1591 + sh = list_first_entry(&sh->batch_list, struct stripe_head, 1592 + batch_list); 1593 + goto again; 1594 + } 1812 1595 } 1813 1596 1814 1597 static void ops_complete_check(void *stripe_head_ref) ··· 1852 1589 int pd_idx = sh->pd_idx; 1853 1590 int qd_idx = sh->qd_idx; 1854 1591 struct page *xor_dest; 1855 - struct page **xor_srcs = percpu->scribble; 1592 + struct page **xor_srcs = to_addr_page(percpu, 0); 1856 1593 struct dma_async_tx_descriptor *tx; 1857 1594 struct async_submit_ctl submit; 1858 1595 int count; ··· 1861 1598 pr_debug("%s: stripe %llu\n", __func__, 1862 1599 (unsigned long long)sh->sector); 1863 1600 1601 + BUG_ON(sh->batch_head); 1864 1602 count = 0; 1865 1603 xor_dest = sh->dev[pd_idx].page; 1866 1604 xor_srcs[count++] = xor_dest; ··· 1872 1608 } 1873 1609 1874 1610 init_async_submit(&submit, 0, NULL, NULL, NULL, 1875 - to_addr_conv(sh, percpu)); 1611 + to_addr_conv(sh, percpu, 0)); 1876 1612 tx = async_xor_val(xor_dest, xor_srcs, 0, count, STRIPE_SIZE, 1877 1613 &sh->ops.zero_sum_result, &submit); 1878 1614 ··· 1883 1619 1884 1620 static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu, int checkp) 1885 1621 { 1886 - struct page **srcs = percpu->scribble; 1622 + struct page **srcs = to_addr_page(percpu, 0); 1887 1623 struct async_submit_ctl submit; 1888 1624 int count; 1889 1625 1890 1626 pr_debug("%s: stripe %llu checkp: %d\n", __func__, 1891 1627 (unsigned long long)sh->sector, checkp); 1892 1628 1893 - count = set_syndrome_sources(srcs, sh); 1629 + BUG_ON(sh->batch_head); 1630 + count = set_syndrome_sources(srcs, sh, SYNDROME_SRC_ALL); 1894 1631 if (!checkp) 1895 1632 srcs[count] = NULL; 1896 1633 1897 1634 atomic_inc(&sh->count); 1898 1635 init_async_submit(&submit, ASYNC_TX_ACK, NULL, ops_complete_check, 1899 - sh, to_addr_conv(sh, percpu)); 1636 + sh, to_addr_conv(sh, percpu, 0)); 1900 1637 async_syndrome_val(srcs, 0, count+2, STRIPE_SIZE, 1901 1638 &sh->ops.zero_sum_result, percpu->spare_page, &submit); 1902 1639 } ··· 1932 1667 async_tx_ack(tx); 1933 1668 } 1934 1669 1935 - if (test_bit(STRIPE_OP_PREXOR, &ops_request)) 1936 - tx = ops_run_prexor(sh, percpu, tx); 1670 + if (test_bit(STRIPE_OP_PREXOR, &ops_request)) { 1671 + if (level < 6) 1672 + tx = ops_run_prexor5(sh, percpu, tx); 1673 + else 1674 + tx = ops_run_prexor6(sh, percpu, tx); 1675 + } 1937 1676 1938 1677 if (test_bit(STRIPE_OP_BIODRAIN, &ops_request)) { 1939 1678 tx = ops_run_biodrain(sh, tx); ··· 1962 1693 BUG(); 1963 1694 } 1964 1695 1965 - if (overlap_clear) 1696 + if (overlap_clear && !sh->batch_head) 1966 1697 for (i = disks; i--; ) { 1967 1698 struct r5dev *dev = &sh->dev[i]; 1968 1699 if (test_and_clear_bit(R5_Overlap, &dev->flags)) ··· 1971 1702 put_cpu(); 1972 1703 } 1973 1704 1974 - static int grow_one_stripe(struct r5conf *conf, int hash) 1705 + static int grow_one_stripe(struct r5conf *conf, gfp_t gfp) 1975 1706 { 1976 1707 struct stripe_head *sh; 1977 - sh = kmem_cache_zalloc(conf->slab_cache, GFP_KERNEL); 1708 + sh = kmem_cache_zalloc(conf->slab_cache, gfp); 1978 1709 if (!sh) 1979 1710 return 0; 1980 1711 ··· 1982 1713 1983 1714 spin_lock_init(&sh->stripe_lock); 1984 1715 1985 - if (grow_buffers(sh)) { 1716 + if (grow_buffers(sh, gfp)) { 1986 1717 shrink_buffers(sh); 1987 1718 kmem_cache_free(conf->slab_cache, sh); 1988 1719 return 0; 1989 1720 } 1990 - sh->hash_lock_index = hash; 1721 + sh->hash_lock_index = 1722 + conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS; 1991 1723 /* we just created an active stripe so... */ 1992 1724 atomic_set(&sh->count, 1); 1993 1725 atomic_inc(&conf->active_stripes); 1994 1726 INIT_LIST_HEAD(&sh->lru); 1727 + 1728 + spin_lock_init(&sh->batch_lock); 1729 + INIT_LIST_HEAD(&sh->batch_list); 1730 + sh->batch_head = NULL; 1995 1731 release_stripe(sh); 1732 + conf->max_nr_stripes++; 1996 1733 return 1; 1997 1734 } 1998 1735 ··· 2006 1731 { 2007 1732 struct kmem_cache *sc; 2008 1733 int devs = max(conf->raid_disks, conf->previous_raid_disks); 2009 - int hash; 2010 1734 2011 1735 if (conf->mddev->gendisk) 2012 1736 sprintf(conf->cache_name[0], ··· 2023 1749 return 1; 2024 1750 conf->slab_cache = sc; 2025 1751 conf->pool_size = devs; 2026 - hash = conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS; 2027 - while (num--) { 2028 - if (!grow_one_stripe(conf, hash)) 1752 + while (num--) 1753 + if (!grow_one_stripe(conf, GFP_KERNEL)) 2029 1754 return 1; 2030 - conf->max_nr_stripes++; 2031 - hash = (hash + 1) % NR_STRIPE_HASH_LOCKS; 2032 - } 1755 + 2033 1756 return 0; 2034 1757 } 2035 1758 ··· 2043 1772 * calculate over all devices (not just the data blocks), using zeros in place 2044 1773 * of the P and Q blocks. 2045 1774 */ 2046 - static size_t scribble_len(int num) 1775 + static struct flex_array *scribble_alloc(int num, int cnt, gfp_t flags) 2047 1776 { 1777 + struct flex_array *ret; 2048 1778 size_t len; 2049 1779 2050 1780 len = sizeof(struct page *) * (num+2) + sizeof(addr_conv_t) * (num+2); 2051 - 2052 - return len; 1781 + ret = flex_array_alloc(len, cnt, flags); 1782 + if (!ret) 1783 + return NULL; 1784 + /* always prealloc all elements, so no locking is required */ 1785 + if (flex_array_prealloc(ret, 0, cnt, flags)) { 1786 + flex_array_free(ret); 1787 + return NULL; 1788 + } 1789 + return ret; 2053 1790 } 2054 1791 2055 1792 static int resize_stripes(struct r5conf *conf, int newsize) ··· 2175 1896 err = -ENOMEM; 2176 1897 2177 1898 get_online_cpus(); 2178 - conf->scribble_len = scribble_len(newsize); 2179 1899 for_each_present_cpu(cpu) { 2180 1900 struct raid5_percpu *percpu; 2181 - void *scribble; 1901 + struct flex_array *scribble; 2182 1902 2183 1903 percpu = per_cpu_ptr(conf->percpu, cpu); 2184 - scribble = kmalloc(conf->scribble_len, GFP_NOIO); 1904 + scribble = scribble_alloc(newsize, conf->chunk_sectors / 1905 + STRIPE_SECTORS, GFP_NOIO); 2185 1906 2186 1907 if (scribble) { 2187 - kfree(percpu->scribble); 1908 + flex_array_free(percpu->scribble); 2188 1909 percpu->scribble = scribble; 2189 1910 } else { 2190 1911 err = -ENOMEM; ··· 2216 1937 return err; 2217 1938 } 2218 1939 2219 - static int drop_one_stripe(struct r5conf *conf, int hash) 1940 + static int drop_one_stripe(struct r5conf *conf) 2220 1941 { 2221 1942 struct stripe_head *sh; 1943 + int hash = (conf->max_nr_stripes - 1) % NR_STRIPE_HASH_LOCKS; 2222 1944 2223 1945 spin_lock_irq(conf->hash_locks + hash); 2224 1946 sh = get_free_stripe(conf, hash); ··· 2230 1950 shrink_buffers(sh); 2231 1951 kmem_cache_free(conf->slab_cache, sh); 2232 1952 atomic_dec(&conf->active_stripes); 1953 + conf->max_nr_stripes--; 2233 1954 return 1; 2234 1955 } 2235 1956 2236 1957 static void shrink_stripes(struct r5conf *conf) 2237 1958 { 2238 - int hash; 2239 - for (hash = 0; hash < NR_STRIPE_HASH_LOCKS; hash++) 2240 - while (drop_one_stripe(conf, hash)) 2241 - ; 1959 + while (conf->max_nr_stripes && 1960 + drop_one_stripe(conf)) 1961 + ; 2242 1962 2243 1963 if (conf->slab_cache) 2244 1964 kmem_cache_destroy(conf->slab_cache); ··· 2434 2154 } 2435 2155 rdev_dec_pending(rdev, conf->mddev); 2436 2156 2157 + if (sh->batch_head && !uptodate) 2158 + set_bit(STRIPE_BATCH_ERR, &sh->batch_head->state); 2159 + 2437 2160 if (!test_and_clear_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags)) 2438 2161 clear_bit(R5_LOCKED, &sh->dev[i].flags); 2439 2162 set_bit(STRIPE_HANDLE, &sh->state); 2440 2163 release_stripe(sh); 2164 + 2165 + if (sh->batch_head && sh != sh->batch_head) 2166 + release_stripe(sh->batch_head); 2441 2167 } 2442 2168 2443 2169 static sector_t compute_blocknr(struct stripe_head *sh, int i, int previous); ··· 2821 2535 schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s, 2822 2536 int rcw, int expand) 2823 2537 { 2824 - int i, pd_idx = sh->pd_idx, disks = sh->disks; 2538 + int i, pd_idx = sh->pd_idx, qd_idx = sh->qd_idx, disks = sh->disks; 2825 2539 struct r5conf *conf = sh->raid_conf; 2826 2540 int level = conf->level; 2827 2541 ··· 2857 2571 if (!test_and_set_bit(STRIPE_FULL_WRITE, &sh->state)) 2858 2572 atomic_inc(&conf->pending_full_writes); 2859 2573 } else { 2860 - BUG_ON(level == 6); 2861 2574 BUG_ON(!(test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags) || 2862 2575 test_bit(R5_Wantcompute, &sh->dev[pd_idx].flags))); 2576 + BUG_ON(level == 6 && 2577 + (!(test_bit(R5_UPTODATE, &sh->dev[qd_idx].flags) || 2578 + test_bit(R5_Wantcompute, &sh->dev[qd_idx].flags)))); 2863 2579 2864 2580 for (i = disks; i--; ) { 2865 2581 struct r5dev *dev = &sh->dev[i]; 2866 - if (i == pd_idx) 2582 + if (i == pd_idx || i == qd_idx) 2867 2583 continue; 2868 2584 2869 2585 if (dev->towrite && ··· 2912 2624 * toread/towrite point to the first in a chain. 2913 2625 * The bi_next chain must be in order. 2914 2626 */ 2915 - static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, int forwrite) 2627 + static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, 2628 + int forwrite, int previous) 2916 2629 { 2917 2630 struct bio **bip; 2918 2631 struct r5conf *conf = sh->raid_conf; ··· 2932 2643 * protect it. 2933 2644 */ 2934 2645 spin_lock_irq(&sh->stripe_lock); 2646 + /* Don't allow new IO added to stripes in batch list */ 2647 + if (sh->batch_head) 2648 + goto overlap; 2935 2649 if (forwrite) { 2936 2650 bip = &sh->dev[dd_idx].towrite; 2937 2651 if (*bip == NULL) ··· 2948 2656 } 2949 2657 if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi)) 2950 2658 goto overlap; 2659 + 2660 + if (!forwrite || previous) 2661 + clear_bit(STRIPE_BATCH_READY, &sh->state); 2951 2662 2952 2663 BUG_ON(*bip && bi->bi_next && (*bip) != bi->bi_next); 2953 2664 if (*bip) ··· 2969 2674 sector = bio_end_sector(bi); 2970 2675 } 2971 2676 if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS) 2972 - set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags); 2677 + if (!test_and_set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags)) 2678 + sh->overwrite_disks++; 2973 2679 } 2974 2680 2975 2681 pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n", ··· 2984 2688 sh->bm_seq = conf->seq_flush+1; 2985 2689 set_bit(STRIPE_BIT_DELAY, &sh->state); 2986 2690 } 2691 + 2692 + if (stripe_can_batch(sh)) 2693 + stripe_add_to_batch_list(conf, sh); 2987 2694 return 1; 2988 2695 2989 2696 overlap: ··· 3019 2720 struct bio **return_bi) 3020 2721 { 3021 2722 int i; 2723 + BUG_ON(sh->batch_head); 3022 2724 for (i = disks; i--; ) { 3023 2725 struct bio *bi; 3024 2726 int bitmap_end = 0; ··· 3046 2746 /* fail all writes first */ 3047 2747 bi = sh->dev[i].towrite; 3048 2748 sh->dev[i].towrite = NULL; 2749 + sh->overwrite_disks = 0; 3049 2750 spin_unlock_irq(&sh->stripe_lock); 3050 2751 if (bi) 3051 2752 bitmap_end = 1; ··· 3135 2834 int abort = 0; 3136 2835 int i; 3137 2836 2837 + BUG_ON(sh->batch_head); 3138 2838 clear_bit(STRIPE_SYNCING, &sh->state); 3139 2839 if (test_and_clear_bit(R5_Overlap, &sh->dev[sh->pd_idx].flags)) 3140 2840 wake_up(&conf->wait_for_overlap); ··· 3366 3064 { 3367 3065 int i; 3368 3066 3067 + BUG_ON(sh->batch_head); 3369 3068 /* look for blocks to read/compute, skip this if a compute 3370 3069 * is already in flight, or if the stripe contents are in the 3371 3070 * midst of changing due to a write ··· 3390 3087 int i; 3391 3088 struct r5dev *dev; 3392 3089 int discard_pending = 0; 3090 + struct stripe_head *head_sh = sh; 3091 + bool do_endio = false; 3092 + int wakeup_nr = 0; 3393 3093 3394 3094 for (i = disks; i--; ) 3395 3095 if (sh->dev[i].written) { ··· 3408 3102 clear_bit(R5_UPTODATE, &dev->flags); 3409 3103 if (test_and_clear_bit(R5_SkipCopy, &dev->flags)) { 3410 3104 WARN_ON(test_bit(R5_UPTODATE, &dev->flags)); 3411 - dev->page = dev->orig_page; 3412 3105 } 3106 + do_endio = true; 3107 + 3108 + returnbi: 3109 + dev->page = dev->orig_page; 3413 3110 wbi = dev->written; 3414 3111 dev->written = NULL; 3415 3112 while (wbi && wbi->bi_iter.bi_sector < ··· 3429 3120 STRIPE_SECTORS, 3430 3121 !test_bit(STRIPE_DEGRADED, &sh->state), 3431 3122 0); 3123 + if (head_sh->batch_head) { 3124 + sh = list_first_entry(&sh->batch_list, 3125 + struct stripe_head, 3126 + batch_list); 3127 + if (sh != head_sh) { 3128 + dev = &sh->dev[i]; 3129 + goto returnbi; 3130 + } 3131 + } 3132 + sh = head_sh; 3133 + dev = &sh->dev[i]; 3432 3134 } else if (test_bit(R5_Discard, &dev->flags)) 3433 3135 discard_pending = 1; 3434 3136 WARN_ON(test_bit(R5_SkipCopy, &dev->flags)); ··· 3461 3141 * will be reinitialized 3462 3142 */ 3463 3143 spin_lock_irq(&conf->device_lock); 3144 + unhash: 3464 3145 remove_hash(sh); 3146 + if (head_sh->batch_head) { 3147 + sh = list_first_entry(&sh->batch_list, 3148 + struct stripe_head, batch_list); 3149 + if (sh != head_sh) 3150 + goto unhash; 3151 + } 3465 3152 spin_unlock_irq(&conf->device_lock); 3153 + sh = head_sh; 3154 + 3466 3155 if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state)) 3467 3156 set_bit(STRIPE_HANDLE, &sh->state); 3468 3157 ··· 3480 3151 if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state)) 3481 3152 if (atomic_dec_and_test(&conf->pending_full_writes)) 3482 3153 md_wakeup_thread(conf->mddev->thread); 3154 + 3155 + if (!head_sh->batch_head || !do_endio) 3156 + return; 3157 + for (i = 0; i < head_sh->disks; i++) { 3158 + if (test_and_clear_bit(R5_Overlap, &head_sh->dev[i].flags)) 3159 + wakeup_nr++; 3160 + } 3161 + while (!list_empty(&head_sh->batch_list)) { 3162 + int i; 3163 + sh = list_first_entry(&head_sh->batch_list, 3164 + struct stripe_head, batch_list); 3165 + list_del_init(&sh->batch_list); 3166 + 3167 + set_mask_bits(&sh->state, ~STRIPE_EXPAND_SYNC_FLAG, 3168 + head_sh->state & ~((1 << STRIPE_ACTIVE) | 3169 + (1 << STRIPE_PREREAD_ACTIVE) | 3170 + STRIPE_EXPAND_SYNC_FLAG)); 3171 + sh->check_state = head_sh->check_state; 3172 + sh->reconstruct_state = head_sh->reconstruct_state; 3173 + for (i = 0; i < sh->disks; i++) { 3174 + if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) 3175 + wakeup_nr++; 3176 + sh->dev[i].flags = head_sh->dev[i].flags; 3177 + } 3178 + 3179 + spin_lock_irq(&sh->stripe_lock); 3180 + sh->batch_head = NULL; 3181 + spin_unlock_irq(&sh->stripe_lock); 3182 + if (sh->state & STRIPE_EXPAND_SYNC_FLAG) 3183 + set_bit(STRIPE_HANDLE, &sh->state); 3184 + release_stripe(sh); 3185 + } 3186 + 3187 + spin_lock_irq(&head_sh->stripe_lock); 3188 + head_sh->batch_head = NULL; 3189 + spin_unlock_irq(&head_sh->stripe_lock); 3190 + wake_up_nr(&conf->wait_for_overlap, wakeup_nr); 3191 + if (head_sh->state & STRIPE_EXPAND_SYNC_FLAG) 3192 + set_bit(STRIPE_HANDLE, &head_sh->state); 3483 3193 } 3484 3194 3485 3195 static void handle_stripe_dirtying(struct r5conf *conf, ··· 3529 3161 int rmw = 0, rcw = 0, i; 3530 3162 sector_t recovery_cp = conf->mddev->recovery_cp; 3531 3163 3532 - /* RAID6 requires 'rcw' in current implementation. 3533 - * Otherwise, check whether resync is now happening or should start. 3164 + /* Check whether resync is now happening or should start. 3534 3165 * If yes, then the array is dirty (after unclean shutdown or 3535 3166 * initial creation), so parity in some stripes might be inconsistent. 3536 3167 * In this case, we need to always do reconstruct-write, to ensure 3537 3168 * that in case of drive failure or read-error correction, we 3538 3169 * generate correct data from the parity. 3539 3170 */ 3540 - if (conf->max_degraded == 2 || 3171 + if (conf->rmw_level == PARITY_DISABLE_RMW || 3541 3172 (recovery_cp < MaxSector && sh->sector >= recovery_cp && 3542 3173 s->failed == 0)) { 3543 3174 /* Calculate the real rcw later - for now make it 3544 3175 * look like rcw is cheaper 3545 3176 */ 3546 3177 rcw = 1; rmw = 2; 3547 - pr_debug("force RCW max_degraded=%u, recovery_cp=%llu sh->sector=%llu\n", 3548 - conf->max_degraded, (unsigned long long)recovery_cp, 3178 + pr_debug("force RCW rmw_level=%u, recovery_cp=%llu sh->sector=%llu\n", 3179 + conf->rmw_level, (unsigned long long)recovery_cp, 3549 3180 (unsigned long long)sh->sector); 3550 3181 } else for (i = disks; i--; ) { 3551 3182 /* would I have to read this buffer for read_modify_write */ 3552 3183 struct r5dev *dev = &sh->dev[i]; 3553 - if ((dev->towrite || i == sh->pd_idx) && 3184 + if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) && 3554 3185 !test_bit(R5_LOCKED, &dev->flags) && 3555 3186 !(test_bit(R5_UPTODATE, &dev->flags) || 3556 3187 test_bit(R5_Wantcompute, &dev->flags))) { ··· 3559 3192 rmw += 2*disks; /* cannot read it */ 3560 3193 } 3561 3194 /* Would I have to read this buffer for reconstruct_write */ 3562 - if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx && 3195 + if (!test_bit(R5_OVERWRITE, &dev->flags) && 3196 + i != sh->pd_idx && i != sh->qd_idx && 3563 3197 !test_bit(R5_LOCKED, &dev->flags) && 3564 3198 !(test_bit(R5_UPTODATE, &dev->flags) || 3565 3199 test_bit(R5_Wantcompute, &dev->flags))) { ··· 3573 3205 pr_debug("for sector %llu, rmw=%d rcw=%d\n", 3574 3206 (unsigned long long)sh->sector, rmw, rcw); 3575 3207 set_bit(STRIPE_HANDLE, &sh->state); 3576 - if (rmw < rcw && rmw > 0) { 3208 + if ((rmw < rcw || (rmw == rcw && conf->rmw_level == PARITY_ENABLE_RMW)) && rmw > 0) { 3577 3209 /* prefer read-modify-write, but need to get some data */ 3578 3210 if (conf->mddev->queue) 3579 3211 blk_add_trace_msg(conf->mddev->queue, ··· 3581 3213 (unsigned long long)sh->sector, rmw); 3582 3214 for (i = disks; i--; ) { 3583 3215 struct r5dev *dev = &sh->dev[i]; 3584 - if ((dev->towrite || i == sh->pd_idx) && 3216 + if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) && 3585 3217 !test_bit(R5_LOCKED, &dev->flags) && 3586 3218 !(test_bit(R5_UPTODATE, &dev->flags) || 3587 3219 test_bit(R5_Wantcompute, &dev->flags)) && ··· 3600 3232 } 3601 3233 } 3602 3234 } 3603 - if (rcw <= rmw && rcw > 0) { 3235 + if ((rcw < rmw || (rcw == rmw && conf->rmw_level != PARITY_ENABLE_RMW)) && rcw > 0) { 3604 3236 /* want reconstruct write, but need to get some data */ 3605 3237 int qread =0; 3606 3238 rcw = 0; ··· 3658 3290 { 3659 3291 struct r5dev *dev = NULL; 3660 3292 3293 + BUG_ON(sh->batch_head); 3661 3294 set_bit(STRIPE_HANDLE, &sh->state); 3662 3295 3663 3296 switch (sh->check_state) { ··· 3749 3380 int qd_idx = sh->qd_idx; 3750 3381 struct r5dev *dev; 3751 3382 3383 + BUG_ON(sh->batch_head); 3752 3384 set_bit(STRIPE_HANDLE, &sh->state); 3753 3385 3754 3386 BUG_ON(s->failed > 2); ··· 3913 3543 * copy some of them into a target stripe for expand. 3914 3544 */ 3915 3545 struct dma_async_tx_descriptor *tx = NULL; 3546 + BUG_ON(sh->batch_head); 3916 3547 clear_bit(STRIPE_EXPAND_SOURCE, &sh->state); 3917 3548 for (i = 0; i < sh->disks; i++) 3918 3549 if (i != sh->pd_idx && i != sh->qd_idx) { ··· 3986 3615 3987 3616 memset(s, 0, sizeof(*s)); 3988 3617 3989 - s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); 3990 - s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); 3618 + s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state) && !sh->batch_head; 3619 + s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state) && !sh->batch_head; 3991 3620 s->failed_num[0] = -1; 3992 3621 s->failed_num[1] = -1; 3993 3622 ··· 4157 3786 rcu_read_unlock(); 4158 3787 } 4159 3788 3789 + static int clear_batch_ready(struct stripe_head *sh) 3790 + { 3791 + struct stripe_head *tmp; 3792 + if (!test_and_clear_bit(STRIPE_BATCH_READY, &sh->state)) 3793 + return 0; 3794 + spin_lock(&sh->stripe_lock); 3795 + if (!sh->batch_head) { 3796 + spin_unlock(&sh->stripe_lock); 3797 + return 0; 3798 + } 3799 + 3800 + /* 3801 + * this stripe could be added to a batch list before we check 3802 + * BATCH_READY, skips it 3803 + */ 3804 + if (sh->batch_head != sh) { 3805 + spin_unlock(&sh->stripe_lock); 3806 + return 1; 3807 + } 3808 + spin_lock(&sh->batch_lock); 3809 + list_for_each_entry(tmp, &sh->batch_list, batch_list) 3810 + clear_bit(STRIPE_BATCH_READY, &tmp->state); 3811 + spin_unlock(&sh->batch_lock); 3812 + spin_unlock(&sh->stripe_lock); 3813 + 3814 + /* 3815 + * BATCH_READY is cleared, no new stripes can be added. 3816 + * batch_list can be accessed without lock 3817 + */ 3818 + return 0; 3819 + } 3820 + 3821 + static void check_break_stripe_batch_list(struct stripe_head *sh) 3822 + { 3823 + struct stripe_head *head_sh, *next; 3824 + int i; 3825 + 3826 + if (!test_and_clear_bit(STRIPE_BATCH_ERR, &sh->state)) 3827 + return; 3828 + 3829 + head_sh = sh; 3830 + do { 3831 + sh = list_first_entry(&sh->batch_list, 3832 + struct stripe_head, batch_list); 3833 + BUG_ON(sh == head_sh); 3834 + } while (!test_bit(STRIPE_DEGRADED, &sh->state)); 3835 + 3836 + while (sh != head_sh) { 3837 + next = list_first_entry(&sh->batch_list, 3838 + struct stripe_head, batch_list); 3839 + list_del_init(&sh->batch_list); 3840 + 3841 + set_mask_bits(&sh->state, ~STRIPE_EXPAND_SYNC_FLAG, 3842 + head_sh->state & ~((1 << STRIPE_ACTIVE) | 3843 + (1 << STRIPE_PREREAD_ACTIVE) | 3844 + (1 << STRIPE_DEGRADED) | 3845 + STRIPE_EXPAND_SYNC_FLAG)); 3846 + sh->check_state = head_sh->check_state; 3847 + sh->reconstruct_state = head_sh->reconstruct_state; 3848 + for (i = 0; i < sh->disks; i++) 3849 + sh->dev[i].flags = head_sh->dev[i].flags & 3850 + (~((1 << R5_WriteError) | (1 << R5_Overlap))); 3851 + 3852 + spin_lock_irq(&sh->stripe_lock); 3853 + sh->batch_head = NULL; 3854 + spin_unlock_irq(&sh->stripe_lock); 3855 + 3856 + set_bit(STRIPE_HANDLE, &sh->state); 3857 + release_stripe(sh); 3858 + 3859 + sh = next; 3860 + } 3861 + } 3862 + 4160 3863 static void handle_stripe(struct stripe_head *sh) 4161 3864 { 4162 3865 struct stripe_head_state s; ··· 4248 3803 return; 4249 3804 } 4250 3805 4251 - if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state)) { 3806 + if (clear_batch_ready(sh) ) { 3807 + clear_bit_unlock(STRIPE_ACTIVE, &sh->state); 3808 + return; 3809 + } 3810 + 3811 + check_break_stripe_batch_list(sh); 3812 + 3813 + if (test_bit(STRIPE_SYNC_REQUESTED, &sh->state) && !sh->batch_head) { 4252 3814 spin_lock(&sh->stripe_lock); 4253 3815 /* Cannot process 'sync' concurrently with 'discard' */ 4254 3816 if (!test_bit(STRIPE_DISCARD, &sh->state) && ··· 4610 4158 * how busy the stripe_cache is 4611 4159 */ 4612 4160 4613 - if (conf->inactive_blocked) 4161 + if (test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) 4614 4162 return 1; 4615 4163 if (conf->quiesce) 4616 4164 return 1; ··· 4632 4180 unsigned int chunk_sectors = mddev->chunk_sectors; 4633 4181 unsigned int bio_sectors = bvm->bi_size >> 9; 4634 4182 4635 - if ((bvm->bi_rw & 1) == WRITE) 4636 - return biovec->bv_len; /* always allow writes to be mergeable */ 4183 + /* 4184 + * always allow writes to be mergeable, read as well if array 4185 + * is degraded as we'll go through stripe cache anyway. 4186 + */ 4187 + if ((bvm->bi_rw & 1) == WRITE || mddev->degraded) 4188 + return biovec->bv_len; 4637 4189 4638 4190 if (mddev->new_chunk_sectors < mddev->chunk_sectors) 4639 4191 chunk_sectors = mddev->new_chunk_sectors; ··· 5059 4603 } 5060 4604 set_bit(STRIPE_DISCARD, &sh->state); 5061 4605 finish_wait(&conf->wait_for_overlap, &w); 4606 + sh->overwrite_disks = 0; 5062 4607 for (d = 0; d < conf->raid_disks; d++) { 5063 4608 if (d == sh->pd_idx || d == sh->qd_idx) 5064 4609 continue; 5065 4610 sh->dev[d].towrite = bi; 5066 4611 set_bit(R5_OVERWRITE, &sh->dev[d].flags); 5067 4612 raid5_inc_bi_active_stripes(bi); 4613 + sh->overwrite_disks++; 5068 4614 } 5069 4615 spin_unlock_irq(&sh->stripe_lock); 5070 4616 if (conf->mddev->bitmap) { ··· 5114 4656 5115 4657 md_write_start(mddev, bi); 5116 4658 5117 - if (rw == READ && 4659 + /* 4660 + * If array is degraded, better not do chunk aligned read because 4661 + * later we might have to read it again in order to reconstruct 4662 + * data on failed drives. 4663 + */ 4664 + if (rw == READ && mddev->degraded == 0 && 5118 4665 mddev->reshape_position == MaxSector && 5119 4666 chunk_aligned_read(mddev,bi)) 5120 4667 return; ··· 5235 4772 } 5236 4773 5237 4774 if (test_bit(STRIPE_EXPANDING, &sh->state) || 5238 - !add_stripe_bio(sh, bi, dd_idx, rw)) { 4775 + !add_stripe_bio(sh, bi, dd_idx, rw, previous)) { 5239 4776 /* Stripe is busy expanding or 5240 4777 * add failed due to overlap. Flush everything 5241 4778 * and wait a while ··· 5248 4785 } 5249 4786 set_bit(STRIPE_HANDLE, &sh->state); 5250 4787 clear_bit(STRIPE_DELAYED, &sh->state); 5251 - if ((bi->bi_rw & REQ_SYNC) && 4788 + if ((!sh->batch_head || sh == sh->batch_head) && 4789 + (bi->bi_rw & REQ_SYNC) && 5252 4790 !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) 5253 4791 atomic_inc(&conf->preread_active_stripes); 5254 4792 release_stripe_plug(mddev, sh); ··· 5514 5050 return reshape_sectors; 5515 5051 } 5516 5052 5517 - /* FIXME go_faster isn't used */ 5518 - static inline sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped, int go_faster) 5053 + static inline sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipped) 5519 5054 { 5520 5055 struct r5conf *conf = mddev->private; 5521 5056 struct stripe_head *sh; ··· 5649 5186 return handled; 5650 5187 } 5651 5188 5652 - if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) { 5189 + if (!add_stripe_bio(sh, raid_bio, dd_idx, 0, 0)) { 5653 5190 release_stripe(sh); 5654 5191 raid5_set_bi_processed_stripes(raid_bio, scnt); 5655 5192 conf->retry_read_aligned = raid_bio; ··· 5775 5312 int batch_size, released; 5776 5313 5777 5314 released = release_stripe_list(conf, conf->temp_inactive_list); 5315 + if (released) 5316 + clear_bit(R5_DID_ALLOC, &conf->cache_state); 5778 5317 5779 5318 if ( 5780 5319 !list_empty(&conf->bitmap_list)) { ··· 5815 5350 pr_debug("%d stripes handled\n", handled); 5816 5351 5817 5352 spin_unlock_irq(&conf->device_lock); 5353 + if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state)) { 5354 + grow_one_stripe(conf, __GFP_NOWARN); 5355 + /* Set flag even if allocation failed. This helps 5356 + * slow down allocation requests when mem is short 5357 + */ 5358 + set_bit(R5_DID_ALLOC, &conf->cache_state); 5359 + } 5818 5360 5819 5361 async_tx_issue_pending_all(); 5820 5362 blk_finish_plug(&plug); ··· 5837 5365 spin_lock(&mddev->lock); 5838 5366 conf = mddev->private; 5839 5367 if (conf) 5840 - ret = sprintf(page, "%d\n", conf->max_nr_stripes); 5368 + ret = sprintf(page, "%d\n", conf->min_nr_stripes); 5841 5369 spin_unlock(&mddev->lock); 5842 5370 return ret; 5843 5371 } ··· 5847 5375 { 5848 5376 struct r5conf *conf = mddev->private; 5849 5377 int err; 5850 - int hash; 5851 5378 5852 5379 if (size <= 16 || size > 32768) 5853 5380 return -EINVAL; 5854 - hash = (conf->max_nr_stripes - 1) % NR_STRIPE_HASH_LOCKS; 5855 - while (size < conf->max_nr_stripes) { 5856 - if (drop_one_stripe(conf, hash)) 5857 - conf->max_nr_stripes--; 5858 - else 5859 - break; 5860 - hash--; 5861 - if (hash < 0) 5862 - hash = NR_STRIPE_HASH_LOCKS - 1; 5863 - } 5381 + 5382 + conf->min_nr_stripes = size; 5383 + while (size < conf->max_nr_stripes && 5384 + drop_one_stripe(conf)) 5385 + ; 5386 + 5387 + 5864 5388 err = md_allow_write(mddev); 5865 5389 if (err) 5866 5390 return err; 5867 - hash = conf->max_nr_stripes % NR_STRIPE_HASH_LOCKS; 5868 - while (size > conf->max_nr_stripes) { 5869 - if (grow_one_stripe(conf, hash)) 5870 - conf->max_nr_stripes++; 5871 - else break; 5872 - hash = (hash + 1) % NR_STRIPE_HASH_LOCKS; 5873 - } 5391 + 5392 + while (size > conf->max_nr_stripes) 5393 + if (!grow_one_stripe(conf, GFP_KERNEL)) 5394 + break; 5395 + 5874 5396 return 0; 5875 5397 } 5876 5398 EXPORT_SYMBOL(raid5_set_cache_size); ··· 5899 5433 raid5_store_stripe_cache_size); 5900 5434 5901 5435 static ssize_t 5436 + raid5_show_rmw_level(struct mddev *mddev, char *page) 5437 + { 5438 + struct r5conf *conf = mddev->private; 5439 + if (conf) 5440 + return sprintf(page, "%d\n", conf->rmw_level); 5441 + else 5442 + return 0; 5443 + } 5444 + 5445 + static ssize_t 5446 + raid5_store_rmw_level(struct mddev *mddev, const char *page, size_t len) 5447 + { 5448 + struct r5conf *conf = mddev->private; 5449 + unsigned long new; 5450 + 5451 + if (!conf) 5452 + return -ENODEV; 5453 + 5454 + if (len >= PAGE_SIZE) 5455 + return -EINVAL; 5456 + 5457 + if (kstrtoul(page, 10, &new)) 5458 + return -EINVAL; 5459 + 5460 + if (new != PARITY_DISABLE_RMW && !raid6_call.xor_syndrome) 5461 + return -EINVAL; 5462 + 5463 + if (new != PARITY_DISABLE_RMW && 5464 + new != PARITY_ENABLE_RMW && 5465 + new != PARITY_PREFER_RMW) 5466 + return -EINVAL; 5467 + 5468 + conf->rmw_level = new; 5469 + return len; 5470 + } 5471 + 5472 + static struct md_sysfs_entry 5473 + raid5_rmw_level = __ATTR(rmw_level, S_IRUGO | S_IWUSR, 5474 + raid5_show_rmw_level, 5475 + raid5_store_rmw_level); 5476 + 5477 + 5478 + static ssize_t 5902 5479 raid5_show_preread_threshold(struct mddev *mddev, char *page) 5903 5480 { 5904 5481 struct r5conf *conf; ··· 5972 5463 conf = mddev->private; 5973 5464 if (!conf) 5974 5465 err = -ENODEV; 5975 - else if (new > conf->max_nr_stripes) 5466 + else if (new > conf->min_nr_stripes) 5976 5467 err = -EINVAL; 5977 5468 else 5978 5469 conf->bypass_threshold = new; ··· 6127 5618 &raid5_preread_bypass_threshold.attr, 6128 5619 &raid5_group_thread_cnt.attr, 6129 5620 &raid5_skip_copy.attr, 5621 + &raid5_rmw_level.attr, 6130 5622 NULL, 6131 5623 }; 6132 5624 static struct attribute_group raid5_attrs_group = { ··· 6209 5699 static void free_scratch_buffer(struct r5conf *conf, struct raid5_percpu *percpu) 6210 5700 { 6211 5701 safe_put_page(percpu->spare_page); 6212 - kfree(percpu->scribble); 5702 + if (percpu->scribble) 5703 + flex_array_free(percpu->scribble); 6213 5704 percpu->spare_page = NULL; 6214 5705 percpu->scribble = NULL; 6215 5706 } ··· 6220 5709 if (conf->level == 6 && !percpu->spare_page) 6221 5710 percpu->spare_page = alloc_page(GFP_KERNEL); 6222 5711 if (!percpu->scribble) 6223 - percpu->scribble = kmalloc(conf->scribble_len, GFP_KERNEL); 5712 + percpu->scribble = scribble_alloc(max(conf->raid_disks, 5713 + conf->previous_raid_disks), conf->chunk_sectors / 5714 + STRIPE_SECTORS, GFP_KERNEL); 6224 5715 6225 5716 if (!percpu->scribble || (conf->level == 6 && !percpu->spare_page)) { 6226 5717 free_scratch_buffer(conf, percpu); ··· 6253 5740 6254 5741 static void free_conf(struct r5conf *conf) 6255 5742 { 5743 + if (conf->shrinker.seeks) 5744 + unregister_shrinker(&conf->shrinker); 6256 5745 free_thread_groups(conf); 6257 5746 shrink_stripes(conf); 6258 5747 raid5_free_percpu(conf); ··· 6320 5805 put_online_cpus(); 6321 5806 6322 5807 return err; 5808 + } 5809 + 5810 + static unsigned long raid5_cache_scan(struct shrinker *shrink, 5811 + struct shrink_control *sc) 5812 + { 5813 + struct r5conf *conf = container_of(shrink, struct r5conf, shrinker); 5814 + int ret = 0; 5815 + while (ret < sc->nr_to_scan) { 5816 + if (drop_one_stripe(conf) == 0) 5817 + return SHRINK_STOP; 5818 + ret++; 5819 + } 5820 + return ret; 5821 + } 5822 + 5823 + static unsigned long raid5_cache_count(struct shrinker *shrink, 5824 + struct shrink_control *sc) 5825 + { 5826 + struct r5conf *conf = container_of(shrink, struct r5conf, shrinker); 5827 + 5828 + if (conf->max_nr_stripes < conf->min_nr_stripes) 5829 + /* unlikely, but not impossible */ 5830 + return 0; 5831 + return conf->max_nr_stripes - conf->min_nr_stripes; 6323 5832 } 6324 5833 6325 5834 static struct r5conf *setup_conf(struct mddev *mddev) ··· 6418 5879 else 6419 5880 conf->previous_raid_disks = mddev->raid_disks - mddev->delta_disks; 6420 5881 max_disks = max(conf->raid_disks, conf->previous_raid_disks); 6421 - conf->scribble_len = scribble_len(max_disks); 6422 5882 6423 5883 conf->disks = kzalloc(max_disks * sizeof(struct disk_info), 6424 5884 GFP_KERNEL); ··· 6445 5907 INIT_LIST_HEAD(conf->temp_inactive_list + i); 6446 5908 6447 5909 conf->level = mddev->new_level; 5910 + conf->chunk_sectors = mddev->new_chunk_sectors; 6448 5911 if (raid5_alloc_percpu(conf) != 0) 6449 5912 goto abort; 6450 5913 ··· 6478 5939 conf->fullsync = 1; 6479 5940 } 6480 5941 6481 - conf->chunk_sectors = mddev->new_chunk_sectors; 6482 5942 conf->level = mddev->new_level; 6483 - if (conf->level == 6) 5943 + if (conf->level == 6) { 6484 5944 conf->max_degraded = 2; 6485 - else 5945 + if (raid6_call.xor_syndrome) 5946 + conf->rmw_level = PARITY_ENABLE_RMW; 5947 + else 5948 + conf->rmw_level = PARITY_DISABLE_RMW; 5949 + } else { 6486 5950 conf->max_degraded = 1; 5951 + conf->rmw_level = PARITY_ENABLE_RMW; 5952 + } 6487 5953 conf->algorithm = mddev->new_layout; 6488 5954 conf->reshape_progress = mddev->reshape_position; 6489 5955 if (conf->reshape_progress != MaxSector) { ··· 6496 5952 conf->prev_algo = mddev->layout; 6497 5953 } 6498 5954 6499 - memory = conf->max_nr_stripes * (sizeof(struct stripe_head) + 5955 + conf->min_nr_stripes = NR_STRIPES; 5956 + memory = conf->min_nr_stripes * (sizeof(struct stripe_head) + 6500 5957 max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; 6501 5958 atomic_set(&conf->empty_inactive_list_nr, NR_STRIPE_HASH_LOCKS); 6502 - if (grow_stripes(conf, NR_STRIPES)) { 5959 + if (grow_stripes(conf, conf->min_nr_stripes)) { 6503 5960 printk(KERN_ERR 6504 5961 "md/raid:%s: couldn't allocate %dkB for buffers\n", 6505 5962 mdname(mddev), memory); ··· 6508 5963 } else 6509 5964 printk(KERN_INFO "md/raid:%s: allocated %dkB\n", 6510 5965 mdname(mddev), memory); 5966 + /* 5967 + * Losing a stripe head costs more than the time to refill it, 5968 + * it reduces the queue depth and so can hurt throughput. 5969 + * So set it rather large, scaled by number of devices. 5970 + */ 5971 + conf->shrinker.seeks = DEFAULT_SEEKS * conf->raid_disks * 4; 5972 + conf->shrinker.scan_objects = raid5_cache_scan; 5973 + conf->shrinker.count_objects = raid5_cache_count; 5974 + conf->shrinker.batch = 128; 5975 + conf->shrinker.flags = 0; 5976 + register_shrinker(&conf->shrinker); 6511 5977 6512 5978 sprintf(pers_name, "raid%d", mddev->new_level); 6513 5979 conf->thread = md_register_thread(raid5d, mddev, pers_name); ··· 7160 6604 */ 7161 6605 struct r5conf *conf = mddev->private; 7162 6606 if (((mddev->chunk_sectors << 9) / STRIPE_SIZE) * 4 7163 - > conf->max_nr_stripes || 6607 + > conf->min_nr_stripes || 7164 6608 ((mddev->new_chunk_sectors << 9) / STRIPE_SIZE) * 4 7165 - > conf->max_nr_stripes) { 6609 + > conf->min_nr_stripes) { 7166 6610 printk(KERN_WARNING "md/raid:%s: reshape: not enough stripes. Needed %lu\n", 7167 6611 mdname(mddev), 7168 6612 ((max(mddev->chunk_sectors, mddev->new_chunk_sectors) << 9)

+50 -9

drivers/md/raid5.h

··· 210 210 atomic_t count; /* nr of active thread/requests */ 211 211 int bm_seq; /* sequence number for bitmap flushes */ 212 212 int disks; /* disks in stripe */ 213 + int overwrite_disks; /* total overwrite disks in stripe, 214 + * this is only checked when stripe 215 + * has STRIPE_BATCH_READY 216 + */ 213 217 enum check_states check_state; 214 218 enum reconstruct_states reconstruct_state; 215 219 spinlock_t stripe_lock; 216 220 int cpu; 217 221 struct r5worker_group *group; 222 + 223 + struct stripe_head *batch_head; /* protected by stripe lock */ 224 + spinlock_t batch_lock; /* only header's lock is useful */ 225 + struct list_head batch_list; /* protected by head's batch lock*/ 218 226 /** 219 227 * struct stripe_operations 220 228 * @target - STRIPE_OP_COMPUTE_BLK target ··· 335 327 STRIPE_ON_UNPLUG_LIST, 336 328 STRIPE_DISCARD, 337 329 STRIPE_ON_RELEASE_LIST, 330 + STRIPE_BATCH_READY, 331 + STRIPE_BATCH_ERR, 338 332 }; 339 333 334 + #define STRIPE_EXPAND_SYNC_FLAG \ 335 + ((1 << STRIPE_EXPAND_SOURCE) |\ 336 + (1 << STRIPE_EXPAND_READY) |\ 337 + (1 << STRIPE_EXPANDING) |\ 338 + (1 << STRIPE_SYNC_REQUESTED)) 340 339 /* 341 340 * Operation request flags 342 341 */ ··· 354 339 STRIPE_OP_BIODRAIN, 355 340 STRIPE_OP_RECONSTRUCT, 356 341 STRIPE_OP_CHECK, 342 + }; 343 + 344 + /* 345 + * RAID parity calculation preferences 346 + */ 347 + enum { 348 + PARITY_DISABLE_RMW = 0, 349 + PARITY_ENABLE_RMW, 350 + PARITY_PREFER_RMW, 351 + }; 352 + 353 + /* 354 + * Pages requested from set_syndrome_sources() 355 + */ 356 + enum { 357 + SYNDROME_SRC_ALL, 358 + SYNDROME_SRC_WANT_DRAIN, 359 + SYNDROME_SRC_WRITTEN, 357 360 }; 358 361 /* 359 362 * Plugging: ··· 429 396 spinlock_t hash_locks[NR_STRIPE_HASH_LOCKS]; 430 397 struct mddev *mddev; 431 398 int chunk_sectors; 432 - int level, algorithm; 399 + int level, algorithm, rmw_level; 433 400 int max_degraded; 434 401 int raid_disks; 435 402 int max_nr_stripes; 403 + int min_nr_stripes; 436 404 437 405 /* reshape_progress is the leading edge of a 'reshape' 438 406 * It has value MaxSector when no reshape is happening ··· 492 458 /* per cpu variables */ 493 459 struct raid5_percpu { 494 460 struct page *spare_page; /* Used when checking P/Q in raid6 */ 495 - void *scribble; /* space for constructing buffer 461 + struct flex_array *scribble; /* space for constructing buffer 496 462 * lists and performing address 497 463 * conversions 498 464 */ 499 465 } __percpu *percpu; 500 - size_t scribble_len; /* size of scribble region must be 501 - * associated with conf to handle 502 - * cpu hotplug while reshaping 503 - */ 504 466 #ifdef CONFIG_HOTPLUG_CPU 505 467 struct notifier_block cpu_notify; 506 468 #endif ··· 510 480 struct llist_head released_stripes; 511 481 wait_queue_head_t wait_for_stripe; 512 482 wait_queue_head_t wait_for_overlap; 513 - int inactive_blocked; /* release of inactive stripes blocked, 514 - * waiting for 25% to be free 515 - */ 483 + unsigned long cache_state; 484 + #define R5_INACTIVE_BLOCKED 1 /* release of inactive stripes blocked, 485 + * waiting for 25% to be free 486 + */ 487 + #define R5_ALLOC_MORE 2 /* It might help to allocate another 488 + * stripe. 489 + */ 490 + #define R5_DID_ALLOC 4 /* A stripe was allocated, don't allocate 491 + * more until at least one has been 492 + * released. This avoids flooding 493 + * the cache. 494 + */ 495 + struct shrinker shrinker; 516 496 int pool_size; /* number of disks in stripeheads in pool */ 517 497 spinlock_t device_lock; 518 498 struct disk_info *disks; ··· 536 496 int group_cnt; 537 497 int worker_cnt_per_group; 538 498 }; 499 + 539 500 540 501 /* 541 502 * Our supported algorithms

+3

include/linux/async_tx.h

··· 60 60 * dependency chain 61 61 * @ASYNC_TX_FENCE: specify that the next operation in the dependency 62 62 * chain uses this operation's result as an input 63 + * @ASYNC_TX_PQ_XOR_DST: do not overwrite the syndrome but XOR it with the 64 + * input data. Required for rmw case. 63 65 */ 64 66 enum async_tx_flags { 65 67 ASYNC_TX_XOR_ZERO_DST = (1 << 0), 66 68 ASYNC_TX_XOR_DROP_DST = (1 << 1), 67 69 ASYNC_TX_ACK = (1 << 2), 68 70 ASYNC_TX_FENCE = (1 << 3), 71 + ASYNC_TX_PQ_XOR_DST = (1 << 4), 69 72 }; 70 73 71 74 /**

+1

include/linux/raid/pq.h

··· 72 72 /* Routine choices */ 73 73 struct raid6_calls { 74 74 void (*gen_syndrome)(int, size_t, void **); 75 + void (*xor_syndrome)(int, int, int, size_t, void **); 75 76 int (*valid)(void); /* Returns 1 if this routine set is usable */ 76 77 const char *name; /* Name of this routine set */ 77 78 int prefer; /* Has special performance attribute */

+7

include/uapi/linux/raid/md_p.h

··· 78 78 #define MD_DISK_ACTIVE 1 /* disk is running or spare disk */ 79 79 #define MD_DISK_SYNC 2 /* disk is in sync with the raid set */ 80 80 #define MD_DISK_REMOVED 3 /* disk is in sync with the raid set */ 81 + #define MD_DISK_CLUSTER_ADD 4 /* Initiate a disk add across the cluster 82 + * For clustered enviroments only. 83 + */ 84 + #define MD_DISK_CANDIDATE 5 /* disk is added as spare (local) until confirmed 85 + * For clustered enviroments only. 86 + */ 81 87 82 88 #define MD_DISK_WRITEMOSTLY 9 /* disk is "write-mostly" is RAID1 config. 83 89 * read requests will only be sent here in ··· 107 101 #define MD_SB_CLEAN 0 108 102 #define MD_SB_ERRORS 1 109 103 104 + #define MD_SB_CLUSTERED 5 /* MD is clustered */ 110 105 #define MD_SB_BITMAP_PRESENT 8 /* bitmap may be present nearby */ 111 106 112 107 /*

+1

include/uapi/linux/raid/md_u.h

··· 62 62 #define STOP_ARRAY _IO (MD_MAJOR, 0x32) 63 63 #define STOP_ARRAY_RO _IO (MD_MAJOR, 0x33) 64 64 #define RESTART_ARRAY_RW _IO (MD_MAJOR, 0x34) 65 + #define CLUSTERED_DISK_NACK _IO (MD_MAJOR, 0x35) 65 66 66 67 /* 63 partitions with the alternate major number (mdp) */ 67 68 #define MdpMinorShift 6

+34 -7

lib/raid6/algos.c

··· 131 131 static inline const struct raid6_calls *raid6_choose_gen( 132 132 void *(*const dptrs)[(65536/PAGE_SIZE)+2], const int disks) 133 133 { 134 - unsigned long perf, bestperf, j0, j1; 134 + unsigned long perf, bestgenperf, bestxorperf, j0, j1; 135 + int start = (disks>>1)-1, stop = disks-3; /* work on the second half of the disks */ 135 136 const struct raid6_calls *const *algo; 136 137 const struct raid6_calls *best; 137 138 138 - for (bestperf = 0, best = NULL, algo = raid6_algos; *algo; algo++) { 139 + for (bestgenperf = 0, bestxorperf = 0, best = NULL, algo = raid6_algos; *algo; algo++) { 139 140 if (!best || (*algo)->prefer >= best->prefer) { 140 141 if ((*algo)->valid && !(*algo)->valid()) 141 142 continue; ··· 154 153 } 155 154 preempt_enable(); 156 155 157 - if (perf > bestperf) { 158 - bestperf = perf; 156 + if (perf > bestgenperf) { 157 + bestgenperf = perf; 159 158 best = *algo; 160 159 } 161 - pr_info("raid6: %-8s %5ld MB/s\n", (*algo)->name, 160 + pr_info("raid6: %-8s gen() %5ld MB/s\n", (*algo)->name, 162 161 (perf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2)); 162 + 163 + if (!(*algo)->xor_syndrome) 164 + continue; 165 + 166 + perf = 0; 167 + 168 + preempt_disable(); 169 + j0 = jiffies; 170 + while ((j1 = jiffies) == j0) 171 + cpu_relax(); 172 + while (time_before(jiffies, 173 + j1 + (1<<RAID6_TIME_JIFFIES_LG2))) { 174 + (*algo)->xor_syndrome(disks, start, stop, 175 + PAGE_SIZE, *dptrs); 176 + perf++; 177 + } 178 + preempt_enable(); 179 + 180 + if (best == *algo) 181 + bestxorperf = perf; 182 + 183 + pr_info("raid6: %-8s xor() %5ld MB/s\n", (*algo)->name, 184 + (perf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2+1)); 163 185 } 164 186 } 165 187 166 188 if (best) { 167 - pr_info("raid6: using algorithm %s (%ld MB/s)\n", 189 + pr_info("raid6: using algorithm %s gen() %ld MB/s\n", 168 190 best->name, 169 - (bestperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2)); 191 + (bestgenperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2)); 192 + if (best->xor_syndrome) 193 + pr_info("raid6: .... xor() %ld MB/s, rmw enabled\n", 194 + (bestxorperf*HZ) >> (20-16+RAID6_TIME_JIFFIES_LG2+1)); 170 195 raid6_call = *best; 171 196 } else 172 197 pr_err("raid6: Yikes! No algorithm found!\n");

+1

lib/raid6/altivec.uc

··· 119 119 120 120 const struct raid6_calls raid6_altivec$# = { 121 121 raid6_altivec$#_gen_syndrome, 122 + NULL, /* XOR not yet implemented */ 122 123 raid6_have_altivec, 123 124 "altivecx$#", 124 125 0

+3

lib/raid6/avx2.c

··· 89 89 90 90 const struct raid6_calls raid6_avx2x1 = { 91 91 raid6_avx21_gen_syndrome, 92 + NULL, /* XOR not yet implemented */ 92 93 raid6_have_avx2, 93 94 "avx2x1", 94 95 1 /* Has cache hints */ ··· 151 150 152 151 const struct raid6_calls raid6_avx2x2 = { 153 152 raid6_avx22_gen_syndrome, 153 + NULL, /* XOR not yet implemented */ 154 154 raid6_have_avx2, 155 155 "avx2x2", 156 156 1 /* Has cache hints */ ··· 244 242 245 243 const struct raid6_calls raid6_avx2x4 = { 246 244 raid6_avx24_gen_syndrome, 245 + NULL, /* XOR not yet implemented */ 247 246 raid6_have_avx2, 248 247 "avx2x4", 249 248 1 /* Has cache hints */

+40 -1

lib/raid6/int.uc

··· 107 107 } 108 108 } 109 109 110 + static void raid6_int$#_xor_syndrome(int disks, int start, int stop, 111 + size_t bytes, void **ptrs) 112 + { 113 + u8 **dptr = (u8 **)ptrs; 114 + u8 *p, *q; 115 + int d, z, z0; 116 + 117 + unative_t wd$$, wq$$, wp$$, w1$$, w2$$; 118 + 119 + z0 = stop; /* P/Q right side optimization */ 120 + p = dptr[disks-2]; /* XOR parity */ 121 + q = dptr[disks-1]; /* RS syndrome */ 122 + 123 + for ( d = 0 ; d < bytes ; d += NSIZE*$# ) { 124 + /* P/Q data pages */ 125 + wq$$ = wp$$ = *(unative_t *)&dptr[z0][d+$$*NSIZE]; 126 + for ( z = z0-1 ; z >= start ; z-- ) { 127 + wd$$ = *(unative_t *)&dptr[z][d+$$*NSIZE]; 128 + wp$$ ^= wd$$; 129 + w2$$ = MASK(wq$$); 130 + w1$$ = SHLBYTE(wq$$); 131 + w2$$ &= NBYTES(0x1d); 132 + w1$$ ^= w2$$; 133 + wq$$ = w1$$ ^ wd$$; 134 + } 135 + /* P/Q left side optimization */ 136 + for ( z = start-1 ; z >= 0 ; z-- ) { 137 + w2$$ = MASK(wq$$); 138 + w1$$ = SHLBYTE(wq$$); 139 + w2$$ &= NBYTES(0x1d); 140 + wq$$ = w1$$ ^ w2$$; 141 + } 142 + *(unative_t *)&p[d+NSIZE*$$] ^= wp$$; 143 + *(unative_t *)&q[d+NSIZE*$$] ^= wq$$; 144 + } 145 + 146 + } 147 + 110 148 const struct raid6_calls raid6_intx$# = { 111 149 raid6_int$#_gen_syndrome, 112 - NULL, /* always valid */ 150 + raid6_int$#_xor_syndrome, 151 + NULL, /* always valid */ 113 152 "int" NSTRING "x$#", 114 153 0 115 154 };

+2

lib/raid6/mmx.c

··· 76 76 77 77 const struct raid6_calls raid6_mmxx1 = { 78 78 raid6_mmx1_gen_syndrome, 79 + NULL, /* XOR not yet implemented */ 79 80 raid6_have_mmx, 80 81 "mmxx1", 81 82 0 ··· 135 134 136 135 const struct raid6_calls raid6_mmxx2 = { 137 136 raid6_mmx2_gen_syndrome, 137 + NULL, /* XOR not yet implemented */ 138 138 raid6_have_mmx, 139 139 "mmxx2", 140 140 0

+1

lib/raid6/neon.c

··· 42 42 } \ 43 43 struct raid6_calls const raid6_neonx ## _n = { \ 44 44 raid6_neon ## _n ## _gen_syndrome, \ 45 + NULL, /* XOR not yet implemented */ \ 45 46 raid6_have_neon, \ 46 47 "neonx" #_n, \ 47 48 0 \

+2

lib/raid6/sse1.c

··· 92 92 93 93 const struct raid6_calls raid6_sse1x1 = { 94 94 raid6_sse11_gen_syndrome, 95 + NULL, /* XOR not yet implemented */ 95 96 raid6_have_sse1_or_mmxext, 96 97 "sse1x1", 97 98 1 /* Has cache hints */ ··· 155 154 156 155 const struct raid6_calls raid6_sse1x2 = { 157 156 raid6_sse12_gen_syndrome, 157 + NULL, /* XOR not yet implemented */ 158 158 raid6_have_sse1_or_mmxext, 159 159 "sse1x2", 160 160 1 /* Has cache hints */

+227

lib/raid6/sse2.c

··· 88 88 kernel_fpu_end(); 89 89 } 90 90 91 + 92 + static void raid6_sse21_xor_syndrome(int disks, int start, int stop, 93 + size_t bytes, void **ptrs) 94 + { 95 + u8 **dptr = (u8 **)ptrs; 96 + u8 *p, *q; 97 + int d, z, z0; 98 + 99 + z0 = stop; /* P/Q right side optimization */ 100 + p = dptr[disks-2]; /* XOR parity */ 101 + q = dptr[disks-1]; /* RS syndrome */ 102 + 103 + kernel_fpu_begin(); 104 + 105 + asm volatile("movdqa %0,%%xmm0" : : "m" (raid6_sse_constants.x1d[0])); 106 + 107 + for ( d = 0 ; d < bytes ; d += 16 ) { 108 + asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d])); 109 + asm volatile("movdqa %0,%%xmm2" : : "m" (p[d])); 110 + asm volatile("pxor %xmm4,%xmm2"); 111 + /* P/Q data pages */ 112 + for ( z = z0-1 ; z >= start ; z-- ) { 113 + asm volatile("pxor %xmm5,%xmm5"); 114 + asm volatile("pcmpgtb %xmm4,%xmm5"); 115 + asm volatile("paddb %xmm4,%xmm4"); 116 + asm volatile("pand %xmm0,%xmm5"); 117 + asm volatile("pxor %xmm5,%xmm4"); 118 + asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d])); 119 + asm volatile("pxor %xmm5,%xmm2"); 120 + asm volatile("pxor %xmm5,%xmm4"); 121 + } 122 + /* P/Q left side optimization */ 123 + for ( z = start-1 ; z >= 0 ; z-- ) { 124 + asm volatile("pxor %xmm5,%xmm5"); 125 + asm volatile("pcmpgtb %xmm4,%xmm5"); 126 + asm volatile("paddb %xmm4,%xmm4"); 127 + asm volatile("pand %xmm0,%xmm5"); 128 + asm volatile("pxor %xmm5,%xmm4"); 129 + } 130 + asm volatile("pxor %0,%%xmm4" : : "m" (q[d])); 131 + /* Don't use movntdq for r/w memory area < cache line */ 132 + asm volatile("movdqa %%xmm4,%0" : "=m" (q[d])); 133 + asm volatile("movdqa %%xmm2,%0" : "=m" (p[d])); 134 + } 135 + 136 + asm volatile("sfence" : : : "memory"); 137 + kernel_fpu_end(); 138 + } 139 + 91 140 const struct raid6_calls raid6_sse2x1 = { 92 141 raid6_sse21_gen_syndrome, 142 + raid6_sse21_xor_syndrome, 93 143 raid6_have_sse2, 94 144 "sse2x1", 95 145 1 /* Has cache hints */ ··· 200 150 kernel_fpu_end(); 201 151 } 202 152 153 + static void raid6_sse22_xor_syndrome(int disks, int start, int stop, 154 + size_t bytes, void **ptrs) 155 + { 156 + u8 **dptr = (u8 **)ptrs; 157 + u8 *p, *q; 158 + int d, z, z0; 159 + 160 + z0 = stop; /* P/Q right side optimization */ 161 + p = dptr[disks-2]; /* XOR parity */ 162 + q = dptr[disks-1]; /* RS syndrome */ 163 + 164 + kernel_fpu_begin(); 165 + 166 + asm volatile("movdqa %0,%%xmm0" : : "m" (raid6_sse_constants.x1d[0])); 167 + 168 + for ( d = 0 ; d < bytes ; d += 32 ) { 169 + asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d])); 170 + asm volatile("movdqa %0,%%xmm6" :: "m" (dptr[z0][d+16])); 171 + asm volatile("movdqa %0,%%xmm2" : : "m" (p[d])); 172 + asm volatile("movdqa %0,%%xmm3" : : "m" (p[d+16])); 173 + asm volatile("pxor %xmm4,%xmm2"); 174 + asm volatile("pxor %xmm6,%xmm3"); 175 + /* P/Q data pages */ 176 + for ( z = z0-1 ; z >= start ; z-- ) { 177 + asm volatile("pxor %xmm5,%xmm5"); 178 + asm volatile("pxor %xmm7,%xmm7"); 179 + asm volatile("pcmpgtb %xmm4,%xmm5"); 180 + asm volatile("pcmpgtb %xmm6,%xmm7"); 181 + asm volatile("paddb %xmm4,%xmm4"); 182 + asm volatile("paddb %xmm6,%xmm6"); 183 + asm volatile("pand %xmm0,%xmm5"); 184 + asm volatile("pand %xmm0,%xmm7"); 185 + asm volatile("pxor %xmm5,%xmm4"); 186 + asm volatile("pxor %xmm7,%xmm6"); 187 + asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d])); 188 + asm volatile("movdqa %0,%%xmm7" :: "m" (dptr[z][d+16])); 189 + asm volatile("pxor %xmm5,%xmm2"); 190 + asm volatile("pxor %xmm7,%xmm3"); 191 + asm volatile("pxor %xmm5,%xmm4"); 192 + asm volatile("pxor %xmm7,%xmm6"); 193 + } 194 + /* P/Q left side optimization */ 195 + for ( z = start-1 ; z >= 0 ; z-- ) { 196 + asm volatile("pxor %xmm5,%xmm5"); 197 + asm volatile("pxor %xmm7,%xmm7"); 198 + asm volatile("pcmpgtb %xmm4,%xmm5"); 199 + asm volatile("pcmpgtb %xmm6,%xmm7"); 200 + asm volatile("paddb %xmm4,%xmm4"); 201 + asm volatile("paddb %xmm6,%xmm6"); 202 + asm volatile("pand %xmm0,%xmm5"); 203 + asm volatile("pand %xmm0,%xmm7"); 204 + asm volatile("pxor %xmm5,%xmm4"); 205 + asm volatile("pxor %xmm7,%xmm6"); 206 + } 207 + asm volatile("pxor %0,%%xmm4" : : "m" (q[d])); 208 + asm volatile("pxor %0,%%xmm6" : : "m" (q[d+16])); 209 + /* Don't use movntdq for r/w memory area < cache line */ 210 + asm volatile("movdqa %%xmm4,%0" : "=m" (q[d])); 211 + asm volatile("movdqa %%xmm6,%0" : "=m" (q[d+16])); 212 + asm volatile("movdqa %%xmm2,%0" : "=m" (p[d])); 213 + asm volatile("movdqa %%xmm3,%0" : "=m" (p[d+16])); 214 + } 215 + 216 + asm volatile("sfence" : : : "memory"); 217 + kernel_fpu_end(); 218 + } 219 + 203 220 const struct raid6_calls raid6_sse2x2 = { 204 221 raid6_sse22_gen_syndrome, 222 + raid6_sse22_xor_syndrome, 205 223 raid6_have_sse2, 206 224 "sse2x2", 207 225 1 /* Has cache hints */ ··· 366 248 kernel_fpu_end(); 367 249 } 368 250 251 + static void raid6_sse24_xor_syndrome(int disks, int start, int stop, 252 + size_t bytes, void **ptrs) 253 + { 254 + u8 **dptr = (u8 **)ptrs; 255 + u8 *p, *q; 256 + int d, z, z0; 257 + 258 + z0 = stop; /* P/Q right side optimization */ 259 + p = dptr[disks-2]; /* XOR parity */ 260 + q = dptr[disks-1]; /* RS syndrome */ 261 + 262 + kernel_fpu_begin(); 263 + 264 + asm volatile("movdqa %0,%%xmm0" :: "m" (raid6_sse_constants.x1d[0])); 265 + 266 + for ( d = 0 ; d < bytes ; d += 64 ) { 267 + asm volatile("movdqa %0,%%xmm4" :: "m" (dptr[z0][d])); 268 + asm volatile("movdqa %0,%%xmm6" :: "m" (dptr[z0][d+16])); 269 + asm volatile("movdqa %0,%%xmm12" :: "m" (dptr[z0][d+32])); 270 + asm volatile("movdqa %0,%%xmm14" :: "m" (dptr[z0][d+48])); 271 + asm volatile("movdqa %0,%%xmm2" : : "m" (p[d])); 272 + asm volatile("movdqa %0,%%xmm3" : : "m" (p[d+16])); 273 + asm volatile("movdqa %0,%%xmm10" : : "m" (p[d+32])); 274 + asm volatile("movdqa %0,%%xmm11" : : "m" (p[d+48])); 275 + asm volatile("pxor %xmm4,%xmm2"); 276 + asm volatile("pxor %xmm6,%xmm3"); 277 + asm volatile("pxor %xmm12,%xmm10"); 278 + asm volatile("pxor %xmm14,%xmm11"); 279 + /* P/Q data pages */ 280 + for ( z = z0-1 ; z >= start ; z-- ) { 281 + asm volatile("prefetchnta %0" :: "m" (dptr[z][d])); 282 + asm volatile("prefetchnta %0" :: "m" (dptr[z][d+32])); 283 + asm volatile("pxor %xmm5,%xmm5"); 284 + asm volatile("pxor %xmm7,%xmm7"); 285 + asm volatile("pxor %xmm13,%xmm13"); 286 + asm volatile("pxor %xmm15,%xmm15"); 287 + asm volatile("pcmpgtb %xmm4,%xmm5"); 288 + asm volatile("pcmpgtb %xmm6,%xmm7"); 289 + asm volatile("pcmpgtb %xmm12,%xmm13"); 290 + asm volatile("pcmpgtb %xmm14,%xmm15"); 291 + asm volatile("paddb %xmm4,%xmm4"); 292 + asm volatile("paddb %xmm6,%xmm6"); 293 + asm volatile("paddb %xmm12,%xmm12"); 294 + asm volatile("paddb %xmm14,%xmm14"); 295 + asm volatile("pand %xmm0,%xmm5"); 296 + asm volatile("pand %xmm0,%xmm7"); 297 + asm volatile("pand %xmm0,%xmm13"); 298 + asm volatile("pand %xmm0,%xmm15"); 299 + asm volatile("pxor %xmm5,%xmm4"); 300 + asm volatile("pxor %xmm7,%xmm6"); 301 + asm volatile("pxor %xmm13,%xmm12"); 302 + asm volatile("pxor %xmm15,%xmm14"); 303 + asm volatile("movdqa %0,%%xmm5" :: "m" (dptr[z][d])); 304 + asm volatile("movdqa %0,%%xmm7" :: "m" (dptr[z][d+16])); 305 + asm volatile("movdqa %0,%%xmm13" :: "m" (dptr[z][d+32])); 306 + asm volatile("movdqa %0,%%xmm15" :: "m" (dptr[z][d+48])); 307 + asm volatile("pxor %xmm5,%xmm2"); 308 + asm volatile("pxor %xmm7,%xmm3"); 309 + asm volatile("pxor %xmm13,%xmm10"); 310 + asm volatile("pxor %xmm15,%xmm11"); 311 + asm volatile("pxor %xmm5,%xmm4"); 312 + asm volatile("pxor %xmm7,%xmm6"); 313 + asm volatile("pxor %xmm13,%xmm12"); 314 + asm volatile("pxor %xmm15,%xmm14"); 315 + } 316 + asm volatile("prefetchnta %0" :: "m" (q[d])); 317 + asm volatile("prefetchnta %0" :: "m" (q[d+32])); 318 + /* P/Q left side optimization */ 319 + for ( z = start-1 ; z >= 0 ; z-- ) { 320 + asm volatile("pxor %xmm5,%xmm5"); 321 + asm volatile("pxor %xmm7,%xmm7"); 322 + asm volatile("pxor %xmm13,%xmm13"); 323 + asm volatile("pxor %xmm15,%xmm15"); 324 + asm volatile("pcmpgtb %xmm4,%xmm5"); 325 + asm volatile("pcmpgtb %xmm6,%xmm7"); 326 + asm volatile("pcmpgtb %xmm12,%xmm13"); 327 + asm volatile("pcmpgtb %xmm14,%xmm15"); 328 + asm volatile("paddb %xmm4,%xmm4"); 329 + asm volatile("paddb %xmm6,%xmm6"); 330 + asm volatile("paddb %xmm12,%xmm12"); 331 + asm volatile("paddb %xmm14,%xmm14"); 332 + asm volatile("pand %xmm0,%xmm5"); 333 + asm volatile("pand %xmm0,%xmm7"); 334 + asm volatile("pand %xmm0,%xmm13"); 335 + asm volatile("pand %xmm0,%xmm15"); 336 + asm volatile("pxor %xmm5,%xmm4"); 337 + asm volatile("pxor %xmm7,%xmm6"); 338 + asm volatile("pxor %xmm13,%xmm12"); 339 + asm volatile("pxor %xmm15,%xmm14"); 340 + } 341 + asm volatile("movntdq %%xmm2,%0" : "=m" (p[d])); 342 + asm volatile("movntdq %%xmm3,%0" : "=m" (p[d+16])); 343 + asm volatile("movntdq %%xmm10,%0" : "=m" (p[d+32])); 344 + asm volatile("movntdq %%xmm11,%0" : "=m" (p[d+48])); 345 + asm volatile("pxor %0,%%xmm4" : : "m" (q[d])); 346 + asm volatile("pxor %0,%%xmm6" : : "m" (q[d+16])); 347 + asm volatile("pxor %0,%%xmm12" : : "m" (q[d+32])); 348 + asm volatile("pxor %0,%%xmm14" : : "m" (q[d+48])); 349 + asm volatile("movntdq %%xmm4,%0" : "=m" (q[d])); 350 + asm volatile("movntdq %%xmm6,%0" : "=m" (q[d+16])); 351 + asm volatile("movntdq %%xmm12,%0" : "=m" (q[d+32])); 352 + asm volatile("movntdq %%xmm14,%0" : "=m" (q[d+48])); 353 + } 354 + asm volatile("sfence" : : : "memory"); 355 + kernel_fpu_end(); 356 + } 357 + 358 + 369 359 const struct raid6_calls raid6_sse2x4 = { 370 360 raid6_sse24_gen_syndrome, 361 + raid6_sse24_xor_syndrome, 371 362 raid6_have_sse2, 372 363 "sse2x4", 373 364 1 /* Has cache hints */

+36 -15

lib/raid6/test/test.c

··· 28 28 char data[NDISKS][PAGE_SIZE]; 29 29 char recovi[PAGE_SIZE], recovj[PAGE_SIZE]; 30 30 31 - static void makedata(void) 31 + static void makedata(int start, int stop) 32 32 { 33 33 int i, j; 34 34 35 - for (i = 0; i < NDISKS; i++) { 35 + for (i = start; i <= stop; i++) { 36 36 for (j = 0; j < PAGE_SIZE; j++) 37 37 data[i][j] = rand(); 38 38 ··· 91 91 { 92 92 const struct raid6_calls *const *algo; 93 93 const struct raid6_recov_calls *const *ra; 94 - int i, j; 94 + int i, j, p1, p2; 95 95 int err = 0; 96 96 97 - makedata(); 97 + makedata(0, NDISKS-1); 98 98 99 99 for (ra = raid6_recov_algos; *ra; ra++) { 100 100 if ((*ra)->valid && !(*ra)->valid()) 101 101 continue; 102 + 102 103 raid6_2data_recov = (*ra)->data2; 103 104 raid6_datap_recov = (*ra)->datap; 104 105 105 106 printf("using recovery %s\n", (*ra)->name); 106 107 107 108 for (algo = raid6_algos; *algo; algo++) { 108 - if (!(*algo)->valid || (*algo)->valid()) { 109 - raid6_call = **algo; 109 + if ((*algo)->valid && !(*algo)->valid()) 110 + continue; 110 111 111 - /* Nuke syndromes */ 112 - memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE); 112 + raid6_call = **algo; 113 113 114 - /* Generate assumed good syndrome */ 115 - raid6_call.gen_syndrome(NDISKS, PAGE_SIZE, 116 - (void **)&dataptrs); 114 + /* Nuke syndromes */ 115 + memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE); 117 116 118 - for (i = 0; i < NDISKS-1; i++) 119 - for (j = i+1; j < NDISKS; j++) 120 - err += test_disks(i, j); 121 - } 117 + /* Generate assumed good syndrome */ 118 + raid6_call.gen_syndrome(NDISKS, PAGE_SIZE, 119 + (void **)&dataptrs); 120 + 121 + for (i = 0; i < NDISKS-1; i++) 122 + for (j = i+1; j < NDISKS; j++) 123 + err += test_disks(i, j); 124 + 125 + if (!raid6_call.xor_syndrome) 126 + continue; 127 + 128 + for (p1 = 0; p1 < NDISKS-2; p1++) 129 + for (p2 = p1; p2 < NDISKS-2; p2++) { 130 + 131 + /* Simulate rmw run */ 132 + raid6_call.xor_syndrome(NDISKS, p1, p2, PAGE_SIZE, 133 + (void **)&dataptrs); 134 + makedata(p1, p2); 135 + raid6_call.xor_syndrome(NDISKS, p1, p2, PAGE_SIZE, 136 + (void **)&dataptrs); 137 + 138 + for (i = 0; i < NDISKS-1; i++) 139 + for (j = i+1; j < NDISKS; j++) 140 + err += test_disks(i, j); 141 + } 142 + 122 143 } 123 144 printf("\n"); 124 145 }

+1

lib/raid6/tilegx.uc

··· 80 80 81 81 const struct raid6_calls raid6_tilegx$# = { 82 82 raid6_tilegx$#_gen_syndrome, 83 + NULL, /* XOR not yet implemented */ 83 84 NULL, 84 85 "tilegx$#", 85 86 0